[go: up one dir, main page]

WO2023093689A1 - Computational graph optimization method and apparatus, and device - Google Patents

Computational graph optimization method and apparatus, and device Download PDF

Info

Publication number
WO2023093689A1
WO2023093689A1 PCT/CN2022/133304 CN2022133304W WO2023093689A1 WO 2023093689 A1 WO2023093689 A1 WO 2023093689A1 CN 2022133304 W CN2022133304 W CN 2022133304W WO 2023093689 A1 WO2023093689 A1 WO 2023093689A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
memory
node set
tensor
threshold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2022/133304
Other languages
French (fr)
Chinese (zh)
Inventor
高雄
吴一迪
程彬
张兆创
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of WO2023093689A1 publication Critical patent/WO2023093689A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular to a calculation graph optimization method, device and equipment.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is the branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that respond in ways similar to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.
  • AI networks have also been widely used.
  • the AI network is becoming more and more complex, and the scale of its network model is showing a trend of increasing.
  • the number of network layers, parameters, and data sets of the AI network are increasing, which makes the network model of the AI network
  • the corresponding memory consumption is getting bigger and bigger.
  • the memory of the current computer hardware is relatively small, the common one is 16 gigabytes (GB) or 32GB, etc., which makes the memory consumption of the network model of the AI network more and more large, the current computer hardware It will be difficult to support the network model of AI network. Therefore, how to reduce the memory consumption of the network model of the AI network is a technical problem that needs to be solved urgently.
  • the present application provides a calculation graph optimization method, device, equipment, computer storage medium, computer program product, and chip.
  • the calculation graph is optimized by combining recomputation and operator fusion, achieving a significant reduction in memory usage. At the same time, it solves the problem that networks with one or more very large tensors cannot perform without introducing large heavy calculation overhead.
  • the present application provides a computation graph optimization method, including: obtaining a second computation graph from the first computation graph based on parameters and data dependencies between nodes in the first computation graph, the parameters including operator fusion rules , one or more of the memory threshold of the memory occupied by the tensor output by a single node in the first computation graph, the peak memory threshold corresponding to the first computation graph, and the data mutation threshold corresponding to the node in the first computation graph, the peak memory
  • the threshold is the threshold of the memory occupied by all the tensors to be used at the moment of execution of a node during the execution of the calculation graph
  • the data mutation threshold is the threshold of data mutation generated by at least one node in the first calculation graph
  • the second calculation graph and the second calculation graph A computation graph is merged to obtain a third computation graph, wherein, in the third computation graph, there is a first directed edge between the first node that outputs the first tensor and the second node that inputs the first tensor, and the output
  • the second calculation graph that needs to be recalculated is screened out from the first calculation graph, and the second calculation graph is merged with the first calculation graph to obtain a new calculation graph, and the new calculation graph Carry out operator fusion and execute the fused calculation graph.
  • the calculation graph is optimized through the combination of recalculation and operator fusion, which achieves a significant reduction in memory usage without introducing large recalculation overhead. , which solves the problem that networks with one or more very large tensors cannot perform.
  • the second computation graph is obtained from the first computation graph based on parameters and data dependencies between nodes in the first computation graph, which specifically includes: based on the parameters and the nodes in the first computation graph The data dependency among them is obtained from the producers included in the first calculation graph, and N first node sets are obtained, N is a positive integer greater than or equal to 1, and the first node set includes at least one node, wherein, one first The nodes contained in the node set are all directly or indirectly related to a tensor output by a producer in the first calculation graph, and are directly or indirectly related to at least one tensor input by a producer in the first calculation graph; Recomputation is performed on each first node set to obtain N recalculation subgraphs, and the N recomputation subgraphs constitute a second computation graph.
  • the second computation graph can be obtained from the first computation graph according to the data dependencies and parameters among the nodes in the first computation graph.
  • the parameter is a memory threshold; the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold.
  • the node set corresponding to the output tensor that occupies too much memory can be automatically used as the required node set, thereby solving the problem of memory occupied by a single operator and/or tensor.
  • the parameter is the peak memory threshold; there is an indirect path between the producer and the consumer corresponding to the first node set, and each node between the producer and the consumer corresponding to the first node set executes The peak memory when is above the peak memory threshold.
  • the node set corresponding to the high peak memory usage can be automatically used as the required node set, thereby solving the problem of memory usage of a single operator and/or tensor.
  • the parameter is a data mutation threshold; the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is high at the data mutation threshold.
  • the parameter is an operator fusion rule; the first node set and the consumers corresponding to the first node set can be fused; or, the first node set and the consumers corresponding to the first node set cannot be fused, And there is an indirect path between the producer and the consumer corresponding to the first node set. In this way, the required node set can be filtered out through operator fusion rules.
  • the parameters are the memory threshold and the peak memory threshold; there is no indirect path between the producer and the consumer corresponding to the first node set, and the memory occupied by the output tensor corresponding to the first node set higher than the memory threshold; and/or, there is an indirect path between the producer and consumer corresponding to the first node set, and the peak memory of each node between the producer and consumer corresponding to the first node set is higher than Peak memory threshold.
  • the required node set can be filtered out by combining the memory threshold and the peak memory threshold, improving the efficiency and accuracy of screening.
  • the parameters are memory threshold and data mutation threshold; the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, and/or, the output tensor corresponding to the first node set occupies The deviation value between the occupied memory and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold. In this way, the required node set can be screened out by combining the memory threshold and the data mutation threshold, improving the efficiency and accuracy of screening.
  • the parameters are the memory threshold and the operator fusion rule; the first node set meets one or more of the following conditions: the memory occupied by the output tensor corresponding to the first node set is higher than the memory Threshold, or, the first node set and the consumers corresponding to the first node set can be fused, or, the first node set and the consumers corresponding to the first node set cannot be fused, and the producers and consumers corresponding to the first node set There is an indirect path between them. In this way, the required node set can be screened out through the combination of memory threshold and operator fusion rules, improving the efficiency and accuracy of screening.
  • the parameters are the peak memory threshold and the data mutation threshold; there is an indirect path between the producer and the consumer corresponding to the first node set, and there is an indirect path between the producer and the consumer corresponding to the first node set
  • the peak memory of each node during execution is higher than the peak memory threshold; and/or, the difference between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set
  • the deviation value is higher than the data mutation threshold. In this way, the required node set can be screened out by combining the peak memory threshold and the data mutation threshold, improving the efficiency and accuracy of screening.
  • the parameters are memory thresholds and operator fusion rules; the first node set meets one or more of the following conditions: there is an indirect path between the producer and the consumer corresponding to the first node set , and the peak memory of each node between the producer and consumer corresponding to the first node set is higher than the peak memory threshold during execution, or the first node set and the consumer corresponding to the first node set can be fused, or, the first node set A node set and the consumer corresponding to the first node set cannot be merged, and there is an indirect path between the producer and the consumer corresponding to the first node set.
  • the required node set can be screened out by combining the peak memory threshold and operator fusion rules, improving the efficiency and accuracy of screening.
  • the parameters are data mutation thresholds and operator fusion rules; the first node set meets one or more of the following conditions: the memory occupied by the output tensor corresponding to the first node set is the same as the first node set The deviation value between the memory occupied by at least one input tensor corresponding to a node set is higher than the data mutation threshold, or, the first node set and the consumers corresponding to the first node set can be fused, or, the first node set and The consumers corresponding to the first node set cannot be merged, and there is an indirect path between the producer and the consumer corresponding to the first node set. In this way, the required node set can be screened out by combining the data mutation threshold and operator fusion rules, improving the efficiency and accuracy of screening.
  • the parameters are memory threshold, peak memory threshold, and data mutation threshold; the first node set meets one or more of the following conditions: between the producer and the consumer corresponding to the first node set There is no indirect path, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or, there is an indirect path between the producer and the consumer corresponding to the first node set, and the first node set corresponds to The peak memory of each node between the producer and the consumer is higher than the peak memory threshold, or the memory occupied by the output tensor corresponding to the first node set is equal to the memory occupied by at least one input tensor corresponding to the first node set The deviation value between the memory is higher than the data mutation threshold. In this way, the required node set can be screened out by combining the memory threshold, peak memory threshold, and data mutation threshold, improving the efficiency and accuracy of screening.
  • the parameters are memory threshold, peak memory threshold and operator fusion rule; the first node set meets one or more of the following conditions: the relationship between the producer and consumer corresponding to the first node set There is no indirect path between, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or, there is an indirect path between the producer and the consumer corresponding to the first node set, and the first node set corresponds to The peak memory of each node between the producer and consumer is higher than the peak memory threshold, or the first node set and the consumer corresponding to the first node set can be merged, or the first node set and the first node set The consumers corresponding to the set cannot be merged, and there is an indirect path between the producer and the consumer corresponding to the first set of nodes. In this way, the required node set can be screened out by combining the memory threshold, peak memory threshold, and operator fusion rules, improving the efficiency and accuracy of screening.
  • the parameters are memory threshold, data mutation threshold and operator fusion rule; the first node set meets one or more of the following conditions: the output tensor corresponding to the first node set occupies The memory is higher than the memory threshold, or, the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold, or, The first node set and the consumers corresponding to the first node set can be fused, or the first node set and the consumers corresponding to the first node set cannot be fused, and there is an indirect relationship between the producer and the consumer corresponding to the first node set path. In this way, the required node set can be screened out by combining the memory threshold, data mutation threshold, and operator fusion rules, improving the efficiency and accuracy of screening.
  • the parameters are peak memory threshold, data mutation threshold and operator fusion rule;
  • the first node set meets one or more of the following conditions: the producers and consumers corresponding to the first node set There is an indirect path between them, and the peak memory of each node between the producer and consumer corresponding to the first node set is higher than the peak memory threshold during execution, or the memory occupied by the output tensor corresponding to the first node set is the same as
  • the deviation value between the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold, or, the first node set and the consumers corresponding to the first node set can be fused, or, the first node set Consumers corresponding to the first set of nodes cannot be fused, and there is an indirect path between producers and consumers corresponding to the first set of nodes.
  • the required node set can be screened out by combining the peak memory threshold, data mutation threshold, and operator fusion rules, improving the efficiency and accuracy of screening.
  • the parameters are memory threshold, peak memory threshold, data mutation threshold and operator fusion rule;
  • the first node set meets one or more of the following conditions: the producer corresponding to the first node set There is no indirect path between the consumer and the consumer, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or there is an indirect path between the producer and the consumer corresponding to the first node set, and the first
  • the peak memory of each node between the producer and consumer corresponding to a node set is higher than the peak memory threshold during execution, or the memory occupied by the output tensor corresponding to the first node set is at least one of the corresponding to the first node set
  • the deviation value between the memory occupied by the input tensor is higher than the data mutation threshold, or, the first node set and the consumers corresponding to the first node set can be fused, or, the consumption of the first node set and the first node set corresponding cannot be merged, and there is an indirect path between the producers
  • At least one first node set is obtained from producers included in the first computation graph based on parameters and data dependencies between nodes in the first computation graph, specifically including:
  • the first list is obtained, and the first list includes the corresponding relationship between producers and consumers in the first calculation graph; according to the output of each producer in the first list Tensor, obtain the set of candidate nodes, the set of candidate nodes includes at least the first set of nodes, wherein, for any producer in the first list, the tensor output in any producer and with any producer A set of directly related and indirectly related nodes is used as a node set; based on parameters, the first node set is selected from the candidate node sets. In this way, the first set of nodes can be obtained from the producers in the first computation graph.
  • each first node set is recalculated to obtain N recalculated subgraphs, which specifically includes: for any node set in the N first node sets, copy any node The producer subgraph corresponding to the set; delete the nodes in the producer subgraph except the nodes contained in any node set, and delete the edges not related to any node set in the producer subgraph to obtain any node set The corresponding recomputation subgraph.
  • merging the second computation graph with the first computation graph to obtain the third computation graph specifically includes: respectively constructing the input of each recalculation subgraph in the N recalculation subgraphs The directed edge between the node of the first target tensor and the node outputting the first target tensor in the first computation graph, and the output second Directed edges between nodes of the target tensor and nodes of the input second target tensor in the first computation graph to obtain a third computation graph.
  • the method further includes: Operators in the calculation graph are fused. This improves the performance of operators in the calculation graph; in addition, through the first operator fusion, the number of operators can also be reduced, which can reduce the overhead of analyzing operators in the subsequent recalculation process; in addition, it can also expand In the subsequent recalculation process, the operator is analyzed to improve the effect of recalculation.
  • the present application provides a calculation graph optimization device, including: at least one memory for storing programs; at least one processor for executing the programs stored in the memory, and when the programs stored in the memory are executed, the processor is used to Execute the method provided in the first aspect.
  • the present application provides a device, including: at least one memory for storing a program; at least one processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processor is used for executing the first methods provided in the aspect.
  • the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program runs on an electronic device, the electronic device executes the method provided in the first aspect.
  • the present application provides a computer program product, which causes the electronic device to execute the method provided in the first aspect when the computer program product is run on the electronic device.
  • the present application provides a chip, including at least one processor and an interface; the interface is used to provide program instructions or data for at least one processor; at least one processor is used to execute program instructions to implement the first aspect. provided method.
  • Fig. 1 is a schematic diagram of an artificial intelligence subject framework provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a training process of a network model of an AI network provided by an embodiment of the present application
  • Fig. 3 is a schematic diagram of the training process of the network model of another AI network provided by the embodiment of the present application.
  • Fig. 4 is a schematic diagram of comparison before and after operator fusion provided by the embodiment of the present application.
  • Fig. 5 is a schematic diagram of peak memory comparison before and after operator fusion provided by the embodiment of the present application.
  • FIG. 6 is a schematic diagram of a graph-calculation fusion process under an AI open-source computing framework provided by an embodiment of the present application
  • FIG. 7 is a schematic diagram of an operator fusion process provided by an embodiment of the present application.
  • Fig. 8 is a schematic diagram of topologically sorting operators in a calculation graph provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of steps for screening recomputing node sets through memory provided by the embodiment of the present application.
  • FIG. 10 is a schematic diagram of another step of screening recomputing node sets through memory provided by the embodiment of the present application.
  • Fig. 11 is a schematic diagram of a calculation graph provided by the embodiment of the present application.
  • Fig. 12 is a schematic diagram of peak memory comparison before and after recalculation of the calculation graph provided by the embodiment of the present application;
  • Fig. 13 is a schematic diagram of the steps of screening recomputation node sets through operator fusion rules provided by the embodiment of the present application;
  • Fig. 14 is a schematic diagram of a process of determining a recalculation graph provided by an embodiment of the present application.
  • Fig. 15 is a schematic diagram of the process of merging the determined recalculation graph and the calculation graph obtained after operator fusion of the original computation graph provided by the embodiment of the present application;
  • FIG. 16 is a schematic diagram of an operator fusion process provided by an embodiment of the present application.
  • Figure 17 is a schematic diagram of the comparison of the memory size required before and after optimizing the calculation graph by means of operator fusion and recalculation provided by the embodiment of the present application;
  • Fig. 18 is a schematic flow chart of a calculation graph optimization method provided by an embodiment of the present application.
  • FIG. 19 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • first and second and the like in the specification and claims herein are used to distinguish different objects, rather than to describe a specific order of objects.
  • first response message and the second response message are used to distinguish different response messages, rather than describing a specific order of the response messages.
  • words such as “exemplary” or “for example” are used as examples, illustrations or illustrations. Any embodiment or design scheme described as “exemplary” or “for example” in the embodiments of the present application shall not be interpreted as being more preferred or more advantageous than other embodiments or design schemes. Rather, the use of words such as “exemplary” or “such as” is intended to present related concepts in a concrete manner.
  • multiple means two or more, for example, multiple processing units refer to two or more processing units, etc.; multiple A component refers to two or more components or the like.
  • Graph computing fusion is MindSpore's unique network performance optimization technology. It can automatically analyze and optimize the logic of the existing network computing graph, and combine the capabilities of the target hardware to optimize the computing graph by computing simplification and replacement, operator splitting and fusion, and operator specialization compilation to improve device computing resources. Utilization, to achieve overall optimization of network performance.
  • graph computing fusion has unique advantages such as multi-operator cross-boundary joint optimization, cross-layer collaboration with MindAKG (Polyhedral-based operator compiler), and real-time compilation.
  • graph-computing integration only requires the user to open the corresponding configuration, and the entire optimization process can be completed automatically, without additional perception by network developers, allowing users to focus on network algorithm implementation.
  • Tensor is a data structure used to represent data in the AI framework. It usually consists of shape (representing the size and dimension of data), data type (such as flaot32, int, etc.), memory address and other related attributes.
  • an operator In the field of deep learning, an operator generally refers to a computing unit that can complete simple or complex calculations such as addition, subtraction, multiplication, and division, and is usually executed in units of operators on acceleration hardware.
  • Memory multiplexing is an optimization technique.
  • the multiplexing of physical memory can be achieved by pointing Tensors of different layers to the same address, so as to achieve the purpose of reducing memory usage.
  • a computation graph is a computation function expressed by a directed graph with Operators as nodes.
  • Computational graphs are mainly represented by nodes and edges.
  • Nodes in a computation graph can represent operators.
  • the edge between two operators in the computation graph can represent the relationship between the two operators. Among them, the edge can have a direction, and the direction of the edge can be the flow direction of data between operators. For example, when the edge between operator a and operator b in the calculation graph is from operator a to operator b, then the tensor output by operator a is the input of operator b.
  • the calculation function expressed by the calculation graph sequentially calls and executes the operator nodes in the directed graph to the input tensor, and obtains the final output tensor.
  • Peak memory refers to the memory occupied by all the tensors that need to be used when an operator is executed during the execution of the calculation graph. For example, please refer to (A) in Figure 5.
  • the tensors output by operator A, the tensors output by operator C, and the tensors to be output by operator D are all tensors that need to be used. Therefore, the peak memory at this time is T0+T1+T2; among them, the input tensor of operator B and the output tensor of operator C reuse one memory.
  • a producer is a computing subgraph that generates data in two computing subgraphs with data dependencies in the computing graph, that is, an operator that generates data.
  • operator O 1 and operator O 2 have a data dependency relationship, where the output tensor of operator O 1 is the output tensor of operator O 2
  • the input tensor therefore, the operator O 1 can be called a producer.
  • a consumer is a computation subgraph that consumes data in two computation subgraphs that have data dependencies in the computation graph, that is, an operator that consumes data.
  • operator O 1 and operator O 2 have a data dependency relationship, where the output tensor of operator O 1 is the output tensor of operator O 2
  • the input tensor therefore, the operator O2 can be called a consumer.
  • a node set refers to a node or a set of multiple nodes.
  • the nodes contained in a node set are all directly or indirectly related to a tensor output by a producer in the calculation graph, and are directly or indirectly related to at least one tensor input by the producer.
  • the calculation graph composed of multiple nodes can obtain at least one tensor from the outside and output a tensor to the outside, that is to say, among the multiple nodes At least one node can obtain tensors input from the outside, and only one node can output tensors to the outside, while the tensors output by other nodes can only be used inside the calculation graph.
  • Gather, Scatter and Div can constitute a node set, and the calculation graph composed of these three nodes, in this calculation graph, Gather, Scatter and Div are connected in sequence, where Gather and Scatter There is an edge between Scatter and Div, and these three nodes are directly or indirectly related to the tensor B output by the corresponding producer FuseOp1, and are directly or indirectly related to the tensor A input by the producer FuseOp1;
  • the calculation graph composed of these three nodes can obtain a tensor A from the outside through Gather, and can output a tensor B to the outside through Div. Neither Gather nor Scatter can output tensors to the outside of the calculation graph.
  • the data mutation threshold refers to the threshold at which at least one node in the calculation graph generates a data mutation.
  • the data mutation threshold is the threshold of the deviation value (such as difference or ratio, etc.) between the memory occupied by the tensor output by the node and the memory occupied by the tensor input by the node.
  • these multiple nodes can constitute the node set described above, and the data mutation threshold is the memory occupied by the output tensor corresponding to the node set formed by these multiple nodes and the node set
  • the threshold of the deviation value (such as: difference or ratio, etc.) between the memory occupied by the corresponding at least one input tensor.
  • the output tensor corresponding to the node set refers to the tensor that the calculation graph composed of the nodes in the node set outputs to the outside
  • the input tensor corresponding to the node set refers to the calculation graph composed of the nodes in the node set.
  • FIG. 1 shows a schematic diagram of an artificial intelligence main framework, which describes the overall workflow of an artificial intelligence system, and is applicable to general artificial intelligence field requirements.
  • Intelligent information chain reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has undergone a condensed process of "data-information-knowledge-wisdom".
  • IT value chain reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of artificial intelligence, information (provided and processed by technology) to the systematic industrial ecological process.
  • the infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform.
  • the basic platform includes distributed computing framework and network and other related platform guarantees and supports, which can include cloud storage and Computing, interconnection network, etc.
  • sensors communicate with the outside to obtain data, and these data are provided to the smart chips in the distributed computing system provided by the basic platform for calculation.
  • Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, text, and IoT data of traditional equipment, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can symbolize and formalize intelligent information modeling, extraction, preprocessing, training, etc. of data.
  • Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, and using formalized information to carry out machine thinking and solve problems according to reasoning control strategies.
  • the typical functions are search and matching.
  • Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image processing identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is the packaging of the overall solution of artificial intelligence, which commercializes intelligent information decision-making and realizes landing applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, automatic driving, safe city, smart terminals, etc.
  • FIG. 2 shows a training process of a network model of an AI network.
  • the training process of the network model may include a forward calculation process and a backward calculation process.
  • the part on the left side of the dotted line x is the forward calculation process
  • the part on the right side of the dotted line x is the backward calculation process
  • each block in the figure represents a tensor.
  • tensors in the same row reuse the same memory address, for example: tensor A and tensor A b reuse a memory address, tensor B and tensor B b reuse a memory address, ..., tensor F and tensor F b multiplex a memory address etc.
  • tensor A b , tensor B b , tensor C b , tensor D b , tensor E b , and tensor F b are respectively related to tensor A, tensor B, tensor C, Reverse operator data corresponding to tensor D, tensor E, and tensor F.
  • the forward calculation can be performed first, and then the backward calculation can be performed. Among them, the tensor obtained by the forward calculation will be stored in the memory of the corresponding computing device.
  • the calculation of tensor A b depends on tensor In and tensor B b
  • the calculation of tensor B b depends on tensor A and tensor C b
  • the calculation of tensor C b depends on tensor Quantity B and tensor D b
  • the calculation of tensor D b depends on tensor C and tensor E b , etc.
  • recomputation or operator fusion can be used to solve the problem of memory usage.
  • the time-for-space method is adopted in Figure 1, and the memory occupied by the part of the results obtained by the forward calculation in Figure 1 is released in advance, and this part of the results is processed in the subsequent backward calculation.
  • the tensor A obtained by the forward calculation may continue to occupy memory before the tensor B b is obtained by the backward calculation.
  • the tensor D obtained by the forward calculation can continue to occupy memory before the tensor E b is obtained by the backward calculation.
  • Tensor B, Tensor C, Tensor E, and Tensor F obtained from the forward calculation.
  • the memory occupied by these four tensors can be released immediately after being used in the forward calculation process, and this can be done during the reverse calculation.
  • the calculation process corresponding to the four tensors is re-executed. Among them, the memory occupied by tensor B can be released after the tensor C is obtained by the forward calculation, and the memory occupied by the tensor C can be released after the tensor D is obtained by the forward calculation.
  • the tensor D b When the tensor D b is calculated backward, it can be re- The tensor C is calculated, and when the tensor C b is calculated backward, the tensor B can be recalculated. In this way, the purpose of reducing memory usage can be achieved by recalculating.
  • the manual recalculation point solution often requires the user to identify the recalculation segmentation point, which requires high user capabilities, and needs to be set separately for different models and different hardware backends, which is inefficient; the automatic calculation recalculation point solution takes time to solve. Long (hour-level solution time, more than most training tasks themselves), or need to be counted at runtime, and cannot be combined with operator fusion.
  • the recalculation method can only be recalculated for data dependencies between forward and reverse, and cannot be recalculated for data dependencies between forward and forward, and reverse and reverse, and the usage scenarios are relatively small.
  • the method of recalculation cannot solve the problem of memory occupied by a single operator and a single super large tensor. For example, continue to refer to Figure 3, if the data volume of the tensor A is too large, the memory occupied by it will be large, and at this time, the memory will continue to be occupied by a large amount.
  • FIG. 4 is the calculation method before fusion, wherein, tensor A is input to operator O1 to obtain tensor B, and tensor B is input to operator O2 to obtain Tensor C, tensor C is input to operator O 3 to obtain tensor D.
  • Operators O 1 , O 2 and O 3 can be fused by using the preset fusion rules, and these three operators can be fused into one operator.
  • tensor D obtained from tensor A is transformed into Figure 4 The way shown in (B). In (B) of FIG.
  • tensor D can be obtained after inputting tensor A to the fused operator (O 1 +O 2 +O 3 ). Comparing (A) and (B) in Figure 4, it can be seen that there are 4 tensors that need to occupy memory in (A) in Figure 4, and 2 tensors that need to be occupied in (B) in Figure 4 Memory. It can be seen that using operator fusion can reduce memory usage.
  • operator fusion can only process operators in adjacent spaces, and the scope and scale of its fusion are limited.
  • operator fusion will extend the life cycle of different tensors to the entire fusion operator, which instead increases the memory usage.
  • FIG. 5 is the execution sequence of each operator before operator fusion, the life cycle of each operator's input tensor and output tensor, and each operator's Peak memory during execution
  • FIG. 5 shows the execution order of each operator after operator fusion, the life cycle of each operator's input tensor and output tensor, and the peak value of each operator during execution Memory.
  • operator A, operator B, and operator C may be fused into operator E.
  • the present application provides a calculation graph optimization method, which mainly obtains a recalculation graph that needs to be recalculated from the calculation graph to be optimized, and merges the obtained recalculation graph with the calculation graph to be optimized to generate A new calculation graph, and finally perform operator fusion on the new calculation graph. Therefore, using the combination of recalculation and operator fusion, while significantly reducing memory usage, without introducing large recalculation overhead (including compile time and runtime), it solves the problem of having one or more super large tensors. The network cannot execute the problem.
  • FIG. 6 shows a schematic diagram of a graph-computing fusion process under an AI open-source computing framework.
  • the AI open source computing framework shown in FIG. 6 may be MindSpore.
  • the AI open source computing framework may include a front-end presentation layer (not shown in the figure), a MindSpore layer, and a MindSpore Auto Kernel Generator layer (MindSpore AGK layer).
  • the front-end presentation layer may provide at least one application programming interface (API) interface for the user.
  • API application programming interface
  • users can select the desired network model 401 at the front-end presentation layer, and configure various data through the front-end presentation layer.
  • the network model can be constructed in the MindSpore layer and automatically differentiated to generate the user-configured MindSpore Expressio (ME) front-end representation.
  • a calculation graph can be automatically constructed based on the user-configured ME front-end representation.
  • public optimization can be performed on the calculation graph first, such as: eliminating unnecessary code, optimizing the same data that occurs multiple times, and so on.
  • perform back-end optimization on the calculation graph for example, select a more suitable expression method according to the characteristics of different hardware.
  • an optimized calculation graph can be generated.
  • the process of obtaining the optimized calculation graph can be understood as a graph-computing fusion process.
  • the calculation graph optimized by graph computing fusion can contain several operator definition descriptions, which describe the calculation process of each operator in the calculation graph.
  • the json format can be passed to Mind AKG for compilation and optimization, and the code generator is called to generate the back-end hardware code and compiled into the back-end Kernel.
  • the back-end Kernel can be understood as the expression of the operator on the hardware, such as the binary file corresponding to the operator.
  • the MindSpore layer can call the Kernel in the order of the calculation graph and execute the corresponding calculation graph.
  • the back-end optimization part in the MindSpore layer can reduce the memory consumption of the network model through recalculation and operator fusion processing, so that it can be used for different types of hardware back-ends (such as graphics processors, neural network processing, etc.) processor, central processing unit, etc.) to provide performance and memory optimization.
  • hardware back-ends such as graphics processors, neural network processing, etc.
  • the calculation graph optimization solution may be run on any device, device, platform or device cluster with computing and processing capabilities, but not limited to.
  • the computation graph optimization scheme mainly includes two steps of setting configuration item interface and computing graph optimization.
  • Setting the configuration item interface is mainly for the user to configure some necessary parameters for subsequent use in the calculation graph optimization process;
  • calculation graph optimization is mainly a process of combining operator fusion and recomputation, see the description below for details.
  • MindSpore can be configured with a Context module and a graph-calculation fusion module.
  • the Context module can be used to receive the user's global settings for the running process.
  • the graph-calculation fusion module can, but is not limited to, be responsible for the graph-computation fusion of calculation graphs and operators.
  • the graph-calculation fusion module can be, but not limited to, configured in the AI compiler module in MindSpore, and the AI compiler module can be responsible for compiling and optimizing the calculation graph and operators.
  • two configuration item interfaces can be added in the Context module, but not limited to.
  • One configuration item interface can be used to configure the memory threshold and/or peak memory threshold
  • the other configuration item interface can be used to configure the data threshold.
  • users can set the values of the above two configuration items according to the actual network and dataset scenarios. When the user does not set it, the default value of the system can be used, or it can be automatically generated according to the hardware environment, for example: the memory size of the graphics processing unit (graphics processing unit, GPU) can be used as the memory threshold and/or the peak memory threshold.
  • the interface for setting configuration items can be set in real time or in advance, which can be determined according to actual conditions, and is not limited here.
  • the process of computing graph optimization is mainly a process of combining operator fusion and recomputation. This process can be executed, but not limited to, in the graph-computing fusion module configured in MindSpore. Wherein, the graph computing fusion module can read the memory threshold and/or peak memory threshold and data threshold configured by the user in the Context module.
  • the calculation graph optimization process mainly includes primary operator fusion, recomputation, and secondary operator fusion. See the description below for details.
  • operator fusion can be performed on the obtained calculation graph first.
  • operator fusion rules such as: elementwise+elementwise operator fusion rules, elementwise+reduction operator fusion rules, segment+elementwise operator fusion rules, etc.
  • category of operators in the calculation graph for example: computation-intensive, memory-intensive, etc.
  • An operator obtained after fusion hereinafter referred to as "fusion operator"
  • fusion operator an operator obtained after fusion
  • This improves the performance of operators in the calculation graph.
  • the number of operators can also be reduced, which can reduce the overhead of analyzing operators in the subsequent recalculation process.
  • (A) in Figure 7 is the initially obtained calculation graph, which includes 6 operators, and these 6 operators are Slice, Gather, ScatterAdd, MatMul, Mul and Div.
  • the output of Slice is tensor A; the input of Gather is tensor A, and the output is tensor B; the input of ScatterAdd is tensor B, and the output is tensor C; the input of MatMul is tensor Quantity C, the output is tensor D; the input of Mul is tensor D, and the output is tensor E; the input of Div is tensor B and tensor D.
  • the operator fusion rules are: Gather and ScatterAdd fusion, Mul and Div fusion, then based on the operator fusion rules (A) in Figure 7, the operator fusion can be obtained as shown in Figure 7 (B).
  • the calculation graph shown. (B) in FIG. 7 includes 4 calculation subgraphs, and these 4 calculation subgraphs correspond to 4 fusion operators, respectively: Slice, FuseOp1, MatMul and FuseOp2.
  • the output of Slice is tensor A; the input of FuseOp1 is tensor A, and the output is tensor B and tensor C; the input of MatMul is tensor C, and the output is tensor D; FuseOp2 The input of is tensor D and tensor B. It can be understood that since (B) in FIG. 7 is a calculation graph obtained after operator fusion, each operator in (B) in FIG. 7 can be called a "fusion operator".
  • a subsequent recalculation step may be performed after performing this step. In some other embodiments, this step may not be performed, but the subsequent recalculation step is directly performed.
  • the recalculation process mainly includes building a producer-consumer list, determining a recalculation candidate set, screening a recalculation point set from the recalculation candidate set, determining a recalculation graph, and generating one or more items in a new calculation graph, See description below for details.
  • the nodes in the calculation graph are topologically sorted first, for example, sorted according to the physical relationship between operators in the calculation graph, etc., and then the producer- list of consumers.
  • the producer is the calculation subgraph that generates data in the two calculation subgraphs with data dependencies, that is, the operator that generates data
  • the consumer is the calculation subgraph that uses data in the two calculation subgraphs with data dependencies , which is an operator using data.
  • (A) in FIG. 8 is a calculation subgraph obtained after one fusion. Based on the data dependencies among the nodes in the calculation graph, each node in the calculation graph shown in (A) of FIG. 8 is topologically sorted, and the sorting result shown in (B) of FIG. 8 can be obtained.
  • (B) of FIG. 8 Slice and FuseOp1 have a data dependency
  • FuseOp1 has a data dependency with MatMul and FuseOp2 respectively
  • MatMul and Div have a data dependency
  • Div and FuseOp2 have a data dependency.
  • Slice can generate data, and FuseOp1 can use the data generated by Slice, so Slice is the producer, and FuseOp1 is the consumer.
  • FuseOp1 and MatMul FuseOp1 is the producer, and MatMul is the consumer
  • FuseOp1 and FuseOp2 FuseOp1 is the producer, and FuseOp2 is the consumer
  • MatMul and Div MatMu is the producer, and Div is the consumer
  • Div and FuseOp2 Div is the producer and FuseOp2 is the consumer.
  • the producer-consumer list corresponding to the computation graph in (A) in FIG. 8 can be obtained from (B) in FIG. 8 , and the list can be shown in Table 1.
  • the producer-consumer list After the producer-consumer list is constructed, for any producer in the list, a set of operators in the producer and directly and indirectly related to the tensor output by the producer can be used as recomputation candidates
  • a piece of data in the set which can be called a set of recomputing nodes.
  • a recomputation candidate set is obtained.
  • the set of recomputing nodes may include one node, or may include multiple nodes.
  • the tensors output by the producer are tensor B and tensor C respectively.
  • the node directly related to tensor B is Gather, and there is no node indirectly related to tensor B.
  • the node directly related to tensor C is ScatterAdd, and the node indirectly related to tensor C is Gather. Therefore, the set of recomputation candidates obtained by the producer FuseOp1 is [(Gather), (Gather, ScatterAdd)].
  • the recalculation candidate combination includes two recalculation node sets, one recalculation node set is (Gather), which includes a node, that is, node Gather, and the other recalculation node set is (Gather, ScatterAdd), which includes two nodes, namely nodes Gather and ScatterAdd.
  • the method of screening the recalculation point set from the recalculation candidate set may include one or more methods of screening based on memory size, screening based on data mutation, and screening based on operator fusion rules. Among them, one of these three ways can be selected, or can be combined arbitrarily, which can be determined according to actual conditions, and is not limited here. The three methods are described below.
  • the memory occupied by the output tensor corresponding to each recalculation node set in the recalculation candidate set may be compared with the previously set memory threshold or peak memory threshold.
  • the required recalculation point set is filtered out from the recalculation candidate set.
  • the output tensor corresponding to the recalculation node set refers to a tensor output by the producer corresponding to the recalculation node set, and each node in the recalculation node set is directly related to the tensor or indirectly related.
  • the input tensor corresponding to the recalculation node set refers to at least one tensor input by the producer corresponding to the recalculation node set, and each node in the recalculation node set is related to each tensor in the at least one tensor directly or indirectly related.
  • the process of obtaining the producer-consumer list it is possible to record whether there is an indirect path between the producer-consumer, that is, record the tensor output by the producer except the tensor directly used as the consumer Whether the input tensor can also be indirectly used as the input tensor of this consumer.
  • the tensor B output by the producer FuseOp1 can be directly used as the input tensor of the consumer FuseOp2, and the tensor C output by the producer FuseOp1 can be passed MatMul and Div indirectly serve as tensors for input to FuseOp2, so there is an indirect path between FuseOp1 and FuseOp2.
  • the recalculation candidate set Filter out the required set of recomputing nodes corresponding to the producer.
  • the memory occupied by the output tensor corresponding to each recomputation node set corresponding to the producer in the recomputation candidate set may be compared with the previously set memory threshold or peak memory threshold.
  • MatMul-Div For example, continue to refer to Figure 8, for the producer-consumer MatMul-Div, there is no indirect path between the two. If the memory threshold set in the early stage is 40 bytes, the memory occupied by the tensor output by the producer MatMul is 50 bytes. Since there is only a direct path between MatMul and Div, and there is no indirect path, and the memory occupied by the tensor output by the producer MatMul is larger than the memory threshold set in the previous stage, MatMul can be used as the required set of recomputing nodes.
  • determining the memory occupied by the tensor output by the producer it may be determined based on, but not limited to, the data size of the tensor output by the producer.
  • the structure such as: dimension, type
  • the structure of the tensor output by each operator in the calculation graph can be determined. Therefore, according to the structure of the tensor output by each operator, etc., Determine the memory size occupied by tensors output by each operator.
  • the peak memory of each operator on the indirect path between the producer and the consumer may be calculated separately during execution.
  • the producer-consumer can be directly related to each tensor input from the producer to the consumer
  • the set of operators and indirectly related operators are all used as a set of required recomputing nodes.
  • the producer-consumer and the producer-consumer input can be The set of directly related and indirectly related operators of the tensor is not the required set of recomputing nodes, and these sets of recomputing nodes are discarded. Thus, the required recalculation point set is filtered out from the recalculation candidate set.
  • FuseOp1 and FuseOp2 in FIG. 11 can form a producer-consumer.
  • the set of operators directly and indirectly related to tensor B is Gather
  • the set of operators directly and indirectly related to tensor E is (Gather, ScatterAdd, Div)
  • the node sets are (Gather) and (Gather, ScatterAdd, Div) respectively.
  • the producer-consumer and the tensor input from the producer to the consumer can be combined during screening.
  • the size between the memory occupied by the producer and the memory occupied by the input tensor related to the tensor required by the producer is used as the judgment condition of the attachment.
  • the tensor input from the producer to the consumer that is, the tensor output by the producer
  • the peak memory of each operator on the indirect path between the producer and the consumer during execution has at least one peak memory corresponding to the operator
  • the set of operators directly and indirectly related to each tensor input from the producer to the consumer in the producer-consumer can be used as a required Recalculate the set of nodes.
  • T1 If T1 is less than T0, the peak memory corresponding to the calculation graph will be increased after recalculation. If T1 is equal to T0, then after recalculation at this time The peak memory corresponding to the calculation graph has not changed, but the system overhead has been increased due to recalculation. Therefore, when T1 is greater than T0, operator 1 can be used as the required set of recomputing nodes.
  • the recalculation node set in the recalculation candidate set When screening based on data mutation, it can be determined whether there is data mutation in the recalculation node set in the recalculation candidate set. If the deviation value (such as difference, ratio, etc.) between the size of the memory occupied by the output tensor corresponding to the recalculation node set and the size of the memory occupied by at least one input tensor corresponding to it is greater than or equal to the data set in the previous period Threshold, indicating that there is a data mutation at this time, then keep the set of recalculation points, otherwise discard the set of recalculation points. In this way, the situation of excessive memory usage caused by data mutation is eliminated.
  • the deviation value such as difference, ratio, etc.
  • the set of recalculation points in the recalculation candidate set is [(Gather), (Gather-ScatterAdd-Div)]. If the memory occupied by tensor E minus the memory occupied by tensor A is greater than the previously set data threshold, it indicates that there is a data mutation in the recalculation point set Gather, and the recalculation point set can be retained at this time. If the memory occupied by tensor B minus the memory occupied by tensor A is less than the previously set data threshold, it indicates that there is no data mutation in the recalculation point set (Gather-ScatterAdd-Div), and the recalculation point can be discarded at this time gather. The set of recalculation points thus filtered out is (Gather).
  • (A) in Figure 12 is the execution order of each operator before recomputation, the input tensor of each operator and The life cycle of the output tensor and the peak memory of each operator during execution
  • (B) in Figure 12 shows the execution order of each operator after recalculation, the input tensor and output tensor of each operator The lifetime and peak memory of each operator during execution. Comparing (A) and (B) in Figure 12, when operator 4 is executing, the peak memory before recalculation is T1+T2+T3+T4, and the peak memory after recalculation is T0+T2+T3 +T4.
  • operator 1 is a set of recalculation nodes in the determined recalculation candidate set, at this time, it can be seen that when operator 4 is executing, The difference between the peak memory after recalculation and the peak memory before recalculation is small, so in this case, operator 1 can be used as the selected recalculation node set, or operator 1 can not be selected as the recalculation node set gather.
  • the recomputation node combination may be discarded.
  • the set of recomputing nodes is thus filtered out.
  • S1321 can also be understood as judging whether a node that outputs tensors in the recalculation node set and its corresponding consumer can be fused.
  • the determined recomputation node set in the recomputation candidate set is operator 1, and operator 1 and operator 5 cannot be merged. Since operator 1 and operator 5 do not comply with the fusion rules, but there is an indirect path between them, operator 1 can be used as the required recomputation node set. Comparing (A) and (B) in Figure 12, if the memory T1 occupied by the tensor output by operator 1 is too large, then in (B) in Figure 12, the tensor output by operator 1 can be greatly shortened. Life cycle, at this time the peak memory corresponding to the entire calculation graph has also been significantly reduced.
  • a recomputation graph can be determined from the determined recomputation node set.
  • an edge between a recalculation node set and a consumer corresponding to a producer to which the recalculation node set belongs may be used as a recomputation point in the calculation graph.
  • deleting all nodes in the producer subgraph except the recomputation node set corresponding to the producer subgraph can be understood as deleting the tensor output from the producer subgraph and the recalculation node set irrelevant operators.
  • the producers to which the set of recomputing nodes belong can also be directly copied in the computation graph, so that the producer subgraph corresponding to the set of recomputing nodes can be obtained. Then, delete all the nodes in the producer subgraph except for the recalculation node set corresponding to the producer subgraph, and delete the nodes in the producer subgraph that are not related to the recalculation node set corresponding to the producer subgraph edge, so that the recomputation graph corresponding to each recomputation node set is obtained.
  • (A) in FIG. 14 is a fused subgraph obtained by performing operator fusion on the initial calculation graph. If the memory occupied by tensor B in (A) of FIG. 14 is higher than the memory threshold configured by the user, then the set of recalculation nodes can be determined as Gather, and the edge corresponding to tensor B can be determined as the recalculation point. Next, the producer subgraph corresponding to the recomputation point can be copied, that is, FuseOp1 can be copied to obtain the fusion operator shown in (B) of FIG. 14 .
  • copying the producer subgraph corresponding to the recalculation point can be understood as copying the fusion operator that outputs the tensor B from the fusion subgraph obtained by one operator fusion.
  • operators other than the recalculation node set Gather can be deleted, that is, ScatterAdd can be deleted, and edges that are not related to the recalculation node set Gather can be deleted, that is, operator Scatter and operator Gather can be deleted The edge between them and the edge corresponding to the output tensor of the operator Scatter, so as to obtain the recomputation graph shown in (C) of Figure 14.
  • each recalculation graph is combined with the original computation graph to generate a new computation graph.
  • Tensor nodes have data dependencies.
  • the original computation subgraph if operator fusion has been performed on the original computation graph during the execution of the recomputation step, the original computation subgraph is the subgraph obtained after operator fusion; If operator fusion is not performed on the original computation graph during the recalculation step, the original computation subgraph is the original computation graph.
  • the directed edge between the node inputting the first target tensor in each recomputation graph and the node outputting the first target tensor in the original calculation graph can be respectively constructed, and each recomputation graph can be constructed separately The directed edge between the node outputting the second target tensor in the original calculation graph and the node inputting the second target tensor in the original calculation graph to obtain a new calculation graph.
  • (A) in Figure 15 is the initial calculation graph (that is, the original calculation graph).
  • a fusion subgraph shown After recalculating the primary fusion subgraph shown in (B) of FIG. 15 , a recalculation graph as shown in (C) of FIG. 15 can be obtained.
  • the operator used to input the tensor A in the recomputation graph in (C) of Figure 15 can be connected with the operator Slice in (B) of Figure 15, that is, to construct two The directed edge between them, and the operator used to output the tensor B in the recomputation graph in (C) of Figure 15 is connected to the operator FuseOp2 in (B) of Figure 15, that is, to construct the The directed edge between. Therefore, after combining the recomputation graph shown in (C) of FIG. 15 with the primary fusion subgraph shown in (B) of FIG. 15 , a new computation graph shown in (D) of FIG. 15 can be obtained.
  • Operator fusion ie, secondary fusion
  • the computation graph can be divided into A plurality of calculation subgraphs, wherein each calculation subgraph may correspond to an operator obtained after fusion (hereinafter referred to as "fusion operator").
  • fusion operator an operator obtained after fusion
  • the calculation graph obtained by executing the final optimization can be compiled first, and then the compiled calculation graph can be executed, or the calculation graph can be directly executed; either way, it is essentially
  • the calculation graph obtained by the final optimization performed is only different in representation.
  • FIG. 16 shows a new calculation graph obtained after recalculation. If the preset operator fusion rule is: Gather and Mul fusion, then based on the operator fusion rule, the fusion subgraph shown in (B) in Figure 16 can be obtained after performing operator fusion on (A) in Figure 16 .
  • the nodes described in the embodiments of the present application may be, but not limited to, operators, and the operators described in the embodiments of the present application may also be, but not limited to, nodes, which may be determined according to actual conditions.
  • the fusion subgraph is obtained by first performing operator fusion on the calculation graph, and then the recalculation graph that needs to be recalculated is determined from the fusion subgraph, and the determined recalculation graph is combined with the fusion subgraph to obtain Generate a new calculation graph, and finally perform operator fusion on the new calculation graph to obtain the required calculation graph, which achieves a significant reduction in memory usage without introducing large recalculation overhead, and solves the problem of having one or more The problem that networks with very large tensors cannot perform.
  • M and L respectively represent the memory size occupied by the tensor output by the operator
  • M represents a medium size, such as 56.8M
  • L represents a super large size, such as 27G.
  • the calculation graph shown in (A) of FIG. 17 is a calculation graph without any processing, that is, an initial calculation graph.
  • the calculation graph shown in (B) of FIG. 17 is a calculation graph obtained by performing operator fusion on the calculation graph shown in FIG. 17 (A) based on a preset operator fusion rule, that is, a primary fusion subgraph.
  • a preset operator fusion rule that is, a primary fusion subgraph.
  • the operator Slice and the operator MatMul reuse one memory so the memory size required for the calculation graph shown in (B) of Figure 17 is: M1+M2+L1+L2.
  • the calculation graph shown in (C) of FIG. 17 is a calculation graph obtained after recalculating the calculation graph shown in FIG. 17 (B), that is, a new calculation graph.
  • memory cannot be reused between operators, so the memory size required for the calculation graph shown in (C) of Figure 17 is: M1+M2+M3+L1+L2.
  • the calculation graph shown in (D) of Figure 17 is a calculation graph obtained by performing secondary operator fusion on the calculation graph shown in Figure 17 (C) based on the preset operator fusion rules, that is, the secondary fusion subgraph , the computation graph is the desired computation graph.
  • the memory size required for the calculation graph shown in (D) of Figure 17 is: M1+M2+M3+L1.
  • the recalculation point can be determined based on memory size, data mutation, fusion rules, etc., which can not only effectively reduce memory usage and improve the efficiency of identifying recalculation points, but also reduce memory usage.
  • Tensors are converted to internal storage (such as register space), ⁇
  • FIG. 18 is a schematic flowchart of a computation graph optimization method provided by an embodiment of the present application. It can be understood that the method can be executed by any device, device, platform, or device cluster that has computing and processing capabilities. As shown in Figure 18, the calculation graph optimization method may include the following steps:
  • the parameters include operator fusion rules and the tensor output of a single node in the first computation graph.
  • the threshold of the memory occupied by the tensor to be used, and the data mutation threshold is the threshold of at least one node in the first calculation graph that generates a data mutation.
  • the second computation graph can be obtained from the first computation graph based on parameters and data dependencies between nodes in the first computation graph.
  • the parameters include operator fusion rules, memory thresholds occupied by tensors output by a single node in the first computation graph, peak memory thresholds corresponding to the first computation graph, and data mutation thresholds corresponding to nodes in the first computation graph one or more of the .
  • the peak memory threshold is the threshold of memory occupied by all the tensors to be used at the time of execution of a node during the execution of the computation graph
  • the data mutation threshold is the threshold of at least one node in the first computation graph that generates a data mutation.
  • the user can set the memory threshold, peak memory threshold, and data mutation threshold through the Context module configured in MindSpore.
  • the memory threshold may also be called a memory threshold
  • the peak memory threshold may also be called a peak memory threshold
  • the data mutation threshold may also be called a data threshold.
  • a single node can be understood as a node, and can also be understood as a node obtained by merging multiple nodes.
  • N first A node set, N is a positive integer greater than or equal to 1
  • the first node set includes at least one node
  • the nodes contained in a first node set are all directly related to a tensor output by a producer in the first calculation graph are related or indirectly related, and are directly related or indirectly related to at least one tensor input by a producer in the first computation graph.
  • recomputation is performed on each first node set to obtain N recalculation subgraphs, wherein the N recalculation subgraphs constitute the second computation graph.
  • the second computation graph may include N recomputation subgraphs.
  • the first node set can be understood as the required recomputation node set described above
  • the recalculation subgraph can be understood as the recomputation graph described above.
  • performing recalculation on each first node set to obtain N recalculation subgraphs may be implemented in the following manner. Specifically, for any node set in the N first node sets, the producer subgraph (also called a producer) corresponding to the any node set may be copied first. Then, delete the nodes in the producer subgraph except for the nodes contained in any node set, and delete the edges that are not related to any node set in the producer subgraph, so that the corresponding weight of any node set can be obtained Compute subgraphs. By traversing each node set in the N first node sets, N recomputation subgraphs can be obtained.
  • the first list may be obtained based on the data dependencies between the nodes in the first computation graph, and the first list includes the correspondence between producers and consumers in the first computation graph.
  • the first list can be the producer-consumer list described above, which can be shown in Table 1 above, and this step can be understood as constructing the producer-consumer list in the recalculation step described above. Steps for consumer list.
  • a set of candidate nodes is obtained, and the set of candidate nodes includes at least the first set of nodes, wherein, for any producer in the first list, any The set of nodes in a producer that are directly and indirectly related to tensors output by any producer is regarded as a node set.
  • this step can be understood as a step of determining a recalculation candidate set in the recalculation step described above.
  • the candidate node set can be understood as the recalculation candidate set described above.
  • the first node set is selected from the candidate node sets.
  • this step can be understood as one or more of the above-described recalculation steps based on memory size screening, data mutation-based screening, and operator fusion rule-based screening to determine recalculation from the recalculation candidate set Diagram of steps.
  • the parameter is a memory threshold; the memory occupied by the output tensor corresponding to the first set of nodes is higher than the memory threshold.
  • the node set corresponding to the output tensor that occupies too much memory can be automatically used as the required node set, thereby solving the problem of memory occupied by a single operator and/or tensor.
  • the parameter is the peak memory threshold; there is an indirect path between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set when executing Above the peak memory threshold.
  • the node set corresponding to the high peak memory usage can be automatically used as the required node set, thereby solving the problem of memory usage of a single operator and/or tensor.
  • the parameter is a data mutation threshold; the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold .
  • the node set with data mutation can be automatically used as the required node set, thereby eliminating the problem of excessive memory usage caused by data mutation.
  • the parameter is an operator fusion rule; the first node set and the consumers corresponding to the first node set can be fused; or, the first node set and the consumers corresponding to the first node set cannot be fused, and the first node set There is an indirect path between the producer and consumer corresponding to the set. In this way, the required node set can be filtered out through operator fusion rules.
  • the parameters are memory threshold and peak memory threshold; there is no indirect path between producers and consumers corresponding to the first set of nodes, and the memory occupied by the output tensor corresponding to the first set of nodes is higher than the memory threshold and/or, there is an indirect path between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than the peak memory threshold during execution.
  • the required node set can be filtered out by combining the memory threshold and the peak memory threshold, improving the efficiency and accuracy of screening.
  • the parameters are a memory threshold and a data mutation threshold; the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, and/or, the memory occupied by the output tensor corresponding to the first node set is different from A deviation value between memory occupied by at least one input tensor corresponding to the first node set is higher than a data mutation threshold.
  • the required node set can be screened out by combining the memory threshold and the data mutation threshold, improving the efficiency and accuracy of screening.
  • the parameters are a memory threshold and an operator fusion rule; the first node set meets one or more of the following conditions: the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or, The first node set and the consumers corresponding to the first node set can be fused, or the first node set and the consumers corresponding to the first node set cannot be fused, and there is an indirect relationship between the producer and the consumer corresponding to the first node set path. In this way, the required node set can be screened out through the combination of memory threshold and operator fusion rules, improving the efficiency and accuracy of screening.
  • the parameters are the peak memory threshold and the data mutation threshold; there is an indirect path between the producer and the consumer corresponding to the first node set, and each node between the producer and the consumer corresponding to the first node set executes and/or, the deviation value between the memory occupied by the output tensor corresponding to the first set of nodes and the memory occupied by at least one input tensor corresponding to the first set of nodes is higher than Data mutation threshold.
  • the required node set can be screened out by combining the peak memory threshold and the data mutation threshold, improving the efficiency and accuracy of screening.
  • the parameters are the memory threshold and operator fusion rules; the first node set meets one or more of the following conditions: there is an indirect path between the producer and the consumer corresponding to the first node set, and the first The peak memory of each node between the producer and the consumer corresponding to the node set is higher than the peak memory threshold during execution, or the first node set and the consumer corresponding to the first node set can be merged, or the first node set and The consumers corresponding to the first node set cannot be merged, and there is an indirect path between the producer and the consumer corresponding to the first node set.
  • the required node set can be screened out by combining the peak memory threshold and operator fusion rules, improving the efficiency and accuracy of screening.
  • the parameters are data mutation threshold and operator fusion rule; the first node set meets one or more of the following conditions: the memory occupied by the output tensor corresponding to the first node set corresponds to the first node set The deviation value between the memory occupied by at least one input tensor of is higher than the data mutation threshold, or, the first node set and the consumers corresponding to the first node set can be fused, or, the first node set and the first node set The corresponding consumers cannot be fused, and there is an indirect path between the producers and consumers corresponding to the first set of nodes. In this way, the required node set can be screened out by combining the data mutation threshold and operator fusion rules, improving the efficiency and accuracy of screening.
  • the parameters are a memory threshold, a peak memory threshold, and a data mutation threshold;
  • the first set of nodes meets one or more of the following conditions: there is no indirect path between producers and consumers corresponding to the first set of nodes , and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or there is an indirect path between the producer and the consumer corresponding to the first node set, and the producer and consumer corresponding to the first node set
  • the peak memory of each node between them is higher than the peak memory threshold, or the memory occupied by the output tensor corresponding to the first node set is between the memory occupied by at least one input tensor corresponding to the first node set
  • the deviation value of is higher than the data mutation threshold. In this way, the required node set can be screened out by combining the memory threshold, peak memory threshold, and data mutation threshold, improving the efficiency and accuracy of screening.
  • the parameters are memory threshold, peak memory threshold, and operator fusion rules; the first set of nodes meets one or more of the following conditions: there is no indirect relationship between producers and consumers corresponding to the first set of nodes path, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or, there is an indirect path between the producer and the consumer corresponding to the first node set, and the producer and the consumer corresponding to the first node set correspond to The peak memory of each node between consumers is higher than the peak memory threshold, or the first node set and the consumers corresponding to the first node set can be merged, or the consumption of the first node set and the first node set cannot be merged, and there is an indirect path between the producers and consumers corresponding to the first set of nodes. In this way, the required node set can be screened out by combining the memory threshold, peak memory threshold, and operator fusion rules, improving the efficiency and accuracy of screening.
  • the parameters are memory threshold, data mutation threshold and operator fusion rule; the first node set meets one or more of the following conditions: the memory occupied by the output tensor corresponding to the first node set is higher than the memory Threshold, or, the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold, or, the first node set Consumers corresponding to the first node set can be fused, or the first node set and consumers corresponding to the first node set cannot be fused, and there is an indirect path between producers and consumers corresponding to the first node set. In this way, the required node set can be screened out by combining the memory threshold, data mutation threshold, and operator fusion rules, improving the efficiency and accuracy of screening.
  • the parameters are peak memory threshold, data mutation threshold, and operator fusion rules; the first node set meets one or more of the following conditions: there is an indirect path, and the peak memory of each node between the producer and consumer corresponding to the first node set is higher than the peak memory threshold during execution, or the memory occupied by the output tensor corresponding to the first node set is the same as the first node set
  • the deviation value between the memory occupied by the corresponding at least one input tensor is higher than the data mutation threshold, or the first node set and the consumers corresponding to the first node set can be fused, or the first node set and the first node set
  • the consumers corresponding to the set cannot be merged, and there is an indirect path between the producer and the consumer corresponding to the first set of nodes. In this way, the required node set can be screened out by combining the peak memory threshold, data mutation threshold, and operator fusion rules, improving the efficiency and accuracy of screening.
  • the parameters are memory threshold, peak memory threshold, data mutation threshold and operator fusion rule; the first node set meets one or more of the following conditions: the relationship between the producer and the consumer corresponding to the first node set There is no indirect path between, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or, there is an indirect path between the producer and the consumer corresponding to the first node set, and the first node set corresponds to The peak memory of each node between the producer and consumer is higher than the peak memory threshold, or the memory occupied by the output tensor corresponding to the first node set is equal to the memory occupied by at least one input tensor corresponding to the first node set The deviation value between the occupied memory is higher than the data mutation threshold, or the first node set and the consumers corresponding to the first node set can be fused, or the first node set and the consumers corresponding to the first node set cannot be fused, And there is an indirect path between the producer and the consumer corresponding to the
  • operators in the first computation graph may be fused first. This improves the performance of operators in the calculation graph; in addition, through the first operator fusion, the number of operators can also be reduced, which can reduce the overhead of analyzing operators in the subsequent recalculation process; in addition, it can also expand In the subsequent recalculation process, the operator is analyzed to improve the effect of recalculation.
  • operators in the first computation graph may be fused based on a preset operator merging rule.
  • S1802 After obtaining the second calculation graph, S1802 can be executed.
  • the second computation graph can be merged with the first computation graph, thus obtaining the third computation graph.
  • the third calculation graph there is a first directed edge between the first node outputting the first tensor and the second node inputting the first tensor, and the third node outputting the second tensor is connected to the second node inputting the second tensor
  • the first directed edge is from the first node to the second node
  • the second directed edge has the third node to the fourth node
  • the first node and the fourth node correspond to the A node in a computation graph
  • the second node and the third node correspond to nodes in a second computation graph.
  • the process of obtaining the third calculation graph can be understood as the step of generating a new calculation graph in the recalculation step described above.
  • each recalculation subgraph in the N recalculation subgraphs can be constructed separately The directed edge between the node inputting the first target tensor in the graph and the node outputting the first target tensor in the first computation graph, and constructing each recalculation subgraph in the N recomputation subgraphs respectively A directed edge between a node that outputs the second target tensor and a node that inputs the second target tensor in the first computation graph results in a third computation graph.
  • operators in the third computation graph may be fused based on but not limited to a preset operator fusion rule, so as to obtain a fourth computation graph.
  • the fourth computation graph may also be called an optimized computation graph.
  • the calculation graph can be executed.
  • the calculation graph can be compiled first, and then the compiled calculation graph can be executed, or the calculation graph can be directly executed; either way, it is essentially executed
  • the fourth calculation graph is only in a different form.
  • the recalculation graph is obtained by recomputing the desired computation graph first, and the obtained recalculation graph is merged with the original computation graph into a new computation graph, and then the new computation graph is fused with operators to obtain the optimized Computation graph; after that, the optimized computation graph can be executed. Therefore, the calculation graph is optimized by combining recalculation and operator fusion, which achieves a significant reduction in memory usage without introducing large recalculation overhead, and solves the problem of networks with one or more large tensors. Unable to implement the problem.
  • FIG. 19 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • a chip 1900 includes one or more processors 1901 and an interface circuit 1902 .
  • the chip 1900 may also include a bus 1903 . in:
  • the processor 1901 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in the processor 1901 or instructions in the form of software.
  • the above-mentioned processor 1901 may be a general-purpose processor, a digital communicator (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components .
  • DSP digital communicator
  • ASIC application-specific integrated circuit
  • FPGA field programmable gate array
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the interface circuit 1902 can be used for sending or receiving data, instructions or information.
  • the processor 1901 can use the data, instructions or other information received by the interface circuit 1902 to process, and can send the processing completion information through the interface circuit 1902 .
  • the chip further includes a memory, which may include a read-only memory and a random access memory, and provides operation instructions and data to the processor.
  • a portion of the memory may also include non-volatile random access memory (NVRAM).
  • NVRAM non-volatile random access memory
  • the memory stores executable software modules or data structures
  • the processor can execute corresponding operations by calling operation instructions stored in the memory (the operation instructions can be stored in the operating system).
  • the interface circuit 1902 may be used to output the execution result of the processor 1901 .
  • processor 1901 and the interface circuit 1902 can be realized by hardware design, software design, or a combination of software and hardware, which is not limited here.
  • processor in the embodiments of the present application may be a central processing unit (central processing unit, CPU), and may also be other general processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), field programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof.
  • CPU central processing unit
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • a general-purpose processor can be a microprocessor, or any conventional processor.
  • the method steps in the embodiments of the present application may be implemented by means of hardware, or may be implemented by means of a processor executing software instructions.
  • the software instructions can be composed of corresponding software modules, and the software modules can be stored in random access memory (random access memory, RAM), flash memory, read-only memory (read-only memory, ROM), programmable read-only memory (programmable rom) , PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically erasable programmable read-only memory (electrically EPROM, EEPROM), register, hard disk, mobile hard disk, CD-ROM or known in the art any other form of storage medium.
  • An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium.
  • the storage medium may also be a component of the processor.
  • the processor and storage medium can be located in the ASIC.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted via a computer-readable storage medium.
  • the computer instructions may be transmitted from one website site, computer, server, or data center to another website site by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) , computer, server or data center for transmission.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (solid state disk, SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A computational graph optimization method, comprising: obtaining a second computational graph from a first computational graph on the basis of a data dependency relationship between parameters and nodes in the first computational graph; then combining the second computational graph and the first computational graph to obtain a new computational graph; fusing operators in the new computational graph to obtain an optimized computational graph; and finally executing the optimized computational graph. In this way, the computational graph is optimized in a mode of combining re-computation and operator fusion, such that large re-computation overhead is not introduced while memory occupation is remarkably reduced, and the problem of incapability of execution of a network having one or more super-large tensors is solved.

Description

一种计算图优化方法、装置及设备A calculation graph optimization method, device and equipment

本申请要求于2021年11月29日提交中国国家知识产权局、申请号为202111431551.X、申请名称为“一种计算图优化方法、装置及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the State Intellectual Property Office of China on November 29, 2021, with the application number 202111431551.X and the application name "A Computational Graph Optimization Method, Device and Equipment", the entire content of which Incorporated in this application by reference.

技术领域technical field

本申请涉及人工智能技术领域,尤其涉及一种计算图优化方法、装置及设备。The present application relates to the technical field of artificial intelligence, and in particular to a calculation graph optimization method, device and equipment.

背景技术Background technique

人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。Artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is the branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that respond in ways similar to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.

目前,随着计算机技术的不断发展,AI网络也得到了广泛的应用。并且,AI网络越来越复杂,其网络模型的规模呈现越来越大的趋势,比如:AI网络的网络层数、参数量和数据集等越来越多,这就使得AI网络的网络模型对应的内存消耗越来越大。由于目前的计算机硬件的内存都比较小,常见的是16吉字节(gigabyte,GB)或32GB等,这就使得在AI网络的网络模型对应的内存消耗越来越大时,目前的计算机硬件将难以支持AI网络的网络模型。因此,如何降低AI网络的网络模型的内存消耗是目前亟需解决的技术问题。At present, with the continuous development of computer technology, AI networks have also been widely used. Moreover, the AI network is becoming more and more complex, and the scale of its network model is showing a trend of increasing. For example, the number of network layers, parameters, and data sets of the AI network are increasing, which makes the network model of the AI network The corresponding memory consumption is getting bigger and bigger. Since the memory of the current computer hardware is relatively small, the common one is 16 gigabytes (GB) or 32GB, etc., which makes the memory consumption of the network model of the AI network more and more large, the current computer hardware It will be difficult to support the network model of AI network. Therefore, how to reduce the memory consumption of the network model of the AI network is a technical problem that needs to be solved urgently.

发明内容Contents of the invention

本申请提供了一种计算图优化方法、装置、设备、计算机存储介质、计算机程序产品及芯片,通过重计算与算子融合相结合的方式对计算图进行优化,实现了在显著减少内存占用的同时,又不引入较大重计算开销,解决了具有一个或多个超大张量的网络无法执行的问题。The present application provides a calculation graph optimization method, device, equipment, computer storage medium, computer program product, and chip. The calculation graph is optimized by combining recomputation and operator fusion, achieving a significant reduction in memory usage. At the same time, it solves the problem that networks with one or more very large tensors cannot perform without introducing large heavy calculation overhead.

第一方面,本申请提供一种计算图优化方法,包括:基于参数和第一计算图中节点之间的数据依赖关系,从第一计算图中得到第二计算图,参数包括算子融合规则、第一计算图中单个节点输出的张量所占内存的内存阈值、第一计算图对应的峰值内存阈值和第一计算图中节点对应的数据突变阈值中的一项或多项,峰值内存阈值为计算图执行过程中在一个节点执行的时刻所有需使用的张量所占内存的阈值,数据突变阈值为第一计算图中至少一个节点产生数据突变的阈值;将第二计算图与第一计算图合并,以得到第三计算图,其中,在第三计算图中,输出第一张量的第一节点与输入第一张量的第二节点间具有第一有向边,输出第二张量的第三节点与输入第二张量的第四节点间具有第二有向边,第一有向边由第一节点指向第二节点,第二有向边有第三节点指向第四节点,第一节点和第四节点对应第一计算图中的节点,第二节点和第三节点对应第二计算图中的节点;对第三计算图中的算子进行融合,以得到第四计算图;执行第四计算图。这样,基于参数从第一计算图中筛选出需要进行重计算 的第二计算图,并将第二计算图与第一计算图进行合并,以得到新的计算图,并对该新的计算图进行算子融合,并执行融合的计算图,由此,通过重计算与算子融合相结合的方式对计算图进行优化,实现了在显著减少内存占用的同时,又不引入较大重计算开销,解决了具有一个或多个超大张量的网络无法执行的问题。In the first aspect, the present application provides a computation graph optimization method, including: obtaining a second computation graph from the first computation graph based on parameters and data dependencies between nodes in the first computation graph, the parameters including operator fusion rules , one or more of the memory threshold of the memory occupied by the tensor output by a single node in the first computation graph, the peak memory threshold corresponding to the first computation graph, and the data mutation threshold corresponding to the node in the first computation graph, the peak memory The threshold is the threshold of the memory occupied by all the tensors to be used at the moment of execution of a node during the execution of the calculation graph, and the data mutation threshold is the threshold of data mutation generated by at least one node in the first calculation graph; the second calculation graph and the second calculation graph A computation graph is merged to obtain a third computation graph, wherein, in the third computation graph, there is a first directed edge between the first node that outputs the first tensor and the second node that inputs the first tensor, and the output There is a second directed edge between the third node of the second tensor and the fourth node of the input second tensor, the first directed edge is from the first node to the second node, and the second directed edge has the third node pointing to the second Four nodes, the first node and the fourth node correspond to the nodes in the first calculation graph, the second node and the third node correspond to the nodes in the second calculation graph; the operators in the third calculation graph are fused to obtain the first Four computation graphs; execute the fourth computation graph. In this way, based on the parameters, the second calculation graph that needs to be recalculated is screened out from the first calculation graph, and the second calculation graph is merged with the first calculation graph to obtain a new calculation graph, and the new calculation graph Carry out operator fusion and execute the fused calculation graph. Thus, the calculation graph is optimized through the combination of recalculation and operator fusion, which achieves a significant reduction in memory usage without introducing large recalculation overhead. , which solves the problem that networks with one or more very large tensors cannot perform.

在一种可能的实现方式中,基于参数和第一计算图中节点之间的数据依赖关系,从第一计算图中得到第二计算图,具体包括:基于参数和第一计算图中节点之间的数据依赖关系,从第一计算图所包含的生产者中得到N个第一节点集合,N为大于或等于1的正整数,第一节点集合中包括至少一个节点,其中,一个第一节点集合所包含的节点均与第一计算图中一个生产者输出的一个张量直接相关或间接相关,且均与第一计算图中一个生产者输入的至少一个张量直接相关或间接相关;对每个第一节点集合均进行重计算,以得到N个重计算子图,N个重计算子图构成第二计算图。这样即可以根据第一计算图中节点之间的数据依赖关系和参数,从第一计算图中得到第二计算图。In a possible implementation, the second computation graph is obtained from the first computation graph based on parameters and data dependencies between nodes in the first computation graph, which specifically includes: based on the parameters and the nodes in the first computation graph The data dependency among them is obtained from the producers included in the first calculation graph, and N first node sets are obtained, N is a positive integer greater than or equal to 1, and the first node set includes at least one node, wherein, one first The nodes contained in the node set are all directly or indirectly related to a tensor output by a producer in the first calculation graph, and are directly or indirectly related to at least one tensor input by a producer in the first calculation graph; Recomputation is performed on each first node set to obtain N recalculation subgraphs, and the N recomputation subgraphs constitute a second computation graph. In this way, the second computation graph can be obtained from the first computation graph according to the data dependencies and parameters among the nodes in the first computation graph.

在一种可能的实现方式中,参数为内存阈值;第一节点集合对应的输出张量所占的内存高于内存阈值。这样即可以自动的将占用内存过高的输出张量对应的节点集合作为所需的节点集合,从而解决单个算子和/或张量占用内存的问题。In a possible implementation manner, the parameter is a memory threshold; the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold. In this way, the node set corresponding to the output tensor that occupies too much memory can be automatically used as the required node set, thereby solving the problem of memory occupied by a single operator and/or tensor.

在一种可能的实现方式中,参数为峰值内存阈值;第一节点集合对应的生产者和消费者之间存在间接路径,且第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于峰值内存阈值。这样即可以自动的将峰值内存占用过高对应的节点集合作为所需的节点集合,从而解决单个算子和/或张量占用内存的问题。In a possible implementation, the parameter is the peak memory threshold; there is an indirect path between the producer and the consumer corresponding to the first node set, and each node between the producer and the consumer corresponding to the first node set executes The peak memory when is above the peak memory threshold. In this way, the node set corresponding to the high peak memory usage can be automatically used as the required node set, thereby solving the problem of memory usage of a single operator and/or tensor.

在一种可能的实现方式中,参数为数据突变阈值;第一节点集合对应的输出张量所占的内存与第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于数据突变阈值。这样即可以自动的将存在数据突变的节点集合作为所需的节点集合,从而消除因数据突变引起的内存占用过大的问题。In a possible implementation, the parameter is a data mutation threshold; the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is high at the data mutation threshold. In this way, the node set with data mutation can be automatically used as the required node set, thereby eliminating the problem of excessive memory usage caused by data mutation.

在一种可能的实现方式中,参数为算子融合规则;第一节点集合与第一节点集合对应的消费者能够融合;或者,第一节点集合与第一节点集合对应的消费者不能融合,且第一节点集合对应的生产者和消费者之间存在间接路径。这样可以通过算子融合规则的方式筛选出所需的节点集合。In a possible implementation, the parameter is an operator fusion rule; the first node set and the consumers corresponding to the first node set can be fused; or, the first node set and the consumers corresponding to the first node set cannot be fused, And there is an indirect path between the producer and the consumer corresponding to the first node set. In this way, the required node set can be filtered out through operator fusion rules.

在一种可能的实现方式中,参数为内存阈值和峰值内存阈值;第一节点集合对应的生产者和消费者之间不存在间接路径,且第一节点集合对应的输出张量所占的内存高于内存阈值;和/或,第一节点集合对应的生产者和消费者之间存在间接路径,且第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于峰值内存阈值。这样可以通过内存阈值和峰值内存阈值相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In a possible implementation, the parameters are the memory threshold and the peak memory threshold; there is no indirect path between the producer and the consumer corresponding to the first node set, and the memory occupied by the output tensor corresponding to the first node set higher than the memory threshold; and/or, there is an indirect path between the producer and consumer corresponding to the first node set, and the peak memory of each node between the producer and consumer corresponding to the first node set is higher than Peak memory threshold. In this way, the required node set can be filtered out by combining the memory threshold and the peak memory threshold, improving the efficiency and accuracy of screening.

在一种可能的实现方式中,参数为内存阈值和数据突变阈值;第一节点集合对应的输出张量所占的内存高于内存阈值,和/或,第一节点集合对应的输出张量所占的内存与第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于数据突变阈值。这样可以通过内存阈值和数据突变阈值相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In a possible implementation, the parameters are memory threshold and data mutation threshold; the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, and/or, the output tensor corresponding to the first node set occupies The deviation value between the occupied memory and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold. In this way, the required node set can be screened out by combining the memory threshold and the data mutation threshold, improving the efficiency and accuracy of screening.

在一种可能的实现方式中,参数为内存阈值和算子融合规则;第一节点集合符合以下条件中的一项或多项:第一节点集合对应的输出张量所占的内存高于内存阈值,或者,第一节点集合与第一节点集合对应的消费者能够融合,或者,第一节点集合与第一节点集合对应的消费者不能融合,且第一节点集合对应的生产者和消费者之间存在间接路径。这样可以通过 内存阈值和算子融合规则相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In a possible implementation, the parameters are the memory threshold and the operator fusion rule; the first node set meets one or more of the following conditions: the memory occupied by the output tensor corresponding to the first node set is higher than the memory Threshold, or, the first node set and the consumers corresponding to the first node set can be fused, or, the first node set and the consumers corresponding to the first node set cannot be fused, and the producers and consumers corresponding to the first node set There is an indirect path between them. In this way, the required node set can be screened out through the combination of memory threshold and operator fusion rules, improving the efficiency and accuracy of screening.

在一种可能的实现方式中,参数为峰值内存阈值和数据突变阈值;第一节点集合对应的生产者和消费者之间存在间接路径,且第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于峰值内存阈值;和/或,第一节点集合对应的输出张量所占的内存与第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于数据突变阈值。这样可以通过峰值内存阈值和数据突变阈值相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In a possible implementation, the parameters are the peak memory threshold and the data mutation threshold; there is an indirect path between the producer and the consumer corresponding to the first node set, and there is an indirect path between the producer and the consumer corresponding to the first node set The peak memory of each node during execution is higher than the peak memory threshold; and/or, the difference between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set The deviation value is higher than the data mutation threshold. In this way, the required node set can be screened out by combining the peak memory threshold and the data mutation threshold, improving the efficiency and accuracy of screening.

在一种可能的实现方式中,参数为内存阈值和算子融合规则;第一节点集合符合以下条件中的一项或多项:第一节点集合对应的生产者和消费者之间存在间接路径,且第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于峰值内存阈值,或者,第一节点集合与第一节点集合对应的消费者能够融合,或者,第一节点集合与第一节点集合对应的消费者不能融合,且第一节点集合对应的生产者和消费者之间存在间接路径。这样可以通过峰值内存阈值和算子融合规则相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In a possible implementation, the parameters are memory thresholds and operator fusion rules; the first node set meets one or more of the following conditions: there is an indirect path between the producer and the consumer corresponding to the first node set , and the peak memory of each node between the producer and consumer corresponding to the first node set is higher than the peak memory threshold during execution, or the first node set and the consumer corresponding to the first node set can be fused, or, the first node set A node set and the consumer corresponding to the first node set cannot be merged, and there is an indirect path between the producer and the consumer corresponding to the first node set. In this way, the required node set can be screened out by combining the peak memory threshold and operator fusion rules, improving the efficiency and accuracy of screening.

在一种可能的实现方式中,参数为数据突变阈值和算子融合规则;第一节点集合符合以下条件中的一项或多项:第一节点集合对应的输出张量所占的内存与第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于数据突变阈值,或者,第一节点集合与第一节点集合对应的消费者能够融合,或者,第一节点集合与第一节点集合对应的消费者不能融合,且第一节点集合对应的生产者和消费者之间存在间接路径。这样可以通过数据突变阈值和算子融合规则相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In a possible implementation, the parameters are data mutation thresholds and operator fusion rules; the first node set meets one or more of the following conditions: the memory occupied by the output tensor corresponding to the first node set is the same as the first node set The deviation value between the memory occupied by at least one input tensor corresponding to a node set is higher than the data mutation threshold, or, the first node set and the consumers corresponding to the first node set can be fused, or, the first node set and The consumers corresponding to the first node set cannot be merged, and there is an indirect path between the producer and the consumer corresponding to the first node set. In this way, the required node set can be screened out by combining the data mutation threshold and operator fusion rules, improving the efficiency and accuracy of screening.

在一种可能的实现方式中,参数为内存阈值、峰值内存阈值和数据突变阈值;第一节点集合符合以下条件中的一项或多项:第一节点集合对应的生产者和消费者之间不存在间接路径,且第一节点集合对应的输出张量所占的内存高于内存阈值,或者,第一节点集合对应的生产者和消费者之间存在间接路径,且第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于峰值内存阈值,或者,第一节点集合对应的输出张量所占的内存与第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于数据突变阈值。这样可以通过内存阈值、峰值内存阈值和数据突变阈值相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In a possible implementation, the parameters are memory threshold, peak memory threshold, and data mutation threshold; the first node set meets one or more of the following conditions: between the producer and the consumer corresponding to the first node set There is no indirect path, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or, there is an indirect path between the producer and the consumer corresponding to the first node set, and the first node set corresponds to The peak memory of each node between the producer and the consumer is higher than the peak memory threshold, or the memory occupied by the output tensor corresponding to the first node set is equal to the memory occupied by at least one input tensor corresponding to the first node set The deviation value between the memory is higher than the data mutation threshold. In this way, the required node set can be screened out by combining the memory threshold, peak memory threshold, and data mutation threshold, improving the efficiency and accuracy of screening.

在一种可能的实现方式中,参数为内存阈值、峰值内存阈值和算子融合规则;第一节点集合符合以下条件中的一项或多项:第一节点集合对应的生产者和消费者之间不存在间接路径,且第一节点集合对应的输出张量所占的内存高于内存阈值,或者,第一节点集合对应的生产者和消费者之间存在间接路径,且第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于峰值内存阈值,或者,第一节点集合与第一节点集合对应的消费者能够融合,或者,第一节点集合与第一节点集合对应的消费者不能融合,且第一节点集合对应的生产者和消费者之间存在间接路径。这样可以通过内存阈值、峰值内存阈值和算子融合规则相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In a possible implementation, the parameters are memory threshold, peak memory threshold and operator fusion rule; the first node set meets one or more of the following conditions: the relationship between the producer and consumer corresponding to the first node set There is no indirect path between, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or, there is an indirect path between the producer and the consumer corresponding to the first node set, and the first node set corresponds to The peak memory of each node between the producer and consumer is higher than the peak memory threshold, or the first node set and the consumer corresponding to the first node set can be merged, or the first node set and the first node set The consumers corresponding to the set cannot be merged, and there is an indirect path between the producer and the consumer corresponding to the first set of nodes. In this way, the required node set can be screened out by combining the memory threshold, peak memory threshold, and operator fusion rules, improving the efficiency and accuracy of screening.

在一种可能的实现方式中,参数为内存阈值、数据突变阈值和算子融合规则;第一节点集合符合以下条件中的一项或多项:第一节点集合对应的输出张量所占的内存高于内存阈值,或者,第一节点集合对应的输出张量所占的内存与第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于数据突变阈值,或者,第一节点集合与第一节点集合对应的消费 者能够融合,或者,第一节点集合与第一节点集合对应的消费者不能融合,且第一节点集合对应的生产者和消费者之间存在间接路径。这样可以通过内存阈值、数据突变阈值和算子融合规则相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In a possible implementation, the parameters are memory threshold, data mutation threshold and operator fusion rule; the first node set meets one or more of the following conditions: the output tensor corresponding to the first node set occupies The memory is higher than the memory threshold, or, the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold, or, The first node set and the consumers corresponding to the first node set can be fused, or the first node set and the consumers corresponding to the first node set cannot be fused, and there is an indirect relationship between the producer and the consumer corresponding to the first node set path. In this way, the required node set can be screened out by combining the memory threshold, data mutation threshold, and operator fusion rules, improving the efficiency and accuracy of screening.

在一种可能的实现方式中,参数为峰值内存阈值、数据突变阈值和算子融合规则;第一节点集合符合以下条件中的一项或多项:第一节点集合对应的生产者和消费者之间存在间接路径,且第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于峰值内存阈值,或者,第一节点集合对应的输出张量所占的内存与第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于数据突变阈值,或者,第一节点集合与第一节点集合对应的消费者能够融合,或者,第一节点集合与第一节点集合对应的消费者不能融合,且第一节点集合对应的生产者和消费者之间存在间接路径。这样可以通过峰值内存阈值、数据突变阈值和算子融合规则相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In a possible implementation, the parameters are peak memory threshold, data mutation threshold and operator fusion rule; the first node set meets one or more of the following conditions: the producers and consumers corresponding to the first node set There is an indirect path between them, and the peak memory of each node between the producer and consumer corresponding to the first node set is higher than the peak memory threshold during execution, or the memory occupied by the output tensor corresponding to the first node set is the same as The deviation value between the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold, or, the first node set and the consumers corresponding to the first node set can be fused, or, the first node set Consumers corresponding to the first set of nodes cannot be fused, and there is an indirect path between producers and consumers corresponding to the first set of nodes. In this way, the required node set can be screened out by combining the peak memory threshold, data mutation threshold, and operator fusion rules, improving the efficiency and accuracy of screening.

在一种可能的实现方式中,参数为内存阈值、峰值内存阈值、数据突变阈值和算子融合规则;第一节点集合符合以下条件中的一项或多项:第一节点集合对应的生产者和消费者之间不存在间接路径,且第一节点集合对应的输出张量所占的内存高于内存阈值,或者,第一节点集合对应的生产者和消费者之间存在间接路径,且第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于峰值内存阈值,或者,第一节点集合对应的输出张量所占的内存与第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于数据突变阈值,或者,第一节点集合与第一节点集合对应的消费者能够融合,或者,第一节点集合与第一节点集合对应的消费者不能融合,且第一节点集合对应的生产者和消费者之间存在间接路径。这样可以通过内存阈值、峰值内存阈值、数据突变阈值和算子融合规则相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In a possible implementation, the parameters are memory threshold, peak memory threshold, data mutation threshold and operator fusion rule; the first node set meets one or more of the following conditions: the producer corresponding to the first node set There is no indirect path between the consumer and the consumer, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or there is an indirect path between the producer and the consumer corresponding to the first node set, and the first The peak memory of each node between the producer and consumer corresponding to a node set is higher than the peak memory threshold during execution, or the memory occupied by the output tensor corresponding to the first node set is at least one of the corresponding to the first node set The deviation value between the memory occupied by the input tensor is higher than the data mutation threshold, or, the first node set and the consumers corresponding to the first node set can be fused, or, the consumption of the first node set and the first node set corresponding cannot be merged, and there is an indirect path between the producers and consumers corresponding to the first set of nodes. In this way, the required node set can be screened out through a combination of memory threshold, peak memory threshold, data mutation threshold, and operator fusion rules, improving the efficiency and accuracy of screening.

在一种可能的实现方式中,基于参数和第一计算图中节点之间的数据依赖关系,从第一计算图所包含的生产者中得到至少一个第一节点集合,具体包括:In a possible implementation, at least one first node set is obtained from producers included in the first computation graph based on parameters and data dependencies between nodes in the first computation graph, specifically including:

基于第一计算图中节点之间的数据依赖关系,得到第一列表,第一列表中包括第一计算图中生产者和消费者间的对应关系;根据第一列表中每个生产者输出的张量,得到备选节点集合,备选节点集合中至少包括第一节点集合,其中,针对第一列表中的任一生产者,将在任一生产者中且与任一生产者输出的张量直接相关和间接相关的节点的集合作为一个节点集合;基于参数,从备选节点集合中选出第一节点集合。这样,可以由第一计算图中的生产者得到第一节点集合。Based on the data dependencies between the nodes in the first calculation graph, the first list is obtained, and the first list includes the corresponding relationship between producers and consumers in the first calculation graph; according to the output of each producer in the first list Tensor, obtain the set of candidate nodes, the set of candidate nodes includes at least the first set of nodes, wherein, for any producer in the first list, the tensor output in any producer and with any producer A set of directly related and indirectly related nodes is used as a node set; based on parameters, the first node set is selected from the candidate node sets. In this way, the first set of nodes can be obtained from the producers in the first computation graph.

在一种可能的实现方式中,对每个第一节点集合进行重计算,以得到N个重计算子图,具体包括:针对N个第一节点集合中的任一节点集合,拷贝任一节点集合对应的生产者子图;删除生产者子图中除任一节点集合所包含的节点之外的节点,以及删除生产者子图中与任一节点集合无关的边,以得到任一节点集合对应的重计算子图。In a possible implementation manner, each first node set is recalculated to obtain N recalculated subgraphs, which specifically includes: for any node set in the N first node sets, copy any node The producer subgraph corresponding to the set; delete the nodes in the producer subgraph except the nodes contained in any node set, and delete the edges not related to any node set in the producer subgraph to obtain any node set The corresponding recomputation subgraph.

在一种可能的实现方式中,述将第二计算图与第一计算图合并,以得到第三计算图,具体包括:分别构建N个重计算子图中的每个重计算子图的输入第一目标张量的节点与第一计算图中输出第一目标张量的节点之间的有向边,以及,分别构建N个重计算子图中的每个重计算子图的输出第二目标张量的节点与第一计算图中输入第二目标张量的节点之间的有向边,以得到第三计算图。In a possible implementation manner, merging the second computation graph with the first computation graph to obtain the third computation graph specifically includes: respectively constructing the input of each recalculation subgraph in the N recalculation subgraphs The directed edge between the node of the first target tensor and the node outputting the first target tensor in the first computation graph, and the output second Directed edges between nodes of the target tensor and nodes of the input second target tensor in the first computation graph to obtain a third computation graph.

在一种可能的实现方式中,基于参数和第一计算图中节点之间的数据依赖关系,从第一计算图中得到需要进行重计算的第二计算图之前,方法还包括:对第一计算图中的算子进行 融合。由此提升计算图中算子的性能;另外,通过第一次算子融合,也可以减少算子数量,这样可以减少后续在重计算过程中对算子进行分析的开销;此外,也可以扩大在后续重计算过程中对算子进行分析的范围,提升重计算的效果。In a possible implementation, based on the parameters and the data dependencies between the nodes in the first computation graph, before obtaining the second computation graph that needs to be recalculated from the first computation graph, the method further includes: Operators in the calculation graph are fused. This improves the performance of operators in the calculation graph; in addition, through the first operator fusion, the number of operators can also be reduced, which can reduce the overhead of analyzing operators in the subsequent recalculation process; in addition, it can also expand In the subsequent recalculation process, the operator is analyzed to improve the effect of recalculation.

第二方面,本申请提供一种计算图优化装置,包括:至少一个存储器,用于存储程序;至少一个处理器,用于执行存储器存储的程序,当存储器存储的程序被执行时,处理器用于执行第一方面中所提供的方法。In a second aspect, the present application provides a calculation graph optimization device, including: at least one memory for storing programs; at least one processor for executing the programs stored in the memory, and when the programs stored in the memory are executed, the processor is used to Execute the method provided in the first aspect.

第三方面,本申请提供一种设备,包括:至少一个存储器,用于存储程序;至少一个处理器,用于执行存储器存储的程序,当存储器存储的程序被执行时,处理器用于执行第一方面中所提供的方法。In a third aspect, the present application provides a device, including: at least one memory for storing a program; at least one processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processor is used for executing the first methods provided in the aspect.

第四方面,本申请提供一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,当计算机程序在电子设备上运行时,使得电子设备执行第一方面中所提供的方法。In a fourth aspect, the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program runs on an electronic device, the electronic device executes the method provided in the first aspect.

第五方面,本申请提供一种计算机程序产品,当计算机程序产品在电子设备上运行时,使得电子设备执行第一方面中所提供的方法。In a fifth aspect, the present application provides a computer program product, which causes the electronic device to execute the method provided in the first aspect when the computer program product is run on the electronic device.

第六方面,本申请提供一种芯片,包括至少一个处理器和接口;接口,用于为至少一个处理器提供程序指令或者数据;至少一个处理器用于执行程序行指令,以实现第一方面中所提供的方法。In a sixth aspect, the present application provides a chip, including at least one processor and an interface; the interface is used to provide program instructions or data for at least one processor; at least one processor is used to execute program instructions to implement the first aspect. provided method.

可以理解的是,上述第二方面至第六方面的有益效果可以参见上述第一方面中的相关描述,在此不再赘述。It can be understood that, for the beneficial effects of the above-mentioned second aspect to the sixth aspect, reference can be made to the related description in the above-mentioned first aspect, which will not be repeated here.

附图说明Description of drawings

下面对实施例或现有技术描述中所需使用的附图作简单地介绍。The following briefly introduces the drawings used in the embodiments or the description of the prior art.

图1是本申请实施例提供的一种人工智能主体框架的示意图;Fig. 1 is a schematic diagram of an artificial intelligence subject framework provided by an embodiment of the present application;

图2是本申请实施例提供的一种AI网络的网络模型的训练过程示意图;FIG. 2 is a schematic diagram of a training process of a network model of an AI network provided by an embodiment of the present application;

图3是本申请实施例提供的另一种AI网络的网络模型的训练过程示意图;Fig. 3 is a schematic diagram of the training process of the network model of another AI network provided by the embodiment of the present application;

图4是本申请实施例提供的一种算子融合前后的对比示意图;Fig. 4 is a schematic diagram of comparison before and after operator fusion provided by the embodiment of the present application;

图5是本申请实施例提供的一种算子融合前后的峰值内存对比示意图;Fig. 5 is a schematic diagram of peak memory comparison before and after operator fusion provided by the embodiment of the present application;

图6是本申请实施例提供的一种AI开源计算框架下的图算融合过程示意图;FIG. 6 is a schematic diagram of a graph-calculation fusion process under an AI open-source computing framework provided by an embodiment of the present application;

图7是本申请实施例提供的一种算子融合过程示意图;FIG. 7 is a schematic diagram of an operator fusion process provided by an embodiment of the present application;

图8是本申请实施例提供的一种对计算图中算子进行拓扑排序的示意图;Fig. 8 is a schematic diagram of topologically sorting operators in a calculation graph provided by an embodiment of the present application;

图9是本申请实施例提供的一种通过内存方式筛选重计算节点集合的步骤示意图;FIG. 9 is a schematic diagram of steps for screening recomputing node sets through memory provided by the embodiment of the present application;

图10是本申请实施例提供的另一种通过内存方式筛选重计算节点集合的步骤示意图;FIG. 10 is a schematic diagram of another step of screening recomputing node sets through memory provided by the embodiment of the present application;

图11是本申请实施例提供的一种计算图的示意图;Fig. 11 is a schematic diagram of a calculation graph provided by the embodiment of the present application;

图12是本申请实施例提供的一种对计算图进行重计算前后的峰值内存对比示意图;Fig. 12 is a schematic diagram of peak memory comparison before and after recalculation of the calculation graph provided by the embodiment of the present application;

图13是本申请实施例提供的一种通过算子融合规则筛选重计算节点集合的步骤示意图;Fig. 13 is a schematic diagram of the steps of screening recomputation node sets through operator fusion rules provided by the embodiment of the present application;

图14是本申请实施例提供的一种确定重计算图的过程示意图;Fig. 14 is a schematic diagram of a process of determining a recalculation graph provided by an embodiment of the present application;

图15是本申请实施例提供的一种确定出的重计算图与对原计算图进行一次算子融合后的到的计算图合并的过程示意图;Fig. 15 is a schematic diagram of the process of merging the determined recalculation graph and the calculation graph obtained after operator fusion of the original computation graph provided by the embodiment of the present application;

图16是本申请实施例提供的一种算子融合的过程示意图;FIG. 16 is a schematic diagram of an operator fusion process provided by an embodiment of the present application;

图17是本申请实施例提供的一种采用算子融合和重计算的方式对计算图进行优化前后 所需的内存大小的对比示意图;Figure 17 is a schematic diagram of the comparison of the memory size required before and after optimizing the calculation graph by means of operator fusion and recalculation provided by the embodiment of the present application;

图18是本申请实施例提供的一种计算图优化方法的流程示意图;Fig. 18 is a schematic flow chart of a calculation graph optimization method provided by an embodiment of the present application;

图19是本申请实施例提供的一种芯片的结构示意图。FIG. 19 is a schematic structural diagram of a chip provided by an embodiment of the present application.

具体实施方式Detailed ways

本文中术语“和/或”,是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。本文中符号“/”表示关联对象是或者的关系,例如A/B表示A或者B。The term "and/or" in this article is an association relationship describing associated objects, which means that there can be three relationships, for example, A and/or B can mean: A exists alone, A and B exist simultaneously, and B exists alone These three situations. The symbol "/" in this document indicates that the associated object is an or relationship, for example, A/B indicates A or B.

本文中的说明书和权利要求书中的术语“第一”和“第二”等是用于区别不同的对象,而不是用于描述对象的特定顺序。例如,第一响应消息和第二响应消息等是用于区别不同的响应消息,而不是用于描述响应消息的特定顺序。The terms "first" and "second" and the like in the specification and claims herein are used to distinguish different objects, rather than to describe a specific order of objects. For example, the first response message and the second response message are used to distinguish different response messages, rather than describing a specific order of the response messages.

在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。In the embodiments of the present application, words such as "exemplary" or "for example" are used as examples, illustrations or illustrations. Any embodiment or design scheme described as "exemplary" or "for example" in the embodiments of the present application shall not be interpreted as being more preferred or more advantageous than other embodiments or design schemes. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete manner.

在本申请实施例的描述中,除非另有说明,“多个”的含义是指两个或者两个以上,例如,多个处理单元是指两个或者两个以上的处理单元等;多个元件是指两个或者两个以上的元件等。In the description of the embodiments of the present application, unless otherwise specified, "multiple" means two or more, for example, multiple processing units refer to two or more processing units, etc.; multiple A component refers to two or more components or the like.

为便于理解,先对本方案中涉及的技术术语进行介绍。For ease of understanding, the technical terms involved in this solution are introduced first.

(1)图算融合(1) Graph computing fusion

图算融合是MindSpore特有的网络性能优化技术。它可以通过自动分析和优化现有网络计算图逻辑,并结合目标硬件能力,对计算图进行计算化简和替代、算子拆分和融合、算子特例化编译等优化,以提升设备计算资源利用率,实现对网络性能的整体优化。相比传统优化技术,图算融合具有多算子跨边界联合优化、与MindAKG(基于Polyhedral的算子编译器)跨层协同、即时编译等独特优势。另外,图算融合只需要用户打开对应配置后,整个优化过程即可自动完成,不需要网络开发人员进行其它额外感知,使得用户可以聚焦网络算法实现。Graph computing fusion is MindSpore's unique network performance optimization technology. It can automatically analyze and optimize the logic of the existing network computing graph, and combine the capabilities of the target hardware to optimize the computing graph by computing simplification and replacement, operator splitting and fusion, and operator specialization compilation to improve device computing resources. Utilization, to achieve overall optimization of network performance. Compared with traditional optimization technologies, graph computing fusion has unique advantages such as multi-operator cross-boundary joint optimization, cross-layer collaboration with MindAKG (Polyhedral-based operator compiler), and real-time compilation. In addition, graph-computing integration only requires the user to open the corresponding configuration, and the entire optimization process can be completed automatically, without additional perception by network developers, allowing users to focus on network algorithm implementation.

(2)张量(Tensor)(2) Tensor

张量(Tensor)是AI框架中用于表示数据的一种数据结构,通常由形状(表示数据的大小与维度)、数据类型(如flaot32、int等)、内存地址等相关属性构成。Tensor (Tensor) is a data structure used to represent data in the AI framework. It usually consists of shape (representing the size and dimension of data), data type (such as flaot32, int, etc.), memory address and other related attributes.

(3)算子(3) operator

在深度学习领域,算子一般指一个计算单元,可以完成类似加、减、乘、除等简单或复合的计算,在加速硬件上通常以算子为单位进行执行。In the field of deep learning, an operator generally refers to a computing unit that can complete simple or complex calculations such as addition, subtraction, multiplication, and division, and is usually executed in units of operators on acceleration hardware.

(4)内存复用(4) Memory multiplexing

内存复用是一种优化技术。在深度学习网络中可以通过将不同层的Tensor指向相同的地址来实现对物理内存的复用,以达成内存占用降低的目的。Memory multiplexing is an optimization technique. In the deep learning network, the multiplexing of physical memory can be achieved by pointing Tensors of different layers to the same address, so as to achieve the purpose of reducing memory usage.

(5)计算图(5) Calculation graph

计算图是一个以算子(Operators)为节点的有向图所表达的计算函数。计算图主要是用节点(nodes)和边(edges)表示。计算图中的节点可以代表算子。计算图中两个算子之间的边可以代表两个算子之间的关系。其中,边可以有方向,边的方向可以为算子间数据的流向。例如,当计算图中算子a与算子b之间的边是由算子a指向算子b时,则算子a输出的tensor 为算子b的输入。在AI框架里面,计算图所表达的计算函数对输入tensor依次调用执行有向图中的算子节点,并得到最终的输出tensor。A computation graph is a computation function expressed by a directed graph with Operators as nodes. Computational graphs are mainly represented by nodes and edges. Nodes in a computation graph can represent operators. The edge between two operators in the computation graph can represent the relationship between the two operators. Among them, the edge can have a direction, and the direction of the edge can be the flow direction of data between operators. For example, when the edge between operator a and operator b in the calculation graph is from operator a to operator b, then the tensor output by operator a is the input of operator b. In the AI framework, the calculation function expressed by the calculation graph sequentially calls and executes the operator nodes in the directed graph to the input tensor, and obtains the final output tensor.

(6)峰值内存(6) Peak memory

峰值内存是指在计算图执行过程中,某个算子执行的时刻所有需使用的张量所占的内存。例如,请参阅图5的(A),在算子D执行的时刻,算子A输出的张量、算子C输出的张量和算子D所要输出的张量均为所需使用的张量,因此此时的峰值内存为T0+T1+T2;其中,算子B的输入张量与算子C的输出张量复用一个内存。Peak memory refers to the memory occupied by all the tensors that need to be used when an operator is executed during the execution of the calculation graph. For example, please refer to (A) in Figure 5. At the moment when operator D is executed, the tensors output by operator A, the tensors output by operator C, and the tensors to be output by operator D are all tensors that need to be used. Therefore, the peak memory at this time is T0+T1+T2; among them, the input tensor of operator B and the output tensor of operator C reuse one memory.

(7)生产者(7) Producer

生产者是在计算图中具有数据依赖关系的两个计算子图中生成数据的计算子图,即生成数据的算子。例如,请参阅图4,以算子O 1和算子O 2为例,算子O 1和算子O 2具有数据依赖关系,其中,算子O 1的输出张量为算子O 2的输入张量,因此,算子O 1可以称之为生产者。 A producer is a computing subgraph that generates data in two computing subgraphs with data dependencies in the computing graph, that is, an operator that generates data. For example, please refer to Figure 4, taking operator O 1 and operator O 2 as an example, operator O 1 and operator O 2 have a data dependency relationship, where the output tensor of operator O 1 is the output tensor of operator O 2 The input tensor, therefore, the operator O 1 can be called a producer.

(8)消费者(8) Consumers

消费者是在计算图中具有数据依赖关系的两个计算子图中使用数据的计算子图,即使用数据的算子。例如,请参阅图4,以算子O 1和算子O 2为例,算子O 1和算子O 2具有数据依赖关系,其中,算子O 1的输出张量为算子O 2的输入张量,因此,算子O 2可以称之为消费者。 A consumer is a computation subgraph that consumes data in two computation subgraphs that have data dependencies in the computation graph, that is, an operator that consumes data. For example, please refer to Figure 4, taking operator O 1 and operator O 2 as an example, operator O 1 and operator O 2 have a data dependency relationship, where the output tensor of operator O 1 is the output tensor of operator O 2 The input tensor, therefore, the operator O2 can be called a consumer.

(9)节点集合(9) Node set

节点集合是指一个节点或多个节点构成的集合。一个节点集合所包含的节点均与计算图中一个生产者输出的一个张量直接相关或间接相关,且均与该生产者输入的至少一个张量直接相关或间接相关。此外,也可以理解为,当节点集合中包括多个节点时,多个节点组成的计算图可从外部获取至少一个张量并向外部输出一个张量,也即是说,这多个节点中至少一个节点可以获取到外部输入的张量,且仅有一个节点可以向外部输出张量,而其他的节点输出的张量仅能在该计算图内部使用。A node set refers to a node or a set of multiple nodes. The nodes contained in a node set are all directly or indirectly related to a tensor output by a producer in the calculation graph, and are directly or indirectly related to at least one tensor input by the producer. In addition, it can also be understood that when the node set includes multiple nodes, the calculation graph composed of multiple nodes can obtain at least one tensor from the outside and output a tensor to the outside, that is to say, among the multiple nodes At least one node can obtain tensors input from the outside, and only one node can output tensors to the outside, while the tensors output by other nodes can only be used inside the calculation graph.

举例来说,如图11所示,Gather、Scatter和Div可以构成一个节点集合,这三个节点所组成的计算图,在该计算图中,Gather、Scatter和Div依次相连,其中,Gather和Scatter间具有边,Scatter和Div具有边,这三个节点均与其所对应的生产者FuseOp1输出的张量B直接相关或间接相关,且均与生产者FuseOp1输入的张量A直接相关或间接相关;换言之,这三个节点所组成的计算图可以通过Gather从外部获取一个张量A,并可以通过Div向外部输出的一个张量B,其中,Gather和Scatter均不能向计算图外部输出张量,它们输出的张量仅能在计算图内部使用。如图14所示,在如14的(B)中,Gather可以构成一个节点集合,该节点集合对应的计算图可以如图14的(C)所示,该计算图可以通过从外部获取一个张量A,并可以向外部输出的一个张量B。For example, as shown in Figure 11, Gather, Scatter and Div can constitute a node set, and the calculation graph composed of these three nodes, in this calculation graph, Gather, Scatter and Div are connected in sequence, where Gather and Scatter There is an edge between Scatter and Div, and these three nodes are directly or indirectly related to the tensor B output by the corresponding producer FuseOp1, and are directly or indirectly related to the tensor A input by the producer FuseOp1; In other words, the calculation graph composed of these three nodes can obtain a tensor A from the outside through Gather, and can output a tensor B to the outside through Div. Neither Gather nor Scatter can output tensors to the outside of the calculation graph. The tensors they output can only be used inside the computation graph. As shown in Figure 14, in (B) of Figure 14, Gather can form a node set, and the calculation graph corresponding to the node set can be shown in Figure 14 (C). Quantity A, and a tensor B that can be output externally.

(10)节点间的数据依赖关系(10) Data dependencies between nodes

当一个节点所需的数据需要由另一个节点提供时,这两个节点之间则具有数据依赖关系。例如,当一个节点输出的张量为另一个节点输入的张量时,则这两个节点之间具有数据依赖关系。When the data required by one node needs to be provided by another node, there is a data dependency between the two nodes. For example, when a tensor output by one node is a tensor input by another node, there is a data dependency between the two nodes.

(11)数据突变阈值(11) Data mutation threshold

数据突变阈值是指计算图中至少一个节点产生数据突变的阈值。以一个节点为例,数据突变阈值则为该节点输出的张量所占的内存与该节点输入的张量所占的内存之间偏差值(比如:差值或比值等)的阈值。以多个节点为例,此时这多个节点可以构成上文所描述的节点集合,数据突变阈值则为这多个节点所构成的节点集合对应的输出张量所占的内存与该节点 集合对应的至少一个输入张量所占的内存之间偏差值(比如:差值或比值等)的阈值。其中,节点集合对应的输出张量是指该节点集合中的节点所组成的计算图向外部输出的张量,节点集合对应的输入张量是指该节点集合中的节点所组成的计算图从外部获取到的张量。The data mutation threshold refers to the threshold at which at least one node in the calculation graph generates a data mutation. Taking a node as an example, the data mutation threshold is the threshold of the deviation value (such as difference or ratio, etc.) between the memory occupied by the tensor output by the node and the memory occupied by the tensor input by the node. Taking multiple nodes as an example, these multiple nodes can constitute the node set described above, and the data mutation threshold is the memory occupied by the output tensor corresponding to the node set formed by these multiple nodes and the node set The threshold of the deviation value (such as: difference or ratio, etc.) between the memory occupied by the corresponding at least one input tensor. Among them, the output tensor corresponding to the node set refers to the tensor that the calculation graph composed of the nodes in the node set outputs to the outside, and the input tensor corresponding to the node set refers to the calculation graph composed of the nodes in the node set. Tensor obtained externally.

接下来对本申请实施例中的技术方案进行描述。Next, the technical solutions in the embodiments of the present application will be described.

示例性的,图1示出了一种人工智能主体框架的示意图,该主体框架描述了人工智能系统总体工作流程,适用于通用的人工智能领域需求。Exemplarily, FIG. 1 shows a schematic diagram of an artificial intelligence main framework, which describes the overall workflow of an artificial intelligence system, and is applicable to general artificial intelligence field requirements.

下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。The following is an elaboration on the above artificial intelligence theme framework from the two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis).

“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。"Intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has undergone a condensed process of "data-information-knowledge-wisdom".

“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。"IT value chain" reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of artificial intelligence, information (provided and processed by technology) to the systematic industrial ecological process.

(1)基础设施:(1) Infrastructure:

基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。The infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform. Communicate with the outside through sensors; computing power is provided by smart chips (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips); the basic platform includes distributed computing framework and network and other related platform guarantees and supports, which can include cloud storage and Computing, interconnection network, etc. For example, sensors communicate with the outside to obtain data, and these data are provided to the smart chips in the distributed computing system provided by the basic platform for calculation.

(2)数据(2) data

基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence. The data involves graphics, images, voice, text, and IoT data of traditional equipment, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.

(3)数据处理(3) Data processing

数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.

其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。Among them, machine learning and deep learning can symbolize and formalize intelligent information modeling, extraction, preprocessing, training, etc. of data.

推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, and using formalized information to carry out machine thinking and solve problems according to reasoning control strategies. The typical functions are search and matching.

决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.

(4)通用能力(4) General ability

对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。After the above-mentioned data processing is performed on the data, some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image processing identification, etc.

(5)智能产品及行业应用(5) Smart products and industry applications

智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,平安城市,智能终端等。Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is the packaging of the overall solution of artificial intelligence, which commercializes intelligent information decision-making and realizes landing applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, automatic driving, safe city, smart terminals, etc.

需要说明的是,本申请涉及到的算子融合过程可以位于上述(3)中的数据处理阶段。It should be noted that the operator fusion process involved in this application can be located in the data processing stage in (3) above.

示例性的,图2示出了一种AI网络的网络模型的训练过程。该网络模型的训练过程可以包括前向计算过程和后向计算过程。如图1所示,虚线x左侧的部分为前向计算过程,虚线x右侧的部分为后向计算过程,图中每个方块均代表一个张量(tensor)。在图1中,处于同一行的张量复用同一个内存地址,比如:张量A和张量A b复用一个内存地址,张量B和张量B b复用一个内存地址,…,张量F和张量F b复用一个内存地址等。此外,图1中,张量A b、张量B b、张量C b、张量D b、张量E b、张量F b分别是与张量A、张量B、张量C、张量D、张量E、张量F对应的反向算子数据。在图1中,在网络模型的训练过程中,可以先进行前向计算,然后再进行后向计算。其中,前向计算得到的张量均会存储在相应的计算设备的内存中。在后向计算过程中,张量A b的计算依赖于张量In和张量B b,张量B b的计算依赖于张量A和张量C b,张量C b的计算依赖于张量B和张量D b,张量D b的计算依赖于张量C和张量E b,等等,其中,在后向计算过程中,计算得到一个张量后,则会释放该张量所对应的前向计算得到的张量的内存,例如,计算得到张量B b后,则释放张量A所占的内存,这样后续计算得到的张量A b就可以复用张量A所占的内存。可见,在图1中所示的网络模型的训练过程中,前向计算得到的数据会长时间占用内存,进而导致该网络模型的内存消耗过大。 Exemplarily, FIG. 2 shows a training process of a network model of an AI network. The training process of the network model may include a forward calculation process and a backward calculation process. As shown in Figure 1, the part on the left side of the dotted line x is the forward calculation process, and the part on the right side of the dotted line x is the backward calculation process, and each block in the figure represents a tensor. In Figure 1, tensors in the same row reuse the same memory address, for example: tensor A and tensor A b reuse a memory address, tensor B and tensor B b reuse a memory address, ..., tensor F and tensor F b multiplex a memory address etc. In addition, in Figure 1, tensor A b , tensor B b , tensor C b , tensor D b , tensor E b , and tensor F b are respectively related to tensor A, tensor B, tensor C, Reverse operator data corresponding to tensor D, tensor E, and tensor F. In Fig. 1, in the training process of the network model, the forward calculation can be performed first, and then the backward calculation can be performed. Among them, the tensor obtained by the forward calculation will be stored in the memory of the corresponding computing device. In the backward calculation process, the calculation of tensor A b depends on tensor In and tensor B b , the calculation of tensor B b depends on tensor A and tensor C b , and the calculation of tensor C b depends on tensor Quantity B and tensor D b , the calculation of tensor D b depends on tensor C and tensor E b , etc., where, in the backward calculation process, after a tensor is calculated, the tensor will be released The memory of the tensor obtained by the corresponding forward calculation, for example, after the calculation of the tensor B b , the memory occupied by the tensor A is released, so that the tensor A b obtained by the subsequent calculation can reuse the tensor A occupied memory. It can be seen that during the training process of the network model shown in FIG. 1 , the data obtained by the forward calculation will occupy memory for a long time, which will lead to excessive memory consumption of the network model.

一般地,在图1中可以采用重计算的方式或者算子融合的方式解决内存占用的问题。其中,采用重计算的方式时,是在图1中采用时间换空间的方式,提前释放图1中前向计算得到的部分结果所占的内存,并在后续的后向计算时对这部分结果进行重新计算,从而缩短这部分数据的生命周期,进而达到降低内存的占用的目的。具体地,如图3所示,前向计算得到的张量A可以在后向计算得到张量B b之前持续占用内存。前向计算得到的张量D可以在后向计算得到张量E b之前持续占用内存。前向计算得到的张量B、张量C、张量E和张量F,这四个张量占用的内存可以在前向计算过程中被使用后立即被释放,并在反向计算时这四个张量分别对应的计算过程被重新执行。其中,在前向计算得到张量C后可以释放张量B占用的内存,在前向计算得到张量D后可以释放张量C占用的内存,当后向计算张量D b时,可以重新计算得到张量C,当后向计算张量C b时,可以重新计算得到张量B。这样通过重计算的方式即可以达到降低内存占用的目的。继续参阅图1,若在图1中每个方块均占用一个内存,且在得到张量F时前向计算结束,则在前向计算结束后得到的数据总共占用7个内存;而在图3中,由于张量B、张量C、张量E和张量F在前向计算过程中被使用完后它们所占的内存会立即被释放,因此这四个张量可以认为是没有占用内存,所以在图3中前向计算结束后得到的数据总共占用3个内存。可见,通过重计算的方式可以达到降低内存占用的目的。 Generally, in Figure 1, recomputation or operator fusion can be used to solve the problem of memory usage. Among them, when the recalculation method is used, the time-for-space method is adopted in Figure 1, and the memory occupied by the part of the results obtained by the forward calculation in Figure 1 is released in advance, and this part of the results is processed in the subsequent backward calculation. Perform recalculation to shorten the life cycle of this part of the data, thereby achieving the purpose of reducing memory usage. Specifically, as shown in FIG. 3 , the tensor A obtained by the forward calculation may continue to occupy memory before the tensor B b is obtained by the backward calculation. The tensor D obtained by the forward calculation can continue to occupy memory before the tensor E b is obtained by the backward calculation. Tensor B, Tensor C, Tensor E, and Tensor F obtained from the forward calculation. The memory occupied by these four tensors can be released immediately after being used in the forward calculation process, and this can be done during the reverse calculation. The calculation process corresponding to the four tensors is re-executed. Among them, the memory occupied by tensor B can be released after the tensor C is obtained by the forward calculation, and the memory occupied by the tensor C can be released after the tensor D is obtained by the forward calculation. When the tensor D b is calculated backward, it can be re- The tensor C is calculated, and when the tensor C b is calculated backward, the tensor B can be recalculated. In this way, the purpose of reducing memory usage can be achieved by recalculating. Continuing to refer to Figure 1, if each block occupies one memory in Figure 1, and the forward calculation ends when the tensor F is obtained, the data obtained after the forward calculation ends takes up 7 memory in total; while in Figure 3 Among them, since the memory occupied by tensor B, tensor C, tensor E and tensor F will be released immediately after being used in the forward calculation process, these four tensors can be considered as not occupying memory , so the data obtained after the forward calculation in Figure 3 occupies a total of 3 memories. It can be seen that the purpose of reducing memory usage can be achieved by recalculating.

但采用重计算的方式时,往往需要手动指定重计算点或者自动计算重计算点。其中,手动重计算点的方案往往需要用户识别重计算切分点,对用户能力要求高,并且针对不同模型、不同硬件后端需要分别设置,效率低下;自动计算重计算点的方案要么求解时间长(求解时间小时级别,超过大多数训练任务本身),要么需要在运行时统计,无法与算子融合结合。另外,重计算的方式只能针对前向-反向之间数据依赖的重计算,无法对前向-前向、反向-反向之间的数据依赖进行重计算,使用场景较小。此外,重计算的方式也不能解决单个算子和单个超大张量占用内存的问题。例如,继续参阅图3,若张量A的数据量过大,则其占用的内存会较大,这时就会出现内存持续被大幅占用的情况。However, when using the recalculation method, it is often necessary to manually specify the recalculation point or automatically calculate the recalculation point. Among them, the manual recalculation point solution often requires the user to identify the recalculation segmentation point, which requires high user capabilities, and needs to be set separately for different models and different hardware backends, which is inefficient; the automatic calculation recalculation point solution takes time to solve. Long (hour-level solution time, more than most training tasks themselves), or need to be counted at runtime, and cannot be combined with operator fusion. In addition, the recalculation method can only be recalculated for data dependencies between forward and reverse, and cannot be recalculated for data dependencies between forward and forward, and reverse and reverse, and the usage scenarios are relatively small. In addition, the method of recalculation cannot solve the problem of memory occupied by a single operator and a single super large tensor. For example, continue to refer to Figure 3, if the data volume of the tensor A is too large, the memory occupied by it will be large, and at this time, the memory will continue to be occupied by a large amount.

采用算子融合的方式时,其通过将相邻多个算子融合成一个算子来提升网络性能,同时也可以降低内存的使用开销。具体地,如图4所示,图4的(A)为融合前的计算方式,其中, 张量A输入到算子O 1后得到张量B,张量B输入到算子O 2后得到张量C,张量C输入到算子O 3后得到张量D。利用预先设定的融合规则将算子O 1、O 2和O 3进行融合,可以将这三个算子融合为一个算子,此时,由张量A得到张量D即转变为图4的(B)所示的方式。在图4的(B)中,将张量A输入到融合后的算子(O 1+O 2+O 3)后可以得到张量D。将图4的(A)和(B)进行对比可以看出,在图4的(A)中有4个张量需要占用内存,而在图4的(B)中有2个张量需要占用内存。可见,采用算子融合的方式可以降低内存的占用。 When using operator fusion, it can improve network performance by fusing multiple adjacent operators into one operator, and can also reduce memory usage overhead. Specifically, as shown in Figure 4, (A) in Figure 4 is the calculation method before fusion, wherein, tensor A is input to operator O1 to obtain tensor B, and tensor B is input to operator O2 to obtain Tensor C, tensor C is input to operator O 3 to obtain tensor D. Operators O 1 , O 2 and O 3 can be fused by using the preset fusion rules, and these three operators can be fused into one operator. At this time, tensor D obtained from tensor A is transformed into Figure 4 The way shown in (B). In (B) of FIG. 4 , tensor D can be obtained after inputting tensor A to the fused operator (O 1 +O 2 +O 3 ). Comparing (A) and (B) in Figure 4, it can be seen that there are 4 tensors that need to occupy memory in (A) in Figure 4, and 2 tensors that need to be occupied in (B) in Figure 4 Memory. It can be seen that using operator fusion can reduce memory usage.

但受限于内部存储大小以及硬件处理能力,算子融合只能处理相邻空间的算子,其融合的范围和规模有限。而在前向和反向算子之间存在大量计算过程,无法直接融合,前向计算的结果依然需要在内存中暂存。此外,在部分多输入场景中,算子融合会将不同的张量的生命周期拉长到整个融合算子,反而增大了内存占用。示例性的,如图5所示,图5的(A)为算子融合前每个算子的执行顺序,每个算子的输入张量和输出张量的生命周期和每个算子在执行时的峰值内存,图5的(B)为算子融合后每个算子的执行顺序,每个算子的输入张量和输出张量的生命周期和每个算子在执行时的峰值内存。其中,在图5中,算子A、算子B和算子C可以融合为算子E。在图5的(A)中,算子B输出的张量与算子A输入的张量复用一个内存,算子C输出的张量与算子B输入的张量复用一个内存,算子D输出的张量与算子C输入的张量复用一个内存。在图5的(B)中,算子D输出的张量与算子E输入的一个张量可以复用一个内存,当算子C执行时,由于此时相当于算子E在执行,所以此时无法复用内存T0、T1和T2,只能再申请一个内存,即内存T3。由图5的(A)可知,算子融合前,使用该计算图时的峰值内存为(T0+T1+T2),由图5的(B)可知,算子融合后,使用该计算图时的峰值内存为(T0+T1+T2+T3)。可见,算子融合后的峰值内存比算子融合前的峰值内存明显增大。However, limited by the internal storage size and hardware processing capability, operator fusion can only process operators in adjacent spaces, and the scope and scale of its fusion are limited. However, there are a large number of calculation processes between the forward and reverse operators, which cannot be directly fused, and the results of the forward calculation still need to be temporarily stored in the memory. In addition, in some multi-input scenarios, operator fusion will extend the life cycle of different tensors to the entire fusion operator, which instead increases the memory usage. Exemplarily, as shown in Figure 5, (A) in Figure 5 is the execution sequence of each operator before operator fusion, the life cycle of each operator's input tensor and output tensor, and each operator's Peak memory during execution, (B) in Figure 5 shows the execution order of each operator after operator fusion, the life cycle of each operator's input tensor and output tensor, and the peak value of each operator during execution Memory. Wherein, in FIG. 5 , operator A, operator B, and operator C may be fused into operator E. In (A) of Figure 5, the tensor output by operator B and the tensor input by operator A share one memory, the tensor output by operator C and the tensor input by operator B share one memory, and the tensor output by operator D The tensor and the tensor input by operator C share a single memory. In (B) of Figure 5, the tensor output by operator D and a tensor input by operator E can reuse one memory. When operator C is executed, it is equivalent to the execution of operator E at this time, so At this time, memory T0, T1, and T2 cannot be reused, and only one more memory, that is, memory T3, can be applied. From (A) in Figure 5, it can be seen that before operator fusion, the peak memory when using the calculation graph is (T0+T1+T2), and from Figure 5 (B), after operator fusion, when using the calculation graph The peak memory of is (T0+T1+T2+T3). It can be seen that the peak memory after operator fusion is significantly larger than the peak memory before operator fusion.

由上述分析可知,采用重计算的方式或者算子融合的方式虽然可以解决一部分内存占用的问题,但仍然无法同时解决由于跨越长路径依赖的长生命周期的张量引起的高峰值内存占用导致网络无法执行的问题,和,单个算子及单个张量超过内存限制导致网络无法执行的问题。From the above analysis, it can be seen that although the method of recomputation or operator fusion can solve the problem of part of the memory usage, it still cannot solve the problem of high peak memory usage caused by tensors with long life cycles spanning long path dependencies. The problem that cannot be executed, and the problem that a single operator and a single tensor exceeds the memory limit causes the network to fail to execute.

有鉴于此,本申请提供了一种计算图优化方法,该方法主要是由待优化的计算图得到需要重计算的重计算图,并将得到的重计算图与待优化的计算图合并,生成新的计算图,最后对新的计算图进行算子融合。由此,采用重计算和算子融合相结合的方式,在显著减少内存占用的同时,又不引入较大重计算开销(包括编译时和运行时),解决了具有一个或多个超大张量的网络无法执行的问题。In view of this, the present application provides a calculation graph optimization method, which mainly obtains a recalculation graph that needs to be recalculated from the calculation graph to be optimized, and merges the obtained recalculation graph with the calculation graph to be optimized to generate A new calculation graph, and finally perform operator fusion on the new calculation graph. Therefore, using the combination of recalculation and operator fusion, while significantly reducing memory usage, without introducing large recalculation overhead (including compile time and runtime), it solves the problem of having one or more super large tensors. The network cannot execute the problem.

示例性的,图6示出了一种AI开源计算框架下的图算融合过程示意图。图6中所示的AI开源计算框架可以为昇思(MindSpore)。如图6所示,该AI开源计算框架可以包括前端表示层(图中未示出)、MindSpore图层和MindSpore Auto Kernel Generator层(MindSpore AGK层)。其中,前端表示层可以为用户提供至少一个应用程序编程(application programming interface,API)接口。在使用该AI开源计算框架时,用户可以在前端表示层选择其所期望使用的网络模型401,并通过前端表示层配置各种数据。之后,在MindSpore图层中可以构建出网络模型和自动微分生成用户配置的MindSpore Expressio(ME)前端表示。然后,在MindSpore图层中可以基于用户配置的ME前端表示自动构建出计算图。接着,在计算图上可以先进行公共优化,例如:消除掉不必要的代码、优化多次出现的相同的数据等等。接着,再对计算图进行后端优化,例如:针对不同的硬件的特性选配更适宜的表达方式等。对计算 图进行后端优化后,可以生成优化后的计算图。其中,得到优化后的计算图的过程可以理解为是图算融合过程。经过图算融合优化的计算图中可以包含若干算子定义描述,这些描述说明了计算图中各个算子的计算过程。在MindSpore AGK层可以json格式传给Mind AKG进行编译优化,并调用代码生成器生成后端硬件代码,以及编译成后端Kernel。其中,后端Kernel可以理解为是算子在硬件上的表达,比如:算子对应的二进制文件等。在完成图层和算子编译后,MindSpore图层即可以按照计算图顺序调用Kernel,并执行相应的计算图。Exemplarily, FIG. 6 shows a schematic diagram of a graph-computing fusion process under an AI open-source computing framework. The AI open source computing framework shown in FIG. 6 may be MindSpore. As shown in Figure 6, the AI open source computing framework may include a front-end presentation layer (not shown in the figure), a MindSpore layer, and a MindSpore Auto Kernel Generator layer (MindSpore AGK layer). Wherein, the front-end presentation layer may provide at least one application programming interface (API) interface for the user. When using the AI open source computing framework, users can select the desired network model 401 at the front-end presentation layer, and configure various data through the front-end presentation layer. Afterwards, the network model can be constructed in the MindSpore layer and automatically differentiated to generate the user-configured MindSpore Expressio (ME) front-end representation. Then, in the MindSpore layer, a calculation graph can be automatically constructed based on the user-configured ME front-end representation. Then, public optimization can be performed on the calculation graph first, such as: eliminating unnecessary code, optimizing the same data that occurs multiple times, and so on. Then, perform back-end optimization on the calculation graph, for example, select a more suitable expression method according to the characteristics of different hardware. After back-end optimization of the calculation graph, an optimized calculation graph can be generated. Among them, the process of obtaining the optimized calculation graph can be understood as a graph-computing fusion process. The calculation graph optimized by graph computing fusion can contain several operator definition descriptions, which describe the calculation process of each operator in the calculation graph. At the MindSpore AGK layer, the json format can be passed to Mind AKG for compilation and optimization, and the code generator is called to generate the back-end hardware code and compiled into the back-end Kernel. Among them, the back-end Kernel can be understood as the expression of the operator on the hardware, such as the binary file corresponding to the operator. After compiling the layers and operators, the MindSpore layer can call the Kernel in the order of the calculation graph and execute the corresponding calculation graph.

其中,本申请中主要聚焦MindSpore图层中的后端优化的部分。在MindSpore图层中的后端优化的部分可以通过重计算与算子融合的处理方式,降低网络模型的内存消耗,从而可以对不同种类的硬件的后端(比如:图形处理器、神经网络处理器、中央处理器等)提供性能和内存上的优化。Among them, this application mainly focuses on the back-end optimization part of the MindSpore layer. The back-end optimization part in the MindSpore layer can reduce the memory consumption of the network model through recalculation and operator fusion processing, so that it can be used for different types of hardware back-ends (such as graphics processors, neural network processing, etc.) processor, central processing unit, etc.) to provide performance and memory optimization.

下面基于上文所描述的MindSporeAI开源计算框架,并结合附图对本申请提供的计算图优化方案进行详细说明。可以理解的是,本申请实施例中所描述的MindSporeAI开源计算框架只是示意性说明,本申请实施例中提供的计算图优化方法并不局限于该框架,也可以应用在其他的框架上,其中,应用在其他框架上时仍在本申请的保护范围内。The calculation graph optimization scheme provided by this application will be described in detail below based on the MindSporeAI open source computing framework described above and in conjunction with the accompanying drawings. It can be understood that the MindSporeAI open source computing framework described in the embodiment of this application is only a schematic illustration, and the calculation graph optimization method provided in the embodiment of this application is not limited to this framework, and can also be applied to other frameworks, where , still within the protection scope of the present application when applied to other frameworks.

示例性的,该计算图优化方案可以但不限于运行在任何具有计算、处理能力的装置、设备、平台或设备集群上。其中,该计算图优化方案主要包括设置配置项接口和计算图优化两个步骤。设置配置项接口主要是用户配置一些必要的参数,以便后续在计算图优化过程中使用;计算图优化主要是算子融合与重计算相结合的过程,详见下文描述。Exemplarily, the calculation graph optimization solution may be run on any device, device, platform or device cluster with computing and processing capabilities, but not limited to. Among them, the computation graph optimization scheme mainly includes two steps of setting configuration item interface and computing graph optimization. Setting the configuration item interface is mainly for the user to configure some necessary parameters for subsequent use in the calculation graph optimization process; calculation graph optimization is mainly a process of combining operator fusion and recomputation, see the description below for details.

(一)设置配置项接口(1) Setting configuration item interface

在MindSpore中可以配置有Context模块和图算融合模块。Context模块可以用于接收用户对运行过程的全局设置。图算融合模块可以但不限于负责对计算图和算子进行图算融合。其中,图算融合模块可以但不限于配置在MindSpore中的AI编译器模块中,该AI编译器模块可以负责对计算图和算子进行编译优化。MindSpore can be configured with a Context module and a graph-calculation fusion module. The Context module can be used to receive the user's global settings for the running process. The graph-calculation fusion module can, but is not limited to, be responsible for the graph-computation fusion of calculation graphs and operators. Among them, the graph-calculation fusion module can be, but not limited to, configured in the AI compiler module in MindSpore, and the AI compiler module can be responsible for compiling and optimizing the calculation graph and operators.

本申请实施例中,在Context模块中可以但不限于新增两个配置项接口。一个配置项接口可以用于配置内存门限和/或峰值内存门限,另一个配置项接口可以用于配置数据门限。其中,用户可以根据实际网络与数据集场景设置以上两个配置项的值。当用户未设置时,可以采用系统默认值,或者,根据硬件环境自动生成,例如:可以但不限于将图形处理器(graphics processing unit,GPU)的显存大小作为内存门限和/或峰值内存门限。In this embodiment of the application, two configuration item interfaces can be added in the Context module, but not limited to. One configuration item interface can be used to configure the memory threshold and/or peak memory threshold, and the other configuration item interface can be used to configure the data threshold. Among them, users can set the values of the above two configuration items according to the actual network and dataset scenarios. When the user does not set it, the default value of the system can be used, or it can be automatically generated according to the hardware environment, for example: the memory size of the graphics processing unit (graphics processing unit, GPU) can be used as the memory threshold and/or the peak memory threshold.

可以理解的是,设置配置项接口可以实时设置,也可以预先设置,具体可根据实际情况而定,此处不做限定。It can be understood that the interface for setting configuration items can be set in real time or in advance, which can be determined according to actual conditions, and is not limited here.

(二)计算图优化(2) Calculation graph optimization

计算图优化的过程主要是算子融合与重计算相结合的过程。该过程可以但不限于在MindSpore中配置的图算融合模块中执行。其中,该图算融合模块可以读取到用户在Context模块中配置的内存门限和/或峰值内存门限、和数据门限。计算图优化过程主要包括一次算子融合、重计算和二次算子融合,详见下文描述。The process of computing graph optimization is mainly a process of combining operator fusion and recomputation. This process can be executed, but not limited to, in the graph-computing fusion module configured in MindSpore. Wherein, the graph computing fusion module can read the memory threshold and/or peak memory threshold and data threshold configured by the user in the Context module. The calculation graph optimization process mainly includes primary operator fusion, recomputation, and secondary operator fusion. See the description below for details.

(1)一次算子融合(1) One operator fusion

在对计算图进行优化时,可以先对得到的计算图进行算子融合。示例性的,可以但不限于基于预先设定的算子融合规则(比如:elementwise+elementwise的算子融合规则、elementwise+reduction的算子融合规则、segment+elementwise的算子融合规则等),并根据计算图中算子的类别(比如:计算密集型、访存密集型等)与硬件的后端特征等,将计算 图划分为多个计算子图,其中,每个计算子图均可以对应一个融合后得到的算子(以下简称“融合算子”),由此得到优化后的计算图,即一次融合子图。由此提升计算图中算子的性能。另外,通过第一次算子融合,也可以减少算子数量,这样可以减少后续在重计算过程中对算子进行分析的开销。此外,也可以扩大在后续重计算过程中对算子进行分析的范围,提升重计算的效果。例如,请参阅图11,假设算子融合前,算子Div未出现数据突变,且在算子融合后,FuseOp1对应的输出张量B远大于其对应的输入张量A,则在算子融合前,难以使用数据突变的方式将算子Div作为重计算节点集合,但在算子融合后,算子Div对应的融合算子FuseOp1存在数据突变,因此可以将与算子Div的输出张量B相关的算子的集合作为重计算节点集合,这样通过算子融合就扩大了重计算过程中对算子进行分析的范围。When optimizing the calculation graph, operator fusion can be performed on the obtained calculation graph first. Exemplary, but not limited to, based on preset operator fusion rules (such as: elementwise+elementwise operator fusion rules, elementwise+reduction operator fusion rules, segment+elementwise operator fusion rules, etc.), and According to the category of operators in the calculation graph (for example: computation-intensive, memory-intensive, etc.) An operator obtained after fusion (hereinafter referred to as "fusion operator"), thereby obtaining an optimized calculation graph, that is, a primary fusion subgraph. This improves the performance of operators in the calculation graph. In addition, through the first operator fusion, the number of operators can also be reduced, which can reduce the overhead of analyzing operators in the subsequent recalculation process. In addition, it can also expand the scope of analyzing operators in the subsequent recalculation process to improve the effect of recalculation. For example, please refer to Figure 11. Assuming that there is no data mutation of operator Div before operator fusion, and after operator fusion, the output tensor B corresponding to FuseOp1 is much larger than its corresponding input tensor A, then in operator fusion Previously, it was difficult to use the operator Div as a recomputation node set in the way of data mutation, but after the operator fusion, the fusion operator FuseOp1 corresponding to the operator Div has a data mutation, so the output tensor B of the operator Div can be combined with A set of related operators is used as a set of recomputing nodes, so that operator fusion can expand the scope of analyzing operators in the process of recomputing.

举例来说,如图7所示,图7的(A)为初始得到的计算图,该计算图中包括6个算子,这6个算子分别为Slice、Gather、ScatterAdd、MatMul、Mul和Div。在图7的(A)中,Slice的输出为张量A;Gather的输入为张量A,输出为张量B;ScatterAdd的输入为张量B,输出为张量C;MatMul的输入为张量C,输出为张量D;Mul的输入为张量D,输出为张量E;Div的输入为张量B和张量D。若预先设定的算子融合规则为:Gather和ScatterAdd融合,Mul和Div融合,则基于该算子融合规则对图7的(A)进行算子融合后可以得到如图7的(B)所示的计算图。在图7的(B)中包括4个计算子图,这4个计算子图分别对应4个融合算子,这4个融合算子分别为:Slice、FuseOp1、MatMul和FuseOp2。在图7的(B)中,Slice的输出为张量A;FuseOp1的输入为张量A,输出为张量B和张量C;MatMul的输入为张量C,输出为张量D;FuseOp2的输入为张量D和张量B。可以理解的是,由于图7的(B)是进行算子融合后得到的计算图,所以,可以将图7的(B)中的各个算子均称为“融合算子”。For example, as shown in Figure 7, (A) in Figure 7 is the initially obtained calculation graph, which includes 6 operators, and these 6 operators are Slice, Gather, ScatterAdd, MatMul, Mul and Div. In (A) of Figure 7, the output of Slice is tensor A; the input of Gather is tensor A, and the output is tensor B; the input of ScatterAdd is tensor B, and the output is tensor C; the input of MatMul is tensor Quantity C, the output is tensor D; the input of Mul is tensor D, and the output is tensor E; the input of Div is tensor B and tensor D. If the pre-set operator fusion rules are: Gather and ScatterAdd fusion, Mul and Div fusion, then based on the operator fusion rules (A) in Figure 7, the operator fusion can be obtained as shown in Figure 7 (B). The calculation graph shown. (B) in FIG. 7 includes 4 calculation subgraphs, and these 4 calculation subgraphs correspond to 4 fusion operators, respectively: Slice, FuseOp1, MatMul and FuseOp2. In (B) of Figure 7, the output of Slice is tensor A; the input of FuseOp1 is tensor A, and the output is tensor B and tensor C; the input of MatMul is tensor C, and the output is tensor D; FuseOp2 The input of is tensor D and tensor B. It can be understood that since (B) in FIG. 7 is a calculation graph obtained after operator fusion, each operator in (B) in FIG. 7 can be called a "fusion operator".

可以理解的是,本申请实施例中,一次算子融合可根据实际情况进行选择。在一些实施例中,可以执行该步骤后在执行后续的重计算步骤。在另一些实施例中,可以不执行该步骤,而是直接执行后续的重计算步骤。It can be understood that, in the embodiment of the present application, one operator fusion can be selected according to the actual situation. In some embodiments, a subsequent recalculation step may be performed after performing this step. In some other embodiments, this step may not be performed, but the subsequent recalculation step is directly performed.

(2)重计算(2) Recalculation

重计算的过程主要包括构建生产者-消费者列表、确定重计算备选集合、从重计算备选集合中筛选重计算点集合、确定重计算图和生成新计算图中的一项或多项,详见下文描述。The recalculation process mainly includes building a producer-consumer list, determining a recalculation candidate set, screening a recalculation point set from the recalculation candidate set, determining a recalculation graph, and generating one or more items in a new calculation graph, See description below for details.

A)构建生产者-消费者列表A) Build a producer-consumer list

根据计算图中节点之间的数据依赖关系,先对计算图中的节点进行拓扑排序,例如,按照计算图中各个算子间的物理关系进行排序等,然后再由排序结果,得到生产者-消费者列表。其中,生产者是具有数据依赖关系的两个计算子图中生成数据的计算子图,即生成数据的算子,消费者是具有数据依赖关系的两个计算子图中使用数据的计算子图,即使用数据的算子。According to the data dependencies between the nodes in the calculation graph, the nodes in the calculation graph are topologically sorted first, for example, sorted according to the physical relationship between operators in the calculation graph, etc., and then the producer- list of consumers. Among them, the producer is the calculation subgraph that generates data in the two calculation subgraphs with data dependencies, that is, the operator that generates data, and the consumer is the calculation subgraph that uses data in the two calculation subgraphs with data dependencies , which is an operator using data.

举例来说,如图8所示,图8的(A)为进行一次融合后得到的计算子图。基于计算图中各个节点间的数据依赖关系,对图8的(A)中所示的计算图中的各个节点进行拓扑排序,可以得到如图8的(B)所示的排序结果。在图8的(B)中为:Slice和FuseOp1具有数据依赖关系,FuseOp1分别与MatMul和FuseOp2具有数据依赖关系,MatMul和Div具有数据依赖关系,Div和FuseOp2具有数据依赖关系。其中,对于Slice和FuseOp1,Slice可以生成数据,FuseOp1可以使用Slice生成的数据,所以,Slice是生产者,FuseOp1是消费者。同理,对于FuseOp1和MatMul,FuseOp1是生产者,MatMul是消费者;对于FuseOp1和FuseOp2,FuseOp1是生产者,FuseOp2是消费者;对于MatMul和Div,MatMu是生产者,Div是消费者; 对于Div和FuseOp2,Div是生产者,FuseOp2是消费者。这样,由图8的(B)即可以获知到图8的(A)中的计算图对应的生产者-消费者列表,该列表可以如表1所示。For example, as shown in FIG. 8 , (A) in FIG. 8 is a calculation subgraph obtained after one fusion. Based on the data dependencies among the nodes in the calculation graph, each node in the calculation graph shown in (A) of FIG. 8 is topologically sorted, and the sorting result shown in (B) of FIG. 8 can be obtained. In (B) of FIG. 8 , Slice and FuseOp1 have a data dependency, FuseOp1 has a data dependency with MatMul and FuseOp2 respectively, MatMul and Div have a data dependency, and Div and FuseOp2 have a data dependency. Among them, for Slice and FuseOp1, Slice can generate data, and FuseOp1 can use the data generated by Slice, so Slice is the producer, and FuseOp1 is the consumer. Similarly, for FuseOp1 and MatMul, FuseOp1 is the producer, and MatMul is the consumer; for FuseOp1 and FuseOp2, FuseOp1 is the producer, and FuseOp2 is the consumer; for MatMul and Div, MatMu is the producer, and Div is the consumer; for Div and FuseOp2, Div is the producer and FuseOp2 is the consumer. In this way, the producer-consumer list corresponding to the computation graph in (A) in FIG. 8 can be obtained from (B) in FIG. 8 , and the list can be shown in Table 1.

表1Table 1

生产者(producer)producer 消费者(consumer)consumer SliceSlices FuseOp1FuseOp1 FuseOp1FuseOp1 MatMulMatMul FuseOp1FuseOp1 FuseOp2FuseOp2 MatMulMatMul Divdiv Divdiv FuseOp2FuseOp2

B)确定重计算备选集合B) Determine the set of recalculation candidates

构建出生产者-消费者列表后,针对列表中的任意一个生产者,可以将在该生产者中且与该生产者输出的张量直接相关和间接相关的算子的集合作为重计算备选集合中的一个数据,该数据可以称之为重计算节点集合。由此,得到重计算备选集合。其中,重计算节点集合中可以包括一个节点,也可以包括多个节点。After the producer-consumer list is constructed, for any producer in the list, a set of operators in the producer and directly and indirectly related to the tensor output by the producer can be used as recomputation candidates A piece of data in the set, which can be called a set of recomputing nodes. Thus, a recomputation candidate set is obtained. Wherein, the set of recomputing nodes may include one node, or may include multiple nodes.

举例来说,继续参阅图8,以FuseOp1这个生产者为例,该生产者输出的张量分别为张量B和张量C。与张量B直接相关的节点为Gather,没有与张量B间接相关的节点。与张量C直接相关的节点为ScatterAdd,与张量C间接相关的节点为Gather。因此,由FuseOp1这个生产者得到的重计算备选集合为[(Gather),(Gather,ScatterAdd)]。该重计算备选结合中包括两个重计算节点集合,一个重计算节点集合为(Gather),其包括一个节点,即节点Gather,另一个重计算节点集合为(Gather,ScatterAdd),其包括两个节点,即节点Gather和ScatterAdd。For example, continue to refer to Figure 8, taking the producer FuseOp1 as an example, the tensors output by the producer are tensor B and tensor C respectively. The node directly related to tensor B is Gather, and there is no node indirectly related to tensor B. The node directly related to tensor C is ScatterAdd, and the node indirectly related to tensor C is Gather. Therefore, the set of recomputation candidates obtained by the producer FuseOp1 is [(Gather), (Gather, ScatterAdd)]. The recalculation candidate combination includes two recalculation node sets, one recalculation node set is (Gather), which includes a node, that is, node Gather, and the other recalculation node set is (Gather, ScatterAdd), which includes two nodes, namely nodes Gather and ScatterAdd.

C)从重计算备选集合中筛选重计算节点集合C) Filter the recalculation node set from the recalculation candidate set

从重计算备选集合中筛选重计算点集合的方式可以包括基于内存大小筛选、基于数据突变筛选和基于算子融合规则筛选等中的一种方式或多种方式。其中,这三种方式可以择一选择,也可以任意组合,具体可根据实际情况而定,此处不做限定。下面对这三种方式分别进行描述。The method of screening the recalculation point set from the recalculation candidate set may include one or more methods of screening based on memory size, screening based on data mutation, and screening based on operator fusion rules. Among them, one of these three ways can be selected, or can be combined arbitrarily, which can be determined according to actual conditions, and is not limited here. The three methods are described below.

a)基于内存大小筛选a) Filter based on memory size

在基于内存大小筛选时,可以根据重计算备选集合中各个重计算节点集合输出的张量所占的内存大小,从重计算备选集合中筛选出所需的该生产者对应的重计算节点集合。具体地,如图9所示,在S901,可以将重计算备选集合中各个重计算节点集合对应的输出张量所占的内存与前期设置的内存门限或者峰值内存门限进行对比。在S902,当重计算节点集合对应的输出张量所占的内存大于或等于前期设置的内存门限或者峰值内存门限时,则确定该重计算节点集合为所需的重计算节点集合,此时可以保留该重计算节点集合。在S903,当重计算节点集合对应的输出张量所占的内存小于前期设置的内存门限或者峰值内存门限时,则确定该重计算节点集合不是所需的重计算节点集合,此时可以舍弃该重计算节点集合。由此从重计算备选集合中筛选出所需的重计算点集合。在一个例子中,重计算节点集合对应的输出张量是指该重计算节点集合所对应的生产者输出的一个张量,且该重计算节点集合中的每个节点均与该张量直接相关或间接相关。重计算节点集合对应的输入张量是指该重计算节点集合所对应的生产者输入的至少一个张量,且该重计算节点集合中的每个节点均与该至少一个张量中的每个张量直接相关或间接相关。When filtering based on memory size, you can filter out the required recalculation node set corresponding to the producer from the recalculation candidate set according to the memory size occupied by the tensors output by each recalculation node set in the recalculation candidate set . Specifically, as shown in FIG. 9 , at S901 , the memory occupied by the output tensor corresponding to each recalculation node set in the recalculation candidate set may be compared with the previously set memory threshold or peak memory threshold. In S902, when the memory occupied by the output tensor corresponding to the recalculation node set is greater than or equal to the memory threshold or the peak memory threshold set in the previous period, it is determined that the recalculation node set is the required recalculation node set. At this time, it can be The set of recompute nodes is preserved. In S903, when the memory occupied by the output tensor corresponding to the recalculation node set is less than the previously set memory threshold or peak memory threshold, it is determined that the recalculation node set is not the required recalculation node set, and the recalculation node set can be discarded at this time Recalculate the set of nodes. Thus, the required recalculation point set is filtered out from the recalculation candidate set. In one example, the output tensor corresponding to the recalculation node set refers to a tensor output by the producer corresponding to the recalculation node set, and each node in the recalculation node set is directly related to the tensor or indirectly related. The input tensor corresponding to the recalculation node set refers to at least one tensor input by the producer corresponding to the recalculation node set, and each node in the recalculation node set is related to each tensor in the at least one tensor directly or indirectly related.

作为另一种可能的实现方式,在得到生产者-消费者列表的过程中,可以记录在生产者-消费者之间是否存在间接路径,即记录生产者输出的张量除了直接作为消费者的输入的张量,是否也可以间接作为该消费者的输入张量。举例来说,继续参阅图8,对于FuseOp1-FuseOp2这个生产者-消费者,生产者FuseOp1输出的张量B可以直接作为消费者FuseOp2的输入的张量,生产者FuseOp1输出的张量C可以经MatMul和Div间接作为FuseOp2的输入的张量,所以FuseOp1和FuseOp2之间存在间接路径。As another possible implementation, in the process of obtaining the producer-consumer list, it is possible to record whether there is an indirect path between the producer-consumer, that is, record the tensor output by the producer except the tensor directly used as the consumer Whether the input tensor can also be indirectly used as the input tensor of this consumer. For example, continue to refer to Figure 8, for the producer-consumer FuseOp1-FuseOp2, the tensor B output by the producer FuseOp1 can be directly used as the input tensor of the consumer FuseOp2, and the tensor C output by the producer FuseOp1 can be passed MatMul and Div indirectly serve as tensors for input to FuseOp2, so there is an indirect path between FuseOp1 and FuseOp2.

其中,针对任一生产者-消费者,如图10所示,在S1001,可以判断该生产者-消费者之间是否存在间接路径。当该生产者-消费者之间不存在间接路径时,可以根据重计算备选集合中该生产者对应的各个重计算节点集合对应的输出张量所占的内存大小,从重计算备选集合中筛选出所需的该生产者对应的重计算节点集合。具体地,在S1002,可以将重计算备选集合中该生产者对应的各个重计算节点集合对应的输出张量所占的内存与前期设置的内存门限或者峰值内存门限进行对比。在S1003,当重计算节点集合对应的输出张量所占的内存大于或等于前期设置的内存门限时,则确定该重计算节点集合为所需的重计算节点集合,此时可以保留该重计算节点集合。在S1004,当重计算节点集合对应的输出张量所占的内存小于前期设置的内存门限或者峰值内存门限时,则确定该重计算节点集合不是所需的重计算节点集合,此时可以舍弃该重计算节点集合。由此从重计算备选集合中筛选出所需的重计算点集合。Wherein, for any producer-consumer, as shown in FIG. 10, at S1001, it may be determined whether an indirect path exists between the producer-consumer. When there is no indirect path between the producer and the consumer, according to the memory size occupied by the output tensor corresponding to each recomputation node set corresponding to the producer in the recalculation candidate set, the recalculation candidate set Filter out the required set of recomputing nodes corresponding to the producer. Specifically, in S1002, the memory occupied by the output tensor corresponding to each recomputation node set corresponding to the producer in the recomputation candidate set may be compared with the previously set memory threshold or peak memory threshold. In S1003, when the memory occupied by the output tensor corresponding to the recalculation node set is greater than or equal to the memory threshold set in the previous period, it is determined that the recalculation node set is the required recalculation node set, and the recalculation node set can be reserved at this time collection of nodes. In S1004, when the memory occupied by the output tensor corresponding to the recalculation node set is less than the previously set memory threshold or peak memory threshold, it is determined that the recalculation node set is not the required recalculation node set, and the recalculation node set can be discarded at this time Recalculate the set of nodes. Thus, the required recalculation point set is filtered out from the recalculation candidate set.

举例来说,继续参阅图8,对于MatMul-Div这个生产者-消费者,两者之间不存在间接路径。若前期设置的内存门限为40字节,生产者MatMul输出的张量所占的内存为50字节。由于MatMul和Div之间只有直接路径,没有间接路径,且生产者MatMul输出的张量所占的内存大于前期设置的内存门限,所以可以将MatMul作为所需的重计算节点集合。For example, continue to refer to Figure 8, for the producer-consumer MatMul-Div, there is no indirect path between the two. If the memory threshold set in the early stage is 40 bytes, the memory occupied by the tensor output by the producer MatMul is 50 bytes. Since there is only a direct path between MatMul and Div, and there is no indirect path, and the memory occupied by the tensor output by the producer MatMul is larger than the memory threshold set in the previous stage, MatMul can be used as the required set of recomputing nodes.

示例性的,对于确定生产者输出的张量所占的内存,可以但不限于基于生产者输出的张量的数据大小进行确定。其中,在确定出计算图后,计算图中每个算子输出的张量的结构(比如:维度、类型)等即可以确定,因此,可以根据每个算子输出的张量的结构等,确定出各个算子输出的张量所占的内存大小。例如,若张量的维度为[4,4,3,2],类型为float32,由于每个数据类型为float32的数据所占的字节为4,所以该张量所占的内存大小为:4x4x3x2x4=384字节。Exemplarily, for determining the memory occupied by the tensor output by the producer, it may be determined based on, but not limited to, the data size of the tensor output by the producer. Among them, after the calculation graph is determined, the structure (such as: dimension, type) of the tensor output by each operator in the calculation graph can be determined. Therefore, according to the structure of the tensor output by each operator, etc., Determine the memory size occupied by tensors output by each operator. For example, if the dimension of the tensor is [4,4,3,2] and the type is float32, since each data type of float32 occupies 4 bytes, the memory size occupied by the tensor is: 4x4x3x2x4 = 384 bytes.

继续参阅图10,在S1005,当该生产者-消费者之间存在间接路径时,可以分别计算该生产者与消费者之间的间接路径上各个算子在执行时的峰值内存。接着,在S1006,判断是否存在至少一个算子对应的峰值内存大于或等于前期设定的峰值内存门限。接着,在S1007,当存在至少一个算子对应的峰值内存大于或等于前期设定的峰值内存门限时,可以将该生产者-消费者中与生产者向消费者输入的每个张量直接相关和间接相关的算子的集合均作为一个所需的重计算节点集合。在S1008,当生产者-消费者之间存在间接路径上各个算子对应的峰值内存均小于前期设定的峰值内存门限时,可以将该生产者-消费者中与生产者向消费者输入的张量直接相关和间接相关的算子的集合作为不是所需的重计算节点集合,并舍弃这些重计算节点集合。由此,从重计算备选集合中筛选出所需的重计算点集合。Continuing to refer to FIG. 10 , at S1005 , when there is an indirect path between the producer and the consumer, the peak memory of each operator on the indirect path between the producer and the consumer may be calculated separately during execution. Next, at S1006, it is determined whether there is at least one operator whose peak memory is greater than or equal to the previously set peak memory threshold. Next, in S1007, when there is at least one operator whose corresponding peak memory is greater than or equal to the previously set peak memory threshold, the producer-consumer can be directly related to each tensor input from the producer to the consumer The set of operators and indirectly related operators are all used as a set of required recomputing nodes. In S1008, when the peak memory corresponding to each operator on the indirect path between the producer and the consumer is smaller than the previously set peak memory threshold, the producer-consumer and the producer-consumer input can be The set of directly related and indirectly related operators of the tensor is not the required set of recomputing nodes, and these sets of recomputing nodes are discarded. Thus, the required recalculation point set is filtered out from the recalculation candidate set.

举例来说,如图11所示,在图11中FuseOp1和FuseOp2可以组成生产者-消费者。FuseOp1和FuseOp2之间存在一条间接路径,该间接路径上的算子为MatMul。若MatMul执行时的内存(即张量B、张量C、张量D和张量E所占的内存之和)大于前期设定的峰值内存门限,则可以将与张量B直接相关和间接相关的算子的集合,以及与张量E直接相关和间接相关的算子的集合作为所需重计算节点集合。其中,与张量B直接相关和间接相关的算子的集合为 Gather,与张量E直接相关和间接相关的算子的集合为(Gather,ScatterAdd,Div),所以此时筛选出的重计算节点集合分别为(Gather)和(Gather,ScatterAdd,Div)。For example, as shown in FIG. 11 , FuseOp1 and FuseOp2 in FIG. 11 can form a producer-consumer. There is an indirect path between FuseOp1 and FuseOp2, and the operator on this indirect path is MatMul. If the memory when MatMul is executed (that is, the sum of the memory occupied by tensor B, tensor C, tensor D, and tensor E) is greater than the peak memory threshold set in the previous stage, it can be directly related to tensor B and indirectly The collection of related operators, and the collection of operators directly and indirectly related to tensor E are used as the required recomputation node collection. Among them, the set of operators directly and indirectly related to tensor B is Gather, and the set of operators directly and indirectly related to tensor E is (Gather, ScatterAdd, Div), so the recalculation selected at this time The node sets are (Gather) and (Gather, ScatterAdd, Div) respectively.

在一些实施例中,为了提升基于峰值内存门限筛选时的准确度,可以在筛选时再将生产者-消费者中与生产者向消费者输入的张量(即生产者输出的张量)所占的内存和与该生产者所需的与该张量相关的输入张量所占的内存之间的大小作为附件的判断条件。其中,在生产者-消费者之间存在间接路径时,当生产者-消费者中与生产者向消费者输入的张量(即生产者输出的张量)所占的内存大于与该生产者所需的与该张量相关的输入张量所占的内存,且该,该生产者与消费者之间的间接路径上各个算子在执行时的峰值内存存在至少一个算子对应的峰值内存大于或等于前期设定的峰值内存门限时,则可以将该生产者-消费者中与生产者向消费者输入的每个张量直接相关和间接相关的算子的集合均作为一个所需的重计算节点集合。In some embodiments, in order to improve the accuracy of screening based on the peak memory threshold, the producer-consumer and the tensor input from the producer to the consumer (that is, the tensor output by the producer) can be combined during screening. The size between the memory occupied by the producer and the memory occupied by the input tensor related to the tensor required by the producer is used as the judgment condition of the attachment. Among them, when there is an indirect path between the producer and the consumer, when the tensor input from the producer to the consumer (that is, the tensor output by the producer) in the producer-consumer occupies more memory than the producer The required memory occupied by the input tensor related to the tensor, and the peak memory of each operator on the indirect path between the producer and the consumer during execution has at least one peak memory corresponding to the operator When it is greater than or equal to the peak memory threshold set in the previous period, the set of operators directly and indirectly related to each tensor input from the producer to the consumer in the producer-consumer can be used as a required Recalculate the set of nodes.

举例来说,如图12的(A)所示,对于算子1和算子5这一对生产者和消费者,两者之间存在间接路径,且在算子4执行时的峰值内存最高,此时的峰值内存为(T1+T2+T3+T4)。若算子4执行时的峰值内存超过了前期设定的峰值内存门限,此时则可以将算子1作为所需的重计算节点集合。如图12的(B)所示,当将算子1作为所需的重计算节点集合,且在重计算后的峰值内存为(T0+T2+T3+T4)时,如果T1大于T0,则此时经过重计算后可以降低计算图对应的峰值内存,如果T1小于T0,则此时经过重计算后反而增大了计算图对应的峰值内存,如果T1等于T0,则此时经过重计算后计算图对应的峰值内存没有变化,但由于经过了重计算,因此增大了系统开销。所以当T1大于T0时可以将算子1作为所需的重计算节点集合。For example, as shown in (A) of Figure 12, for the pair of producers and consumers of operator 1 and operator 5, there is an indirect path between them, and the peak memory of operator 4 is the highest when it is executed , the peak memory at this time is (T1+T2+T3+T4). If the peak memory of operator 4 exceeds the peak memory threshold set in the previous stage, then operator 1 can be used as the required set of recomputing nodes. As shown in (B) of Figure 12, when operator 1 is used as the required recomputation node set, and the peak memory after recomputation is (T0+T2+T3+T4), if T1 is greater than T0, then At this time, after recalculation, the peak memory corresponding to the calculation graph can be reduced. If T1 is less than T0, the peak memory corresponding to the calculation graph will be increased after recalculation. If T1 is equal to T0, then after recalculation at this time The peak memory corresponding to the calculation graph has not changed, but the system overhead has been increased due to recalculation. Therefore, when T1 is greater than T0, operator 1 can be used as the required set of recomputing nodes.

b)基于数据突变筛选b) Mutation screening based on data

基于数据突变筛选时,可以确定重计算备选集合中的重计算节点集合是否存在数据突变。若重计算节点集合对应的输出张量所占的内存的大小与其对应的至少一个输入张量所占的内存的大小之间的偏差值(比如差值、比值等)大于或等于前期设置的数据门限,表明此时存在数据突变,则保留该重计算点集合,否则则舍弃该重计算点集合。由此以消除因数据突变引起的内存占用过大的情况。When screening based on data mutation, it can be determined whether there is data mutation in the recalculation node set in the recalculation candidate set. If the deviation value (such as difference, ratio, etc.) between the size of the memory occupied by the output tensor corresponding to the recalculation node set and the size of the memory occupied by at least one input tensor corresponding to it is greater than or equal to the data set in the previous period Threshold, indicating that there is a data mutation at this time, then keep the set of recalculation points, otherwise discard the set of recalculation points. In this way, the situation of excessive memory usage caused by data mutation is eliminated.

举例来说,继续参阅图11,若重计算备选集合中的重计算点集合为[(Gather),(Gather-ScatterAdd-Div)]。如果张量E所占的内存减去张量A所占的内存大于前期设置的数据门限,则表明重计算点集合Gather存在数据突变,此时可以保留该重计算点集合。如果张量B所占的内存减去张量A所占的内存小于前期设置的数据门限,则表明重计算点集合(Gather-ScatterAdd-Div)不存在数据突变,此时可以舍弃该重计算点集合。由此筛选出的重计算点集合为(Gather)。For example, continue to refer to FIG. 11 , if the set of recalculation points in the recalculation candidate set is [(Gather), (Gather-ScatterAdd-Div)]. If the memory occupied by tensor E minus the memory occupied by tensor A is greater than the previously set data threshold, it indicates that there is a data mutation in the recalculation point set Gather, and the recalculation point set can be retained at this time. If the memory occupied by tensor B minus the memory occupied by tensor A is less than the previously set data threshold, it indicates that there is no data mutation in the recalculation point set (Gather-ScatterAdd-Div), and the recalculation point can be discarded at this time gather. The set of recalculation points thus filtered out is (Gather).

对于通过数据突变消除内存占用过大的情况,举例来说,如图12所述,图12的(A)为重计算前的每个算子的执行顺序,每个算子的输入张量和输出张量的生命周期和每个算子在执行时的峰值内存,图12的(B)为重计算后的每个算子的执行顺序,每个算子的输入张量和输出张量的生命周期和每个算子在执行时的峰值内存。将图12的(A)和(B)对比来看,当算子4在执行时,重计算前的峰值内存为T1+T2+T3+T4,重计算后的峰值内存为T0+T2+T3+T4。如果T1减T0得到的值大于前期设定的数据门限,且算子1为确定出的重计算备选集合中的一个重计算节点集合,此时,可以看出在算子4在执行时,重计算后的峰值内存明显小于重计算前的峰值内存,因此,这种情况下可以显著降低峰值内存,所以此时可以 将算子1作为筛选出的重计算节点集合。如果T1减T0得到的值小于前期设定的数据门限,且算子1为确定出的重计算备选集合中的一个重计算节点集合,此时,可以看出在算子4在执行时,重计算后的峰值内存与重计算前的峰值内存相差较小,因此这种情况下可以将算子1作为筛选出的重计算节点集合,也可以不将算子1作为筛选出的重计算节点集合。For the case of eliminating excessive memory usage through data mutation, for example, as shown in Figure 12, (A) in Figure 12 is the execution order of each operator before recomputation, the input tensor of each operator and The life cycle of the output tensor and the peak memory of each operator during execution, (B) in Figure 12 shows the execution order of each operator after recalculation, the input tensor and output tensor of each operator The lifetime and peak memory of each operator during execution. Comparing (A) and (B) in Figure 12, when operator 4 is executing, the peak memory before recalculation is T1+T2+T3+T4, and the peak memory after recalculation is T0+T2+T3 +T4. If the value obtained by subtracting T0 from T1 is greater than the previously set data threshold, and operator 1 is a recomputation node set in the determined recomputation candidate set, at this point, it can be seen that when operator 4 is executing, The peak memory after recalculation is significantly smaller than the peak memory before recalculation. Therefore, in this case, the peak memory can be significantly reduced. Therefore, operator 1 can be used as a set of filtered recalculation nodes at this time. If the value obtained by subtracting T0 from T1 is less than the previously set data threshold, and operator 1 is a set of recalculation nodes in the determined recalculation candidate set, at this time, it can be seen that when operator 4 is executing, The difference between the peak memory after recalculation and the peak memory before recalculation is small, so in this case, operator 1 can be used as the selected recalculation node set, or operator 1 can not be selected as the recalculation node set gather.

c)基于算子融合规则筛选c) Screening based on operator fusion rules

在基于算子融合规则筛选重计算节点集合时,如图13的(A)所示,在S1311,可以基于预先设定的算子融合规则,判断重计算备选集合中的重计算节点集合与该重计算节点集合对应的消费者是否能够融合。在S1312,如果能够融合则保留该重计算节点集合。在S1313,如果不能融合,可以舍弃该重计算节点集合。在一个例子中,S1311也可以理解为是判断重计算节点集合中向外部输出张量的节点与其对应的消费者是否能够融合。When screening the recomputation node set based on the operator fusion rule, as shown in (A) of Figure 13, at S1311, based on the preset operator fusion rule, it can be determined whether the recomputation node set in the recomputation candidate set is the same as Whether the consumers corresponding to the set of recomputing nodes can be merged. In S1312, if it can be merged, the set of recomputing nodes is retained. In S1313, if the fusion cannot be performed, the recalculation node set may be discarded. In an example, S1311 can also be understood as judging whether a node that outputs tensors in the recalculation node set and its corresponding consumer can be fused.

作为另一种可能的实现方式,如图13的(B)所示,在S1321,可以基于预先设定的算子融合规则,判断重计算备选集合中的重计算节点集合与该重计算节点集合对应的消费者是否能够融合。在S1322,如果能够融合则保留该重计算节点集合。在S1323,如果不能融合,可以判断该重计算节点集合对应的生产者与消费者之间是否存在间接路径。其中,当存在间接路径时,可以保留该重计算节点集合,即执行S1322。在S1324,当不存在间接路径时,可以舍弃该重计算节点结合。由此筛选出重计算节点集合。其中,当不符合算子融合规则,但存在间接路径时,重计算依然可以缩短内存占用过大的张量的生命周期,从而带来降低峰值内存的可能性收益。在一个例子中,S1321也可以理解为是判断重计算节点集合中向外部输出张量的节点与其对应的消费者是否能够融合。As another possible implementation, as shown in (B) of Figure 13, at S1321, based on the preset operator fusion rules, it is possible to determine the recomputation node set in the recomputation candidate set and the recomputation node set Whether the consumers corresponding to the set can be merged. In S1322, if it can be merged, the set of recomputing nodes is retained. In S1323, if the integration is not possible, it may be determined whether there is an indirect path between the producer and the consumer corresponding to the set of recomputing nodes. Wherein, when there is an indirect path, the set of recomputing nodes may be reserved, that is, S1322 is executed. At S1324, when there is no indirect path, the recomputation node combination may be discarded. The set of recomputing nodes is thus filtered out. Among them, when the operator fusion rules are not met, but there is an indirect path, recalculation can still shorten the life cycle of tensors that occupy too much memory, thereby bringing the possibility of reducing the peak memory benefit. In an example, S1321 can also be understood as judging whether a node that outputs tensors in the recalculation node set and its corresponding consumer can be fused.

举例来说,继续参阅图12,假设确定出的重计算备选集合中的重计算节点集合为算子1,且算子1和算子5不能融合。由于算子1和算子5之间不符合融合规则,但两者之间存在间接路径,因此可以将算子1作为所需的重计算节点集合。将图12的(A)和(B)进行对比,若算子1输出的张量所占的内存T1过大,则在图12的(B)中可以大幅缩短算子1输出的张量的生命周期,这时整个计算图对应的峰值内存也明显得到了降低。For example, continue referring to FIG. 12 , assuming that the determined recomputation node set in the recomputation candidate set is operator 1, and operator 1 and operator 5 cannot be merged. Since operator 1 and operator 5 do not comply with the fusion rules, but there is an indirect path between them, operator 1 can be used as the required recomputation node set. Comparing (A) and (B) in Figure 12, if the memory T1 occupied by the tensor output by operator 1 is too large, then in (B) in Figure 12, the tensor output by operator 1 can be greatly shortened. Life cycle, at this time the peak memory corresponding to the entire calculation graph has also been significantly reduced.

D)确定重计算图D) Determine the recalculation graph

在筛选出重计算节点集合后,可以由确定出的重计算节点集合,确定出重计算图。示例性的,可以在计算图中将重计算节点集合与该重计算节点集合所属的生产者对应的消费者之间的边作为重计算点。然后,再拷贝各个重计算点对应的生产者子图,其中,得到的生产者子图的数量与筛选出的重计算节点集合的数量相等,或者多于筛选出的重计算节点集合的数量。最后,分别将生产者子图中除该生产者子图对应的重计算节点集合之外的节点均删除,以及删除生产者子图中与该生产者子图对应的重计算节点集合无关的边,这样就得到了各个重计算节点集合对应的重计算图。由于这样得到的重计算图对应的重计算节点集合之间重复的概率很小,因此,这样就降低了重复计算的概率,从而降低了计算量,节省了重计算开销。在一个例子中,将生产者子图中除该生产者子图对应的重计算节点集合之外的节点均删除,可以理解为是删除生产者子图中与该重计算节点集合输出的张量无关的算子。After the recomputation node set is screened out, a recomputation graph can be determined from the determined recomputation node set. Exemplarily, an edge between a recalculation node set and a consumer corresponding to a producer to which the recalculation node set belongs may be used as a recomputation point in the calculation graph. Then, copy the producer subgraph corresponding to each recomputation point, wherein the number of the obtained producer subgraph is equal to or greater than the number of the filtered recalculation node set. Finally, delete all the nodes in the producer subgraph except for the recalculation node set corresponding to the producer subgraph, and delete the edges in the producer subgraph that are not related to the recalculation node set corresponding to the producer subgraph , so that the recomputation graph corresponding to each recomputation node set is obtained. Since the recomputation graph obtained in this way has a very small probability of duplication between recomputation node sets, the probability of recomputation is reduced, thereby reducing the amount of calculation and saving recomputation overhead. In one example, deleting all nodes in the producer subgraph except the recomputation node set corresponding to the producer subgraph can be understood as deleting the tensor output from the producer subgraph and the recalculation node set irrelevant operators.

在一个例子中,针对任一重计算节点集合,也可以直接在计算图中拷贝与该重计算节点集合所属的生产者,这样就得到了该重计算节点集合对应的生产者子图。然后,在分别将生产者子图中除该生产者子图对应的重计算节点集合之外的节点均删除,以及删除生产者子图中与该生产者子图对应的重计算节点集合无关的边,这样就得到了各个重计算节点集合对应的重计算图。In one example, for any set of recomputing nodes, the producers to which the set of recomputing nodes belong can also be directly copied in the computation graph, so that the producer subgraph corresponding to the set of recomputing nodes can be obtained. Then, delete all the nodes in the producer subgraph except for the recalculation node set corresponding to the producer subgraph, and delete the nodes in the producer subgraph that are not related to the recalculation node set corresponding to the producer subgraph edge, so that the recomputation graph corresponding to each recomputation node set is obtained.

举例来说,如图14所示,图14的(A)为对初始的计算图进行一次算子融合得到的融合子图。若在图14的(A)中张量B所占的内存高于用户配置的内存门限,此时可以确定重计算节点集合为Gather,以及确定张量B对应的边为重计算点。接着,可以拷贝该重计算点对应的生产者子图,即复制FuseOp1,得到图14的(B)所示的融合算子。其中,拷贝该重计算点对应的生产者子图可以理解为是从进行一次算子融合得到的融合子图中复制输出张量B的融合算子。接着,在图14的(B)中可以删除除重计算节点集合Gather之外的算子,即删除ScatterAdd,以及删除与除重计算节点集合Gather无关的边,即删除算子Scatter与算子Gather之间的边和算子Scatter的输出张量对应的边,从而得到如图14的(C)所示的重计算图。For example, as shown in FIG. 14 , (A) in FIG. 14 is a fused subgraph obtained by performing operator fusion on the initial calculation graph. If the memory occupied by tensor B in (A) of FIG. 14 is higher than the memory threshold configured by the user, then the set of recalculation nodes can be determined as Gather, and the edge corresponding to tensor B can be determined as the recalculation point. Next, the producer subgraph corresponding to the recomputation point can be copied, that is, FuseOp1 can be copied to obtain the fusion operator shown in (B) of FIG. 14 . Wherein, copying the producer subgraph corresponding to the recalculation point can be understood as copying the fusion operator that outputs the tensor B from the fusion subgraph obtained by one operator fusion. Next, in (B) of Figure 14, operators other than the recalculation node set Gather can be deleted, that is, ScatterAdd can be deleted, and edges that are not related to the recalculation node set Gather can be deleted, that is, operator Scatter and operator Gather can be deleted The edge between them and the edge corresponding to the output tensor of the operator Scatter, so as to obtain the recomputation graph shown in (C) of Figure 14.

E)生成新计算图E) Generate a new calculation graph

得到重计算图后,根据每个重计算图中节点与原计算子图中节点之间的关系,将每个重计算图均与原计算图结合,就生成了新计算图。其中,重计算图输入第一张量的节点与原计算子图中输出第一张量的节点间具有数据依赖关系,重计算图输出第二张量的节点与原计算子图中输入第二张量的节点间具有数据依赖关系。可以理解的是,对于原计算子图,若在执行重计算步骤时已经对原始的计算图进行了算子融合,则该原计算子图为经过算子融合后得到的子图;若在执行重计算步骤时未对原始的计算图进行算子融合,则该原计算子图为原始的计算图。After the recalculation graph is obtained, according to the relationship between the nodes in each recalculation graph and the nodes in the original computation subgraph, each recalculation graph is combined with the original computation graph to generate a new computation graph. Among them, there is a data dependency between the node that inputs the first tensor in the recalculation graph and the node that outputs the first tensor in the original computation subgraph, and the node that outputs the second tensor in the recalculation graph is the same as the node that outputs the second tensor in the original computation subgraph. Tensor nodes have data dependencies. It is understandable that, for the original computation subgraph, if operator fusion has been performed on the original computation graph during the execution of the recomputation step, the original computation subgraph is the subgraph obtained after operator fusion; If operator fusion is not performed on the original computation graph during the recalculation step, the original computation subgraph is the original computation graph.

示例性的,可以分别构建每个重计算图的输入第一目标张量的节点与原计算图中输出第一目标张量的节点之间的有向边,以及,分别构建每个重计算图中输出第二目标张量的节点与原计算图中输入第二目标张量的节点之间的有向边,以得到新计算图。Exemplarily, the directed edge between the node inputting the first target tensor in each recomputation graph and the node outputting the first target tensor in the original calculation graph can be respectively constructed, and each recomputation graph can be constructed separately The directed edge between the node outputting the second target tensor in the original calculation graph and the node inputting the second target tensor in the original calculation graph to obtain a new calculation graph.

举例来说,如图15所示,图15的(A)为初始的计算图(即:原计算图),对该计算图进行一次算子融合后,可以得到如图15的(B)所示的一次融合子图。对图15的(B)所示的一次融合子图进行重计算后,可以得到如图15的(C)所示的重计算图。由于图15的(C)所示的重计算图的输入为张量A,输出为张量B,而图15的(B)中输出张量A的算子为算子Slice,输入张量B的算子为算子FuseOp2,因此,可以将图15的(C)中的重计算图中用于输入张量A的算子与图15的(B)中的算子Slice相连,即构建两者之间的有向边,以及将图15的(C)中的重计算图中用于输出张量B的算子与图15的(B)中的算子FuseOp2相连,即构建两者之间的有向边。所以,将图15的(C)所示的重计算图与图15的(B)所示的一次融合子图结合后,可以得到图15的(D)所示的新计算图。For example, as shown in Figure 15, (A) in Figure 15 is the initial calculation graph (that is, the original calculation graph). A fusion subgraph shown. After recalculating the primary fusion subgraph shown in (B) of FIG. 15 , a recalculation graph as shown in (C) of FIG. 15 can be obtained. Since the input of the recomputation graph shown in (C) in Figure 15 is tensor A, the output is tensor B, and the operator outputting tensor A in Figure 15 (B) is operator Slice, and the input tensor B The operator of is the operator FuseOp2, therefore, the operator used to input the tensor A in the recomputation graph in (C) of Figure 15 can be connected with the operator Slice in (B) of Figure 15, that is, to construct two The directed edge between them, and the operator used to output the tensor B in the recomputation graph in (C) of Figure 15 is connected to the operator FuseOp2 in (B) of Figure 15, that is, to construct the The directed edge between. Therefore, after combining the recomputation graph shown in (C) of FIG. 15 with the primary fusion subgraph shown in (B) of FIG. 15 , a new computation graph shown in (D) of FIG. 15 can be obtained.

(3)二次算子融合(3) Secondary operator fusion

对得到的新计算图进行算子融合(即二次融合),可以得到新的融合子图,由此得到优化后的计算图。示例性的,可以基于预先设定的算子融合规则,并根据计算图中算子的类别(比如:计算密集型、访存密集型等)与硬件的后端特征等,将计算图划分为多个计算子图,其中,每个计算子图均可以对应一个融合后得到的算子(以下简称“融合算子”)。由此得到优化后的计算图,即二次融合子图,该计算图即为最终优化得到的计算图。之后,即可以执行该最终优化得到的计算图。在一个例子中,对于执行最终优化得到的计算图,可以先对该计算图进行编译,然后再执行编译后的计算图,也可以是直接执行该计算图;不论哪种方式,本质上均是执行的最终优化得到的计算图,只是表示形式不同。Operator fusion (ie, secondary fusion) is performed on the obtained new calculation graph to obtain a new fusion subgraph, thereby obtaining an optimized calculation graph. Exemplarily, based on preset operator fusion rules, the computation graph can be divided into A plurality of calculation subgraphs, wherein each calculation subgraph may correspond to an operator obtained after fusion (hereinafter referred to as "fusion operator"). Thus, an optimized calculation graph is obtained, that is, the secondary fusion subgraph, which is the final optimized calculation graph. After that, the calculation graph obtained by the final optimization can be executed. In an example, for the calculation graph obtained by executing the final optimization, the calculation graph can be compiled first, and then the compiled calculation graph can be executed, or the calculation graph can be directly executed; either way, it is essentially The calculation graph obtained by the final optimization performed is only different in representation.

举例来说,如图16所示,图16的(A)所示的为重计算后得到的新计算图。若预先设定 的算子融合规则为:Gather和Mul融合,则基于该算子融合规则对图16的(A)进行算子融合后可以得到如图16的(B)所示的融合子图。For example, as shown in FIG. 16 , (A) in FIG. 16 shows a new calculation graph obtained after recalculation. If the preset operator fusion rule is: Gather and Mul fusion, then based on the operator fusion rule, the fusion subgraph shown in (B) in Figure 16 can be obtained after performing operator fusion on (A) in Figure 16 .

在一些实施例中,本申请实施例所描述的节点可以但不限于是指算子,本申请实施例所描述的算子也可以但不限于是指节点,具体可根据实际情况而定。In some embodiments, the nodes described in the embodiments of the present application may be, but not limited to, operators, and the operators described in the embodiments of the present application may also be, but not limited to, nodes, which may be determined according to actual conditions.

由此,通过先对计算图进行算子融合得到融合子图,然后再从融合子图中确定出由需要重计算的重计算图,并将确定出的重计算图与融合子图结合,以生成新的计算图,最后对新的计算图进行算子融合,得到所需的计算图,实现了在显著减少内存占用的同时,又不引入较大重计算开销,解决了具有一个或多个超大张量的网络无法执行的问题。Therefore, the fusion subgraph is obtained by first performing operator fusion on the calculation graph, and then the recalculation graph that needs to be recalculated is determined from the fusion subgraph, and the determined recalculation graph is combined with the fusion subgraph to obtain Generate a new calculation graph, and finally perform operator fusion on the new calculation graph to obtain the required calculation graph, which achieves a significant reduction in memory usage without introducing large recalculation overhead, and solves the problem of having one or more The problem that networks with very large tensors cannot perform.

为便于理解,下面举例说明通过上述方案降低内存占用的过程。For ease of understanding, the following example illustrates the process of reducing memory usage through the above solution.

示例性的,如图17所示,在图17中,M和L分别代表算子输出的张量所占的内存大小,M代表中等大小,比如56.8M,L代表超大,比如27G。图17的(A)所示的计算图为未进行任何处理的计算图,即初始计算图。在图17的(A)中,算子Slice和算子ScatterAdd复用一个内存,算子Gather和算子Div复用一个内存,所以图17的(A)所示的计算图所需的内存大小为:M1+M2+L1+L2。Exemplarily, as shown in Figure 17, in Figure 17, M and L respectively represent the memory size occupied by the tensor output by the operator, M represents a medium size, such as 56.8M, and L represents a super large size, such as 27G. The calculation graph shown in (A) of FIG. 17 is a calculation graph without any processing, that is, an initial calculation graph. In (A) of Figure 17, the operator Slice and the operator ScatterAdd reuse a memory, and the operator Gather and the operator Div reuse a memory, so the memory size required for the calculation graph shown in Figure 17 (A) It is: M1+M2+L1+L2.

图17的(B)所示的计算图为基于预先设定的算子融合规则对图17的(A)所示的计算图进行一次算子融合得到的计算图,即一次融合子图。在图17的(B)中,算子Slice和算子MatMul复用一个内存,所以图17的(B)所示的计算图所需的内存大小为:M1+M2+L1+L2。The calculation graph shown in (B) of FIG. 17 is a calculation graph obtained by performing operator fusion on the calculation graph shown in FIG. 17 (A) based on a preset operator fusion rule, that is, a primary fusion subgraph. In (B) of Figure 17, the operator Slice and the operator MatMul reuse one memory, so the memory size required for the calculation graph shown in (B) of Figure 17 is: M1+M2+L1+L2.

图17的(C)所示的计算图为对图17的(B)所示的计算图进行重计算后得到的计算图,即新的计算图。在图17的(C)中,算子间不能复用内存,所以图17的(C)所示的计算图所需的内存大小为:M1+M2+M3+L1+L2。The calculation graph shown in (C) of FIG. 17 is a calculation graph obtained after recalculating the calculation graph shown in FIG. 17 (B), that is, a new calculation graph. In (C) of Figure 17, memory cannot be reused between operators, so the memory size required for the calculation graph shown in (C) of Figure 17 is: M1+M2+M3+L1+L2.

图17的(D)所示的计算图为基于预先设定的算子融合规则对图17的(C)所示的计算图进行二次算子融合得到的计算图,即二次融合子图,该计算图即为所需的计算图。在图17的(D)中,算子间不能复用内存,所以图17的(D)所示的计算图所需的内存大小为:M1+M2+M3+L1。The calculation graph shown in (D) of Figure 17 is a calculation graph obtained by performing secondary operator fusion on the calculation graph shown in Figure 17 (C) based on the preset operator fusion rules, that is, the secondary fusion subgraph , the computation graph is the desired computation graph. In (D) of Figure 17, memory cannot be reused between operators, so the memory size required for the calculation graph shown in (D) of Figure 17 is: M1+M2+M3+L1.

假设M1=M2=M3,L1=L2,则图17的(A)所示的计算图占用的内存大小为2M+2L,图17的(B)所示的计算图占用的内存大小为2M+2L,图17的(C)所示的计算图占用的内存大小为3M+2L,图17的(D)所示的计算图占用的内存大小为3M+L。由于M代表中等大小,L代表超大,所以图17的(D)所示的计算图比图17的(A)的计算图少占用的内存大小为(L-M)。此外,由图17的(A)和(B)对比也可以获知,一次算子融合后,计算图所占的内存大小并没有变化。Assuming M1=M2=M3, L1=L2, the memory size occupied by the calculation graph shown in Figure 17 (A) is 2M+2L, and the memory size occupied by the calculation graph shown in Figure 17 (B) is 2M+ 2L, the memory size occupied by the calculation graph shown in (C) of Figure 17 is 3M+2L, and the memory size occupied by the calculation graph shown in (D) of Figure 17 is 3M+L. Since M represents medium size and L represents super large, the calculation graph shown in (D) of Figure 17 occupies less memory than the calculation graph of (A) in Figure 17 (L-M). In addition, it can also be seen from the comparison of (A) and (B) in Figure 17 that after an operator fusion, the memory size occupied by the calculation graph does not change.

进一步地,在重计算过程中,可以基于内存大小、数据突变、融合规则等方式确定重计算点,这样不仅可以有效降低内存占用和提升识别重计算点的效率,还可以将占用内存较大的张量转换为内部存储(比如寄存器空间),`Furthermore, during the recalculation process, the recalculation point can be determined based on memory size, data mutation, fusion rules, etc., which can not only effectively reduce memory usage and improve the efficiency of identifying recalculation points, but also reduce memory usage. Tensors are converted to internal storage (such as register space),`

接下来,基于上文所描述的计算图优化方案,对本申请实施例提供的一种计算图优化方法进行介绍。可以理解的是,该方法是上文所描述的计算图优化方案的另一种表达方式,两者是相结合的。该方法是基于上文所描述的计算图优化方案提出,该方法中的部分或全部内容可以但不限于参见上文对计算图优化方案的描述。Next, based on the calculation graph optimization scheme described above, a calculation graph optimization method provided by the embodiment of this application is introduced. It can be understood that this method is another expression of the calculation graph optimization scheme described above, and the two are combined. This method is proposed based on the calculation graph optimization scheme described above, part or all of the content of this method can be referred to but not limited to the description of the calculation graph optimization scheme above.

请参阅图18,图18是本申请实施例提供的一种计算图优化方法的流程示意图。可以理解,该方法可以通过任何具有计算、处理能力的装置、设备、平台、设备集群来执行。如图 18所示,该计算图优化方法可以包括以下步骤:Please refer to FIG. 18 . FIG. 18 is a schematic flowchart of a computation graph optimization method provided by an embodiment of the present application. It can be understood that the method can be executed by any device, device, platform, or device cluster that has computing and processing capabilities. As shown in Figure 18, the calculation graph optimization method may include the following steps:

S1801、基于参数和第一计算图中节点之间的数据依赖关系,从第一计算图中得到第二计算图,参数包括算子融合规则、第一计算图中单个节点输出的张量所占内存的内存阈值、第一计算图对应的峰值内存阈值和第一计算图中节点对应的数据突变阈值中的一项或多项,峰值内存阈值为计算图执行过程中在一个节点执行的时刻所有需使用的张量所占内存的阈值,数据突变阈值为第一计算图中至少一个节点产生数据突变的阈值。S1801. Obtain the second computation graph from the first computation graph based on the parameters and the data dependencies between the nodes in the first computation graph. The parameters include operator fusion rules and the tensor output of a single node in the first computation graph. One or more of the memory threshold of the memory, the peak memory threshold corresponding to the first computation graph, and the data mutation threshold corresponding to the nodes in the first computation graph. The threshold of the memory occupied by the tensor to be used, and the data mutation threshold is the threshold of at least one node in the first calculation graph that generates a data mutation.

具体地,在对计算图进行优化时,可以基于参数和第一计算图中节点之间的数据依赖关系,从第一计算图中得到第二计算图。示例性的,参数包括算子融合规则、第一计算图中单个节点输出的张量所占内存的内存阈值、第一计算图对应的峰值内存阈值和第一计算图中节点对应的数据突变阈值中的一项或多项。其中,峰值内存阈值为计算图执行过程中在一个节点执行的时刻所有需使用的张量所占内存的阈值,数据突变阈值为第一计算图中至少一个节点产生数据突变的阈值。示例性的,当计算图对应的计算框架为MindSporeAI开源计算框架时,用户可以通过MindSpore中配置的Context模块设置内存阈值、峰值内存阈值和数据突变阈值。示例性的,内存阈值也可以称之为内存门限,峰值内存阈值也可以称之为峰值内存门限,数据突变阈值也可以称之为数据门限。在一些实施例中,单个节点可以理解为是一个节点,也可以理解为是多个节点融合后所得到的一个节点。Specifically, when optimizing the computation graph, the second computation graph can be obtained from the first computation graph based on parameters and data dependencies between nodes in the first computation graph. Exemplarily, the parameters include operator fusion rules, memory thresholds occupied by tensors output by a single node in the first computation graph, peak memory thresholds corresponding to the first computation graph, and data mutation thresholds corresponding to nodes in the first computation graph one or more of the . Wherein, the peak memory threshold is the threshold of memory occupied by all the tensors to be used at the time of execution of a node during the execution of the computation graph, and the data mutation threshold is the threshold of at least one node in the first computation graph that generates a data mutation. For example, when the computing framework corresponding to the computing graph is the MindSporeAI open source computing framework, the user can set the memory threshold, peak memory threshold, and data mutation threshold through the Context module configured in MindSpore. Exemplarily, the memory threshold may also be called a memory threshold, the peak memory threshold may also be called a peak memory threshold, and the data mutation threshold may also be called a data threshold. In some embodiments, a single node can be understood as a node, and can also be understood as a node obtained by merging multiple nodes.

作为一种可能的实现方式,在得到第二计算图时,可以先基于参数和第一计算图中节点之间的数据依赖关系,从第一计算图所包含的生产者中得到N个第一节点集合,N为大于或等于1的正整数,第一节点集合中包括至少一个节点,其中,一个第一节点集合所包含的节点均与第一计算图中一个生产者输出的一个张量直接相关或间接相关,且均与第一计算图中一个生产者输入的至少一个张量直接相关或间接相关。然后,在对每个第一节点集合均进行重计算,以得到N个重计算子图,其中,N个重计算子图构成了第二计算图。这样即可以得到第二计算图。示例性的,第二计算图可以包括N个重计算子图。示例性的,第一节点集合可以理解为是上文所描述的所需的重计算节点集合,重计算子图可以理解为是上文所描述的重计算图。As a possible implementation, when obtaining the second computation graph, based on the parameters and the data dependencies between the nodes in the first computation graph, N first A node set, N is a positive integer greater than or equal to 1, the first node set includes at least one node, and the nodes contained in a first node set are all directly related to a tensor output by a producer in the first calculation graph are related or indirectly related, and are directly related or indirectly related to at least one tensor input by a producer in the first computation graph. Then, recomputation is performed on each first node set to obtain N recalculation subgraphs, wherein the N recalculation subgraphs constitute the second computation graph. In this way, the second calculation graph can be obtained. Exemplarily, the second computation graph may include N recomputation subgraphs. Exemplarily, the first node set can be understood as the required recomputation node set described above, and the recalculation subgraph can be understood as the recomputation graph described above.

在一个例子中,对每个第一节点集合进行重计算以得到N个重计算子图,可以通过以下方式实现。具体的,针对N个第一节点集合中的任一节点集合,可以先拷贝该任一节点集合对应的生产者子图(也可以称之为生产者)。然后,删除生产者子图中除任一节点集合所包含的节点之外的节点,以及删除生产者子图中与任一节点集合无关的边,这样就得到了该任一节点集合对应的重计算子图。遍历N个第一节点集合中的每个节点集合,就可以得到N个重计算子图。In an example, performing recalculation on each first node set to obtain N recalculation subgraphs may be implemented in the following manner. Specifically, for any node set in the N first node sets, the producer subgraph (also called a producer) corresponding to the any node set may be copied first. Then, delete the nodes in the producer subgraph except for the nodes contained in any node set, and delete the edges that are not related to any node set in the producer subgraph, so that the corresponding weight of any node set can be obtained Compute subgraphs. By traversing each node set in the N first node sets, N recomputation subgraphs can be obtained.

其中,对于得到第一节点集合,可以先基于第一计算图中节点之间的数据依赖关系,得到第一列表,第一列表中包括第一计算图中生产者和消费者间的对应关系。示例性的,第一列表可以为上文所描述的生产者-消费者列表,其可以为上文中表1所示,此步骤可以理解为是上文所描述的重计算步骤中构建生产者-消费者列表的步骤。Wherein, for obtaining the first set of nodes, the first list may be obtained based on the data dependencies between the nodes in the first computation graph, and the first list includes the correspondence between producers and consumers in the first computation graph. Exemplarily, the first list can be the producer-consumer list described above, which can be shown in Table 1 above, and this step can be understood as constructing the producer-consumer list in the recalculation step described above. Steps for consumer list.

然后,在根据第一列表中每个生产者输出的张量,得到备选节点集合,备选节点集合中至少包括第一节点集合,其中,针对第一列表中的任一生产者,将在任一生产者中且与任一生产者输出的张量直接相关和间接相关的节点的集合作为一个节点集合。其中,此步骤可以理解为是上文所描述的重计算步骤中确定重计算备选集合的步骤。其中,备选节点集合可以理解为上文所描述的重计算备选集合。Then, according to the tensor output by each producer in the first list, a set of candidate nodes is obtained, and the set of candidate nodes includes at least the first set of nodes, wherein, for any producer in the first list, any The set of nodes in a producer that are directly and indirectly related to tensors output by any producer is regarded as a node set. Wherein, this step can be understood as a step of determining a recalculation candidate set in the recalculation step described above. Wherein, the candidate node set can be understood as the recalculation candidate set described above.

最后,在基于参数,从备选节点集合中选出第一节点集合。其中,此步骤可以理解为是上文所描述的重计算步骤中基于内存大小筛选、基于数据突变筛选和基于算子融合规则筛选等中的一项或多项从重计算备选集合中确定重计算图的步骤。Finally, based on the parameters, the first node set is selected from the candidate node sets. Among them, this step can be understood as one or more of the above-described recalculation steps based on memory size screening, data mutation-based screening, and operator fusion rule-based screening to determine recalculation from the recalculation candidate set Diagram of steps.

在一个例子中,参数为内存阈值;第一节点集合对应的输出张量所占的内存高于内存阈值。这样即可以自动的将占用内存过高的输出张量对应的节点集合作为所需的节点集合,从而解决单个算子和/或张量占用内存的问题。In one example, the parameter is a memory threshold; the memory occupied by the output tensor corresponding to the first set of nodes is higher than the memory threshold. In this way, the node set corresponding to the output tensor that occupies too much memory can be automatically used as the required node set, thereby solving the problem of memory occupied by a single operator and/or tensor.

在一个例子中,参数为峰值内存阈值;第一节点集合对应的生产者和消费者之间存在间接路径,且第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于峰值内存阈值。这样即可以自动的将峰值内存占用过高对应的节点集合作为所需的节点集合,从而解决单个算子和/或张量占用内存的问题。In one example, the parameter is the peak memory threshold; there is an indirect path between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set when executing Above the peak memory threshold. In this way, the node set corresponding to the high peak memory usage can be automatically used as the required node set, thereby solving the problem of memory usage of a single operator and/or tensor.

在一个例子中,参数为数据突变阈值;第一节点集合对应的输出张量所占的内存与第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于数据突变阈值。这样即可以自动的将存在数据突变的节点集合作为所需的节点集合,从而消除因数据突变引起的内存占用过大的问题。In one example, the parameter is a data mutation threshold; the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold . In this way, the node set with data mutation can be automatically used as the required node set, thereby eliminating the problem of excessive memory usage caused by data mutation.

在一个例子中,参数为算子融合规则;第一节点集合与第一节点集合对应的消费者能够融合;或者,第一节点集合与第一节点集合对应的消费者不能融合,且第一节点集合对应的生产者和消费者之间存在间接路径。这样可以通过算子融合规则的方式筛选出所需的节点集合。In one example, the parameter is an operator fusion rule; the first node set and the consumers corresponding to the first node set can be fused; or, the first node set and the consumers corresponding to the first node set cannot be fused, and the first node set There is an indirect path between the producer and consumer corresponding to the set. In this way, the required node set can be filtered out through operator fusion rules.

在一个例子中,参数为内存阈值和峰值内存阈值;第一节点集合对应的生产者和消费者之间不存在间接路径,且第一节点集合对应的输出张量所占的内存高于内存阈值;和/或,第一节点集合对应的生产者和消费者之间存在间接路径,且第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于峰值内存阈值。这样可以通过内存阈值和峰值内存阈值相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In one example, the parameters are memory threshold and peak memory threshold; there is no indirect path between producers and consumers corresponding to the first set of nodes, and the memory occupied by the output tensor corresponding to the first set of nodes is higher than the memory threshold and/or, there is an indirect path between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than the peak memory threshold during execution. In this way, the required node set can be filtered out by combining the memory threshold and the peak memory threshold, improving the efficiency and accuracy of screening.

在一个例子中,参数为内存阈值和数据突变阈值;第一节点集合对应的输出张量所占的内存高于内存阈值,和/或,第一节点集合对应的输出张量所占的内存与第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于数据突变阈值。这样可以通过内存阈值和数据突变阈值相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In one example, the parameters are a memory threshold and a data mutation threshold; the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, and/or, the memory occupied by the output tensor corresponding to the first node set is different from A deviation value between memory occupied by at least one input tensor corresponding to the first node set is higher than a data mutation threshold. In this way, the required node set can be screened out by combining the memory threshold and the data mutation threshold, improving the efficiency and accuracy of screening.

在一个例子中,参数为内存阈值和算子融合规则;第一节点集合符合以下条件中的一项或多项:第一节点集合对应的输出张量所占的内存高于内存阈值,或者,第一节点集合与第一节点集合对应的消费者能够融合,或者,第一节点集合与第一节点集合对应的消费者不能融合,且第一节点集合对应的生产者和消费者之间存在间接路径。这样可以通过内存阈值和算子融合规则相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In one example, the parameters are a memory threshold and an operator fusion rule; the first node set meets one or more of the following conditions: the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or, The first node set and the consumers corresponding to the first node set can be fused, or the first node set and the consumers corresponding to the first node set cannot be fused, and there is an indirect relationship between the producer and the consumer corresponding to the first node set path. In this way, the required node set can be screened out through the combination of memory threshold and operator fusion rules, improving the efficiency and accuracy of screening.

在一个例子中,参数为峰值内存阈值和数据突变阈值;第一节点集合对应的生产者和消费者之间存在间接路径,且第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于峰值内存阈值;和/或,第一节点集合对应的输出张量所占的内存与第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于数据突变阈值。这样可以通过峰值内存阈值和数据突变阈值相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In one example, the parameters are the peak memory threshold and the data mutation threshold; there is an indirect path between the producer and the consumer corresponding to the first node set, and each node between the producer and the consumer corresponding to the first node set executes and/or, the deviation value between the memory occupied by the output tensor corresponding to the first set of nodes and the memory occupied by at least one input tensor corresponding to the first set of nodes is higher than Data mutation threshold. In this way, the required node set can be screened out by combining the peak memory threshold and the data mutation threshold, improving the efficiency and accuracy of screening.

在一个例子中,参数为内存阈值和算子融合规则;第一节点集合符合以下条件中的一项或多项:第一节点集合对应的生产者和消费者之间存在间接路径,且第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于峰值内存阈值,或者,第一节点集合与 第一节点集合对应的消费者能够融合,或者,第一节点集合与第一节点集合对应的消费者不能融合,且第一节点集合对应的生产者和消费者之间存在间接路径。这样可以通过峰值内存阈值和算子融合规则相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In one example, the parameters are the memory threshold and operator fusion rules; the first node set meets one or more of the following conditions: there is an indirect path between the producer and the consumer corresponding to the first node set, and the first The peak memory of each node between the producer and the consumer corresponding to the node set is higher than the peak memory threshold during execution, or the first node set and the consumer corresponding to the first node set can be merged, or the first node set and The consumers corresponding to the first node set cannot be merged, and there is an indirect path between the producer and the consumer corresponding to the first node set. In this way, the required node set can be screened out by combining the peak memory threshold and operator fusion rules, improving the efficiency and accuracy of screening.

在一个例子中,参数为数据突变阈值和算子融合规则;第一节点集合符合以下条件中的一项或多项:第一节点集合对应的输出张量所占的内存与第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于数据突变阈值,或者,第一节点集合与第一节点集合对应的消费者能够融合,或者,第一节点集合与第一节点集合对应的消费者不能融合,且第一节点集合对应的生产者和消费者之间存在间接路径。这样可以通过数据突变阈值和算子融合规则相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In one example, the parameters are data mutation threshold and operator fusion rule; the first node set meets one or more of the following conditions: the memory occupied by the output tensor corresponding to the first node set corresponds to the first node set The deviation value between the memory occupied by at least one input tensor of is higher than the data mutation threshold, or, the first node set and the consumers corresponding to the first node set can be fused, or, the first node set and the first node set The corresponding consumers cannot be fused, and there is an indirect path between the producers and consumers corresponding to the first set of nodes. In this way, the required node set can be screened out by combining the data mutation threshold and operator fusion rules, improving the efficiency and accuracy of screening.

在一个例子中,参数为内存阈值、峰值内存阈值和数据突变阈值;第一节点集合符合以下条件中的一项或多项:第一节点集合对应的生产者和消费者之间不存在间接路径,且第一节点集合对应的输出张量所占的内存高于内存阈值,或者,第一节点集合对应的生产者和消费者之间存在间接路径,且第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于峰值内存阈值,或者,第一节点集合对应的输出张量所占的内存与第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于数据突变阈值。这样可以通过内存阈值、峰值内存阈值和数据突变阈值相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In one example, the parameters are a memory threshold, a peak memory threshold, and a data mutation threshold; the first set of nodes meets one or more of the following conditions: there is no indirect path between producers and consumers corresponding to the first set of nodes , and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or there is an indirect path between the producer and the consumer corresponding to the first node set, and the producer and consumer corresponding to the first node set The peak memory of each node between them is higher than the peak memory threshold, or the memory occupied by the output tensor corresponding to the first node set is between the memory occupied by at least one input tensor corresponding to the first node set The deviation value of is higher than the data mutation threshold. In this way, the required node set can be screened out by combining the memory threshold, peak memory threshold, and data mutation threshold, improving the efficiency and accuracy of screening.

在一个例子中,参数为内存阈值、峰值内存阈值和算子融合规则;第一节点集合符合以下条件中的一项或多项:第一节点集合对应的生产者和消费者之间不存在间接路径,且第一节点集合对应的输出张量所占的内存高于内存阈值,或者,第一节点集合对应的生产者和消费者之间存在间接路径,且第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于峰值内存阈值,或者,第一节点集合与第一节点集合对应的消费者能够融合,或者,第一节点集合与第一节点集合对应的消费者不能融合,且第一节点集合对应的生产者和消费者之间存在间接路径。这样可以通过内存阈值、峰值内存阈值和算子融合规则相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In one example, the parameters are memory threshold, peak memory threshold, and operator fusion rules; the first set of nodes meets one or more of the following conditions: there is no indirect relationship between producers and consumers corresponding to the first set of nodes path, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or, there is an indirect path between the producer and the consumer corresponding to the first node set, and the producer and the consumer corresponding to the first node set correspond to The peak memory of each node between consumers is higher than the peak memory threshold, or the first node set and the consumers corresponding to the first node set can be merged, or the consumption of the first node set and the first node set cannot be merged, and there is an indirect path between the producers and consumers corresponding to the first set of nodes. In this way, the required node set can be screened out by combining the memory threshold, peak memory threshold, and operator fusion rules, improving the efficiency and accuracy of screening.

在一个例子中,参数为内存阈值、数据突变阈值和算子融合规则;第一节点集合符合以下条件中的一项或多项:第一节点集合对应的输出张量所占的内存高于内存阈值,或者,第一节点集合对应的输出张量所占的内存与第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于数据突变阈值,或者,第一节点集合与第一节点集合对应的消费者能够融合,或者,第一节点集合与第一节点集合对应的消费者不能融合,且第一节点集合对应的生产者和消费者之间存在间接路径。这样可以通过内存阈值、数据突变阈值和算子融合规则相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In one example, the parameters are memory threshold, data mutation threshold and operator fusion rule; the first node set meets one or more of the following conditions: the memory occupied by the output tensor corresponding to the first node set is higher than the memory Threshold, or, the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold, or, the first node set Consumers corresponding to the first node set can be fused, or the first node set and consumers corresponding to the first node set cannot be fused, and there is an indirect path between producers and consumers corresponding to the first node set. In this way, the required node set can be screened out by combining the memory threshold, data mutation threshold, and operator fusion rules, improving the efficiency and accuracy of screening.

在一个例子中,参数为峰值内存阈值、数据突变阈值和算子融合规则;第一节点集合符合以下条件中的一项或多项:第一节点集合对应的生产者和消费者之间存在间接路径,且第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于峰值内存阈值,或者,第一节点集合对应的输出张量所占的内存与第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于数据突变阈值,或者,第一节点集合与第一节点集合对应的消费者能够融合,或者,第一节点集合与第一节点集合对应的消费者不能融合,且第一节点集合对应的生产者和消费者之间存在间接路径。这样可以通过峰值内存阈值、数据突变阈值和算子融合规则相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In one example, the parameters are peak memory threshold, data mutation threshold, and operator fusion rules; the first node set meets one or more of the following conditions: there is an indirect path, and the peak memory of each node between the producer and consumer corresponding to the first node set is higher than the peak memory threshold during execution, or the memory occupied by the output tensor corresponding to the first node set is the same as the first node set The deviation value between the memory occupied by the corresponding at least one input tensor is higher than the data mutation threshold, or the first node set and the consumers corresponding to the first node set can be fused, or the first node set and the first node set The consumers corresponding to the set cannot be merged, and there is an indirect path between the producer and the consumer corresponding to the first set of nodes. In this way, the required node set can be screened out by combining the peak memory threshold, data mutation threshold, and operator fusion rules, improving the efficiency and accuracy of screening.

在一个例子中,参数为内存阈值、峰值内存阈值、数据突变阈值和算子融合规则;第一节点集合符合以下条件中的一项或多项:第一节点集合对应的生产者和消费者之间不存在间接路径,且第一节点集合对应的输出张量所占的内存高于内存阈值,或者,第一节点集合对应的生产者和消费者之间存在间接路径,且第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于峰值内存阈值,或者,第一节点集合对应的输出张量所占的内存与第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于数据突变阈值,或者,第一节点集合与第一节点集合对应的消费者能够融合,或者,第一节点集合与第一节点集合对应的消费者不能融合,且第一节点集合对应的生产者和消费者之间存在间接路径。这样可以通过内存阈值、峰值内存阈值、数据突变阈值和算子融合规则相结合的方式筛选出所需的节点集合,提升筛选的效率和准确度。In one example, the parameters are memory threshold, peak memory threshold, data mutation threshold and operator fusion rule; the first node set meets one or more of the following conditions: the relationship between the producer and the consumer corresponding to the first node set There is no indirect path between, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or, there is an indirect path between the producer and the consumer corresponding to the first node set, and the first node set corresponds to The peak memory of each node between the producer and consumer is higher than the peak memory threshold, or the memory occupied by the output tensor corresponding to the first node set is equal to the memory occupied by at least one input tensor corresponding to the first node set The deviation value between the occupied memory is higher than the data mutation threshold, or the first node set and the consumers corresponding to the first node set can be fused, or the first node set and the consumers corresponding to the first node set cannot be fused, And there is an indirect path between the producer and the consumer corresponding to the first node set. In this way, the required node set can be screened out through a combination of memory threshold, peak memory threshold, data mutation threshold, and operator fusion rules, improving the efficiency and accuracy of screening.

在一些实施例中,在S1801之前,可以先对第一计算图中的算子进行融合。由此提升计算图中算子的性能;另外,通过第一次算子融合,也可以减少算子数量,这样可以减少后续在重计算过程中对算子进行分析的开销;此外,也可以扩大在后续重计算过程中对算子进行分析的范围,提升重计算的效果。示例性的,可以但不限于基于预先设定的算子融合规则对第一计算图中的算子进行融合。In some embodiments, before S1801, operators in the first computation graph may be fused first. This improves the performance of operators in the calculation graph; in addition, through the first operator fusion, the number of operators can also be reduced, which can reduce the overhead of analyzing operators in the subsequent recalculation process; in addition, it can also expand In the subsequent recalculation process, the operator is analyzed to improve the effect of recalculation. Exemplarily, but not limited to, operators in the first computation graph may be fused based on a preset operator merging rule.

在得到第二计算图后,即可以执行S1802。After obtaining the second calculation graph, S1802 can be executed.

S1802、将第二计算图与第一计算图合并,以得到第三计算图,其中,在第三计算图中,输出第一张量的第一节点与输入第一张量的第二节点间具有第一有向边,输出第二张量的第三节点与输入第二张量的第四节点间具有第二有向边,第一有向边由第一节点指向第二节点,第二有向边有第三节点指向第四节点,第一节点和第四节点对应第一计算图中的节点,第二节点和第三节点对应第二计算图中的节点。S1802. Merge the second computation graph with the first computation graph to obtain a third computation graph, wherein, in the third computation graph, there is a gap between the first node that outputs the first tensor and the second node that inputs the first tensor There is a first directed edge, there is a second directed edge between the third node outputting the second tensor and the fourth node inputting the second tensor, the first directed edge is directed from the first node to the second node, and the second The directed edge has the third node pointing to the fourth node, the first node and the fourth node correspond to the nodes in the first computation graph, and the second node and the third node correspond to the nodes in the second computation graph.

具体地,在得到第二计算图后,可以将该第二计算图与第一计算图合并,这样就得到了第三计算图。其中,在第三计算图中,输出第一张量的第一节点与输入第一张量的第二节点间具有第一有向边,输出第二张量的第三节点与输入第二张量的第四节点间具有第二有向边,第一有向边由第一节点指向第二节点,第二有向边有第三节点指向第四节点,第一节点和第四节点对应第一计算图中的节点,第二节点和第三节点对应第二计算图中的节点。其中,得到第三计算图的过程可以理解为是上文所描述的重计算步骤中生成新计算图的步骤。Specifically, after obtaining the second computation graph, the second computation graph can be merged with the first computation graph, thus obtaining the third computation graph. Among them, in the third calculation graph, there is a first directed edge between the first node outputting the first tensor and the second node inputting the first tensor, and the third node outputting the second tensor is connected to the second node inputting the second tensor There is a second directed edge between the fourth nodes of the quantity, the first directed edge is from the first node to the second node, the second directed edge has the third node to the fourth node, the first node and the fourth node correspond to the A node in a computation graph, and the second node and the third node correspond to nodes in a second computation graph. Wherein, the process of obtaining the third calculation graph can be understood as the step of generating a new calculation graph in the recalculation step described above.

在一个例子中,当第二计算图由N个重计算子图组成时,在将第二计算图与第一计算图合并时,可以分别构建N个重计算子图中的每个重计算子图的输入第一目标张量的节点与第一计算图中输出第一目标张量的节点之间的有向边,以及,分别构建N个重计算子图中的每个重计算子图的输出第二目标张量的节点与第一计算图中输入第二目标张量的节点之间的有向边,这样就得到了第三计算图。In one example, when the second computation graph is composed of N recalculation subgraphs, when merging the second computation graph with the first computation graph, each recalculation subgraph in the N recalculation subgraphs can be constructed separately The directed edge between the node inputting the first target tensor in the graph and the node outputting the first target tensor in the first computation graph, and constructing each recalculation subgraph in the N recomputation subgraphs respectively A directed edge between a node that outputs the second target tensor and a node that inputs the second target tensor in the first computation graph results in a third computation graph.

S1803、对第三计算图中的算子进行融合,以得到第四计算图。S1803. Fuse operators in the third computation graph to obtain a fourth computation graph.

具体地,可以但不限于基于预先设定的算子融合规则对第三计算图中的算子进行融合,这样就得到了第四计算图。示例性的,第四计算图也可以称之为优化后的计算图。Specifically, operators in the third computation graph may be fused based on but not limited to a preset operator fusion rule, so as to obtain a fourth computation graph. Exemplarily, the fourth computation graph may also be called an optimized computation graph.

S1804、执行第四计算图。S1804. Execute the fourth calculation graph.

具体地,得到第四计算图后,可以执行该计算图。Specifically, after obtaining the fourth calculation graph, the calculation graph can be executed.

在一个例子中,对于执行第四计算图,可以先对该计算图进行编译,然后再执行编译后的计算图,也可以是直接执行该计算图;不论哪种方式,本质上均是执行的第四计算图,只是表示形式不同。In one example, for the execution of the fourth calculation graph, the calculation graph can be compiled first, and then the compiled calculation graph can be executed, or the calculation graph can be directly executed; either way, it is essentially executed The fourth calculation graph is only in a different form.

这样,通过对愿计算图先进行重计算得到重计算图,并将得到的重计算图与原计算图合并为新的计算图,接着对新的计算图进行算子融合,即得优化后的计算图;之后,可以执行该优化后的计算图。由此通过重计算与算子融合相结合的方式对计算图进行优化,实现了在显著减少内存占用的同时,又不引入较大重计算开销,解决了具有一个或多个超大张量的网络无法执行的问题。In this way, the recalculation graph is obtained by recomputing the desired computation graph first, and the obtained recalculation graph is merged with the original computation graph into a new computation graph, and then the new computation graph is fused with operators to obtain the optimized Computation graph; after that, the optimized computation graph can be executed. Therefore, the calculation graph is optimized by combining recalculation and operator fusion, which achieves a significant reduction in memory usage without introducing large recalculation overhead, and solves the problem of networks with one or more large tensors. Unable to implement the problem.

基于上述实施例中的描述的方法,本申请实施例还提供了一种芯片。请参阅图19,图19为本申请实施例提供的一种芯片的结构示意图。如图19所示,芯片1900包括一个或多个处理器1901以及接口电路1902。可选的,芯片1900还可以包含总线1903。其中:Based on the methods described in the foregoing embodiments, embodiments of the present application further provide a chip. Please refer to FIG. 19 . FIG. 19 is a schematic structural diagram of a chip provided by an embodiment of the present application. As shown in FIG. 19 , a chip 1900 includes one or more processors 1901 and an interface circuit 1902 . Optionally, the chip 1900 may also include a bus 1903 . in:

处理器1901可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1901中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1901可以是通用处理器、数字通信器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其它可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 1901 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in the processor 1901 or instructions in the form of software. The above-mentioned processor 1901 may be a general-purpose processor, a digital communicator (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components . Various methods and steps disclosed in the embodiments of the present application may be implemented or executed. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.

接口电路1902可以用于数据、指令或者信息的发送或者接收,处理器1901可以利用接口电路1902接收的数据、指令或者其它信息,进行加工,可以将加工完成信息通过接口电路1902发送出去。The interface circuit 1902 can be used for sending or receiving data, instructions or information. The processor 1901 can use the data, instructions or other information received by the interface circuit 1902 to process, and can send the processing completion information through the interface circuit 1902 .

可选的,芯片还包括存储器,存储器可以包括只读存储器和随机存取存储器,并向处理器提供操作指令和数据。存储器的一部分还可以包括非易失性随机存取存储器(NVRAM)。Optionally, the chip further includes a memory, which may include a read-only memory and a random access memory, and provides operation instructions and data to the processor. A portion of the memory may also include non-volatile random access memory (NVRAM).

可选的,存储器存储了可执行软件模块或者数据结构,处理器可以通过调用存储器存储的操作指令(该操作指令可存储在操作系统中),执行相应的操作。Optionally, the memory stores executable software modules or data structures, and the processor can execute corresponding operations by calling operation instructions stored in the memory (the operation instructions can be stored in the operating system).

可选的,接口电路1902可用于输出处理器1901的执行结果。Optionally, the interface circuit 1902 may be used to output the execution result of the processor 1901 .

需要说明的,处理器1901、接口电路1902各自对应的功能既可以通过硬件设计实现,也可以通过软件设计来实现,还可以通过软硬件结合的方式来实现,这里不作限制。It should be noted that the corresponding functions of the processor 1901 and the interface circuit 1902 can be realized by hardware design, software design, or a combination of software and hardware, which is not limited here.

应理解,上述方法实施例的各步骤可以通过处理器中的硬件形式的逻辑电路或者软件形式的指令完成。It should be understood that each step in the foregoing method embodiments may be implemented by logic circuits in the form of hardware or instructions in the form of software in the processor.

可以理解的是,本申请的实施例中的处理器可以是中央处理单元(central processing unit,CPU),还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、晶体管逻辑器件,硬件部件或者其任意组合。通用处理器可以是微处理器,也可以是任何常规的处理器。It can be understood that the processor in the embodiments of the present application may be a central processing unit (central processing unit, CPU), and may also be other general processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), field programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. A general-purpose processor can be a microprocessor, or any conventional processor.

本申请的实施例中的方法步骤可以通过硬件的方式来实现,也可以由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(random access memory,RAM)、闪存、只读存储器(read-only memory,ROM)、可编程只读存储器(programmable rom,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。The method steps in the embodiments of the present application may be implemented by means of hardware, or may be implemented by means of a processor executing software instructions. The software instructions can be composed of corresponding software modules, and the software modules can be stored in random access memory (random access memory, RAM), flash memory, read-only memory (read-only memory, ROM), programmable read-only memory (programmable rom) , PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically erasable programmable read-only memory (electrically EPROM, EEPROM), register, hard disk, mobile hard disk, CD-ROM or known in the art any other form of storage medium. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be a component of the processor. The processor and storage medium can be located in the ASIC.

在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者通过所述计算机可读存储介质进行传输。所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。In the above embodiments, all or part of them may be implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part. The computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted via a computer-readable storage medium. The computer instructions may be transmitted from one website site, computer, server, or data center to another website site by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) , computer, server or data center for transmission. The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media. The available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (solid state disk, SSD)), etc.

可以理解的是,在本申请的实施例中涉及的各种数字编号仅为描述方便进行的区分,并不用来限制本申请的实施例的范围。It can be understood that the various numbers involved in the embodiments of the present application are only for convenience of description, and are not used to limit the scope of the embodiments of the present application.

Claims (26)

一种计算图优化方法,其特征在于,所述方法包括:A computational graph optimization method, characterized in that the method comprises: 基于参数和第一计算图中节点之间的数据依赖关系,从所述第一计算图中得到第二计算图,所述参数包括算子融合规则、所述第一计算图中单个节点输出的张量所占内存的内存阈值、所述第一计算图对应的峰值内存阈值和所述第一计算图中节点对应的数据突变阈值中的一项或多项,所述峰值内存阈值为计算图执行过程中在一个节点执行的时刻所有需使用的张量所占内存的阈值,所述数据突变阈值为所述第一计算图中至少一个节点产生数据突变的阈值;Based on the parameters and the data dependencies between the nodes in the first calculation graph, the second calculation graph is obtained from the first calculation graph, the parameters include operator fusion rules, and the output of a single node in the first calculation graph One or more of the memory threshold of the memory occupied by the tensor, the peak memory threshold corresponding to the first computation graph, and the data mutation threshold corresponding to the node in the first computation graph, the peak memory threshold being the computation graph The threshold value of the memory occupied by all tensors to be used at the moment of execution of a node during the execution process, the data mutation threshold is the threshold value of data mutation generated by at least one node in the first calculation graph; 将所述第二计算图与所述第一计算图合并,以得到第三计算图,其中,在所述第三计算图中,输出第一张量的第一节点与输入所述第一张量的第二节点间具有第一有向边,输出第二张量的第三节点与输入所述第二张量的第四节点间具有第二有向边,所述第一有向边由所述第一节点指向所述第二节点,所述第二有向边有所述第三节点指向所述第四节点,所述第一节点和所述第四节点对应所述第一计算图中的节点,所述第二节点和所述第三节点对应所述第二计算图中的节点;Merging the second computation graph with the first computation graph to obtain a third computation graph, wherein, in the third computation graph, the first node that outputs the first tensor is the same as the first node that inputs the first tensor There is a first directed edge between the second node of the quantity, and there is a second directed edge between the third node outputting the second tensor and the fourth node inputting the second tensor, and the first directed edge consists of The first node points to the second node, the second directed edge has the third node pointing to the fourth node, and the first node and the fourth node correspond to the first calculation graph The node in, the second node and the third node correspond to the nodes in the second calculation graph; 对所述第三计算图中的算子进行融合,以得到第四计算图;Fusing operators in the third computation graph to obtain a fourth computation graph; 执行所述第四计算图。Execute the fourth computation graph. 根据权利要求1所述的方法,其特征在于,所述基于参数和第一计算图中节点之间的数据依赖关系,从所述第一计算图中得到第二计算图,具体包括:The method according to claim 1, wherein the second calculation graph is obtained from the first calculation graph based on parameters and data dependencies between nodes in the first computation graph, specifically comprising: 基于所述参数和所述第一计算图中节点之间的数据依赖关系,从所述第一计算图所包含的生产者中得到N个第一节点集合,N为大于或等于1的正整数,所述第一节点集合中包括至少一个节点,其中,一个所述第一节点集合所包含的节点均与所述第一计算图中一个生产者输出的一个张量直接相关或间接相关,且均与所述第一计算图中一个生产者输入的至少一个张量直接相关或间接相关;Based on the parameters and the data dependencies between the nodes in the first calculation graph, N first node sets are obtained from the producers included in the first calculation graph, where N is a positive integer greater than or equal to 1 , the first set of nodes includes at least one node, wherein each node included in the first set of nodes is directly or indirectly related to a tensor output by a producer in the first computation graph, and are all directly or indirectly related to at least one tensor input by a producer in the first computation graph; 对每个所述第一节点集合均进行重计算,以得到N个重计算子图,所述N个重计算子图构成所述第二计算图。Recomputation is performed on each of the first node sets to obtain N recalculation subgraphs, and the N recomputation subgraphs constitute the second computation graph. 根据权利要求2所述的方法,其特征在于,所述参数为所述内存阈值;The method according to claim 2, wherein the parameter is the memory threshold; 所述第一节点集合对应的输出张量所占的内存高于所述内存阈值。The memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold. 根据权利要求2所述的方法,其特征在于,所述参数为所述峰值内存阈值;The method according to claim 2, wherein the parameter is the peak memory threshold; 所述第一节点集合对应的生产者和消费者之间存在间接路径,且所述第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于所述峰值内存阈值。There is an indirect path between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than the peak memory threshold during execution . 根据权利要求2所述的方法,其特征在于,所述参数为所述数据突变阈值;The method according to claim 2, wherein the parameter is the data mutation threshold; 所述第一节点集合对应的输出张量所占的内存与所述第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于所述数据突变阈值。A deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold. 根据权利要求2所述的方法,其特征在于,所述参数为所述算子融合规则;The method according to claim 2, wherein the parameter is the operator fusion rule; 所述第一节点集合与所述第一节点集合对应的消费者能够融合;The first node set and the consumers corresponding to the first node set can be merged; 或者,所述第一节点集合与所述第一节点集合对应的消费者不能融合,且所述第一节点集合对应的生产者和消费者之间存在间接路径。Alternatively, the first set of nodes and consumers corresponding to the first set of nodes cannot be merged, and an indirect path exists between producers and consumers corresponding to the first set of nodes. 根据权利要求2所述的方法,其特征在于,所述参数为所述内存阈值和所述峰值内存 阈值;The method according to claim 2, wherein said parameter is said memory threshold and said peak memory threshold; 所述第一节点集合对应的生产者和消费者之间不存在间接路径,且所述第一节点集合对应的输出张量所占的内存高于所述内存阈值;There is no indirect path between the producer and the consumer corresponding to the first node set, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold; 和/或,所述第一节点集合对应的生产者和消费者之间存在间接路径,且所述第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于所述峰值内存阈值。And/or, there is an indirect path between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than the specified The peak memory threshold described above. 根据权利要求2所述的方法,其特征在于,所述参数为所述内存阈值和所述数据突变阈值;The method according to claim 2, wherein the parameters are the memory threshold and the data mutation threshold; 所述第一节点集合对应的输出张量所占的内存高于所述内存阈值,和/或,所述第一节点集合对应的输出张量所占的内存与所述第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于所述数据突变阈值。The memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, and/or, the memory occupied by the output tensor corresponding to the first node set is equal to the memory corresponding to the first node set A deviation value between memory occupied by at least one input tensor is higher than the data mutation threshold. 根据权利要求2所述的方法,其特征在于,所述参数为所述内存阈值和所述算子融合规则;The method according to claim 2, wherein the parameters are the memory threshold and the operator fusion rule; 所述第一节点集合符合以下条件中的一项或多项:The first set of nodes meets one or more of the following conditions: 所述第一节点集合对应的输出张量所占的内存高于所述内存阈值,或者,所述第一节点集合与所述第一节点集合对应的消费者能够融合,或者,所述第一节点集合与所述第一节点集合对应的消费者不能融合,且所述第一节点集合对应的生产者和消费者之间存在间接路径。The memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or, the first node set and the consumers corresponding to the first node set can be fused, or, the first node set The node set and the consumer corresponding to the first node set cannot be merged, and an indirect path exists between the producer and the consumer corresponding to the first node set. 根据权利要求2所述的方法,其特征在于,所述参数为所述峰值内存阈值和所述数据突变阈值;The method according to claim 2, wherein the parameters are the peak memory threshold and the data mutation threshold; 所述第一节点集合对应的生产者和消费者之间存在间接路径,且所述第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于所述峰值内存阈值;There is an indirect path between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than the peak memory threshold during execution ; 和/或,所述第一节点集合对应的输出张量所占的内存与所述第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于所述数据突变阈值。And/or, the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold. 根据权利要求2所述的方法,其特征在于,所述参数为所述内存阈值和所述算子融合规则;The method according to claim 2, wherein the parameters are the memory threshold and the operator fusion rule; 所述第一节点集合符合以下条件中的一项或多项:The first set of nodes meets one or more of the following conditions: 所述第一节点集合对应的生产者和消费者之间存在间接路径,且所述第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于所述峰值内存阈值,或者,所述第一节点集合与所述第一节点集合对应的消费者能够融合,或者,所述第一节点集合与所述第一节点集合对应的消费者不能融合,且所述第一节点集合对应的生产者和消费者之间存在间接路径。There is an indirect path between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than the peak memory threshold during execution , or, the first node set and the consumers corresponding to the first node set can be fused, or, the first node set and the consumers corresponding to the first node set cannot be fused, and the first There are indirect paths between producers and consumers corresponding to a set of nodes. 根据权利要求2所述的方法,其特征在于,所述参数为所述数据突变阈值和所述算子融合规则;The method according to claim 2, wherein the parameters are the data mutation threshold and the operator fusion rule; 所述第一节点集合符合以下条件中的一项或多项:The first set of nodes meets one or more of the following conditions: 所述第一节点集合对应的输出张量所占的内存与所述第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于所述数据突变阈值,或者,所述第一节点集合与所述第一节点集合对应的消费者能够融合,或者,所述第一节点集合与所述第一节点集合对应的消费者不能融合,且所述第一节点集合对应的生产者和消费者之间存在间接路径。The deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold, or, the The first node set and the consumers corresponding to the first node set can be fused, or the first node set and the consumers corresponding to the first node set cannot be fused, and the production corresponding to the first node set There is an indirect path between buyers and consumers. 根据权利要求2所述的方法,其特征在于,所述参数为所述内存阈值、所述峰值内存阈值和所述数据突变阈值;The method according to claim 2, wherein the parameters are the memory threshold, the peak memory threshold and the data mutation threshold; 所述第一节点集合符合以下条件中的一项或多项:The first set of nodes meets one or more of the following conditions: 所述第一节点集合对应的生产者和消费者之间不存在间接路径,且所述第一节点集合对应的输出张量所占的内存高于所述内存阈值,或者,所述第一节点集合对应的生产者和消费者之间存在间接路径,且所述第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于所述峰值内存阈值,或者,所述第一节点集合对应的输出张量所占的内存与所述第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于所述数据突变阈值。There is no indirect path between the producer and the consumer corresponding to the first node set, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or the first node There is an indirect path between the producer and the consumer corresponding to the set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than the peak memory threshold during execution, or, the A deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold. 根据权利要求2所述的方法,其特征在于,所述参数为所述内存阈值、所述峰值内存阈值和所述算子融合规则;The method according to claim 2, wherein the parameters are the memory threshold, the peak memory threshold and the operator fusion rule; 所述第一节点集合符合以下条件中的一项或多项:The first set of nodes meets one or more of the following conditions: 所述第一节点集合对应的生产者和消费者之间不存在间接路径,且所述第一节点集合对应的输出张量所占的内存高于所述内存阈值,或者,所述第一节点集合对应的生产者和消费者之间存在间接路径,且所述第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于所述峰值内存阈值,或者,所述第一节点集合与所述第一节点集合对应的消费者能够融合,或者,所述第一节点集合与所述第一节点集合对应的消费者不能融合,且所述第一节点集合对应的生产者和消费者之间存在间接路径。There is no indirect path between the producer and the consumer corresponding to the first node set, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or the first node There is an indirect path between the producer and the consumer corresponding to the set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than the peak memory threshold during execution, or, the The first node set and the consumers corresponding to the first node set can be fused, or the first node set and the consumers corresponding to the first node set cannot be fused, and the production corresponding to the first node set There is an indirect path between buyers and consumers. 根据权利要求2所述的方法,其特征在于,所述参数为所述内存阈值、所述数据突变阈值和所述算子融合规则;The method according to claim 2, wherein the parameters are the memory threshold, the data mutation threshold and the operator fusion rule; 所述第一节点集合符合以下条件中的一项或多项:The first set of nodes meets one or more of the following conditions: 所述第一节点集合对应的输出张量所占的内存高于所述内存阈值,或者,所述第一节点集合对应的输出张量所占的内存与所述第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于所述数据突变阈值,或者,所述第一节点集合与所述第一节点集合对应的消费者能够融合,或者,所述第一节点集合与所述第一节点集合对应的消费者不能融合,且所述第一节点集合对应的生产者和消费者之间存在间接路径。The memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or the memory occupied by the output tensor corresponding to the first node set is at least one of the memory corresponding to the first node set The deviation value between the memory occupied by the input tensor is higher than the data mutation threshold, or, the first node set and the consumers corresponding to the first node set can be fused, or, the first node set Consumers corresponding to the first node set cannot be merged, and an indirect path exists between producers and consumers corresponding to the first node set. 根据权利要求2所述的方法,其特征在于,所述参数为所述峰值内存阈值、所述数据突变阈值和所述算子融合规则;The method according to claim 2, wherein the parameters are the peak memory threshold, the data mutation threshold and the operator fusion rule; 所述第一节点集合符合以下条件中的一项或多项:The first set of nodes meets one or more of the following conditions: 所述第一节点集合对应的生产者和消费者之间存在间接路径,且所述第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于所述峰值内存阈值,或者,所述第一节点集合对应的输出张量所占的内存与所述第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于所述数据突变阈值,或者,所述第一节点集合与所述第一节点集合对应的消费者能够融合,或者,所述第一节点集合与所述第一节点集合对应的消费者不能融合,且所述第一节点集合对应的生产者和消费者之间存在间接路径。There is an indirect path between the producer and the consumer corresponding to the first node set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than the peak memory threshold during execution , or, the deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold, or , the first node set and the consumers corresponding to the first node set can be fused, or the first node set and the consumers corresponding to the first node set cannot be fused, and the first node set There are indirect paths between corresponding producers and consumers. 根据权利要求2所述的方法,其特征在于,所述参数为所述内存阈值、所述峰值内存阈值、所述数据突变阈值和所述算子融合规则;The method according to claim 2, wherein the parameters are the memory threshold, the peak memory threshold, the data mutation threshold and the operator fusion rule; 所述第一节点集合符合以下条件中的一项或多项:The first set of nodes meets one or more of the following conditions: 所述第一节点集合对应的生产者和消费者之间不存在间接路径,且所述第一节点集合对应的输出张量所占的内存高于所述内存阈值,或者,所述第一节点集合对应的生产者和消费者之间存在间接路径,且所述第一节点集合对应的生产者和消费者之间的各个节点执行时的峰值内存高于所述峰值内存阈值,或者,所述第一节点集合对应的输出张量所占的内存与所述第一节点集合对应的至少一个输入张量所占的内存之间的偏差值高于所述数据突变阈值,或者,所述第一节点集合与所述第一节点集合对应的消费者能够融合,或者,所述第一节点 集合与所述第一节点集合对应的消费者不能融合,且所述第一节点集合对应的生产者和消费者之间存在间接路径。There is no indirect path between the producer and the consumer corresponding to the first node set, and the memory occupied by the output tensor corresponding to the first node set is higher than the memory threshold, or the first node There is an indirect path between the producer and the consumer corresponding to the set, and the peak memory of each node between the producer and the consumer corresponding to the first node set is higher than the peak memory threshold during execution, or, the The deviation value between the memory occupied by the output tensor corresponding to the first node set and the memory occupied by at least one input tensor corresponding to the first node set is higher than the data mutation threshold, or, the first The node set and the consumers corresponding to the first node set can be fused, or the first node set and the consumers corresponding to the first node set cannot be fused, and the producers corresponding to the first node set and There are indirect paths between consumers. 根据权利要求2-17任一所述的方法,其特征在于,所述基于所述参数和所述第一计算图中节点之间的数据依赖关系,从所述第一计算图所包含的生产者中得到至少一个第一节点集合,具体包括:The method according to any one of claims 2-17, characterized in that, based on the parameters and the data dependencies between nodes in the first computation graph, from the production data included in the first computation graph At least one first node set is obtained from those, specifically including: 基于所述第一计算图中节点之间的数据依赖关系,得到第一列表,所述第一列表中包括所述第一计算图中生产者和消费者间的对应关系;Obtaining a first list based on data dependencies between nodes in the first computation graph, the first list including correspondences between producers and consumers in the first computation graph; 根据所述第一列表中每个生产者输出的张量,得到备选节点集合,所述备选节点集合中至少包括所述第一节点集合,其中,针对所述第一列表中的任一生产者,将在所述任一生产者中且与所述任一生产者输出的张量直接相关和间接相关的节点的集合作为一个节点集合;According to the tensor output by each producer in the first list, a candidate node set is obtained, and the candidate node set includes at least the first node set, wherein, for any of the first list A producer, taking the set of nodes directly and indirectly related to the tensor output by any producer in any producer as a set of nodes; 基于所述参数,从所述备选节点集合中选出所述第一节点集合。Based on the parameters, the first set of nodes is selected from the set of candidate nodes. 根据权利要求2-18任一所述的方法,其特征在于,所述对每个所述第一节点集合进行重计算,以得到N个重计算子图,具体包括:The method according to any one of claims 2-18, wherein the recalculation is performed on each of the first node sets to obtain N recalculation subgraphs, specifically comprising: 针对所述N个第一节点集合中的任一节点集合,拷贝所述任一节点集合对应的生产者子图;For any node set in the N first node sets, copy the producer subgraph corresponding to the any node set; 删除所述生产者子图中除所述任一节点集合所包含的节点之外的节点,以及删除所述生产者子图中与所述任一节点集合无关的边,以得到所述任一节点集合对应的重计算子图。Deleting nodes in the producer subgraph other than the nodes included in any node set, and deleting edges not related to any node set in the producer subgraph, so as to obtain the any The recomputed subgraph corresponding to the set of nodes. 根据权利要求2-19任一所述的方法,其特征在于,所述将所述第二计算图与所述第一计算图合并,以得到第三计算图,具体包括:The method according to any one of claims 2-19, wherein the merging the second calculation graph with the first calculation graph to obtain a third calculation graph specifically includes: 分别构建所述N个重计算子图中的每个重计算子图的输入第一目标张量的节点与所述第一计算图中输出所述第一目标张量的节点之间的有向边,以及,分别构建所述N个重计算子图中的每个重计算子图的输出第二目标张量的节点与所述第一计算图中输入所述第二目标张量的节点之间的有向边,以得到所述第三计算图。Respectively construct the directed relationship between the node inputting the first target tensor and the node outputting the first target tensor in the first calculation graph of each recomputing subgraph in the N recomputing subgraphs An edge, and respectively constructing the connection between the node outputting the second target tensor in each of the N recomputing subgraphs and the node inputting the second target tensor in the first calculation graph between directed edges to obtain the third computation graph. 根据权利要求1-20任一所述的方法,其特征在于,所述基于参数和第一计算图中节点之间的数据依赖关系,从所述第一计算图中得到需要进行重计算的第二计算图之前,所述方法还包括:The method according to any one of claims 1-20, characterized in that, based on the parameters and the data dependencies between the nodes in the first calculation graph, the first calculation graph that needs to be recalculated is obtained from the first calculation graph Before calculating the graph, the method also includes: 对所述第一计算图中的算子进行融合。The operators in the first computation graph are fused. 一种计算图优化装置,其特征在于,包括:A computational graph optimization device, characterized in that it comprises: 至少一个存储器,用于存储程序;at least one memory for storing programs; 至少一个处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行如权利要求1-21任一所述的方法。At least one processor is configured to execute the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to execute the method according to any one of claims 1-21. 一种设备,其特征在于,包括:A device, characterized in that it comprises: 至少一个存储器,用于存储程序;at least one memory for storing programs; 至少一个处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行如权利要求1-21任一所述的方法。At least one processor is configured to execute the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to execute the method according to any one of claims 1-21. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,当所述计算机程序在电子设备上运行时,使得所述电子设备执行如权利要求1-21任一所述的方法。A computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is run on an electronic device, the electronic device executes the method according to any one of claims 1-21 . 一种计算机程序产品,其特征在于,当所述计算机程序产品在电子设备上运行时,使得所述电子设备执行如权利要求1-21任一所述的方法。A computer program product, characterized in that, when the computer program product is run on an electronic device, the electronic device is made to execute the method according to any one of claims 1-21. 一种芯片,其特征在于,包括至少一个处理器和接口;A chip, characterized in that it includes at least one processor and an interface; 所述接口,用于为所述至少一个处理器提供程序指令或者数据;said interface for providing program instructions or data to said at least one processor; 所述至少一个处理器用于执行所述程序行指令,以实现如权利要求1-21任一所述的方法。The at least one processor is configured to execute the program line instructions to implement the method as claimed in any one of claims 1-21.
PCT/CN2022/133304 2021-11-29 2022-11-21 Computational graph optimization method and apparatus, and device Ceased WO2023093689A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111431551.X 2021-11-29
CN202111431551.XA CN116204847A (en) 2021-11-29 2021-11-29 Calculation graph optimization method, device and equipment

Publications (1)

Publication Number Publication Date
WO2023093689A1 true WO2023093689A1 (en) 2023-06-01

Family

ID=86511609

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/133304 Ceased WO2023093689A1 (en) 2021-11-29 2022-11-21 Computational graph optimization method and apparatus, and device

Country Status (2)

Country Link
CN (1) CN116204847A (en)
WO (1) WO2023093689A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117455005A (en) * 2023-10-17 2024-01-26 北京百度网讯科技有限公司 Model operator processing method, device, electronic equipment and storage medium
US20240037150A1 (en) * 2022-08-01 2024-02-01 Qualcomm Incorporated Scheduling optimization in sequence space
CN118152980A (en) * 2024-03-22 2024-06-07 上海壁仞科技股份有限公司 Bifurcation operator fusion method, bifurcation operator fusion device, bifurcation operator fusion equipment and bifurcation operator fusion storage medium

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117008916B (en) * 2023-07-06 2024-08-20 清华大学 Tensor program optimization method and device
CN117032954B (en) * 2023-07-17 2024-04-26 北京泛睿科技合伙企业(有限合伙) Memory optimization method, system, equipment and medium for terminal training model
CN119512724A (en) * 2023-08-22 2025-02-25 华为技术有限公司 Method and device for reusing neural network chip memory
CN120163671A (en) * 2023-12-15 2025-06-17 华为技术有限公司 A financial data processing method and related device
CN119806543B (en) * 2025-03-14 2025-06-10 上海燧原智能科技有限公司 Calculation graph compiling optimization method and device based on multi-stage memory chip
CN119829258A (en) * 2025-03-17 2025-04-15 宝德计算机系统股份有限公司 Method, system, electronic equipment and storage medium for optimizing calculation graph

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108292374A (en) * 2015-11-09 2018-07-17 谷歌有限责任公司 Train a neural network represented as a computational graph
CN110941494A (en) * 2019-12-02 2020-03-31 哈尔滨工程大学 Deep learning-oriented GPU parallel computing data processing method
WO2020182989A1 (en) * 2019-03-13 2020-09-17 Deepmind Technologies Limited Scheduling computation graphs using neural networks

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110321999B (en) * 2018-03-30 2021-10-01 赛灵思电子科技(北京)有限公司 Neural Network Computational Graph Optimization Method
CN111338635B (en) * 2020-02-20 2023-09-12 腾讯科技(深圳)有限公司 Graph compiling method, device, equipment and storage medium for calculation graph
CN113703775B (en) * 2021-08-31 2023-11-28 上海阵量智能科技有限公司 Compiling method, compiling device, compiling equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108292374A (en) * 2015-11-09 2018-07-17 谷歌有限责任公司 Train a neural network represented as a computational graph
WO2020182989A1 (en) * 2019-03-13 2020-09-17 Deepmind Technologies Limited Scheduling computation graphs using neural networks
CN110941494A (en) * 2019-12-02 2020-03-31 哈尔滨工程大学 Deep learning-oriented GPU parallel computing data processing method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240037150A1 (en) * 2022-08-01 2024-02-01 Qualcomm Incorporated Scheduling optimization in sequence space
CN117455005A (en) * 2023-10-17 2024-01-26 北京百度网讯科技有限公司 Model operator processing method, device, electronic equipment and storage medium
CN118152980A (en) * 2024-03-22 2024-06-07 上海壁仞科技股份有限公司 Bifurcation operator fusion method, bifurcation operator fusion device, bifurcation operator fusion equipment and bifurcation operator fusion storage medium

Also Published As

Publication number Publication date
CN116204847A (en) 2023-06-02

Similar Documents

Publication Publication Date Title
WO2023093689A1 (en) Computational graph optimization method and apparatus, and device
US20240161474A1 (en) Neural Network Inference Acceleration Method, Target Detection Method, Device, and Storage Medium
JP6549332B2 (en) Network model construction method and apparatus based on machine learning
EP4036803A1 (en) Neural network model processing method and apparatus, computer device, and storage medium
CN114580263A (en) Knowledge graph-based information system fault prediction method and related equipment
WO2022143419A1 (en) Node fusion method for computational graph, and device
CN114691148A (en) Model inference acceleration method, device, electronic device and storage medium
CN113673568B (en) Method, system, computer device and storage medium for detecting tampered image
WO2022252694A1 (en) Neural network optimization method and apparatus
CN116011468A (en) Reasoning method of deep learning model, machine translation method and device
CN114780798A (en) BIM-based knowledge graph system
CN118377936A (en) Event association analysis method and device based on graph rules integrated into computing model
CN105205052A (en) Method and device for mining data
CN119760007A (en) Digital base fusion system and electronic equipment
CN112287663B (en) A text parsing method, device, terminal and storage medium
CN119179648B (en) Optimization method, device, equipment and medium for test excitation code
WO2022217419A1 (en) Neural network model inference method and apparatus, computer device, and storage medium
CN118897993A (en) A data association matching method and system based on deep learning
CN115730675A (en) Method and device for searching machine learning workflow and related equipment
Kholod et al. Decomposition of data mining algorithms into unified functional blocks
CN120315725B (en) Dependency graph generation method, risk assessment method, device, equipment and medium
CN119312810B (en) A big data relationship mining method based on graph computing
CN120706532A (en) A process rule processing method and device based on rule graph
CN117407314A (en) Performance micro-benchmark run-time prediction method based on multi-modal model
CN116755714A (en) Method, device, equipment and storage medium for operating deep neural network model

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22897772

Country of ref document: EP

Kind code of ref document: A1