WO2024247053A1 - Layer integration determination device, method, and program - Google Patents
Layer integration determination device, method, and program Download PDFInfo
- Publication number
- WO2024247053A1 WO2024247053A1 PCT/JP2023/019959 JP2023019959W WO2024247053A1 WO 2024247053 A1 WO2024247053 A1 WO 2024247053A1 JP 2023019959 W JP2023019959 W JP 2023019959W WO 2024247053 A1 WO2024247053 A1 WO 2024247053A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- layer
- unit
- integration
- model
- layer integration
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
Definitions
- the disclosed technology relates to a layer integration determination device, a layer integration determination method, and a layer integration determination program.
- Deep learning is a machine learning method that uses a neural network that reproduces the mechanism of human nerve cells, and is applied to fields such as video processing and natural language processing.
- a neural network that reproduces the mechanism of human nerve cells
- CNN convolutional neural network
- many other methods have been proposed using deep learning models, such as object detection, which determines the position and class of an object in an image, and segmentation, which infers the class of an object for each pixel.
- object detection which determines the position and class of an object in an image
- segmentation which infers the class of an object for each pixel.
- methods using models equipped with an attention mechanism, such as the Transformer are applied to tasks such as machine translation and summarization.
- Such deep learning models have achieved performance that exceeds that of conventional machine learning models, and there is a movement to utilize deep learning in various fields such as medicine and industry.
- Factors behind the improved performance of these deep learning models include the improvement of computer computing power and the development of cloud technology. For example, large-scale parallel calculations are now possible by repurposing GPUs (Graphical Processing Units). In addition, the emergence of cloud services equipped with many GPUs, such as GCP (Google Cloud Platform) and AWS (Amazon Web Service), has made it easier to train large-scale deep learning models.
- GCP Google Cloud Platform
- AWS Amazon Web Service
- Edge AI enables real-time data processing while protecting privacy, but one issue is the difficulty of securing power sources and computing resources.
- Non-Patent Document 1 a technique called layer integration has been introduced that integrates and processes the calculations of multiple layers included in a deep learning model.
- layer integration the convolution calculations of a deep learning model are not processed independently for each layer, but multiple convolution layers are integrated and processed. This makes it possible to process calculations using a small-capacity cache, thereby reducing access to external memory, shortening processing time, and improving power efficiency.
- a technique for automating this integration of calculations has also been proposed (Non-Patent Document 2).
- the disclosed technology has been developed in consideration of the above points, and aims to flexibly support a variety of deep learning models while shortening the time required to determine layer integration for deep learning models that include multiple convolutional layers.
- a first aspect of the present disclosure is a layer integration determination device, which includes a generation unit that integrates operations corresponding to each of the convolutional layers and the surrounding layers associated with the convolutional layers that can be processed by dedicated hardware that executes the inference processing of the deep learning model, among the operations corresponding to each layer of a deep learning model including multiple convolutional layers input in the form of a computation graph in which each layer is represented by a node, to generate a computationally integrated model; an extraction unit that extracts a subgraph from the computationally integrated model that matches a pattern registered in advance in a pattern description unit as a combination of convolutional layers that can be layer integrated based on the layer configuration of each of the convolutional layers; and a determination unit that determines, as a layer integration section, a section corresponding to the subgraph that satisfies a condition registered in advance in a condition description unit as a condition for layer integration based on the specifications of the dedicated hardware.
- the second aspect of the present disclosure is a layer integration determination method executed by a layer integration determination device including a generation unit, an extraction unit, and a determination unit, in which the generation unit integrates operations corresponding to each layer of a deep learning model including multiple convolution layers, which are input in the form of a computation graph in which each layer is represented by a node, and operations corresponding to each of the convolution layers that can be processed by dedicated hardware that executes the inference processing of the deep learning model and the surrounding layers associated with the convolution layers, to generate a computation-integrated model, the extraction unit extracts a subgraph from the computation-integrated model that matches a pattern registered in advance in a pattern description unit as a combination of convolution layers that can be layer-integrated based on the layer configuration of each of the convolution layers, and the determination unit determines, as a layer integration section, a section corresponding to the subgraph that satisfies a condition registered in advance in a condition description unit as a condition for layer integration based on the specifications of the dedicated hardware.
- the third aspect of the present disclosure is a layer integration determination program that causes a computer to function as each part of the layer integration determination device described above.
- the disclosed technology can flexibly accommodate a variety of deep learning models while reducing the time required to determine layer integration for deep learning models that include multiple convolutional layers.
- FIG. 1 is a block diagram showing a configuration of a conventional arithmetic integration device.
- FIG. 1 is a diagram for explaining layers included in a deep learning model.
- 1 is a flowchart showing the flow of a conventional operation integration process.
- 1 is a block diagram showing a hardware configuration of a layer integration determination device according to the first and second embodiments.
- FIG. 1 is a block diagram showing an example of a functional configuration of a layer integration determination device according to a first embodiment.
- FIG. FIG. 13 is a diagram showing an example of a pattern registered in a pattern description section; 13A and 13B are diagrams illustrating other examples of patterns registered in the pattern description section. 13 is a flowchart showing a flow of a layer integration determination process according to the first embodiment.
- FIG. 11 is a block diagram showing an example of a functional configuration of a layer integration determination device according to a second embodiment.
- 13 is a flowchart showing the flow of a layer integration determination process according to the second embodiment.
- FIG. 1 shows the configuration of a computation integration device 1000 that uses a conventional method to automate the integration of computations in a deep learning model, such as the method described in Non-Patent Document 2.
- the deep learning model is input to the computation integration device 1000 in the form of a computation graph in which each layer of the deep learning model is represented by a node.
- the computation integration device 1000 scans the input computation graph, performs processing to integrate layers corresponding to computations that can be integrated, and outputs a computationally integrated model.
- the computation integration device 1000 functionally includes one description unit called the correspondence description unit 1040, and two processing units, a labeling unit 1022 and a computation integration unit 1024.
- the calculations of the convolutional (Conv) layer as shown in Figure 2 are integrated with the calculations of the surrounding layers associated with the convolutional layer to improve calculation efficiency.
- the surrounding layers associated with the convolutional layer are the padding (Pad) layer, the batch normalization (BN) layer, and the activation function (ReLU) layer.
- operations that can be processed by dedicated hardware (hereinafter, also referred to as an "AI chip" as an example) that executes the inference processing of the deep learning model, and combinations of operations that can be integrated, are registered.
- AI chip dedicated hardware
- Operations that can be processed by an AI chip are operations for which a dedicated circuit required to execute the processing of that operation is implemented on the AI chip.
- Combinations of operations that can be integrated are combinations of primitive operations that can be integrated, such as the combination of an operation of a convolution layer and the operations of each of the surrounding layers associated with it, as described above.
- the labeling unit 1022 individually determines whether the calculations of each layer of the deep learning model can be processed on an AI chip, and labels the layers of calculations to be processed on the AI chip so that they can be identified. Specifically, the labeling unit 1022 determines whether each calculation matches the calculations that can be processed by the AI chip described in the correspondence description unit 1040. The labeling unit 1022 attaches a label indicating that the calculations will be processed by the AI chip to layers of calculations that can be processed by the AI chip, and attaches a label indicating that the other layers will be processed by general-purpose hardware (e.g., CPU, GPU, etc.) for controlling the AI chip, etc. The labeling unit 1022 passes the labeled model in which each layer of the deep learning model (each node of the computation graph) has been labeled to the calculation integration unit 1024.
- each layer of the deep learning model each node of the computation graph
- the computation integration unit 1024 combines into one layer each layer that corresponds to a combination of computations in a layer that is labeled to be processed by an AI chip and that matches a combination of computations described in the correspondence description unit 1040 in the labeled model.
- the computation integration unit 1024 thereby reconstructs a computation graph that represents the deep learning model, and generates and outputs a computation-integrated model.
- FIG. 3 shows a flowchart illustrating the flow of the computation integration process executed by the computation integration device 1000 using the conventional method.
- step S1000 the labeling unit 1022 determines whether or not the computation of the processing target layer can be processed on the AI chip based on the computations that can be processed by the AI chip and that are registered in the correspondence description unit 1040. If the computation can be processed, the process proceeds to step S1004, and if the computation cannot be processed, the process proceeds to step S1006.
- step S1004 the labeling unit 1022 attaches a label to the layer to be processed indicating that it will be processed by an AI chip.
- step S1006 the labeling unit 1022 attaches a label to the layer to be processed indicating that it will be processed by general-purpose hardware.
- step S1008 the labeling unit 1022 determines whether scanning of the deep learning model is complete, i.e., whether labeling processing is complete for all layers included in the deep learning model. If scanning is complete, the process proceeds to step S1010, and if scanning is not complete, the loop processing of step S1000 is repeated. In step S1010, the computation integration unit 1024 groups together sections in which consecutive layers are labeled to be processed by the AI chip, and reconstructs the computation graph.
- step S1012 the loop process of step S1012 is executed with each section as the processing target.
- the operation integration unit 1024 performs pattern matching between the combination of operations of each layer in the processing target section and the combination of primitive operations that can be integrated and are registered in the correspondence description unit 1040.
- the operation integration unit 1024 determines whether or not each layer in the processing target section can be integrated depending on whether or not the combination of operations matches through pattern matching. If integration is possible, the process proceeds to step S1016, and if integration is not possible, the process proceeds to step S1018.
- step S1016 the layer combinations corresponding to the operation combinations that match through pattern matching are integrated by replacing them with one layer, and the computation graph is reconstructed.
- step S1018 the computation integration unit 1024 determines whether scanning of the deep learning model is complete, that is, whether the process of determining whether to integrate layers within all sections included in the deep learning model is complete. If scanning is complete, the process proceeds to step S1020, and if scanning is not complete, the loop process of step S1012 is repeated until there are no more sections that can be integrated.
- step S1020 the computation integration unit 1024 outputs the computation graph finally reconstructed in step S1016 as a computationally integrated model, and the computation integration process ends.
- First Embodiment 4 is a block diagram showing the hardware configuration of the layer integration determination device 10 according to the first embodiment.
- the layer integration determination device 10 has a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a storage 14, an input unit 15, a display unit 16, and a communication I/F (Interface) 17.
- Each component is connected to each other via a bus 19 so as to be able to communicate with each other.
- the CPU 11 is a central processing unit that executes various programs and controls each part. That is, the CPU 11 reads out a program from the ROM 12 or storage 14, and executes the program using the RAM 13 as a working area. The CPU 11 controls each of the above components and performs various calculation processes according to the program stored in the ROM 12 or storage 14. In this embodiment, the layer integration determination program described below is stored in the ROM 12 or storage 14.
- ROM 12 stores various programs and data.
- RAM 13 temporarily stores programs or data as a working area.
- Storage 14 is made up of storage devices such as HDD (Hard Disk Drive) and SSD (Solid State Drive), and stores various programs and data including the operating system.
- the input unit 15 includes a pointing device such as a mouse and a keyboard, and is used to perform various input operations.
- the display unit 16 is, for example, a liquid crystal display, and displays various types of information.
- the display unit 16 may also function as the input unit 15 by employing a touch panel system.
- the communication I/F 17 is an interface for communicating with other devices.
- a wired communication standard such as Ethernet (registered trademark) or FDDI
- a wireless communication standard such as 4G, 5G, or Wi-Fi (registered trademark) is used.
- FIG. 5 is a block diagram showing an example of the functional configuration of the layer integration determination device 10.
- the layer integration determination device 10 includes, as its functional configuration, a generation unit 21, an extraction unit 26, a determination unit 28, and an optimization unit 30.
- a correspondence description unit 40, a pattern description unit 42, and a condition description unit 44 are provided in a predetermined storage area of the layer integration determination device 10.
- Each functional configuration is realized by the CPU 11 reading out a layer integration determination program stored in the ROM 12 or storage 14, expanding it in the RAM 13, and executing it.
- the deep learning model to be processed by layer integration is input to the layer integration determination device 10 in the form of a computation graph in which each layer, such as a convolution (Conv) layer and an activation function (Activation) layer, is represented by a node.
- Each node holds parameter information related to the layer corresponding to the node.
- the parameters are, for example, the size of the input and output feature maps, the kernel size and number of channels used in the convolution operation, the number of multiplications of the convolution matrix operation and activation function, the number of additions for bias addition, etc.
- the correspondence description unit 40 is similar to the correspondence description unit 1040 of the conventional method described in FIG. 1. That is, the correspondence description unit 40 registers operations that can be processed by the AI chip and combinations of primitive operations that can be integrated.
- the generation unit 21 integrates the calculations corresponding to each layer of the input deep learning model, including the convolutional layer that can be processed by the AI chip and the surrounding layers associated with the convolutional layer, to generate a calculation-integrated model.
- the generation unit 21 includes a labeling unit 22 and a calculation integration unit 24.
- the labeling unit 22 identifiably labels, among the layers of the input deep learning model, layers of operations that match operations that can be processed by the AI chip and that are pre-registered in the correspondence description unit 40. Other specific details of the labeling unit 22 are similar to those of the labeling unit 1022 in FIG. 1, so detailed explanations will be omitted.
- the labeling unit 22 passes the labeled model (computation graph) to the computation integration unit 24.
- the computation integration unit 24 Based on the labeled model passed from the labeling unit 22, the computation integration unit 24 identifies a combination of operations that corresponds to a layer labeled as a layer of operations that can be processed by the AI chip. The computation integration unit 24 then groups, as processing blocks, combinations of operations that match combinations of primitive operations that can be integrated and are preregistered in the correspondence description unit 40, from among the identified combinations of operations, and reconstructs a computation graph. In this way, the computation integration unit 24 generates a computation-integrated model. Other specific details of the computation integration unit 24 are similar to those of the computation integration unit 1024 in FIG. 1, so a detailed description will be omitted. The computation integration unit 24 passes the generated computation-integrated model to the extraction unit 26.
- the pattern description unit 42 stores patterns of combinations of convolutional layers that can be merged based on the layer structure of each convolutional layer. Specifically, the pattern description unit 42 stores a comprehensive set of partial graph patterns of deep learning models that satisfy constraints that take into account the "number of convolutional layers that can be merged" and the "connections between layers" determined by the AI chip specifications.
- FIG. 6 shows an example of a subgraph pattern.
- FIG. 6(a) is a computation graph showing a combination of layers of integrable operations registered in the correspondence description unit 40 shown in FIG. 2. This is treated as one processing block (Conv Block), and a pattern combining multiple Conv Blocks, for example, as shown in FIG. 6(b), is registered in the pattern description unit 42.
- the pattern may also include other operators such as an upsampling layer.
- the pattern may also be a more complex pattern including a layer such as a residual layer (Res) that combines a skip connection and an addition layer.
- Res residual layer
- the condition description section 44 registers the layer integration conditions based on the AI chip specifications. For example, the condition equation regarding the cache usage on the AI chip required to read the kernel of the convolution layer is specified based on the parameters of each node as in the following equation (1).
- the layer integration section is from the l s layer to the l e layer
- k a is the kernel size of the a-th convolutional layer
- iCH a is the number of input channels of the a-th layer
- oCH a is the number of output channels of the a-th layer.
- C cache on the right side represents the cache capacity, which is a value determined according to the specifications of the AI chip.
- the extraction unit 26 extracts, as a subgraph, a portion of the computationally integrated model (computation graph) passed from the computation integration unit 24 that matches a pattern registered in advance in the pattern description unit 42. Specifically, the extraction unit 26 scans the computation graph indicating the computationally integrated model, matches it with the pattern registered in the pattern description unit 42, and extracts the section where the pattern matches as a subgraph that is a candidate for a layer integration section. The extraction unit 26 passes the extracted subgraph to the determination unit 28.
- the determination unit 28 determines, from among the subgraphs passed from the extraction unit 26, a section corresponding to a subgraph that satisfies the conditions registered in advance in the condition description unit 44 as a layer integration section. Specifically, the determination unit 28 extracts parameters of each node in the subgraph, and reads out the conditional expression registered in the condition description unit 44. The determination unit 28 then uses the extracted parameters to calculate the read out conditional expression, and determines whether or not the subgraph satisfies the corresponding conditional expression. The determination unit 28 generates a layer-integrated model candidate by integrating each layer in the layer integration section that satisfies the condition and reconstructing the computationally integrated model. The determination unit 28 passes the generated layer-integrated model candidate to the optimization unit 30.
- the optimization unit 30 selects and outputs the optimal candidate as the layer-integrated model based on an optimization index from among multiple layer-integrated model candidates with different layer-integrated sections generated by executing the processes of the extraction unit 26 and the determination unit 28 multiple times.
- the optimization index is arbitrary, but it is preferable to use an index that uses at least one of a numerical value representing the processing time of each layer-integrated section, specifically, the amount of calculation of the deep learning model and the amount of data exchanged between the external memory and the AI chip.
- the optimization unit 30 calculates the total amount of product-sum calculations of each layer-integrated section, the variance of the read/write time of the convolution kernel and the input/output feature, etc. as the optimization index.
- the optimization unit 30 selects and outputs the layer-integrated model candidate with the smallest variance as the final layer-integrated model. As a result, a candidate with the most uniform processing time for each layer-integrated section included in the layer-integrated model candidate is selected, and when assembling pipeline processing, processing delays due to waiting for the completion of the previous stage can be suppressed, enabling efficient execution of inference processing.
- FIG. 8 is a flowchart showing the flow of the layer merging determination process by the layer merging determination device 10.
- the layer merging determination process is performed by the CPU 11 reading out a layer merging determination program from the ROM 12 or storage 14, expanding it into the RAM 13, and executing it.
- the layer merging determination process is an example of the layer merging determination method of the present disclosure.
- step S10 the CPU 11, functioning as the generation unit 21, executes a computation integration process.
- the computation integration process is similar to the computation integration process of the conventional method shown in FIG. 3.
- step S12 a specified number of repetitions are executed.
- the specified number of repetitions is specified externally as a hyperparameter.
- step S14 the CPU 11, functioning as the extraction unit 26, changes the order of patterns to be matched for the patterns registered in the pattern description unit 42.
- the order of the patterns can be set arbitrarily; for example, the order may be changed randomly, or the order may be changed regularly by grouping the patterns by the number of convolutional layers in the patterns.
- step S16 the CPU 11 executes a layer-integrated model candidate generation process.
- step S18 the CPU 11, functioning as the optimization unit 30, determines whether the repetition process has reached a designated number of times. If the repetition process has reached the designated number of times, the process proceeds to step S20, and if the repetition process has not reached the designated number of times, the process repeats the loop process of step S12.
- the CPU 11 sets each of the patterns registered in the pattern description unit 42 as a pattern to be processed in the order changed in step S14, and executes the loop process of step S160. Specifically, the CPU 11 scans the computationally integrated model and executes the loop process of step S162.
- step S164 the CPU 11, functioning as the extraction unit 26, determines whether or not there are any unintegrated sections remaining, i.e., sections other than the layer-integrated sections extracted based on patterns other than the pattern to be processed. If there are any unintegrated sections remaining, the process proceeds to step S166; if there are no unintegrated sections remaining, the process proceeds to step S176.
- unintegrated sections remaining i.e., sections other than the layer-integrated sections extracted based on patterns other than the pattern to be processed. If there are any unintegrated sections remaining, the process proceeds to step S166; if there are no unintegrated sections remaining, the process proceeds to step S176.
- step S166 the CPU 11, functioning as the extraction unit 26, searches for a subgraph that matches the pattern to be processed in the unintegrated section of the computationally integrated model (computation graph).
- the CPU 11, functioning as the extraction unit 26, determines whether or not a subgraph that matches the pattern to be processed exists. If a subgraph exists, the process proceeds to step S168; if not, the loop process of step S162 ends.
- step S168 the CPU 11, functioning as the extraction unit 26, extracts a subgraph that matches the pattern to be processed. Then, the CPU 11, functioning as the judgment unit 28, extracts parameters for each node in the subgraph and reads out a condition expression registered in the condition description unit 44. Furthermore, the CPU 11, functioning as the judgment unit 28, calculates the read out condition expression using the extracted parameters and judges whether or not the subgraph satisfies the corresponding condition expression. If the subgraph satisfies the condition, the process proceeds to step S170; if the subgraph does not satisfy the condition, the loop process of step S162 ends.
- step S170 the CPU 11, functioning as the extraction unit 26, determines whether scanning of the entire computationally integrated model has been completed. If scanning has not been completed, the loop process of step S162 is repeated, and if scanning has been completed, the process proceeds to step S174.
- step S174 the CPU 11, functioning as the extraction unit 26, determines whether matching processing with the computationally integrated model has been completed for all patterns registered in the pattern description unit 42. If there are unprocessed patterns, the loop process of step S160 is repeated for the unprocessed patterns, and if all patterns have been processed, the process proceeds to step S176.
- step S176 there are no unintegrated sections remaining in the computationally integrated model, or matching processing has been completed for all patterns.
- the CPU 11, as the determination unit 28, generates a layer-integrated model candidate by integrating each layer in the layer integration section extracted in step S170 above to reconstruct the computationally integrated model. Note that layers that are labeled to be processed by the AI chip and are not included in any layer integration section are treated as being processed as a single layer, and are output together with each layer integration section as part of the layer-integrated model candidate. Then, the layer-integrated model candidate generation process ends, and the process returns to the layer integration determination process (FIG. 8).
- the loop process (step S12) including the layer-integrated model candidate generation process in step S16 is repeated a specified number of times, generating a specified number of layer-integrated model candidates.
- step S20 the CPU 11, functioning as the optimization unit 30, executes the loop process of step S20 for each generated layer-integrated model candidate.
- step S22 the CPU 11, functioning as the optimization unit 30, calculates, as an optimization index, the variance of an index representing the processing time of each layer-integrated section included in the layer-integrated model candidate to be processed.
- step S24 the CPU 11, functioning as the optimization unit 30, determines whether the process of calculating the optimization index for all layer-integrated model candidates has been completed. If there are any unprocessed layer-integrated model candidates, the loop process of step S20 is repeated, and if all have been completed, the process proceeds to step S26.
- step S26 the CPU 11, as the optimization unit 30, selects the optimal layer-merged model from the layer-merged model candidates based on the optimization index. For example, the CPU 11, as the optimization unit 30, selects and outputs, as the final layer-merged model, the layer-merged model candidate that has the smallest variance of the index representing the processing time for each layer-merged section calculated as the optimization index. Then, the layer-merging determination process ends.
- the layer integration determination device integrates the operations corresponding to each of the convolution layers and the surrounding layers associated with the convolution layers that can be processed by the dedicated hardware that executes the inference processing of the deep learning model, among the operations corresponding to each layer of the deep learning model including multiple convolution layers input in the form of a computation graph in which each layer is represented by a node, to generate a computation-integrated model, extracts a subgraph from the computation-integrated model that matches a pattern preregistered in the pattern description section as a combination of convolution layers that can be layer-integrated based on the layer configuration of each of the convolution layers, determines a section corresponding to a subgraph that satisfies a condition preregistered in the condition description section as a condition for layer integration based on the specifications of the dedicated hardware, and selects and outputs the optimal layer-integrated model based on a predetermined index from among multiple layer-integrated model candidates with different layer-integration sections.
- a deep learning model is scanned in advance to generate conditions to be used in optimization as additional conditional expressions, and the conditions are passed to the judgment unit, so that the processing of the extraction unit and the judgment unit can be completed in one go.
- the same components as those in the layer integration judgment device 10 according to the first embodiment are denoted by the same reference numerals and detailed descriptions thereof are omitted.
- the hardware configuration of the layer integration judgment device according to the second embodiment is the same as the hardware configuration of the layer integration judgment device 10 according to the first embodiment shown in FIG. 4, and therefore the description thereof is omitted.
- FIG. 10 is a block diagram showing an example of the functional configuration of the layer integration determination device 210.
- the layer integration determination device 210 includes, as its functional configuration, a generation unit 21 including a labeling unit 22 and a calculation integration unit 24, an extraction unit 26, a determination unit 228, and an addition unit 232.
- a correspondence description unit 40, a pattern description unit 42, and a condition description unit 44 are provided in a predetermined storage area of the layer integration determination device 210.
- Each functional configuration is realized by the CPU 11 reading out a layer integration determination program stored in the ROM 12 or storage 14, expanding it in the RAM 13, and executing it.
- the adding unit 232 adds an optimization condition based on a predetermined number of layer integration sections to the conditions used by the determining unit 228. For example, the adding unit 232 adds, as a condition, a range of the amount of calculation for each layer integration section that is included in the layer-integrated model and that uniforms the processing time for each layer integration section, using at least one of the amount of calculation for the deep learning model and the amount of data exchanged between the external memory and the AI chip.
- the adding unit 232 acquires the deep learning model (computation graph) input to the layer integration determination device 210, and scans the computation graph to estimate the computational volume of the entire deep learning model.
- the adding unit 232 estimates the computation volume from parameters of the deep learning model, including the sizes of the input and output feature maps, the kernel size used in the convolution computation, the number of channels, the convolution matrix computation and the number of multiplications of the activation function, the number of additions for bias addition, etc. These parameters are held by each node of the computation graph, as described above.
- the adding unit 232 sets an upper limit and a lower limit of the amount of calculation per layer-integrated section from the estimated amount of calculation for the entire learning model. Specifically, the adding unit 232 calculates the amount of calculation per layer-integrated section by dividing the amount of calculation for the entire deep learning model by the expected number of layer-integrated sections. The expected number of layer-integrated sections may be given manually as a hyperparameter, or may be estimated mechanically. When estimating mechanically, for example, a layer-integrated model is generated in a state where no additional condition equation is given from the adding unit 232, and the number of layer-integrated sections included in the generated layer-integrated model is adopted. Then, the adding unit 232 sets an upper limit and a lower limit based on the amount of calculation per layer-integrated section.
- the adding unit 232 passes the formula used to calculate the amount of calculation and an additional condition formula indicating the range of the amount of calculation per one-layer integration section to the determining unit 228.
- the additional condition formula is written, for example, as in the following formula (2).
- the layer integration section is from the l s layer to the l e layer
- C min is the lower limit of the amount of calculation
- C max is the upper limit of the amount of calculation.
- Formula (2) is a conditional judgment formula for judging whether the sum of the number of multiplications of the convolution layer and the activation function layer is within the range from the upper limit to the lower limit.
- the conditions to be added are not limited to conditions related to the amount of calculations of the deep learning model, but may be conditions using the amount of data exchanged with external memory, or combinations of these.
- formula (2) an example of an additional condition formula in which an upper and lower limits are set is shown, but an additional condition formula in which only an upper limit is set may also be used.
- an upper limit on the amount of calculations per one-layer integration section may be set based on the computing power of the AI chip and the processing time for one stage of pipeline processing.
- the adding section 232 may overwrite the conditional expression registered in the condition description section 44 with the additional conditional expression, thereby passing the additional conditional expression to the determination section 228.
- FIG. 11 is a flowchart showing the flow of the layer merging determination process by the layer merging determination device 210.
- the layer merging determination process is performed by the CPU 11 reading out a layer merging determination program from the ROM 12 or storage 14, expanding it into the RAM 13, and executing it.
- step S210 the CPU 11, as the adding unit 232, determines the expected number of layer integration sections, for example, by acquiring manually assigned hyperparameters or by mechanically estimating.
- step S212 the CPU 11, as the adding unit 232, acquires the deep learning model (computation graph) input to the layer integration determination device 210, scans the computation graph, and estimates the amount of calculation for the entire deep learning model from the parameters of the deep learning model.
- step S214 the CPU 11, as the adding unit 232, calculates the amount of calculation per layer integration section by dividing the amount of calculation for the entire deep learning model by the expected number of layer integration sections. Then, the CPU 11, as the adding unit 232, sets a range of the amount of calculation per layer integration section (e.g., upper and lower limits) based on the calculated amount of calculation per layer integration section, and creates an additional condition equation. Next, in step S216, the CPU 11, as the adding unit 232, passes the equation used to calculate the amount of calculation and the created additional condition equation to the determining unit 228.
- a range of the amount of calculation per layer integration section e.g., upper and lower limits
- step S10 the CPU 11, functioning as the generation unit 21, executes a calculation integration process.
- the calculation integration process is similar to the calculation integration process of the conventional method shown in FIG. 3.
- the layer-integrated model generation process is similar to the layer-integrated model candidate generation process shown in FIG. 9.
- the CPU 11, functioning as the determination unit 228, determines whether the subgraph satisfies the condition transferred from the addition unit 232, together with the condition registered in the condition description unit 44.
- the model generated in step S176 is not the layer-integrated model candidate, but the final layer-integrated model.
- step S220 the determination unit 228 outputs the generated layer-integrated model, and the layer-integration determination process ends.
- the layer integration determination device integrates layers within a layer integration section that satisfies optimization conditions based on a predetermined number of layer integration sections, such as the range of the amount of calculations per layer integration section, along with layer integration conditions based on the specifications of the AI chip, which are registered in the condition description section, to generate a layer-integrated model.
- a predetermined number of layer integration sections such as the range of the amount of calculations per layer integration section
- layer integration conditions based on the specifications of the AI chip, which are registered in the condition description section
- the layer integration judgment process executed by the CPU by reading the software (program) in each of the above embodiments may be executed by various processors other than the CPU.
- processors in this case include PLDs (Programmable Logic Devices) such as FPGAs (Field-Programmable Gate Arrays) whose circuit configuration can be changed after manufacture, and dedicated electric circuits such as ASICs (Application Specific Integrated Circuits), which are processors having a circuit configuration designed exclusively to execute specific processes.
- the layer integration judgment process may be executed by one of these various processors, or may be executed by a combination of two or more processors of the same or different types (for example, multiple FPGAs, and a combination of a CPU and an FPGA, etc.).
- the hardware structure of these various processors is, more specifically, an electric circuit that combines circuit elements such as semiconductor elements.
- the layer integration judgment program is described as being pre-stored (installed) in storage, but this is not limiting.
- the program may be provided in a form stored in a non-transitory storage medium such as a CD-ROM (Compact Disk Read Only Memory), a DVD-ROM (Digital Versatile Disk Read Only Memory), or a USB (Universal Serial Bus) memory.
- the program may also be downloaded from an external device via a network.
- Memory at least one processor coupled to the memory; Including, The processor, Among the operations corresponding to each layer of a deep learning model including a plurality of convolution layers, which are input in the form of a computation graph in which each layer is represented by a node, operations corresponding to each of the convolution layers that can be processed by dedicated hardware that executes the inference processing of the deep learning model and the surrounding layers associated with the convolution layers are integrated to generate a computation-integrated model; extracting a subgraph that matches a pattern registered in advance in a pattern description section as a combination of convolutional layers that can be integrated based on the layer configuration of each of the convolutional layers from the computationally integrated model; a layer merging determination device configured to determine, as a layer merging interval, an interval corresponding to the subgraph that satisfies a condition registered in advance in a condition description section as a condition for layer merging based on a specification of the dedicated hardware.
- a non-transitory storage medium storing a program executable by a computer to execute a layer integration determination process,
- the layer integration determination process includes: Among the operations corresponding to each layer of a deep learning model including a plurality of convolution layers, which are input in the form of a computation graph in which each layer is represented by a node, operations corresponding to each of the convolution layers that can be processed by dedicated hardware that executes the inference processing of the deep learning model and the surrounding layers associated with the convolution layers are integrated to generate a computation-integrated model; extracting a subgraph that matches a pattern registered in advance in a pattern description section as a combination of convolutional layers that can be integrated based on the layer configuration of each of the convolutional layers from the computationally integrated model;
- a non-transitory storage medium that determines, as a layer integration interval, an interval corresponding to the subgraph that satisfies a condition registered in advance in a condition description section as a condition for layer integration based on a specification of the dedicated
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
Description
開示の技術は、層統合判定装置、層統合判定方法、及び層統合判定プログラムに関する。 The disclosed technology relates to a layer integration determination device, a layer integration determination method, and a layer integration determination program.
深層学習は、人間の神経細胞の仕組みを再現したニューラルネットワークを用いた機械学習手法であり、動画像処理、自然言語処理等の分野に応用されている。例えば、動画像処理の分野においては、畳み込みニューラルネットワーク(CNN:Convolutional Neural Network)と呼ばれる深層学習モデルを用いて、画像中の物体のクラスを判定する画像認識の手法が提案されている。また、他にも、深層学習モデルを用いて、画像中の物体の位置及びクラスを判定する物体検出、ピクセル毎に物体のクラスを推論するセグメンテーション等の手法が数多く提案されている。また、自然言語処理の分野では、Transformerに代表されるAttention機構を備えたモデルを用いた手法が、機械翻訳、要約等のタスクに応用されている。このような深層学習モデルは、従来の機械学習モデルを超える性能を達成しており、医療、産業等の様々な分野で深層学習を活用しようとする動きが見られる。 Deep learning is a machine learning method that uses a neural network that reproduces the mechanism of human nerve cells, and is applied to fields such as video processing and natural language processing. For example, in the field of video processing, an image recognition method has been proposed that uses a deep learning model called a convolutional neural network (CNN) to determine the class of an object in an image. In addition, many other methods have been proposed using deep learning models, such as object detection, which determines the position and class of an object in an image, and segmentation, which infers the class of an object for each pixel. In the field of natural language processing, methods using models equipped with an attention mechanism, such as the Transformer, are applied to tasks such as machine translation and summarization. Such deep learning models have achieved performance that exceeds that of conventional machine learning models, and there is a movement to utilize deep learning in various fields such as medicine and industry.
こうした深層学習モデルの性能向上の要因として、コンピュータの計算能力の向上、クラウド技術の発達等が挙げられる。例えば、GPU(Graphical Processing Unit)を転用して大規模な並列演算が可能になった。また、例えば、GCP(Google Cloud Platform)、AWS(Amazon Web Service)等の、GPUを多数搭載したクラウドサービスが登場したことによって、大規模な深層学習モデルを容易に学習できるようになった。 Factors behind the improved performance of these deep learning models include the improvement of computer computing power and the development of cloud technology. For example, large-scale parallel calculations are now possible by repurposing GPUs (Graphical Processing Units). In addition, the emergence of cloud services equipped with many GPUs, such as GCP (Google Cloud Platform) and AWS (Amazon Web Service), has made it easier to train large-scale deep learning models.
学習したモデルを用いて、車載カメラ、スマートフォン等の端末で取得したデータに対して推論処理を行う場合、クラウド上で推論を実行する方法、及び、データを取得した端末上で推論を実行する方法の2通りの方法がある。前者の方法では、ネットワーク遅延によってリアルタイム性が損なわれ、また、データをインターネット経由でクラウドに送信することで、セキュリティリスクやプライバシー侵害生じる可能性がある。そこで、近年、データを取得した端末側で推論処理を実行するエッジAI(Artificial Intelligence)が注目されている。 When using a trained model to perform inference processing on data acquired by a device such as an in-vehicle camera or a smartphone, there are two methods: performing inference on the cloud, or performing inference on the device that acquired the data. With the former method, real-time performance is compromised due to network delays, and sending data to the cloud via the internet can pose security risks and privacy violations. For this reason, in recent years, attention has been focused on edge AI (Artificial Intelligence), which performs inference processing on the device that acquired the data.
エッジAIは、プライバシーを保護しながらリアルタイムにデータ処理が可能となる一方で、電源及び計算リソースが確保しづらいという課題があった。特に、ドローン、スマートフォン等の移動体上で推論処理を行う場合、重量等の制限から、GPUのような消費電力の大きなデバイスを搭載することは困難である。そこで、AIチップと呼ばれる推論処理に特化したハードウェアを端末側に搭載することで、消費電力を抑えつつ、エッジAIに必要な計算リソースを確保している。 Edge AI enables real-time data processing while protecting privacy, but one issue is the difficulty of securing power sources and computing resources. In particular, when performing inference processing on mobile objects such as drones and smartphones, it is difficult to install devices that consume large amounts of power, such as GPUs, due to weight and other restrictions. Therefore, by installing specialized hardware for inference processing, known as an AI chip, on the terminal, it is possible to reduce power consumption while securing the computing resources required for edge AI.
また、限られた計算リソースを効率的に利用するため、深層学習モデルに含まれる複数の層の演算を統合して処理する層統合と呼ばれる手法も導入されている(非特許文献1)。層統合では、深層学習モデルの畳み込み演算を1層ずつ独立に処理するのではなく、複数の畳み込み層を統合して処理する。これによって、小容量のキャッシュを用いた演算処理が可能となるため、外部メモリへのアクセスを抑制し、処理時間の短縮や電力効率の向上が可能となる。このような演算の統合を自動化する手法も提案されている(非特許文献2)。 Also, to efficiently use limited computing resources, a technique called layer integration has been introduced that integrates and processes the calculations of multiple layers included in a deep learning model (Non-Patent Document 1). In layer integration, the convolution calculations of a deep learning model are not processed independently for each layer, but multiple convolution layers are integrated and processed. This makes it possible to process calculations using a small-capacity cache, thereby reducing access to external memory, shortening processing time, and improving power efficiency. A technique for automating this integration of calculations has also been proposed (Non-Patent Document 2).
従来手法では、深層学習モデルにおけるプリミティブな演算の統合は自動化が可能であるが、複数の畳み込み層を含む深層学習モデルの層統合を行うことは考慮されていない。そのため、従来手法で複数の畳み込み層を含む深層学習モデルの層統合を行う場合、対象の深層学習モデルを詳細に解析し、深層学習モデル毎に個別に層統合の判定を行う必要がり、層統合の判定に時間を要するという課題がある。 Although conventional methods can automate the integration of primitive operations in deep learning models, they do not take into account layer integration of deep learning models that include multiple convolutional layers. Therefore, when integrating layers of deep learning models that include multiple convolutional layers using conventional methods, it is necessary to analyze the target deep learning model in detail and make a layer integration decision for each deep learning model individually, which poses the issue of time-consuming layer integration decisions.
開示の技術は、上記の点に鑑みてなされたものであり、複数の畳み込み層を含む深層学習モデルの層統合の判定に要する時間を短縮しつつ、様々な深層学習モデルに対して柔軟に対応することを目的とする。 The disclosed technology has been developed in consideration of the above points, and aims to flexibly support a variety of deep learning models while shortening the time required to determine layer integration for deep learning models that include multiple convolutional layers.
本開示の第1態様は、層統合判定装置であって、各層をノードで表現した計算グラフの形式で入力された、複数の畳み込み層を含む深層学習モデルの各層に対応する演算のうち、深層学習モデルの推論処理を実行する専用ハードウェアで処理可能な畳み込み層及び前記畳み込み層に付随する周辺の層の各々に対応する演算を統合して、演算統合済みモデルを生成する生成部と、前記演算統合済みモデルのうち、畳み込み層の各々の層構成に基づいて層統合可能な畳み込み層の組み合わせとしてパターン記述部に予め登録されたパターンに一致する部分グラフを抽出する抽出部と、前記専用ハードウェアの仕様に基づく層統合の条件として条件記述部に予め登録された条件を満たす前記部分グラフに相当する区間を、層統合区間として判定する判定部と、を含む。 A first aspect of the present disclosure is a layer integration determination device, which includes a generation unit that integrates operations corresponding to each of the convolutional layers and the surrounding layers associated with the convolutional layers that can be processed by dedicated hardware that executes the inference processing of the deep learning model, among the operations corresponding to each layer of a deep learning model including multiple convolutional layers input in the form of a computation graph in which each layer is represented by a node, to generate a computationally integrated model; an extraction unit that extracts a subgraph from the computationally integrated model that matches a pattern registered in advance in a pattern description unit as a combination of convolutional layers that can be layer integrated based on the layer configuration of each of the convolutional layers; and a determination unit that determines, as a layer integration section, a section corresponding to the subgraph that satisfies a condition registered in advance in a condition description unit as a condition for layer integration based on the specifications of the dedicated hardware.
本開示の第2態様は、生成部と、抽出部と、判定部とを含む層統合判定装置が実行する層統合判定方法であって、前記生成部が、各層をノードで表現した計算グラフの形式で入力された、複数の畳み込み層を含む深層学習モデルの各層に対応する演算のうち、深層学習モデルの推論処理を実行する専用ハードウェアで処理可能な畳み込み層及び前記畳み込み層に付随する周辺の層の各々に対応する演算を統合して、演算統合済みモデルを生成し、前記抽出部が、前記演算統合済みモデルのうち、畳み込み層の各々の層構成に基づいて層統合可能な畳み込み層の組み合わせとしてパターン記述部に予め登録されたパターンに一致する部分グラフを抽出し、前記判定部が、前記専用ハードウェアの仕様に基づく層統合の条件として条件記述部に予め登録された条件を満たす前記部分グラフに相当する区間を、層統合区間として判定する方法である。 The second aspect of the present disclosure is a layer integration determination method executed by a layer integration determination device including a generation unit, an extraction unit, and a determination unit, in which the generation unit integrates operations corresponding to each layer of a deep learning model including multiple convolution layers, which are input in the form of a computation graph in which each layer is represented by a node, and operations corresponding to each of the convolution layers that can be processed by dedicated hardware that executes the inference processing of the deep learning model and the surrounding layers associated with the convolution layers, to generate a computation-integrated model, the extraction unit extracts a subgraph from the computation-integrated model that matches a pattern registered in advance in a pattern description unit as a combination of convolution layers that can be layer-integrated based on the layer configuration of each of the convolution layers, and the determination unit determines, as a layer integration section, a section corresponding to the subgraph that satisfies a condition registered in advance in a condition description unit as a condition for layer integration based on the specifications of the dedicated hardware.
本開示の第3態様は、層統合判定プログラムであって、コンピュータを、上記の層統合判定装置の各部として機能させるためのプログラムである。 The third aspect of the present disclosure is a layer integration determination program that causes a computer to function as each part of the layer integration determination device described above.
開示の技術によれば、複数の畳み込み層を含む深層学習モデルの層統合の判定に要する時間を短縮しつつ、様々な深層学習モデルに対して柔軟に対応することができる。 The disclosed technology can flexibly accommodate a variety of deep learning models while reducing the time required to determine layer integration for deep learning models that include multiple convolutional layers.
以下、開示の技術の実施形態の一例を、図面を参照しつつ説明する。なお、各図面において同一又は等価な構成要素及び部分には同一の参照符号を付与している。また、図面の寸法比率は、説明の都合上誇張されており、実際の比率とは異なる場合がある。 Below, an example of an embodiment of the disclosed technology will be described with reference to the drawings. Note that the same reference symbols are used for identical or equivalent components and parts in each drawing. Also, the dimensional ratios in the drawings have been exaggerated for the convenience of explanation and may differ from the actual ratios.
<従来手法について>
まず、各実施形態の詳細を説明する前に、従来手法について説明する。
<About conventional methods>
First, before describing each embodiment in detail, a conventional method will be described.
図1に、例えば非特許文献2に記載の手法のように、深層学習モデルの演算の統合を自動化する従来手法の演算統合装置1000の構成を示す。 FIG. 1 shows the configuration of a computation integration device 1000 that uses a conventional method to automate the integration of computations in a deep learning model, such as the method described in Non-Patent Document 2.
演算統合装置1000には、深層学習モデルの各層をノードで表現した計算グラフの形式で深層学習モデルが入力される。演算統合装置1000は、入力された計算グラフを走査しながら、統合可能な演算に対応する層を統合する処理を行い、演算統合済みモデルを出力する。図1に示すように、演算統合装置1000は、機能的には、対応関係記述部1040と呼ばれる1つの記述部と、ラベリング部1022及び演算統合部1024の2つの処理部とを含む。 The deep learning model is input to the computation integration device 1000 in the form of a computation graph in which each layer of the deep learning model is represented by a node. The computation integration device 1000 scans the input computation graph, performs processing to integrate layers corresponding to computations that can be integrated, and outputs a computationally integrated model. As shown in FIG. 1, the computation integration device 1000 functionally includes one description unit called the correspondence description unit 1040, and two processing units, a labeling unit 1022 and a computation integration unit 1024.
一般的に、ハードウェアで機械学習モデルの推論処理を実行する場合、図2に示すような畳み込み(Conv)層の演算と、畳み込み層に付随する周辺の各層の演算とを統合して実行することで、演算効率の向上を図る。図2の例では、付随する周辺の層として、パディング(Pad)層、バッチ正規化(BN)層、及び活性化関数(ReLU)層を示している。 Generally, when inference processing of a machine learning model is performed using hardware, the calculations of the convolutional (Conv) layer as shown in Figure 2 are integrated with the calculations of the surrounding layers associated with the convolutional layer to improve calculation efficiency. In the example of Figure 2, the surrounding layers associated with the convolutional layer are the padding (Pad) layer, the batch normalization (BN) layer, and the activation function (ReLU) layer.
そこで、対応関係記述部1040には、深層学習モデルの各層のうち、深層学習モデルの推論処理を実行する専用ハードウェア(以下では、一例として「AIチップ」ともいう)で処理可能な演算と、統合可能な演算の組み合わせとを登録しておく。AIチップで処理可能な演算は、その演算の処理を実行するために必要な専用の回路がAIチップ上に実装されている演算である。統合可能な演算の組み合わせは、上述した、畳み込み層の演算と、それに付随する周辺の各層の演算との組み合わせのような、統合可能なプリミティブな演算の組み合わせである。 Then, in the correspondence description unit 1040, operations that can be processed by dedicated hardware (hereinafter, also referred to as an "AI chip" as an example) that executes the inference processing of the deep learning model, and combinations of operations that can be integrated, are registered. Operations that can be processed by an AI chip are operations for which a dedicated circuit required to execute the processing of that operation is implemented on the AI chip. Combinations of operations that can be integrated are combinations of primitive operations that can be integrated, such as the combination of an operation of a convolution layer and the operations of each of the surrounding layers associated with it, as described above.
ラベリング部1022は、深層学習モデルの各層の演算がAIチップ上で処理可能か否かを個別に判定し、AIチップ上で処理する演算の層を識別可能にラベル付けする。具体的には、ラベリング部1022は、各演算が、対応関係記述部1040に記述された、AIチップで処理可能な演算に一致するか否かを判定する。ラベリング部1022は、AIチップで処理可能な演算の層には、AIチップで処理することを示すラベルを付し、それ以外の層には、AIチップの制御用等の汎用的なハードウェア(例えば、CPU、GPU等)において処理することを示すラベルを付す。ラベリング部1022は、深層学習モデルの各層(計算グラフの各ノード)にラベルを付したラベリング済みモデルを演算統合部1024へ受け渡す。 The labeling unit 1022 individually determines whether the calculations of each layer of the deep learning model can be processed on an AI chip, and labels the layers of calculations to be processed on the AI chip so that they can be identified. Specifically, the labeling unit 1022 determines whether each calculation matches the calculations that can be processed by the AI chip described in the correspondence description unit 1040. The labeling unit 1022 attaches a label indicating that the calculations will be processed by the AI chip to layers of calculations that can be processed by the AI chip, and attaches a label indicating that the other layers will be processed by general-purpose hardware (e.g., CPU, GPU, etc.) for controlling the AI chip, etc. The labeling unit 1022 passes the labeled model in which each layer of the deep learning model (each node of the computation graph) has been labeled to the calculation integration unit 1024.
演算統合部1024は、ラベリング済みモデルにおいて、AIチップで処理することを示すラベルが付された層の演算の組み合わせで、かつ、対応関係記述部1040に記述された演算の組み合わせと一致する演算の組み合わせに対応する各層を1つの層にまとめる。演算統合部1024は、これにより、深層学習モデルを示す計算グラフを再構成し、演算統合済みモデルを生成して出力する。 The computation integration unit 1024 combines into one layer each layer that corresponds to a combination of computations in a layer that is labeled to be processed by an AI chip and that matches a combination of computations described in the correspondence description unit 1040 in the labeled model. The computation integration unit 1024 thereby reconstructs a computation graph that represents the deep learning model, and generates and outputs a computation-integrated model.
図3に、従来手法の演算統合装置1000で実行される演算統合処理の流れを示すフローチャートを示す。 FIG. 3 shows a flowchart illustrating the flow of the computation integration process executed by the computation integration device 1000 using the conventional method.
演算統合装置1000に計算グラフで表現された深層学習モデルが入力されると、深層学習モデルを走査しながら、深層学習モデルに含まれる各層(計算グラフの各ノード)を処理対象として、ステップS1000のループ処理が実行される。具体的には、ステップS1002で、ラベリング部1022が、対応関係記述部1040に登録された、AIチップで処理可能な演算に基づいて、処理対象の層の演算がAIチップ上で処理可能か否かを判定する。処理可能な場合には、ステップS1004へ移行し、処理可能ではない場合には、ステップS1006へ移行する。 When a deep learning model represented by a computation graph is input to the computation integration device 1000, the deep learning model is scanned and the loop process of step S1000 is executed with each layer (each node of the computation graph) included in the deep learning model as the processing target. Specifically, in step S1002, the labeling unit 1022 determines whether or not the computation of the processing target layer can be processed on the AI chip based on the computations that can be processed by the AI chip and that are registered in the correspondence description unit 1040. If the computation can be processed, the process proceeds to step S1004, and if the computation cannot be processed, the process proceeds to step S1006.
ステップS1004では、ラベリング部1022が、処理対象の層に、AIチップで処理することを示すラベルを付す。一方、ステップS1006では、ラベリング部1022が、処理対象の層に、汎用的なハードウェアで処理することを示すラベルを付す。 In step S1004, the labeling unit 1022 attaches a label to the layer to be processed indicating that it will be processed by an AI chip. On the other hand, in step S1006, the labeling unit 1022 attaches a label to the layer to be processed indicating that it will be processed by general-purpose hardware.
次に、ステップS1008で、ラベリング部1022が、深層学習モデルの走査を終了したか、すなわち、深層学習モデルに含まれる全ての層についてラベリング処理が終了したか否かを判定する。走査が終了している場合には、ステップS1010へ移行し、走査が終了していない場合には、ステップS1000のループ処理を繰り返す。ステップS1010では、演算統合部1024が、AIチップで処理することを示すラベルが付された層が連続する区間をまとめて、計算グラフを再構成する。 Next, in step S1008, the labeling unit 1022 determines whether scanning of the deep learning model is complete, i.e., whether labeling processing is complete for all layers included in the deep learning model. If scanning is complete, the process proceeds to step S1010, and if scanning is not complete, the loop processing of step S1000 is repeated. In step S1010, the computation integration unit 1024 groups together sections in which consecutive layers are labeled to be processed by the AI chip, and reconstructs the computation graph.
そして、深層学習モデルを走査しながら、各区間を処理対象として、ステップS1012のループ処理が実行される。具体的には、ステップS1014で、演算統合部1024が、処理対象の区間内の各層の演算の組み合わせと、対応関係記述部1040に登録された統合可能なプリミティブな演算の組み合わせとのパターンマッチングを行う。演算統合部1024は、パターンマッチングにより、演算の組み合わせが一致するか否かにより、処理対象の区間内の各層を統合可能か否かを判定する。統合可能な場合には、ステップS1016へ移行し、統合できない場合には、ステップS1018へ移行する。 Then, while scanning the deep learning model, the loop process of step S1012 is executed with each section as the processing target. Specifically, in step S1014, the operation integration unit 1024 performs pattern matching between the combination of operations of each layer in the processing target section and the combination of primitive operations that can be integrated and are registered in the correspondence description unit 1040. The operation integration unit 1024 determines whether or not each layer in the processing target section can be integrated depending on whether or not the combination of operations matches through pattern matching. If integration is possible, the process proceeds to step S1016, and if integration is not possible, the process proceeds to step S1018.
ステップS1016では、パターンマッチングにより一致した演算の組み合わせに対応する層の組み合わせを1つの層に置き換えることにより統合し、計算グラフを再構成する。次に、ステップS1018で、演算統合部1024が、深層学習モデルの走査を終了したか、すなわち、深層学習モデルに含まれる全ての区間について、その区間内の層を統合するか否かの判定処理が終了したか否かを判定する。走査が終了している場合には、ステップS1020へ移行し、走査が終了していない場合には、統合可能な区間がなくなるまでステップS1012のループ処理を繰り返す。ステップS1020では、演算統合部1024が、上記ステップS1016で最終的に再構成した計算グラフを、演算統合済みモデルとして出力し、演算統合処理は終了する。 In step S1016, the layer combinations corresponding to the operation combinations that match through pattern matching are integrated by replacing them with one layer, and the computation graph is reconstructed. Next, in step S1018, the computation integration unit 1024 determines whether scanning of the deep learning model is complete, that is, whether the process of determining whether to integrate layers within all sections included in the deep learning model is complete. If scanning is complete, the process proceeds to step S1020, and if scanning is not complete, the loop process of step S1012 is repeated until there are no more sections that can be integrated. In step S1020, the computation integration unit 1024 outputs the computation graph finally reconstructed in step S1016 as a computationally integrated model, and the computation integration process ends.
上述したように、従来手法では、深層学習モデルにおいてプリミティブな演算の統合は自動化が可能であるが、複数の畳み込み層を含む深層学習モデルの層統合を行うことは考慮されていない。複数の畳み込み層を含む場合、深層学習モデル毎にネットワーク構造が異なる。そのため、従来手法で複数の畳み込み層を含む深層学習モデルの層統合を行うためには、層統合を行う深層学習モデルを詳細に解析し、モデル毎に個別に対応する必要があった。 As mentioned above, conventional methods can automate the integration of primitive operations in deep learning models, but do not take into account layer integration of deep learning models that include multiple convolutional layers. When multiple convolutional layers are included, the network structure differs for each deep learning model. Therefore, in order to perform layer integration of deep learning models that include multiple convolutional layers using conventional methods, it was necessary to perform a detailed analysis of the deep learning model that performs layer integration and respond individually to each model.
具体的には、従来手法では、AIチップの仕様に従って、AIチップで統合可能な畳み込み層の最大数の制約をもとに統合する畳み込み層の組み合わせを決定する必要がある。また、従来手法では、統合する畳み込み層の各組み合わせについて、その層統合区間内のカーネルサイズ、入出力データサイズ等が、AIチップ上のキャッシュ容量以下であるか否かを確認する必要がある。そのため、従来手法では、様々な深層学習モデルの層統合に柔軟に対応することができないという課題があった。 Specifically, in conventional methods, it is necessary to determine the combination of convolutional layers to be integrated based on the constraint of the maximum number of convolutional layers that can be integrated on an AI chip, in accordance with the specifications of the AI chip. In addition, in conventional methods, it is necessary to check whether the kernel size, input/output data size, etc. within the layer integration section for each combination of convolutional layers to be integrated are equal to or smaller than the cache capacity on the AI chip. As a result, conventional methods have the problem of being unable to flexibly respond to layer integration of various deep learning models.
そこで、以下の各実施形態では、入力された任意の深層学習モデルについて、AIチップの仕様に基づく制約条件を満たす層統合区間の抽出と、最適な層統合区間の判定とを自動化する方法を提案する。 In the following embodiments, we propose a method for automating the extraction of layer integration sections that satisfy constraints based on the AI chip specifications and the determination of the optimal layer integration section for any input deep learning model.
<第1実施形態>
図4は、第1実施形態に係る層統合判定装置10のハードウェア構成を示すブロック図である。図4に示すように、層統合判定装置10は、CPU(Central Processing Unit)11と、ROM(Read Only Memory)12と、RAM(Random Access Memory)13と、ストレージ14と、入力部15と、表示部16と、通信I/F(Interface)17とを有する。各構成は、バス19を介して相互に通信可能に接続されている。
First Embodiment
4 is a block diagram showing the hardware configuration of the layer integration determination device 10 according to the first embodiment. As shown in FIG. 4, the layer integration determination device 10 has a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a storage 14, an input unit 15, a display unit 16, and a communication I/F (Interface) 17. Each component is connected to each other via a bus 19 so as to be able to communicate with each other.
CPU11は、中央演算処理ユニットであり、各種プログラムの実行、各部の制御等を行う。すなわち、CPU11は、ROM12又はストレージ14からプログラムを読み出し、RAM13を作業領域としてプログラムを実行する。CPU11は、ROM12又はストレージ14に記憶されているプログラムに従って、上記各構成の制御及び各種の演算処理を行う。本実施形態では、ROM12又はストレージ14には、後述する層統合判定プログラムが格納されている。 The CPU 11 is a central processing unit that executes various programs and controls each part. That is, the CPU 11 reads out a program from the ROM 12 or storage 14, and executes the program using the RAM 13 as a working area. The CPU 11 controls each of the above components and performs various calculation processes according to the program stored in the ROM 12 or storage 14. In this embodiment, the layer integration determination program described below is stored in the ROM 12 or storage 14.
ROM12は、各種プログラム及び各種データを格納する。RAM13は、作業領域として一時的にプログラム又はデータを記憶する。ストレージ14は、HDD(Hard Disk Drive)、SSD(Solid State Drive)等の記憶装置により構成され、オペレーティングシステムを含む各種プログラム及び各種データを格納する。 ROM 12 stores various programs and data. RAM 13 temporarily stores programs or data as a working area. Storage 14 is made up of storage devices such as HDD (Hard Disk Drive) and SSD (Solid State Drive), and stores various programs and data including the operating system.
入力部15は、マウス等のポインティングデバイス、及びキーボードを含み、各種の入力を行うために使用される。表示部16は、例えば、液晶ディスプレイであり、各種の情報を表示する。表示部16は、タッチパネル方式を採用して、入力部15として機能してもよい。 The input unit 15 includes a pointing device such as a mouse and a keyboard, and is used to perform various input operations. The display unit 16 is, for example, a liquid crystal display, and displays various types of information. The display unit 16 may also function as the input unit 15 by employing a touch panel system.
通信I/F17は、他の機器と通信するためのインタフェースである。当該通信には、例えば、イーサネット(登録商標)、FDDI等の有線通信の規格、又は、4G、5G、Wi-Fi(登録商標)等の無線通信の規格が用いられる。 The communication I/F 17 is an interface for communicating with other devices. For this communication, for example, a wired communication standard such as Ethernet (registered trademark) or FDDI, or a wireless communication standard such as 4G, 5G, or Wi-Fi (registered trademark) is used.
次に、第1実施形態に係る層統合判定装置10の機能構成について説明する。図5は、層統合判定装置10の機能構成の例を示すブロック図である。図5に示すように、層統合判定装置10は、機能構成として、生成部21と、抽出部26と、判定部28と、最適化部30とを含む。また、層統合判定装置10の所定の記憶領域には、対応関係記述部40と、パターン記述部42と、条件記述部44とが設けられる。各機能構成は、CPU11がROM12又はストレージ14に記憶された層統合判定プログラムを読み出し、RAM13に展開して実行することにより実現される。 Next, the functional configuration of the layer integration determination device 10 according to the first embodiment will be described. FIG. 5 is a block diagram showing an example of the functional configuration of the layer integration determination device 10. As shown in FIG. 5, the layer integration determination device 10 includes, as its functional configuration, a generation unit 21, an extraction unit 26, a determination unit 28, and an optimization unit 30. In addition, a correspondence description unit 40, a pattern description unit 42, and a condition description unit 44 are provided in a predetermined storage area of the layer integration determination device 10. Each functional configuration is realized by the CPU 11 reading out a layer integration determination program stored in the ROM 12 or storage 14, expanding it in the RAM 13, and executing it.
層統合判定装置10には、層統合の処理対象の深層学習モデルであって、複数の畳み込み層を含む深層学習モデルが、畳み込み(Conv)層、活性化関数(Activation)層等の各層をノードで表現した計算グラフの形式で入力される。各ノードは、そのノードに対応する層に関するパラメータの情報を保持する。パラメータは、例えば、入力及び出力の特徴量マップのサイズ、畳み込み演算で用いるカーネルサイズ及びチャネル数、畳み込みの行列演算及び活性化関数の乗算回数、バイアス加算の加算回数等である。 The deep learning model to be processed by layer integration, which includes multiple convolution layers, is input to the layer integration determination device 10 in the form of a computation graph in which each layer, such as a convolution (Conv) layer and an activation function (Activation) layer, is represented by a node. Each node holds parameter information related to the layer corresponding to the node. The parameters are, for example, the size of the input and output feature maps, the kernel size and number of channels used in the convolution operation, the number of multiplications of the convolution matrix operation and activation function, the number of additions for bias addition, etc.
対応関係記述部40は、図1で説明した従来手法の対応関係記述部1040と同様である。すなわち、対応関係記述部40には、AIチップで処理可能な演算と、統合可能なプリミティブな演算の組み合わせとが登録されている。 The correspondence description unit 40 is similar to the correspondence description unit 1040 of the conventional method described in FIG. 1. That is, the correspondence description unit 40 registers operations that can be processed by the AI chip and combinations of primitive operations that can be integrated.
生成部21は、入力された深層学習モデルの各層に対応する演算のうち、AIチップで処理可能な畳み込み層及びその畳み込み層に付随する周辺の層の各々に対応する演算を統合して、演算統合済みモデルを生成する。具体的には、生成部21は、ラベリング部22と、演算統合部24とを含む。 The generation unit 21 integrates the calculations corresponding to each layer of the input deep learning model, including the convolutional layer that can be processed by the AI chip and the surrounding layers associated with the convolutional layer, to generate a calculation-integrated model. Specifically, the generation unit 21 includes a labeling unit 22 and a calculation integration unit 24.
ラベリング部22は、入力された深層学習モデルの各層のうち、対応関係記述部40に予め登録された、AIチップで処理可能な演算に一致する演算の層を識別可能にラベル付けする。ラベリング部22について、その他のより具体的な内容は、図1のラベリング部1022と同様であるため、詳細な説明を省略する。ラベリング部22は、ラベリング済みモデル(計算グラフ)を演算統合部24へ受け渡す。 The labeling unit 22 identifiably labels, among the layers of the input deep learning model, layers of operations that match operations that can be processed by the AI chip and that are pre-registered in the correspondence description unit 40. Other specific details of the labeling unit 22 are similar to those of the labeling unit 1022 in FIG. 1, so detailed explanations will be omitted. The labeling unit 22 passes the labeled model (computation graph) to the computation integration unit 24.
演算統合部24は、ラベリング部22から受け渡されたラベリング済みモデルに基づいて、AIチップで処理可能な演算の層としてラベルが付された層に対応する演算の組み合わせを特定する。そして、演算統合部24は、特定した演算の組み合わせのうち、対応関係記述部40に予め登録された、統合可能なプリミティブな演算の組み合わせと一致する演算の組み合わせを処理ブロックとしてまとめて計算グラフを再構成する。これにより、演算統合部24は、演算統合済みモデルを生成する。演算統合部24について、その他のより具体的な内容は、図1の演算統合部1024と同様であるため、詳細な説明を省略する。演算統合部24は、生成した演算統合済みモデルを抽出部26へ受け渡す。 Based on the labeled model passed from the labeling unit 22, the computation integration unit 24 identifies a combination of operations that corresponds to a layer labeled as a layer of operations that can be processed by the AI chip. The computation integration unit 24 then groups, as processing blocks, combinations of operations that match combinations of primitive operations that can be integrated and are preregistered in the correspondence description unit 40, from among the identified combinations of operations, and reconstructs a computation graph. In this way, the computation integration unit 24 generates a computation-integrated model. Other specific details of the computation integration unit 24 are similar to those of the computation integration unit 1024 in FIG. 1, so a detailed description will be omitted. The computation integration unit 24 passes the generated computation-integrated model to the extraction unit 26.
パターン記述部42には、畳み込み層の各々の層構成に基づいて層統合可能な畳み込み層の組み合わせのパターンが登録されている。具体的には、パターン記述部42には、AIチップの仕様で決まる「層統合可能な畳み込み層の層数」と「各層間の接続」とを考慮した制約を満たす、深層学習モデルの部分グラフのパターンが、予め網羅的に登録されている。 The pattern description unit 42 stores patterns of combinations of convolutional layers that can be merged based on the layer structure of each convolutional layer. Specifically, the pattern description unit 42 stores a comprehensive set of partial graph patterns of deep learning models that satisfy constraints that take into account the "number of convolutional layers that can be merged" and the "connections between layers" determined by the AI chip specifications.
図6に、部分グラフのパターンの一例を示す。図6(a)は、図2に示す、対応関係記述部40に登録された統合可能な演算の層の組み合わせを表す計算グラフである。これを1つの処理ブロック(Conv Block)として、例えば、図6(b)のように、複数のConv Blockを組み合わせたパターンがパターン記述部42に登録される。また、図6(b)に示すように、パターンには、アップサンプリング(Upsample)層のような他の演算子が含まれていてもよい。また、図7(c)に示すように、スキップコネクションと加算(Add)層とを組み合わせたResidual層(Res)のような層を含むより複雑なパターンであってもよい。 FIG. 6 shows an example of a subgraph pattern. FIG. 6(a) is a computation graph showing a combination of layers of integrable operations registered in the correspondence description unit 40 shown in FIG. 2. This is treated as one processing block (Conv Block), and a pattern combining multiple Conv Blocks, for example, as shown in FIG. 6(b), is registered in the pattern description unit 42. As shown in FIG. 6(b), the pattern may also include other operators such as an upsampling layer. As shown in FIG. 7(c), the pattern may also be a more complex pattern including a layer such as a residual layer (Res) that combines a skip connection and an addition layer.
条件記述部44には、AIチップの仕様に基づく層統合の条件が登録されている。例えば、畳み込み層のカーネルを読み込むために必要なAIチップ上のキャッシュ使用量に関する条件式を、各ノードのパラメータに基づいて、下記(1)式のように規定しておく。 The condition description section 44 registers the layer integration conditions based on the AI chip specifications. For example, the condition equation regarding the cache usage on the AI chip required to read the kernel of the convolution layer is specified based on the parameters of each node as in the following equation (1).
(1)式において、層統合区間を第ls層から第le層とし、kaは第a層の畳み込み層のカーネルサイズ、iCHaは第a層の入力のチャネル数、oCHaは第a層の出力のチャネル数である。右辺のCcacheはキャッシュ容量を表しており、AIチップの仕様に応じて決定される値である。 In equation (1), the layer integration section is from the l s layer to the l e layer, k a is the kernel size of the a-th convolutional layer, iCH a is the number of input channels of the a-th layer, and oCH a is the number of output channels of the a-th layer. C cache on the right side represents the cache capacity, which is a value determined according to the specifications of the AI chip.
抽出部26は、演算統合部24から受け渡された演算統合済みモデル(計算グラフ)のうち、パターン記述部42に予め登録されたパターンに一致する部分を部分グラフとして抽出する。具体的には、抽出部26は、演算統合済みモデルを示す計算グラフを走査して、パターン記述部42に登録されたパターンとマッチングを行い、パターンがマッチした区間を層統合区間の候補となる部分グラフとして抽出する。抽出部26は、抽出した部分グラフを判定部28へ受け渡す。 The extraction unit 26 extracts, as a subgraph, a portion of the computationally integrated model (computation graph) passed from the computation integration unit 24 that matches a pattern registered in advance in the pattern description unit 42. Specifically, the extraction unit 26 scans the computation graph indicating the computationally integrated model, matches it with the pattern registered in the pattern description unit 42, and extracts the section where the pattern matches as a subgraph that is a candidate for a layer integration section. The extraction unit 26 passes the extracted subgraph to the determination unit 28.
判定部28は、抽出部26から受け渡された部分グラフのうち、条件記述部44に予め登録された条件を満たす部分グラフに相当する区間を、層統合区間として判定する。具体的には、判定部28は、部分グラフ内の各ノードのパラメータを抽出すると共に、条件記述部44に登録された条件式を読み出す。そして、判定部28は、抽出したパラメータを用いて、読み出した条件式を計算し、部分グラフが該当の条件式を満たすか否かを判定する。判定部28は、条件を満たす層統合区間内の各層を統合して演算統合済みモデルを再構成した層統合済みモデル候補を生成する。判定部28は、生成した層統合済みモデル候補を最適化部30へ受け渡す。 The determination unit 28 determines, from among the subgraphs passed from the extraction unit 26, a section corresponding to a subgraph that satisfies the conditions registered in advance in the condition description unit 44 as a layer integration section. Specifically, the determination unit 28 extracts parameters of each node in the subgraph, and reads out the conditional expression registered in the condition description unit 44. The determination unit 28 then uses the extracted parameters to calculate the read out conditional expression, and determines whether or not the subgraph satisfies the corresponding conditional expression. The determination unit 28 generates a layer-integrated model candidate by integrating each layer in the layer integration section that satisfies the condition and reconstructing the computationally integrated model. The determination unit 28 passes the generated layer-integrated model candidate to the optimization unit 30.
最適化部30は、抽出部26及び判定部28の処理を複数回実行することにより生成される、層統合区間がそれぞれ異なる複数の層統合済みモデル候補から、最適化の指標に基づいて最適な候補を層統合済みモデルとして選択して出力する。最適化の指標は任意であるが、各層統合区間の処理時間を表す数値、具体的には、深層学習モデルの演算量、及び外部メモリとAIチップとでやり取りするデータ量の少なくとも一方を用いた指標とすることが好ましい。例えば、最適化部30は、各層統合区間の積和演算量の合計、畳み込みカーネル及び入出力特徴量の読み書き時間等の分散を最適化の指標として計算する。最適化部30は、分散が最も小さくなる層統合済みモデル候補を、最終的な層統合済みモデルとして選択し出力する。これにより、層統合済みモデル候補に含まれる各層統合区間の処理時間が最も均一化される候補が選択され、パイプライン処理を組む際に、前段処理の終了待ちによる処理の遅延を抑制できるため、効率的な推論処理の実行が可能となる。 The optimization unit 30 selects and outputs the optimal candidate as the layer-integrated model based on an optimization index from among multiple layer-integrated model candidates with different layer-integrated sections generated by executing the processes of the extraction unit 26 and the determination unit 28 multiple times. The optimization index is arbitrary, but it is preferable to use an index that uses at least one of a numerical value representing the processing time of each layer-integrated section, specifically, the amount of calculation of the deep learning model and the amount of data exchanged between the external memory and the AI chip. For example, the optimization unit 30 calculates the total amount of product-sum calculations of each layer-integrated section, the variance of the read/write time of the convolution kernel and the input/output feature, etc. as the optimization index. The optimization unit 30 selects and outputs the layer-integrated model candidate with the smallest variance as the final layer-integrated model. As a result, a candidate with the most uniform processing time for each layer-integrated section included in the layer-integrated model candidate is selected, and when assembling pipeline processing, processing delays due to waiting for the completion of the previous stage can be suppressed, enabling efficient execution of inference processing.
次に、第1実施形態に係る層統合判定装置10の作用について説明する。図8は、層統合判定装置10による層統合判定処理の流れを示すフローチャートである。CPU11がROM12又はストレージ14から層統合判定プログラムを読み出して、RAM13に展開して実行することにより、層統合判定処理が行なわれる。層統合判定処理は、本開示の層統合判定方法の一例である。 Next, the operation of the layer merging determination device 10 according to the first embodiment will be described. FIG. 8 is a flowchart showing the flow of the layer merging determination process by the layer merging determination device 10. The layer merging determination process is performed by the CPU 11 reading out a layer merging determination program from the ROM 12 or storage 14, expanding it into the RAM 13, and executing it. The layer merging determination process is an example of the layer merging determination method of the present disclosure.
ステップS10において、CPU11は、生成部21として、演算統合処理を実行する。演算統合処理は、図3に示す従来手法の演算統合処理と同様である。次に、ステップS12のループ処理として、指定回数の繰り返し処理が実行される。繰り返し処理の指定回数は、外部からハイパーパラメータとして指定する。 In step S10, the CPU 11, functioning as the generation unit 21, executes a computation integration process. The computation integration process is similar to the computation integration process of the conventional method shown in FIG. 3. Next, as the loop process in step S12, a specified number of repetitions are executed. The specified number of repetitions is specified externally as a hyperparameter.
具体的には、ステップS14で、CPU11は、抽出部26として、パターン記述部42に登録されているパターンについて、マッチングを行うパターンの順序を変更する。パターンの順序は任意に設定可能であり、例えば、ランダムに順序を入れ替えたり、パターン内の畳み込み層の数でグルーピングして規則的に順序を入れ替えたりしてよい。これは、後述するステップS16の処理で、繰り返し処理毎に異なる層統合済みモデル候補を生成するための処理である。 Specifically, in step S14, the CPU 11, functioning as the extraction unit 26, changes the order of patterns to be matched for the patterns registered in the pattern description unit 42. The order of the patterns can be set arbitrarily; for example, the order may be changed randomly, or the order may be changed regularly by grouping the patterns by the number of convolutional layers in the patterns. This is a process for generating a different layer-integrated model candidate for each iteration in the process of step S16 described below.
次に、ステップS16で、CPU11は、層統合済みモデル候補生成処理を実行する。次に、ステップS18で、CPU11は、最適化部30として、繰り返し処理が指定回数に到達したか否かを判定する。指定回数に到達した場合には、ステップS20へ移行し、到達していない場合には、ステップS12のループ処理を繰り返す。 Next, in step S16, the CPU 11 executes a layer-integrated model candidate generation process. Next, in step S18, the CPU 11, functioning as the optimization unit 30, determines whether the repetition process has reached a designated number of times. If the repetition process has reached the designated number of times, the process proceeds to step S20, and if the repetition process has not reached the designated number of times, the process repeats the loop process of step S12.
ここで、図9を参照して、ステップS16で実行される層統合済みモデル候補生成処理について説明する。 Now, with reference to FIG. 9, we will explain the layer-integrated model candidate generation process executed in step S16.
CPU11は、パターン記述部42に登録されたパターンを、上記ステップS14で変更した順番でそれぞれ処理対象のパターンとして設定して、ステップS160のループ処理を実行する。具体的には、CPU11は、演算統合済みモデルを走査して、ステップS162のループ処理を実行する。 The CPU 11 sets each of the patterns registered in the pattern description unit 42 as a pattern to be processed in the order changed in step S14, and executes the loop process of step S160. Specifically, the CPU 11 scans the computationally integrated model and executes the loop process of step S162.
ステップS164で、CPU11は、抽出部26として、未統合の区間、すなわち、処理対象のパターン以外のパターンに基づいて抽出された層統合区間以外の区間が残っているか否かを判定する。未統合の区間が残っている場合には、ステップS166へ移行し、残っていない場合には、ステップS176へ移行する。 In step S164, the CPU 11, functioning as the extraction unit 26, determines whether or not there are any unintegrated sections remaining, i.e., sections other than the layer-integrated sections extracted based on patterns other than the pattern to be processed. If there are any unintegrated sections remaining, the process proceeds to step S166; if there are no unintegrated sections remaining, the process proceeds to step S176.
ステップS166では、CPU11は、抽出部26として、演算統合済みモデル(計算グラフ)の未統合の区間において、処理対象のパターンと一致する部分グラフを探索する。CPU11は、抽出部26として、処理対象のパターンと一致する部分グラフが存在するか否かを判定する。部分グラフが存在する場合には、ステップS168へ移行し、存在しない場合には、ステップS162のループ処理を終了する。 In step S166, the CPU 11, functioning as the extraction unit 26, searches for a subgraph that matches the pattern to be processed in the unintegrated section of the computationally integrated model (computation graph). The CPU 11, functioning as the extraction unit 26, determines whether or not a subgraph that matches the pattern to be processed exists. If a subgraph exists, the process proceeds to step S168; if not, the loop process of step S162 ends.
ステップS168では、CPU11は、抽出部26として、処理対象のパターンと一致した部分グラフを抽出する。そして、CPU11は、判定部28として、部分グラフ内の各ノードのパラメータを抽出すると共に、条件記述部44に登録された条件式を読み出す。さらに、CPU11は、判定部28として、抽出したパラメータを用いて、読み出した条件式を計算し、部分グラフが該当の条件式を満たすか否かを判定する。部分グラフが条件を満たす場合には、ステップS170へ移行し、条件を満たさない場合には、ステップS162のループ処理を終了する。 In step S168, the CPU 11, functioning as the extraction unit 26, extracts a subgraph that matches the pattern to be processed. Then, the CPU 11, functioning as the judgment unit 28, extracts parameters for each node in the subgraph and reads out a condition expression registered in the condition description unit 44. Furthermore, the CPU 11, functioning as the judgment unit 28, calculates the read out condition expression using the extracted parameters and judges whether or not the subgraph satisfies the corresponding condition expression. If the subgraph satisfies the condition, the process proceeds to step S170; if the subgraph does not satisfy the condition, the loop process of step S162 ends.
ステップS170では、CPU11は、抽出部26として、演算統合済みモデル全体の走査を終了したか否かを判定する。走査を終了していない場合には、ステップS162のループ処理を繰り返し、走査を終了した場合には、ステップS174へ移行する。ステップS174では、CPU11は、抽出部26として、パターン記述部42に登録された全てのパターンについて、演算統合済みモデルとのマッチング処理が終了したか否かを判定する。未処理のパターンが存在する場合には、未処理にパターンについてステップS160のループ処理を繰り返し、全てのパターンについて終了している場合には、ステップS176へ移行する。 In step S170, the CPU 11, functioning as the extraction unit 26, determines whether scanning of the entire computationally integrated model has been completed. If scanning has not been completed, the loop process of step S162 is repeated, and if scanning has been completed, the process proceeds to step S174. In step S174, the CPU 11, functioning as the extraction unit 26, determines whether matching processing with the computationally integrated model has been completed for all patterns registered in the pattern description unit 42. If there are unprocessed patterns, the loop process of step S160 is repeated for the unprocessed patterns, and if all patterns have been processed, the process proceeds to step S176.
ステップS176では、演算統合済みモデルに未統合の区間が残っていない状態、又は、全てパターンについてのマッチング処理が終了した状態となっている。CPU11は、判定部28として、上記ステップS170で抽出された層統合区間内の各層を統合して演算統合済みモデルを再構成した層統合済みモデル候補を生成する。なお、AIチップで処理することを示すラベルが付された層で、いずれの層統合区間にも含まれなかった層は、単層で処理するものとし、各層統合区間と合わせて層統合済みモデル候補に含めて出力する。そして、層統合済みモデル候補生成処理を終了して、層統合判定処理(図8)へリターンする。ステップS16の層統合済みモデル候補生成処理を含むループ処理(ステップS12)が指定回数繰り返されることにより、指定回数分の層統合済みモデル候補が生成される。 In step S176, there are no unintegrated sections remaining in the computationally integrated model, or matching processing has been completed for all patterns. The CPU 11, as the determination unit 28, generates a layer-integrated model candidate by integrating each layer in the layer integration section extracted in step S170 above to reconstruct the computationally integrated model. Note that layers that are labeled to be processed by the AI chip and are not included in any layer integration section are treated as being processed as a single layer, and are output together with each layer integration section as part of the layer-integrated model candidate. Then, the layer-integrated model candidate generation process ends, and the process returns to the layer integration determination process (FIG. 8). The loop process (step S12) including the layer-integrated model candidate generation process in step S16 is repeated a specified number of times, generating a specified number of layer-integrated model candidates.
次に、ステップS20で、CPU11は、最適化部30として、生成された層統合済みモデル候補毎に、ステップS20のループ処理を実行する。具体的には、ステップS22で、CPU11は、最適化部30として、処理対象の層統合済みモデル候補に含まれる各層統合区間の処理時間を表す指標の分散を最適化の指標として計算する。 Next, in step S20, the CPU 11, functioning as the optimization unit 30, executes the loop process of step S20 for each generated layer-integrated model candidate. Specifically, in step S22, the CPU 11, functioning as the optimization unit 30, calculates, as an optimization index, the variance of an index representing the processing time of each layer-integrated section included in the layer-integrated model candidate to be processed.
次に、ステップS24で、CPU11は、最適化部30として、全ての層統合済みモデル候補について、最適化の指標を計算する処理が終了したか否かを判定する。未処理の層統合済みモデル候補が存在する場合には、ステップS20のループ処理を繰り返し、全て終了している場合には、ステップS26へ移行する。 Next, in step S24, the CPU 11, functioning as the optimization unit 30, determines whether the process of calculating the optimization index for all layer-integrated model candidates has been completed. If there are any unprocessed layer-integrated model candidates, the loop process of step S20 is repeated, and if all have been completed, the process proceeds to step S26.
ステップS26では、CPU11は、最適化部30として、最適化の指標に基づいて、層統合済みモデル候補から、最適な層統合済みモデルを選択する。例えば、CPU11は、最適化部30として、最適化の指標として計算した各層統合区間の処理時間を表す指標の分散が最も小さくなる層統合済みモデル候補を、最終的な層統合済みモデルとして選択し出力する。そして、層統合判定処理は終了する。 In step S26, the CPU 11, as the optimization unit 30, selects the optimal layer-merged model from the layer-merged model candidates based on the optimization index. For example, the CPU 11, as the optimization unit 30, selects and outputs, as the final layer-merged model, the layer-merged model candidate that has the smallest variance of the index representing the processing time for each layer-merged section calculated as the optimization index. Then, the layer-merging determination process ends.
以上説明したように、第1実施形態に係る層統合判定装置は、各層をノードで表現した計算グラフの形式で入力された、複数の畳み込み層を含む深層学習モデルの各層に対応する演算のうち、深層学習モデルの推論処理を実行する専用ハードウェアで処理可能な畳み込み層及び畳み込み層に付随する周辺の層の各々に対応する演算を統合して、演算統合済みモデルを生成し、演算統合済みモデルのうち、畳み込み層の各々の層構成に基づいて層統合可能な畳み込み層の組み合わせとしてパターン記述部に予め登録されたパターンに一致する部分グラフを抽出し、専用ハードウェアの仕様に基づく層統合の条件として条件記述部に予め登録された条件を満たす部分グラフに相当する区間を、層統合区間として判定し、層統合区間がそれぞれ異なる複数の層統合済みモデル候補のうち、所定の指標に基づいて最適な層統合済みモデルを選択して出力する。これにより、計算グラフとして表現された深層学習モデルのみを入力として、複数の畳み込み層を含む深層学習モデルの層統合の判定に要する時間を短縮しつつ、様々な深層学習モデルに対して柔軟に対応することができる。 As described above, the layer integration determination device according to the first embodiment integrates the operations corresponding to each of the convolution layers and the surrounding layers associated with the convolution layers that can be processed by the dedicated hardware that executes the inference processing of the deep learning model, among the operations corresponding to each layer of the deep learning model including multiple convolution layers input in the form of a computation graph in which each layer is represented by a node, to generate a computation-integrated model, extracts a subgraph from the computation-integrated model that matches a pattern preregistered in the pattern description section as a combination of convolution layers that can be layer-integrated based on the layer configuration of each of the convolution layers, determines a section corresponding to a subgraph that satisfies a condition preregistered in the condition description section as a condition for layer integration based on the specifications of the dedicated hardware, and selects and outputs the optimal layer-integrated model based on a predetermined index from among multiple layer-integrated model candidates with different layer-integration sections. This makes it possible to flexibly respond to various deep learning models while shortening the time required to determine layer integration of a deep learning model including multiple convolution layers using only the deep learning model represented as a computation graph as input.
<第2実施形態>
第2実施形態では、予め深層学習モデルを走査して最適化で用いる条件を追加条件式として生成し、判定部に渡しておくことで、抽出部及び判定部の処理を1度で済ませられる形態について説明する。なお、第2実施形態に係る層統合判定装置において、第1実施形態に係る層統合判定装置10と同様の構成については、同一符号を付して詳細な説明を省略する。また、第2実施形態に係る層統合判定装置のハードウェア構成は、図4に示す第1実施形態に係る層統合判定装置10のハードウェア構成と同様であるため、説明を省略する。
Second Embodiment
In the second embodiment, a deep learning model is scanned in advance to generate conditions to be used in optimization as additional conditional expressions, and the conditions are passed to the judgment unit, so that the processing of the extraction unit and the judgment unit can be completed in one go. Note that in the layer integration judgment device according to the second embodiment, the same components as those in the layer integration judgment device 10 according to the first embodiment are denoted by the same reference numerals and detailed descriptions thereof are omitted. In addition, the hardware configuration of the layer integration judgment device according to the second embodiment is the same as the hardware configuration of the layer integration judgment device 10 according to the first embodiment shown in FIG. 4, and therefore the description thereof is omitted.
第2実施形態に係る層統合判定装置210の機能構成について説明する。図10は、層統合判定装置210の機能構成の例を示すブロック図である。図10に示すように、層統合判定装置210は、機能構成として、ラベリング部22及び演算統合部24を含む生成部21と、抽出部26と、判定部228と、追加部232とを含む。また、層統合判定装置210の所定の記憶領域には、対応関係記述部40と、パターン記述部42と、条件記述部44とが設けられる。各機能構成は、CPU11がROM12又はストレージ14に記憶された層統合判定プログラムを読み出し、RAM13に展開して実行することにより実現される。 The functional configuration of the layer integration determination device 210 according to the second embodiment will be described. FIG. 10 is a block diagram showing an example of the functional configuration of the layer integration determination device 210. As shown in FIG. 10, the layer integration determination device 210 includes, as its functional configuration, a generation unit 21 including a labeling unit 22 and a calculation integration unit 24, an extraction unit 26, a determination unit 228, and an addition unit 232. In addition, a correspondence description unit 40, a pattern description unit 42, and a condition description unit 44 are provided in a predetermined storage area of the layer integration determination device 210. Each functional configuration is realized by the CPU 11 reading out a layer integration determination program stored in the ROM 12 or storage 14, expanding it in the RAM 13, and executing it.
追加部232は、判定部228で用いる条件に、予め定めた層統合区間数に基づく最適化の条件を追加する。例えば、追加部232は、深層学習モデルの演算量、及び外部メモリとAIチップとでやり取りするデータ量の少なくとも一方を用いて、層統合済みモデルに含まれる各層統合区間の処理時間が均一化されるような各層統合区間の演算量の範囲を条件として追加する。 The adding unit 232 adds an optimization condition based on a predetermined number of layer integration sections to the conditions used by the determining unit 228. For example, the adding unit 232 adds, as a condition, a range of the amount of calculation for each layer integration section that is included in the layer-integrated model and that uniforms the processing time for each layer integration section, using at least one of the amount of calculation for the deep learning model and the amount of data exchanged between the external memory and the AI chip.
より具体的には、追加部232は、層統合判定装置210に入力された深層学習モデル(計算グラフ)を取得し、計算グラフを走査して深層学習モデル全体の演算量を見積もる。追加部232は、演算量を、入力及び出力の特徴量マップのサイズ、畳み込み演算で用いるカーネルサイズ、チャネル数、畳み込みの行列演算及び活性化関数の乗算回数、バイアス加算の加算回数等を含む深層学習モデルのパラメータから見積もる。これらのパラメータは、上述の通り、計算グラフの各ノードが保持している。 More specifically, the adding unit 232 acquires the deep learning model (computation graph) input to the layer integration determination device 210, and scans the computation graph to estimate the computational volume of the entire deep learning model. The adding unit 232 estimates the computation volume from parameters of the deep learning model, including the sizes of the input and output feature maps, the kernel size used in the convolution computation, the number of channels, the convolution matrix computation and the number of multiplications of the activation function, the number of additions for bias addition, etc. These parameters are held by each node of the computation graph, as described above.
追加部232は、見積もった学習モデル全体の演算量から、1層統合区間当たりの演算量の上限及び下限を設定する。具体的には、追加部232は、深層学習モデル全体の演算量を想定される層統合区間数で割ることで、1層統合区間当たりの演算量を計算する。想定される層統合区間数は、ハイパーパラメータとして人手で与えてもよいし、機械的に見積もりを行ってもよい。機械的に見積もりを行う場合、例えば、追加部232から追加条件式を与えていない状態で層統合済みモデルを生成し、生成された層統合済みモデルに含まれる層統合区間数を採用する。そして、追加部232は、1層統合区間当たりの演算量に基づいて、上限及び下限を設定する。 The adding unit 232 sets an upper limit and a lower limit of the amount of calculation per layer-integrated section from the estimated amount of calculation for the entire learning model. Specifically, the adding unit 232 calculates the amount of calculation per layer-integrated section by dividing the amount of calculation for the entire deep learning model by the expected number of layer-integrated sections. The expected number of layer-integrated sections may be given manually as a hyperparameter, or may be estimated mechanically. When estimating mechanically, for example, a layer-integrated model is generated in a state where no additional condition equation is given from the adding unit 232, and the number of layer-integrated sections included in the generated layer-integrated model is adopted. Then, the adding unit 232 sets an upper limit and a lower limit based on the amount of calculation per layer-integrated section.
追加部232は、演算量の計算に用いた式と、1層統合区間当たりの演算量の範囲を示す追加条件式とを判定部228へ受け渡す。追加条件式は、例えば、下記(2)式のように書き表される。 The adding unit 232 passes the formula used to calculate the amount of calculation and an additional condition formula indicating the range of the amount of calculation per one-layer integration section to the determining unit 228. The additional condition formula is written, for example, as in the following formula (2).
(2)式において、層統合区間を第ls層から第le層とし、maは畳み込み層又は活性化関数層(type=conv,activation)である第a層の乗算回数、Cminは演算量の下限、Cmaxは演算量の上限である。(2)式は、畳み込み層及び活性化関数層の乗算回数の和が上限から下限の範囲に含まれるか否かを判定する条件判定式となっている。 In formula (2), the layer integration section is from the l s layer to the l e layer, m a is the number of multiplications of the a-th layer, which is a convolution layer or an activation function layer (type = conv, activation), C min is the lower limit of the amount of calculation, and C max is the upper limit of the amount of calculation. Formula (2) is a conditional judgment formula for judging whether the sum of the number of multiplications of the convolution layer and the activation function layer is within the range from the upper limit to the lower limit.
なお、追加する条件は、深層学習モデルの演算量に関する条件に限定されるものではなく、外部メモリとやり取りされるデータ量を用い条件、これらを組み合わせた条件等としてもよい。また、(2)式では、追加条件式の例として上限及び下限を設定する場合を示したが、上限のみを設定した追加条件式を用いてもよい。例えば、AIチップの演算能力とパイプライン処理の1ステージの処理時間とから1層統合区間当たりの演算量の上限を設定してもよい。 The conditions to be added are not limited to conditions related to the amount of calculations of the deep learning model, but may be conditions using the amount of data exchanged with external memory, or combinations of these. In addition, in formula (2), an example of an additional condition formula in which an upper and lower limits are set is shown, but an additional condition formula in which only an upper limit is set may also be used. For example, an upper limit on the amount of calculations per one-layer integration section may be set based on the computing power of the AI chip and the processing time for one stage of pipeline processing.
また、追加部232は、作成した追加条件式と計算方法が同一の条件式が条件記述部44に登録されている場合、条件記述部44に登録された条件式を追加条件式で上書きすることにより、追加条件式を判定部228へ受け渡してもよい。 In addition, if a conditional expression using the same calculation method as the created additional conditional expression is registered in the condition description section 44, the adding section 232 may overwrite the conditional expression registered in the condition description section 44 with the additional conditional expression, thereby passing the additional conditional expression to the determination section 228.
判定部228は、第1実施形態における判定部28と同様に、抽出部26により抽出された部分グラフが条件を満たすか否かを判定する。ただし、第2実施形態の判定部228は、条件記述部44に登録された、AIチップの仕様に基づく層統合の条件と共に、追加部232から受け渡された、層統合区間数に基づく最適化の条件を満たすか否かを判定する。判定部228は、条件を満たす層統合区間内の各層を統合して演算統合済みモデルを再構成した層統合済みモデルを生成して出力する。 The determination unit 228, like the determination unit 28 in the first embodiment, determines whether the subgraph extracted by the extraction unit 26 satisfies the conditions. However, the determination unit 228 in the second embodiment determines whether the optimization conditions based on the number of layer integration sections transferred from the addition unit 232 are satisfied, along with the layer integration conditions based on the AI chip specifications registered in the condition description unit 44. The determination unit 228 generates and outputs a layer-integrated model in which the layers in the layer integration sections that satisfy the conditions are integrated to reconstruct the computationally integrated model.
次に、第2実施形態に係る層統合判定装置210の作用について説明する。図11は、層統合判定装置210による層統合判定処理の流れを示すフローチャートである。CPU11がROM12又はストレージ14から層統合判定プログラムを読み出して、RAM13に展開して実行することにより、層統合判定処理が行なわれる。 Next, the operation of the layer merging determination device 210 according to the second embodiment will be described. FIG. 11 is a flowchart showing the flow of the layer merging determination process by the layer merging determination device 210. The layer merging determination process is performed by the CPU 11 reading out a layer merging determination program from the ROM 12 or storage 14, expanding it into the RAM 13, and executing it.
ステップS210で、CPU11は、追加部232として、例えば、人手で与えられたハイパーパラメータを取得することにより、又は、機械的に見積もりを行うことにより、想定される層統合区間数を決定する。次に、ステップS212で、CPU11は、追加部232として、層統合判定装置210に入力された深層学習モデル(計算グラフ)を取得し、計算グラフを走査して、深層学習モデルのパラメータから、深層学習モデル全体の演算量を見積もる。 In step S210, the CPU 11, as the adding unit 232, determines the expected number of layer integration sections, for example, by acquiring manually assigned hyperparameters or by mechanically estimating. Next, in step S212, the CPU 11, as the adding unit 232, acquires the deep learning model (computation graph) input to the layer integration determination device 210, scans the computation graph, and estimates the amount of calculation for the entire deep learning model from the parameters of the deep learning model.
次に、ステップS214で、CPU11は、追加部232として、深層学習モデル全体の演算量を想定される層統合区間数で割ることで、1層統合区間当たりの演算量を計算する。そして、CPU11は、追加部232として、計算した1層統合区間当たりの演算量に基づいて、1層統合区間当たりの演算量の範囲(例えば、上限及び下限)を設定し、追加条件式を作成する。次に、ステップS216で、CPU11は、追加部232として、演算量の計算に用いた式と、作成した追加条件式とを判定部228へ受け渡す。 Next, in step S214, the CPU 11, as the adding unit 232, calculates the amount of calculation per layer integration section by dividing the amount of calculation for the entire deep learning model by the expected number of layer integration sections. Then, the CPU 11, as the adding unit 232, sets a range of the amount of calculation per layer integration section (e.g., upper and lower limits) based on the calculated amount of calculation per layer integration section, and creates an additional condition equation. Next, in step S216, the CPU 11, as the adding unit 232, passes the equation used to calculate the amount of calculation and the created additional condition equation to the determining unit 228.
次に、ステップS10で、CPU11は、生成部21として、演算統合処理を実行する。演算統合処理は、図3に示す従来手法の演算統合処理と同様である。次に、ステップS218で、CPU11は、抽出部26及び判定部228として、層統合済みモデル生成処理を実行する。層統合済みモデル生成処理は、図9に示す層統合済みモデル候補生成処理と同様である。ただし、ステップS168で、部分グラフが条件を満たすか否かを判定する際、CPU11は、判定部228として、条件記述部44に登録された条件と共に、追加部232から受け渡された条件を満たすか否かを判定する。また、ステップS176において生成されるモデルを、層統合済みモデル候補ではなく、最終的な層統合済みモデルとする。 Next, in step S10, the CPU 11, functioning as the generation unit 21, executes a calculation integration process. The calculation integration process is similar to the calculation integration process of the conventional method shown in FIG. 3. Next, in step S218, the CPU 11, functioning as the extraction unit 26 and the determination unit 228, executes a layer-integrated model generation process. The layer-integrated model generation process is similar to the layer-integrated model candidate generation process shown in FIG. 9. However, when determining in step S168 whether the subgraph satisfies the condition, the CPU 11, functioning as the determination unit 228, determines whether the subgraph satisfies the condition transferred from the addition unit 232, together with the condition registered in the condition description unit 44. Furthermore, the model generated in step S176 is not the layer-integrated model candidate, but the final layer-integrated model.
次に、ステップS220で、判定部228が、生成した層統合済みモデルを出力し、層統合判定処理は終了する。 Next, in step S220, the determination unit 228 outputs the generated layer-integrated model, and the layer-integration determination process ends.
以上説明したように、第2実施形態に係る層統合判定装置は、条件記述部に登録された、AIチップの仕様に基づく層統合の条件と共に、例えば、1層統合区間の演算量の範囲等の、予め定めた層統合区間数に基づく最適化の条件を満たす層統合区間内の各層を統合して、層統合済みモデルを生成する。これにより、第1実施形態における抽出部及び判定部の繰り返し処理が不要となり、第1実施形態に係る層統合判定装置よりも、層統合判定処理の処理時間を削減することができる。 As described above, the layer integration determination device according to the second embodiment integrates layers within a layer integration section that satisfies optimization conditions based on a predetermined number of layer integration sections, such as the range of the amount of calculations per layer integration section, along with layer integration conditions based on the specifications of the AI chip, which are registered in the condition description section, to generate a layer-integrated model. This eliminates the need for the repeated processing of the extraction section and determination section in the first embodiment, and makes it possible to reduce the processing time of the layer integration determination process compared to the layer integration determination device according to the first embodiment.
なお、上記各実施形態でCPUがソフトウェア(プログラム)を読み込んで実行した層統合判定処理を、CPU以外の各種のプロセッサが実行してもよい。この場合のプロセッサとしては、FPGA(Field-Programmable Gate Array)等の製造後に回路構成を変更可能なPLD(Programmable Logic Device)、及びASIC(Application Specific Integrated Circuit)等の特定の処理を実行させるために専用に設計された回路構成を有するプロセッサである専用電気回路等が例示される。また、層統合判定処理を、これらの各種のプロセッサのうちの1つで実行してもよいし、同種又は異種の2つ以上のプロセッサの組み合わせ(例えば、複数のFPGA、及びCPUとFPGAとの組み合わせ等)で実行してもよい。また、これらの各種のプロセッサのハードウェア的な構造は、より具体的には、半導体素子等の回路素子を組み合わせた電気回路である。 In addition, the layer integration judgment process executed by the CPU by reading the software (program) in each of the above embodiments may be executed by various processors other than the CPU. Examples of processors in this case include PLDs (Programmable Logic Devices) such as FPGAs (Field-Programmable Gate Arrays) whose circuit configuration can be changed after manufacture, and dedicated electric circuits such as ASICs (Application Specific Integrated Circuits), which are processors having a circuit configuration designed exclusively to execute specific processes. In addition, the layer integration judgment process may be executed by one of these various processors, or may be executed by a combination of two or more processors of the same or different types (for example, multiple FPGAs, and a combination of a CPU and an FPGA, etc.). In addition, the hardware structure of these various processors is, more specifically, an electric circuit that combines circuit elements such as semiconductor elements.
また、上記各実施形態では、層統合判定プログラムがストレージに予め記憶(インストール)されている態様を説明したが、これに限定されない。プログラムは、CD-ROM(Compact Disk Read Only Memory)、DVD-ROM(Digital Versatile Disk Read Only Memory)、及びUSB(Universal Serial Bus)メモリ等の非一時的(non-transitory)記憶媒体に記憶された形態で提供されてもよい。また、プログラムは、ネットワークを介して外部装置からダウンロードされる形態としてもよい。 In addition, in each of the above embodiments, the layer integration judgment program is described as being pre-stored (installed) in storage, but this is not limiting. The program may be provided in a form stored in a non-transitory storage medium such as a CD-ROM (Compact Disk Read Only Memory), a DVD-ROM (Digital Versatile Disk Read Only Memory), or a USB (Universal Serial Bus) memory. The program may also be downloaded from an external device via a network.
以上の各実施形態に関し、更に以下の付記を開示する。 The following notes are further provided with respect to each of the above embodiments.
(付記項1)
メモリと、
前記メモリに接続された少なくとも1つのプロセッサと、
を含み、
前記プロセッサは、
各層をノードで表現した計算グラフの形式で入力された、複数の畳み込み層を含む深層学習モデルの各層に対応する演算のうち、深層学習モデルの推論処理を実行する専用ハードウェアで処理可能な畳み込み層及び前記畳み込み層に付随する周辺の層の各々に対応する演算を統合して、演算統合済みモデルを生成し、
前記演算統合済みモデルのうち、畳み込み層の各々の層構成に基づいて層統合可能な畳み込み層の組み合わせとしてパターン記述部に予め登録されたパターンに一致する部分グラフを抽出し、
前記専用ハードウェアの仕様に基づく層統合の条件として条件記述部に予め登録された条件を満たす前記部分グラフに相当する区間を、層統合区間として判定する
ように構成されている層統合判定装置。
(Additional Note 1)
Memory,
at least one processor coupled to the memory;
Including,
The processor,
Among the operations corresponding to each layer of a deep learning model including a plurality of convolution layers, which are input in the form of a computation graph in which each layer is represented by a node, operations corresponding to each of the convolution layers that can be processed by dedicated hardware that executes the inference processing of the deep learning model and the surrounding layers associated with the convolution layers are integrated to generate a computation-integrated model;
extracting a subgraph that matches a pattern registered in advance in a pattern description section as a combination of convolutional layers that can be integrated based on the layer configuration of each of the convolutional layers from the computationally integrated model;
a layer merging determination device configured to determine, as a layer merging interval, an interval corresponding to the subgraph that satisfies a condition registered in advance in a condition description section as a condition for layer merging based on a specification of the dedicated hardware.
(付記項2)
層統合判定処理を実行するようにコンピュータによって実行可能なプログラムを記憶した非一時的記憶媒体であって、
前記層統合判定処理は、
各層をノードで表現した計算グラフの形式で入力された、複数の畳み込み層を含む深層学習モデルの各層に対応する演算のうち、深層学習モデルの推論処理を実行する専用ハードウェアで処理可能な畳み込み層及び前記畳み込み層に付随する周辺の層の各々に対応する演算を統合して、演算統合済みモデルを生成し、
前記演算統合済みモデルのうち、畳み込み層の各々の層構成に基づいて層統合可能な畳み込み層の組み合わせとしてパターン記述部に予め登録されたパターンに一致する部分グラフを抽出し、
前記専用ハードウェアの仕様に基づく層統合の条件として条件記述部に予め登録された条件を満たす前記部分グラフに相当する区間を、層統合区間として判定する
非一時的記憶媒体。
(Additional Note 2)
A non-transitory storage medium storing a program executable by a computer to execute a layer integration determination process,
The layer integration determination process includes:
Among the operations corresponding to each layer of a deep learning model including a plurality of convolution layers, which are input in the form of a computation graph in which each layer is represented by a node, operations corresponding to each of the convolution layers that can be processed by dedicated hardware that executes the inference processing of the deep learning model and the surrounding layers associated with the convolution layers are integrated to generate a computation-integrated model;
extracting a subgraph that matches a pattern registered in advance in a pattern description section as a combination of convolutional layers that can be integrated based on the layer configuration of each of the convolutional layers from the computationally integrated model;
A non-transitory storage medium that determines, as a layer integration interval, an interval corresponding to the subgraph that satisfies a condition registered in advance in a condition description section as a condition for layer integration based on a specification of the dedicated hardware.
10、210 層統合判定装置
11 CPU
12 ROM
13 RAM
14 ストレージ
15 入力部
16 表示部
17 通信I/F
19 バス
21 生成部
22 ラベリング部
24 演算統合部
26 抽出部
28、228 判定部
232 追加部
30 最適化部
40 対応関係記述部
42 パターン記述部
44 条件記述部
10, 210 layer integration determination device 11 CPU
12 ROM
13 RAM
14 Storage 15 Input unit 16 Display unit 17 Communication I/F
19 Bus 21 Generation unit 22 Labeling unit 24 Operation integration unit 26 Extraction unit 28, 228 Determination unit 232 Addition unit 30 Optimization unit 40 Correspondence relationship description unit 42 Pattern description unit 44 Condition description unit
Claims (8)
前記演算統合済みモデルのうち、畳み込み層の各々の層構成に基づいて層統合可能な畳み込み層の組み合わせとしてパターン記述部に予め登録されたパターンに一致する部分グラフを抽出する抽出部と、
前記専用ハードウェアの仕様に基づく層統合の条件として条件記述部に予め登録された条件を満たす前記部分グラフに相当する区間を、層統合区間として判定する判定部と、
を含む層統合判定装置。 A generation unit that integrates operations corresponding to each of the convolution layers and the surrounding layers associated with the convolution layers that can be processed by dedicated hardware that executes the inference processing of the deep learning model, among operations corresponding to each layer of the deep learning model including multiple convolution layers, input in the form of a computation graph in which each layer is represented by a node, to generate a computation-integrated model;
an extraction unit that extracts a subgraph that matches a pattern registered in advance in a pattern description unit as a combination of convolutional layers that can be integrated based on the layer configuration of each of the convolutional layers from the computationally integrated model;
a determination unit that determines, as a layer integration section, an interval corresponding to the subgraph that satisfies a condition registered in advance in a condition description unit as a layer integration condition based on a specification of the dedicated hardware;
A layer integration determination device comprising:
入力された前記深層学習モデルの演算のうち、対応関係記述部に予め登録された、前記専用ハードウェアで処理可能な演算に一致する演算を識別可能にラベル付けするラベリング部と、
前記ラベル付けされた、前記専用ハードウェアで処理可能な演算の組み合わせであって、前記対応関係記述部に予め登録された、統合可能なプリミティブな演算の組み合わせと一致する演算の組み合わせを処理ブロックとしてまとめて計算グラフを再構成することにより、前記演算統合済みモデルを生成する演算統合部と、
を含む請求項1~請求項5のいずれか1項に記載の層統合判定装置。 The generation unit is
A labeling unit that identifiably labels operations of the input deep learning model that match operations that are pre-registered in a correspondence description unit and can be processed by the dedicated hardware;
a computation integration unit that generates the computation-integrated model by integrating, as a processing block, a combination of operations that is processable by the dedicated hardware and matches a combination of primitive operations that can be integrated and is registered in advance in the correspondence description unit, and reconstructing a computation graph;
The layer integration determination device according to any one of claims 1 to 5, comprising:
前記生成部が、各層をノードで表現した計算グラフの形式で入力された、複数の畳み込み層を含む深層学習モデルの各層に対応する演算のうち、深層学習モデルの推論処理を実行する専用ハードウェアで処理可能な畳み込み層及び前記畳み込み層に付随する周辺の層の各々に対応する演算を統合して、演算統合済みモデルを生成し、
前記抽出部が、前記演算統合済みモデルのうち、畳み込み層の各々の層構成に基づいて層統合可能な畳み込み層の組み合わせとしてパターン記述部に予め登録されたパターンに一致する部分グラフを抽出し、
前記判定部が、前記専用ハードウェアの仕様に基づく層統合の条件として条件記述部に予め登録された条件を満たす前記部分グラフに相当する区間を、層統合区間として判定する
層統合判定方法。 A layer integration determination method executed by a layer integration determination device including a generation unit, an extraction unit, and a determination unit,
The generation unit integrates operations corresponding to each of the convolution layers that can be processed by dedicated hardware that executes the inference processing of the deep learning model and the surrounding layers associated with the convolution layers, among the operations corresponding to each layer of the deep learning model including a plurality of convolution layers, input in the form of a computation graph in which each layer is represented by a node, to generate a computation-integrated model;
the extraction unit extracts, from the computationally integrated model, a subgraph that matches a pattern registered in advance in a pattern description unit as a combination of convolutional layers that can be integrated based on a layer configuration of each of the convolutional layers;
the determination unit determines, as a layer-merging interval, an interval corresponding to the subgraph that satisfies a condition registered in advance in a condition description unit as a condition for layer-merging based on a specification of the dedicated hardware.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2023/019959 WO2024247053A1 (en) | 2023-05-29 | 2023-05-29 | Layer integration determination device, method, and program |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2023/019959 WO2024247053A1 (en) | 2023-05-29 | 2023-05-29 | Layer integration determination device, method, and program |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024247053A1 true WO2024247053A1 (en) | 2024-12-05 |
Family
ID=93657189
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2023/019959 Pending WO2024247053A1 (en) | 2023-05-29 | 2023-05-29 | Layer integration determination device, method, and program |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2024247053A1 (en) |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190303762A1 (en) * | 2018-03-30 | 2019-10-03 | Xilinx, Inc. | Methods of optimization of computational graphs of neural networks |
-
2023
- 2023-05-29 WO PCT/JP2023/019959 patent/WO2024247053A1/en active Pending
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190303762A1 (en) * | 2018-03-30 | 2019-10-03 | Xilinx, Inc. | Methods of optimization of computational graphs of neural networks |
Non-Patent Citations (1)
| Title |
|---|
| XIAO, Q. ET AL.: "Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs", PROCEEDINGS OF THE 54TH ANNUAL DESIGN AUTOMATION CONFERENCE 2017, 2017, pages 1 - 6, XP055573913, [retrieved on 20230731], DOI: 10.1145/3061639.3062244 * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Da Silva et al. | Performance Modeling for FPGAs: Extending the Roofline Model with High‐Level Synthesis Tools | |
| CN111899150A (en) | Data processing method, device, electronic device and storage medium | |
| CN112418320B (en) | Enterprise association relation identification method, device and storage medium | |
| CN113705798B (en) | Processing unit, computing device, and computational graph optimization method for deep learning model | |
| US20220343146A1 (en) | Method and system for temporal graph neural network acceleration | |
| CN114897133A (en) | Universal configurable Transformer hardware accelerator and implementation method thereof | |
| CN111373406A (en) | Accelerated simulation setup process using problem-matched prior knowledge extraction | |
| CN115934275A (en) | Task processing method and dialogue task processing method | |
| Huai et al. | Zerobn: Learning compact neural networks for latency-critical edge systems | |
| JP2021168114A (en) | Neural network and its training method | |
| Li et al. | A survey on machine learning-based routing for VLSI physical design | |
| CN116438547A (en) | A system for logical rule induction of knowledge graphs for engineering systems | |
| WO2024247053A1 (en) | Layer integration determination device, method, and program | |
| CN105302551A (en) | Orthogonal decomposition construction and optimization method and system for big data processing system | |
| CN113705799B (en) | Processing unit, computing device, and computational graph processing method for deep learning model | |
| CN119167253A (en) | Time series anomaly detection method based on progressive diffusion model in cloud environment | |
| CN116561526A (en) | Traffic data recovery and abnormal value detection method based on characteristic non-negative matrix factorization | |
| Yang et al. | Spmmplu: A compiler plug-in with sparse ir for efficient sparse matrix multiplication | |
| Zuluaga et al. | Predicting best design trade-offs: A case study in processor customization | |
| US8606736B2 (en) | Technique for solving optimization problem | |
| CN119989578B (en) | Compressor blade shape generation method, device, equipment, and storage medium | |
| CN115203485B (en) | Graph data processing method and device, electronic equipment and computer readable medium | |
| CN120163196B (en) | Joint search method and device of neural network and hardware | |
| CN117473400B (en) | Equipment fault diagnosis method based on multi-channel hierarchical transformation network structure | |
| Kartam et al. | Construction simulation using parallel computing environments |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23939532 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2025523697 Country of ref document: JP Kind code of ref document: A |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2025523697 Country of ref document: JP |