WO2024247053A1

WO2024247053A1 - Layer integration determination device, method, and program

Info

Publication number: WO2024247053A1
Application number: PCT/JP2023/019959
Authority: WO
Inventors: 宥光飯沼; 健中村; 寛之鵜澤; 大祐小林; 彩希八田; 優也大森; 祐輔堀下; 周平吉田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2024-12-05
Anticipated expiration: 2025-11-29

Abstract

A layer integration determination device (10), wherein: a generation unit (21) generates a computation-integrated model by integrating computations corresponding to each of a convolution layer that can be processed by dedicated hardware that executes inference processing of a deep learning model and a peripheral layer accompanying the convolution layer, among computations corresponding to layers of a deep learning model which includes a plurality of convolution layers and which is input in the form of a calculation graph in which the layers are represented by nodes; an extraction unit (26) extracts a partial graph matching a pattern registered in advance in a pattern description unit (42) as a combination of convolution layers for which layer integration is possible, on the basis of the layer configuration of each of the convolution layers in the computation-integrated model; and a determination unit (28) determines, as a layer integration section, a section corresponding to a partial graph that satisfies a condition registered in advance in a condition description unit (44) as a condition for layer integration based on the specifications of the AI chip.

Description

Layer integration determination device, method, and program

　開示の技術は、層統合判定装置、層統合判定方法、及び層統合判定プログラムに関する。 The disclosed technology relates to a layer integration determination device, a layer integration determination method, and a layer integration determination program.

　深層学習は、人間の神経細胞の仕組みを再現したニューラルネットワークを用いた機械学習手法であり、動画像処理、自然言語処理等の分野に応用されている。例えば、動画像処理の分野においては、畳み込みニューラルネットワーク（ＣＮＮ：Ｃｏｎｖｏｌｕｔｉｏｎａｌ　Ｎｅｕｒａｌ　Ｎｅｔｗｏｒｋ）と呼ばれる深層学習モデルを用いて、画像中の物体のクラスを判定する画像認識の手法が提案されている。また、他にも、深層学習モデルを用いて、画像中の物体の位置及びクラスを判定する物体検出、ピクセル毎に物体のクラスを推論するセグメンテーション等の手法が数多く提案されている。また、自然言語処理の分野では、Ｔｒａｎｓｆｏｒｍｅｒに代表されるＡｔｔｅｎｔｉｏｎ機構を備えたモデルを用いた手法が、機械翻訳、要約等のタスクに応用されている。このような深層学習モデルは、従来の機械学習モデルを超える性能を達成しており、医療、産業等の様々な分野で深層学習を活用しようとする動きが見られる。 Deep learning is a machine learning method that uses a neural network that reproduces the mechanism of human nerve cells, and is applied to fields such as video processing and natural language processing. For example, in the field of video processing, an image recognition method has been proposed that uses a deep learning model called a convolutional neural network (CNN) to determine the class of an object in an image. In addition, many other methods have been proposed using deep learning models, such as object detection, which determines the position and class of an object in an image, and segmentation, which infers the class of an object for each pixel. In the field of natural language processing, methods using models equipped with an attention mechanism, such as the Transformer, are applied to tasks such as machine translation and summarization. Such deep learning models have achieved performance that exceeds that of conventional machine learning models, and there is a movement to utilize deep learning in various fields such as medicine and industry.

　こうした深層学習モデルの性能向上の要因として、コンピュータの計算能力の向上、クラウド技術の発達等が挙げられる。例えば、ＧＰＵ（Ｇｒａｐｈｉｃａｌ　Ｐｒｏｃｅｓｓｉｎｇ　Ｕｎｉｔ）を転用して大規模な並列演算が可能になった。また、例えば、ＧＣＰ（Ｇｏｏｇｌｅ　Ｃｌｏｕｄ　Ｐｌａｔｆｏｒｍ）、ＡＷＳ（Ａｍａｚｏｎ　Ｗｅｂ　Ｓｅｒｖｉｃｅ）等の、ＧＰＵを多数搭載したクラウドサービスが登場したことによって、大規模な深層学習モデルを容易に学習できるようになった。 Factors behind the improved performance of these deep learning models include the improvement of computer computing power and the development of cloud technology. For example, large-scale parallel calculations are now possible by repurposing GPUs (Graphical Processing Units). In addition, the emergence of cloud services equipped with many GPUs, such as GCP (Google Cloud Platform) and AWS (Amazon Web Service), has made it easier to train large-scale deep learning models.

　学習したモデルを用いて、車載カメラ、スマートフォン等の端末で取得したデータに対して推論処理を行う場合、クラウド上で推論を実行する方法、及び、データを取得した端末上で推論を実行する方法の２通りの方法がある。前者の方法では、ネットワーク遅延によってリアルタイム性が損なわれ、また、データをインターネット経由でクラウドに送信することで、セキュリティリスクやプライバシー侵害生じる可能性がある。そこで、近年、データを取得した端末側で推論処理を実行するエッジＡＩ（Ａｒｔｉｆｉｃｉａｌ　Ｉｎｔｅｌｌｉｇｅｎｃｅ）が注目されている。 When using a trained model to perform inference processing on data acquired by a device such as an in-vehicle camera or a smartphone, there are two methods: performing inference on the cloud, or performing inference on the device that acquired the data. With the former method, real-time performance is compromised due to network delays, and sending data to the cloud via the internet can pose security risks and privacy violations. For this reason, in recent years, attention has been focused on edge AI (Artificial Intelligence), which performs inference processing on the device that acquired the data.

　エッジＡＩは、プライバシーを保護しながらリアルタイムにデータ処理が可能となる一方で、電源及び計算リソースが確保しづらいという課題があった。特に、ドローン、スマートフォン等の移動体上で推論処理を行う場合、重量等の制限から、ＧＰＵのような消費電力の大きなデバイスを搭載することは困難である。そこで、ＡＩチップと呼ばれる推論処理に特化したハードウェアを端末側に搭載することで、消費電力を抑えつつ、エッジＡＩに必要な計算リソースを確保している。 Edge AI enables real-time data processing while protecting privacy, but one issue is the difficulty of securing power sources and computing resources. In particular, when performing inference processing on mobile objects such as drones and smartphones, it is difficult to install devices that consume large amounts of power, such as GPUs, due to weight and other restrictions. Therefore, by installing specialized hardware for inference processing, known as an AI chip, on the terminal, it is possible to reduce power consumption while securing the computing resources required for edge AI.

　また、限られた計算リソースを効率的に利用するため、深層学習モデルに含まれる複数の層の演算を統合して処理する層統合と呼ばれる手法も導入されている（非特許文献１）。層統合では、深層学習モデルの畳み込み演算を１層ずつ独立に処理するのではなく、複数の畳み込み層を統合して処理する。これによって、小容量のキャッシュを用いた演算処理が可能となるため、外部メモリへのアクセスを抑制し、処理時間の短縮や電力効率の向上が可能となる。このような演算の統合を自動化する手法も提案されている（非特許文献２）。 Also, to efficiently use limited computing resources, a technique called layer integration has been introduced that integrates and processes the calculations of multiple layers included in a deep learning model (Non-Patent Document 1). In layer integration, the convolution calculations of a deep learning model are not processed independently for each layer, but multiple convolution layers are integrated and processed. This makes it possible to process calculations using a small-capacity cache, thereby reducing access to external memory, shortening processing time, and improving power efficiency. A technique for automating this integration of calculations has also been proposed (Non-Patent Document 2).

F. Indirli, A. Erdem and C. Silvano, "A Tile-based Fused-layer CNN Accelerator for FPGAs," 2020 27th IEEE International Conference on Electronics, Circuits and Systems (ICECS), Glasgow, UK, 2020, pp. 1-4.F. Indirli, A. Erdem and C. Silvano, "A Tile-based Fused-layer CNN Accelerator for FPGAs," 2020 27th IEEE International Conference on Electronics, Circuits and Systems (ICECS), Glasgow, UK, 2020, pp. 1 -4. Zhi Chen and Cody Yu,　"How to Bring Your Own Codegen to TVM," [online], Jul 15, 2020，Amazon Web Services, Inc,　[2023年4月12日検索],インターネット＜URL：https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm＞Zhi Chen and Cody Yu, "How to Bring Your Own Codegen to TVM," [online], Jul 15, 2020, Amazon Web Services, Inc., [Retrieved April 12, 2023], Internet <URL: https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm>

　従来手法では、深層学習モデルにおけるプリミティブな演算の統合は自動化が可能であるが、複数の畳み込み層を含む深層学習モデルの層統合を行うことは考慮されていない。そのため、従来手法で複数の畳み込み層を含む深層学習モデルの層統合を行う場合、対象の深層学習モデルを詳細に解析し、深層学習モデル毎に個別に層統合の判定を行う必要がり、層統合の判定に時間を要するという課題がある。　Although conventional methods can automate the integration of primitive operations in deep learning models, they do not take into account layer integration of deep learning models that include multiple convolutional layers. Therefore, when integrating layers of deep learning models that include multiple convolutional layers using conventional methods, it is necessary to analyze the target deep learning model in detail and make a layer integration decision for each deep learning model individually, which poses the issue of time-consuming layer integration decisions.

　開示の技術は、上記の点に鑑みてなされたものであり、複数の畳み込み層を含む深層学習モデルの層統合の判定に要する時間を短縮しつつ、様々な深層学習モデルに対して柔軟に対応することを目的とする。 The disclosed technology has been developed in consideration of the above points, and aims to flexibly support a variety of deep learning models while shortening the time required to determine layer integration for deep learning models that include multiple convolutional layers.

　本開示の第１態様は、層統合判定装置であって、各層をノードで表現した計算グラフの形式で入力された、複数の畳み込み層を含む深層学習モデルの各層に対応する演算のうち、深層学習モデルの推論処理を実行する専用ハードウェアで処理可能な畳み込み層及び前記畳み込み層に付随する周辺の層の各々に対応する演算を統合して、演算統合済みモデルを生成する生成部と、前記演算統合済みモデルのうち、畳み込み層の各々の層構成に基づいて層統合可能な畳み込み層の組み合わせとしてパターン記述部に予め登録されたパターンに一致する部分グラフを抽出する抽出部と、前記専用ハードウェアの仕様に基づく層統合の条件として条件記述部に予め登録された条件を満たす前記部分グラフに相当する区間を、層統合区間として判定する判定部と、を含む。 A first aspect of the present disclosure is a layer integration determination device, which includes a generation unit that integrates operations corresponding to each of the convolutional layers and the surrounding layers associated with the convolutional layers that can be processed by dedicated hardware that executes the inference processing of the deep learning model, among the operations corresponding to each layer of a deep learning model including multiple convolutional layers input in the form of a computation graph in which each layer is represented by a node, to generate a computationally integrated model; an extraction unit that extracts a subgraph from the computationally integrated model that matches a pattern registered in advance in a pattern description unit as a combination of convolutional layers that can be layer integrated based on the layer configuration of each of the convolutional layers; and a determination unit that determines, as a layer integration section, a section corresponding to the subgraph that satisfies a condition registered in advance in a condition description unit as a condition for layer integration based on the specifications of the dedicated hardware.

　本開示の第２態様は、生成部と、抽出部と、判定部とを含む層統合判定装置が実行する層統合判定方法であって、前記生成部が、各層をノードで表現した計算グラフの形式で入力された、複数の畳み込み層を含む深層学習モデルの各層に対応する演算のうち、深層学習モデルの推論処理を実行する専用ハードウェアで処理可能な畳み込み層及び前記畳み込み層に付随する周辺の層の各々に対応する演算を統合して、演算統合済みモデルを生成し、前記抽出部が、前記演算統合済みモデルのうち、畳み込み層の各々の層構成に基づいて層統合可能な畳み込み層の組み合わせとしてパターン記述部に予め登録されたパターンに一致する部分グラフを抽出し、前記判定部が、前記専用ハードウェアの仕様に基づく層統合の条件として条件記述部に予め登録された条件を満たす前記部分グラフに相当する区間を、層統合区間として判定する方法である。 The second aspect of the present disclosure is a layer integration determination method executed by a layer integration determination device including a generation unit, an extraction unit, and a determination unit, in which the generation unit integrates operations corresponding to each layer of a deep learning model including multiple convolution layers, which are input in the form of a computation graph in which each layer is represented by a node, and operations corresponding to each of the convolution layers that can be processed by dedicated hardware that executes the inference processing of the deep learning model and the surrounding layers associated with the convolution layers, to generate a computation-integrated model, the extraction unit extracts a subgraph from the computation-integrated model that matches a pattern registered in advance in a pattern description unit as a combination of convolution layers that can be layer-integrated based on the layer configuration of each of the convolution layers, and the determination unit determines, as a layer integration section, a section corresponding to the subgraph that satisfies a condition registered in advance in a condition description unit as a condition for layer integration based on the specifications of the dedicated hardware.

　本開示の第３態様は、層統合判定プログラムであって、コンピュータを、上記の層統合判定装置の各部として機能させるためのプログラムである。 The third aspect of the present disclosure is a layer integration determination program that causes a computer to function as each part of the layer integration determination device described above.

　開示の技術によれば、複数の畳み込み層を含む深層学習モデルの層統合の判定に要する時間を短縮しつつ、様々な深層学習モデルに対して柔軟に対応することができる。 The disclosed technology can flexibly accommodate a variety of deep learning models while reducing the time required to determine layer integration for deep learning models that include multiple convolutional layers.

従来手法の演算統合装置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a conventional arithmetic integration device. 深層学習モデルに含まれる層を説明するための図である。FIG. 1 is a diagram for explaining layers included in a deep learning model. 従来手法の演算統合処理の流れを示すフローチャートである。1 is a flowchart showing the flow of a conventional operation integration process. 第１及び第２実施形態に係る層統合判定装置のハードウェア構成を示すブロック図である。1 is a block diagram showing a hardware configuration of a layer integration determination device according to the first and second embodiments. FIG. 第１実施形態に係る層統合判定装置の機能構成の例を示すブロック図である。1 is a block diagram showing an example of a functional configuration of a layer integration determination device according to a first embodiment. FIG. パターン記述部に登録されるパターンの一例を示す図である。FIG. 13 is a diagram showing an example of a pattern registered in a pattern description section; パターン記述部に登録されるパターンの他の例を示す図である。13A and 13B are diagrams illustrating other examples of patterns registered in the pattern description section. 第１実施形態に係る層統合判定処理の流れを示すフローチャートである。13 is a flowchart showing a flow of a layer integration determination process according to the first embodiment. 層統合済みモデル候補生成処理の流れを示すフローチャートである。13 is a flowchart showing the flow of a layer-merged model candidate generation process. 第２実施形態に係る層統合判定装置の機能構成の例を示すブロック図である。FIG. 11 is a block diagram showing an example of a functional configuration of a layer integration determination device according to a second embodiment. 第２実施形態に係る層統合判定処理の流れを示すフローチャートである。13 is a flowchart showing the flow of a layer integration determination process according to the second embodiment.

　以下、開示の技術の実施形態の一例を、図面を参照しつつ説明する。なお、各図面において同一又は等価な構成要素及び部分には同一の参照符号を付与している。また、図面の寸法比率は、説明の都合上誇張されており、実際の比率とは異なる場合がある。 Below, an example of an embodiment of the disclosed technology will be described with reference to the drawings. Note that the same reference symbols are used for identical or equivalent components and parts in each drawing. Also, the dimensional ratios in the drawings have been exaggerated for the convenience of explanation and may differ from the actual ratios.

＜従来手法について＞
　まず、各実施形態の詳細を説明する前に、従来手法について説明する。 <About conventional methods>
First, before describing each embodiment in detail, a conventional method will be described.

　図１に、例えば非特許文献２に記載の手法のように、深層学習モデルの演算の統合を自動化する従来手法の演算統合装置１０００の構成を示す。 FIG. 1 shows the configuration of a computation integration device 1000 that uses a conventional method to automate the integration of computations in a deep learning model, such as the method described in Non-Patent Document 2.

　演算統合装置１０００には、深層学習モデルの各層をノードで表現した計算グラフの形式で深層学習モデルが入力される。演算統合装置１０００は、入力された計算グラフを走査しながら、統合可能な演算に対応する層を統合する処理を行い、演算統合済みモデルを出力する。図１に示すように、演算統合装置１０００は、機能的には、対応関係記述部１０４０と呼ばれる１つの記述部と、ラベリング部１０２２及び演算統合部１０２４の２つの処理部とを含む。 The deep learning model is input to the computation integration device 1000 in the form of a computation graph in which each layer of the deep learning model is represented by a node. The computation integration device 1000 scans the input computation graph, performs processing to integrate layers corresponding to computations that can be integrated, and outputs a computationally integrated model. As shown in FIG. 1, the computation integration device 1000 functionally includes one description unit called the correspondence description unit 1040, and two processing units, a labeling unit 1022 and a computation integration unit 1024.

　一般的に、ハードウェアで機械学習モデルの推論処理を実行する場合、図２に示すような畳み込み（Ｃｏｎｖ）層の演算と、畳み込み層に付随する周辺の各層の演算とを統合して実行することで、演算効率の向上を図る。図２の例では、付随する周辺の層として、パディング（Ｐａｄ）層、バッチ正規化（ＢＮ）層、及び活性化関数（ＲｅＬＵ）層を示している。 Generally, when inference processing of a machine learning model is performed using hardware, the calculations of the convolutional (Conv) layer as shown in Figure 2 are integrated with the calculations of the surrounding layers associated with the convolutional layer to improve calculation efficiency. In the example of Figure 2, the surrounding layers associated with the convolutional layer are the padding (Pad) layer, the batch normalization (BN) layer, and the activation function (ReLU) layer.

　そこで、対応関係記述部１０４０には、深層学習モデルの各層のうち、深層学習モデルの推論処理を実行する専用ハードウェア（以下では、一例として「ＡＩチップ」ともいう）で処理可能な演算と、統合可能な演算の組み合わせとを登録しておく。ＡＩチップで処理可能な演算は、その演算の処理を実行するために必要な専用の回路がＡＩチップ上に実装されている演算である。統合可能な演算の組み合わせは、上述した、畳み込み層の演算と、それに付随する周辺の各層の演算との組み合わせのような、統合可能なプリミティブな演算の組み合わせである。 Then, in the correspondence description unit 1040, operations that can be processed by dedicated hardware (hereinafter, also referred to as an "AI chip" as an example) that executes the inference processing of the deep learning model, and combinations of operations that can be integrated, are registered. Operations that can be processed by an AI chip are operations for which a dedicated circuit required to execute the processing of that operation is implemented on the AI chip. Combinations of operations that can be integrated are combinations of primitive operations that can be integrated, such as the combination of an operation of a convolution layer and the operations of each of the surrounding layers associated with it, as described above.

　ラベリング部１０２２は、深層学習モデルの各層の演算がＡＩチップ上で処理可能か否かを個別に判定し、ＡＩチップ上で処理する演算の層を識別可能にラベル付けする。具体的には、ラベリング部１０２２は、各演算が、対応関係記述部１０４０に記述された、ＡＩチップで処理可能な演算に一致するか否かを判定する。ラベリング部１０２２は、ＡＩチップで処理可能な演算の層には、ＡＩチップで処理することを示すラベルを付し、それ以外の層には、ＡＩチップの制御用等の汎用的なハードウェア（例えば、ＣＰＵ、ＧＰＵ等）において処理することを示すラベルを付す。ラベリング部１０２２は、深層学習モデルの各層（計算グラフの各ノード）にラベルを付したラベリング済みモデルを演算統合部１０２４へ受け渡す。 The labeling unit 1022 individually determines whether the calculations of each layer of the deep learning model can be processed on an AI chip, and labels the layers of calculations to be processed on the AI chip so that they can be identified. Specifically, the labeling unit 1022 determines whether each calculation matches the calculations that can be processed by the AI chip described in the correspondence description unit 1040. The labeling unit 1022 attaches a label indicating that the calculations will be processed by the AI chip to layers of calculations that can be processed by the AI chip, and attaches a label indicating that the other layers will be processed by general-purpose hardware (e.g., CPU, GPU, etc.) for controlling the AI chip, etc. The labeling unit 1022 passes the labeled model in which each layer of the deep learning model (each node of the computation graph) has been labeled to the calculation integration unit 1024.

　演算統合部１０２４は、ラベリング済みモデルにおいて、ＡＩチップで処理することを示すラベルが付された層の演算の組み合わせで、かつ、対応関係記述部１０４０に記述された演算の組み合わせと一致する演算の組み合わせに対応する各層を１つの層にまとめる。演算統合部１０２４は、これにより、深層学習モデルを示す計算グラフを再構成し、演算統合済みモデルを生成して出力する。 The computation integration unit 1024 combines into one layer each layer that corresponds to a combination of computations in a layer that is labeled to be processed by an AI chip and that matches a combination of computations described in the correspondence description unit 1040 in the labeled model. The computation integration unit 1024 thereby reconstructs a computation graph that represents the deep learning model, and generates and outputs a computation-integrated model.

　図３に、従来手法の演算統合装置１０００で実行される演算統合処理の流れを示すフローチャートを示す。 FIG. 3 shows a flowchart illustrating the flow of the computation integration process executed by the computation integration device 1000 using the conventional method.

　演算統合装置１０００に計算グラフで表現された深層学習モデルが入力されると、深層学習モデルを走査しながら、深層学習モデルに含まれる各層（計算グラフの各ノード）を処理対象として、ステップＳ１０００のループ処理が実行される。具体的には、ステップＳ１００２で、ラベリング部１０２２が、対応関係記述部１０４０に登録された、ＡＩチップで処理可能な演算に基づいて、処理対象の層の演算がＡＩチップ上で処理可能か否かを判定する。処理可能な場合には、ステップＳ１００４へ移行し、処理可能ではない場合には、ステップＳ１００６へ移行する。 When a deep learning model represented by a computation graph is input to the computation integration device 1000, the deep learning model is scanned and the loop process of step S1000 is executed with each layer (each node of the computation graph) included in the deep learning model as the processing target. Specifically, in step S1002, the labeling unit 1022 determines whether or not the computation of the processing target layer can be processed on the AI chip based on the computations that can be processed by the AI chip and that are registered in the correspondence description unit 1040. If the computation can be processed, the process proceeds to step S1004, and if the computation cannot be processed, the process proceeds to step S1006.

　ステップＳ１００４では、ラベリング部１０２２が、処理対象の層に、ＡＩチップで処理することを示すラベルを付す。一方、ステップＳ１００６では、ラベリング部１０２２が、処理対象の層に、汎用的なハードウェアで処理することを示すラベルを付す。 In step S1004, the labeling unit 1022 attaches a label to the layer to be processed indicating that it will be processed by an AI chip. On the other hand, in step S1006, the labeling unit 1022 attaches a label to the layer to be processed indicating that it will be processed by general-purpose hardware.

　次に、ステップＳ１００８で、ラベリング部１０２２が、深層学習モデルの走査を終了したか、すなわち、深層学習モデルに含まれる全ての層についてラベリング処理が終了したか否かを判定する。走査が終了している場合には、ステップＳ１０１０へ移行し、走査が終了していない場合には、ステップＳ１０００のループ処理を繰り返す。ステップＳ１０１０では、演算統合部１０２４が、ＡＩチップで処理することを示すラベルが付された層が連続する区間をまとめて、計算グラフを再構成する。 Next, in step S1008, the labeling unit 1022 determines whether scanning of the deep learning model is complete, i.e., whether labeling processing is complete for all layers included in the deep learning model. If scanning is complete, the process proceeds to step S1010, and if scanning is not complete, the loop processing of step S1000 is repeated. In step S1010, the computation integration unit 1024 groups together sections in which consecutive layers are labeled to be processed by the AI chip, and reconstructs the computation graph.

　そして、深層学習モデルを走査しながら、各区間を処理対象として、ステップＳ１０１２のループ処理が実行される。具体的には、ステップＳ１０１４で、演算統合部１０２４が、処理対象の区間内の各層の演算の組み合わせと、対応関係記述部１０４０に登録された統合可能なプリミティブな演算の組み合わせとのパターンマッチングを行う。演算統合部１０２４は、パターンマッチングにより、演算の組み合わせが一致するか否かにより、処理対象の区間内の各層を統合可能か否かを判定する。統合可能な場合には、ステップＳ１０１６へ移行し、統合できない場合には、ステップＳ１０１８へ移行する。 Then, while scanning the deep learning model, the loop process of step S1012 is executed with each section as the processing target. Specifically, in step S1014, the operation integration unit 1024 performs pattern matching between the combination of operations of each layer in the processing target section and the combination of primitive operations that can be integrated and are registered in the correspondence description unit 1040. The operation integration unit 1024 determines whether or not each layer in the processing target section can be integrated depending on whether or not the combination of operations matches through pattern matching. If integration is possible, the process proceeds to step S1016, and if integration is not possible, the process proceeds to step S1018.

　ステップＳ１０１６では、パターンマッチングにより一致した演算の組み合わせに対応する層の組み合わせを１つの層に置き換えることにより統合し、計算グラフを再構成する。次に、ステップＳ１０１８で、演算統合部１０２４が、深層学習モデルの走査を終了したか、すなわち、深層学習モデルに含まれる全ての区間について、その区間内の層を統合するか否かの判定処理が終了したか否かを判定する。走査が終了している場合には、ステップＳ１０２０へ移行し、走査が終了していない場合には、統合可能な区間がなくなるまでステップＳ１０１２のループ処理を繰り返す。ステップＳ１０２０では、演算統合部１０２４が、上記ステップＳ１０１６で最終的に再構成した計算グラフを、演算統合済みモデルとして出力し、演算統合処理は終了する。 In step S1016, the layer combinations corresponding to the operation combinations that match through pattern matching are integrated by replacing them with one layer, and the computation graph is reconstructed. Next, in step S1018, the computation integration unit 1024 determines whether scanning of the deep learning model is complete, that is, whether the process of determining whether to integrate layers within all sections included in the deep learning model is complete. If scanning is complete, the process proceeds to step S1020, and if scanning is not complete, the loop process of step S1012 is repeated until there are no more sections that can be integrated. In step S1020, the computation integration unit 1024 outputs the computation graph finally reconstructed in step S1016 as a computationally integrated model, and the computation integration process ends.

　上述したように、従来手法では、深層学習モデルにおいてプリミティブな演算の統合は自動化が可能であるが、複数の畳み込み層を含む深層学習モデルの層統合を行うことは考慮されていない。複数の畳み込み層を含む場合、深層学習モデル毎にネットワーク構造が異なる。そのため、従来手法で複数の畳み込み層を含む深層学習モデルの層統合を行うためには、層統合を行う深層学習モデルを詳細に解析し、モデル毎に個別に対応する必要があった。 As mentioned above, conventional methods can automate the integration of primitive operations in deep learning models, but do not take into account layer integration of deep learning models that include multiple convolutional layers. When multiple convolutional layers are included, the network structure differs for each deep learning model. Therefore, in order to perform layer integration of deep learning models that include multiple convolutional layers using conventional methods, it was necessary to perform a detailed analysis of the deep learning model that performs layer integration and respond individually to each model.

　具体的には、従来手法では、ＡＩチップの仕様に従って、ＡＩチップで統合可能な畳み込み層の最大数の制約をもとに統合する畳み込み層の組み合わせを決定する必要がある。また、従来手法では、統合する畳み込み層の各組み合わせについて、その層統合区間内のカーネルサイズ、入出力データサイズ等が、ＡＩチップ上のキャッシュ容量以下であるか否かを確認する必要がある。そのため、従来手法では、様々な深層学習モデルの層統合に柔軟に対応することができないという課題があった。 Specifically, in conventional methods, it is necessary to determine the combination of convolutional layers to be integrated based on the constraint of the maximum number of convolutional layers that can be integrated on an AI chip, in accordance with the specifications of the AI chip. In addition, in conventional methods, it is necessary to check whether the kernel size, input/output data size, etc. within the layer integration section for each combination of convolutional layers to be integrated are equal to or smaller than the cache capacity on the AI chip. As a result, conventional methods have the problem of being unable to flexibly respond to layer integration of various deep learning models.

　そこで、以下の各実施形態では、入力された任意の深層学習モデルについて、ＡＩチップの仕様に基づく制約条件を満たす層統合区間の抽出と、最適な層統合区間の判定とを自動化する方法を提案する。 In the following embodiments, we propose a method for automating the extraction of layer integration sections that satisfy constraints based on the AI chip specifications and the determination of the optimal layer integration section for any input deep learning model.

＜第１実施形態＞
　図４は、第１実施形態に係る層統合判定装置１０のハードウェア構成を示すブロック図である。図４に示すように、層統合判定装置１０は、ＣＰＵ（Ｃｅｎｔｒａｌ　Ｐｒｏｃｅｓｓｉｎｇ　Ｕｎｉｔ）１１と、ＲＯＭ（Ｒｅａｄ　Ｏｎｌｙ　Ｍｅｍｏｒｙ）１２と、ＲＡＭ（Ｒａｎｄｏｍ　Ａｃｃｅｓｓ　Ｍｅｍｏｒｙ）１３と、ストレージ１４と、入力部１５と、表示部１６と、通信Ｉ／Ｆ（Ｉｎｔｅｒｆａｃｅ）１７とを有する。各構成は、バス１９を介して相互に通信可能に接続されている。 First Embodiment
4 is a block diagram showing the hardware configuration of the layer integration determination device 10 according to the first embodiment. As shown in FIG. 4, the layer integration determination device 10 has a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a storage 14, an input unit 15, a display unit 16, and a communication I/F (Interface) 17. Each component is connected to each other via a bus 19 so as to be able to communicate with each other.

　ＣＰＵ１１は、中央演算処理ユニットであり、各種プログラムの実行、各部の制御等を行う。すなわち、ＣＰＵ１１は、ＲＯＭ１２又はストレージ１４からプログラムを読み出し、ＲＡＭ１３を作業領域としてプログラムを実行する。ＣＰＵ１１は、ＲＯＭ１２又はストレージ１４に記憶されているプログラムに従って、上記各構成の制御及び各種の演算処理を行う。本実施形態では、ＲＯＭ１２又はストレージ１４には、後述する層統合判定プログラムが格納されている。 The CPU 11 is a central processing unit that executes various programs and controls each part. That is, the CPU 11 reads out a program from the ROM 12 or storage 14, and executes the program using the RAM 13 as a working area. The CPU 11 controls each of the above components and performs various calculation processes according to the program stored in the ROM 12 or storage 14. In this embodiment, the layer integration determination program described below is stored in the ROM 12 or storage 14.

　ＲＯＭ１２は、各種プログラム及び各種データを格納する。ＲＡＭ１３は、作業領域として一時的にプログラム又はデータを記憶する。ストレージ１４は、ＨＤＤ（Ｈａｒｄ　Ｄｉｓｋ　Ｄｒｉｖｅ）、ＳＳＤ（Ｓｏｌｉｄ　Ｓｔａｔｅ　Ｄｒｉｖｅ）等の記憶装置により構成され、オペレーティングシステムを含む各種プログラム及び各種データを格納する。 ROM 12 stores various programs and data. RAM 13 temporarily stores programs or data as a working area. Storage 14 is made up of storage devices such as HDD (Hard Disk Drive) and SSD (Solid State Drive), and stores various programs and data including the operating system.

　入力部１５は、マウス等のポインティングデバイス、及びキーボードを含み、各種の入力を行うために使用される。表示部１６は、例えば、液晶ディスプレイであり、各種の情報を表示する。表示部１６は、タッチパネル方式を採用して、入力部１５として機能してもよい。 The input unit 15 includes a pointing device such as a mouse and a keyboard, and is used to perform various input operations. The display unit 16 is, for example, a liquid crystal display, and displays various types of information. The display unit 16 may also function as the input unit 15 by employing a touch panel system.

　通信Ｉ／Ｆ１７は、他の機器と通信するためのインタフェースである。当該通信には、例えば、イーサネット（登録商標）、ＦＤＤＩ等の有線通信の規格、又は、４Ｇ、５Ｇ、Ｗｉ－Ｆｉ（登録商標）等の無線通信の規格が用いられる。 The communication I/F 17 is an interface for communicating with other devices. For this communication, for example, a wired communication standard such as Ethernet (registered trademark) or FDDI, or a wireless communication standard such as 4G, 5G, or Wi-Fi (registered trademark) is used.

　次に、第１実施形態に係る層統合判定装置１０の機能構成について説明する。図５は、層統合判定装置１０の機能構成の例を示すブロック図である。図５に示すように、層統合判定装置１０は、機能構成として、生成部２１と、抽出部２６と、判定部２８と、最適化部３０とを含む。また、層統合判定装置１０の所定の記憶領域には、対応関係記述部４０と、パターン記述部４２と、条件記述部４４とが設けられる。各機能構成は、ＣＰＵ１１がＲＯＭ１２又はストレージ１４に記憶された層統合判定プログラムを読み出し、ＲＡＭ１３に展開して実行することにより実現される。 Next, the functional configuration of the layer integration determination device 10 according to the first embodiment will be described. FIG. 5 is a block diagram showing an example of the functional configuration of the layer integration determination device 10. As shown in FIG. 5, the layer integration determination device 10 includes, as its functional configuration, a generation unit 21, an extraction unit 26, a determination unit 28, and an optimization unit 30. In addition, a correspondence description unit 40, a pattern description unit 42, and a condition description unit 44 are provided in a predetermined storage area of the layer integration determination device 10. Each functional configuration is realized by the CPU 11 reading out a layer integration determination program stored in the ROM 12 or storage 14, expanding it in the RAM 13, and executing it.

　層統合判定装置１０には、層統合の処理対象の深層学習モデルであって、複数の畳み込み層を含む深層学習モデルが、畳み込み（Ｃｏｎｖ）層、活性化関数（Ａｃｔｉｖａｔｉｏｎ）層等の各層をノードで表現した計算グラフの形式で入力される。各ノードは、そのノードに対応する層に関するパラメータの情報を保持する。パラメータは、例えば、入力及び出力の特徴量マップのサイズ、畳み込み演算で用いるカーネルサイズ及びチャネル数、畳み込みの行列演算及び活性化関数の乗算回数、バイアス加算の加算回数等である。 The deep learning model to be processed by layer integration, which includes multiple convolution layers, is input to the layer integration determination device 10 in the form of a computation graph in which each layer, such as a convolution (Conv) layer and an activation function (Activation) layer, is represented by a node. Each node holds parameter information related to the layer corresponding to the node. The parameters are, for example, the size of the input and output feature maps, the kernel size and number of channels used in the convolution operation, the number of multiplications of the convolution matrix operation and activation function, the number of additions for bias addition, etc.

　対応関係記述部４０は、図１で説明した従来手法の対応関係記述部１０４０と同様である。すなわち、対応関係記述部４０には、ＡＩチップで処理可能な演算と、統合可能なプリミティブな演算の組み合わせとが登録されている。 The correspondence description unit 40 is similar to the correspondence description unit 1040 of the conventional method described in FIG. 1. That is, the correspondence description unit 40 registers operations that can be processed by the AI chip and combinations of primitive operations that can be integrated.

　生成部２１は、入力された深層学習モデルの各層に対応する演算のうち、ＡＩチップで処理可能な畳み込み層及びその畳み込み層に付随する周辺の層の各々に対応する演算を統合して、演算統合済みモデルを生成する。具体的には、生成部２１は、ラベリング部２２と、演算統合部２４とを含む。 The generation unit 21 integrates the calculations corresponding to each layer of the input deep learning model, including the convolutional layer that can be processed by the AI chip and the surrounding layers associated with the convolutional layer, to generate a calculation-integrated model. Specifically, the generation unit 21 includes a labeling unit 22 and a calculation integration unit 24.

　ラベリング部２２は、入力された深層学習モデルの各層のうち、対応関係記述部４０に予め登録された、ＡＩチップで処理可能な演算に一致する演算の層を識別可能にラベル付けする。ラベリング部２２について、その他のより具体的な内容は、図１のラベリング部１０２２と同様であるため、詳細な説明を省略する。ラベリング部２２は、ラベリング済みモデル（計算グラフ）を演算統合部２４へ受け渡す。 The labeling unit 22 identifiably labels, among the layers of the input deep learning model, layers of operations that match operations that can be processed by the AI chip and that are pre-registered in the correspondence description unit 40. Other specific details of the labeling unit 22 are similar to those of the labeling unit 1022 in FIG. 1, so detailed explanations will be omitted. The labeling unit 22 passes the labeled model (computation graph) to the computation integration unit 24.

　演算統合部２４は、ラベリング部２２から受け渡されたラベリング済みモデルに基づいて、ＡＩチップで処理可能な演算の層としてラベルが付された層に対応する演算の組み合わせを特定する。そして、演算統合部２４は、特定した演算の組み合わせのうち、対応関係記述部４０に予め登録された、統合可能なプリミティブな演算の組み合わせと一致する演算の組み合わせを処理ブロックとしてまとめて計算グラフを再構成する。これにより、演算統合部２４は、演算統合済みモデルを生成する。演算統合部２４について、その他のより具体的な内容は、図１の演算統合部１０２４と同様であるため、詳細な説明を省略する。演算統合部２４は、生成した演算統合済みモデルを抽出部２６へ受け渡す。 Based on the labeled model passed from the labeling unit 22, the computation integration unit 24 identifies a combination of operations that corresponds to a layer labeled as a layer of operations that can be processed by the AI chip. The computation integration unit 24 then groups, as processing blocks, combinations of operations that match combinations of primitive operations that can be integrated and are preregistered in the correspondence description unit 40, from among the identified combinations of operations, and reconstructs a computation graph. In this way, the computation integration unit 24 generates a computation-integrated model. Other specific details of the computation integration unit 24 are similar to those of the computation integration unit 1024 in FIG. 1, so a detailed description will be omitted. The computation integration unit 24 passes the generated computation-integrated model to the extraction unit 26.

　パターン記述部４２には、畳み込み層の各々の層構成に基づいて層統合可能な畳み込み層の組み合わせのパターンが登録されている。具体的には、パターン記述部４２には、ＡＩチップの仕様で決まる「層統合可能な畳み込み層の層数」と「各層間の接続」とを考慮した制約を満たす、深層学習モデルの部分グラフのパターンが、予め網羅的に登録されている。 The pattern description unit 42 stores patterns of combinations of convolutional layers that can be merged based on the layer structure of each convolutional layer. Specifically, the pattern description unit 42 stores a comprehensive set of partial graph patterns of deep learning models that satisfy constraints that take into account the "number of convolutional layers that can be merged" and the "connections between layers" determined by the AI chip specifications.

　図６に、部分グラフのパターンの一例を示す。図６（ａ）は、図２に示す、対応関係記述部４０に登録された統合可能な演算の層の組み合わせを表す計算グラフである。これを１つの処理ブロック（Ｃｏｎｖ　Ｂｌｏｃｋ）として、例えば、図６（ｂ）のように、複数のＣｏｎｖ　Ｂｌｏｃｋを組み合わせたパターンがパターン記述部４２に登録される。また、図６（ｂ）に示すように、パターンには、アップサンプリング（Ｕｐｓａｍｐｌｅ）層のような他の演算子が含まれていてもよい。また、図７（ｃ）に示すように、スキップコネクションと加算（Ａｄｄ）層とを組み合わせたＲｅｓｉｄｕａｌ層（Ｒｅｓ）のような層を含むより複雑なパターンであってもよい。 FIG. 6 shows an example of a subgraph pattern. FIG. 6(a) is a computation graph showing a combination of layers of integrable operations registered in the correspondence description unit 40 shown in FIG. 2. This is treated as one processing block (Conv Block), and a pattern combining multiple Conv Blocks, for example, as shown in FIG. 6(b), is registered in the pattern description unit 42. As shown in FIG. 6(b), the pattern may also include other operators such as an upsampling layer. As shown in FIG. 7(c), the pattern may also be a more complex pattern including a layer such as a residual layer (Res) that combines a skip connection and an addition layer.

　条件記述部４４には、ＡＩチップの仕様に基づく層統合の条件が登録されている。例えば、畳み込み層のカーネルを読み込むために必要なＡＩチップ上のキャッシュ使用量に関する条件式を、各ノードのパラメータに基づいて、下記（１）式のように規定しておく。 The condition description section 44 registers the layer integration conditions based on the AI chip specifications. For example, the condition equation regarding the cache usage on the AI chip required to read the kernel of the convolution layer is specified based on the parameters of each node as in the following equation (1).

　（１）式において、層統合区間を第ｌ_ｓ層から第ｌ_ｅ層とし、ｋ_ａは第ａ層の畳み込み層のカーネルサイズ、ｉＣＨ_ａは第ａ層の入力のチャネル数、ｏＣＨ_ａは第ａ層の出力のチャネル数である。右辺のＣ_{ｃａｃｈｅ}はキャッシュ容量を表しており、ＡＩチップの仕様に応じて決定される値である。 In equation (1), the layer integration section is from the l _s layer to the l _e layer, k _a is the kernel size of the a-th convolutional layer, iCH _a is the number of input channels of the a-th layer, and oCH _a is the number of output channels of the a-th layer. C _cache on the right side represents the cache capacity, which is a value determined according to the specifications of the AI chip.

　抽出部２６は、演算統合部２４から受け渡された演算統合済みモデル（計算グラフ）のうち、パターン記述部４２に予め登録されたパターンに一致する部分を部分グラフとして抽出する。具体的には、抽出部２６は、演算統合済みモデルを示す計算グラフを走査して、パターン記述部４２に登録されたパターンとマッチングを行い、パターンがマッチした区間を層統合区間の候補となる部分グラフとして抽出する。抽出部２６は、抽出した部分グラフを判定部２８へ受け渡す。 The extraction unit 26 extracts, as a subgraph, a portion of the computationally integrated model (computation graph) passed from the computation integration unit 24 that matches a pattern registered in advance in the pattern description unit 42. Specifically, the extraction unit 26 scans the computation graph indicating the computationally integrated model, matches it with the pattern registered in the pattern description unit 42, and extracts the section where the pattern matches as a subgraph that is a candidate for a layer integration section. The extraction unit 26 passes the extracted subgraph to the determination unit 28.

　判定部２８は、抽出部２６から受け渡された部分グラフのうち、条件記述部４４に予め登録された条件を満たす部分グラフに相当する区間を、層統合区間として判定する。具体的には、判定部２８は、部分グラフ内の各ノードのパラメータを抽出すると共に、条件記述部４４に登録された条件式を読み出す。そして、判定部２８は、抽出したパラメータを用いて、読み出した条件式を計算し、部分グラフが該当の条件式を満たすか否かを判定する。判定部２８は、条件を満たす層統合区間内の各層を統合して演算統合済みモデルを再構成した層統合済みモデル候補を生成する。判定部２８は、生成した層統合済みモデル候補を最適化部３０へ受け渡す。 The determination unit 28 determines, from among the subgraphs passed from the extraction unit 26, a section corresponding to a subgraph that satisfies the conditions registered in advance in the condition description unit 44 as a layer integration section. Specifically, the determination unit 28 extracts parameters of each node in the subgraph, and reads out the conditional expression registered in the condition description unit 44. The determination unit 28 then uses the extracted parameters to calculate the read out conditional expression, and determines whether or not the subgraph satisfies the corresponding conditional expression. The determination unit 28 generates a layer-integrated model candidate by integrating each layer in the layer integration section that satisfies the condition and reconstructing the computationally integrated model. The determination unit 28 passes the generated layer-integrated model candidate to the optimization unit 30.

　最適化部３０は、抽出部２６及び判定部２８の処理を複数回実行することにより生成される、層統合区間がそれぞれ異なる複数の層統合済みモデル候補から、最適化の指標に基づいて最適な候補を層統合済みモデルとして選択して出力する。最適化の指標は任意であるが、各層統合区間の処理時間を表す数値、具体的には、深層学習モデルの演算量、及び外部メモリとＡＩチップとでやり取りするデータ量の少なくとも一方を用いた指標とすることが好ましい。例えば、最適化部３０は、各層統合区間の積和演算量の合計、畳み込みカーネル及び入出力特徴量の読み書き時間等の分散を最適化の指標として計算する。最適化部３０は、分散が最も小さくなる層統合済みモデル候補を、最終的な層統合済みモデルとして選択し出力する。これにより、層統合済みモデル候補に含まれる各層統合区間の処理時間が最も均一化される候補が選択され、パイプライン処理を組む際に、前段処理の終了待ちによる処理の遅延を抑制できるため、効率的な推論処理の実行が可能となる。 The optimization unit 30 selects and outputs the optimal candidate as the layer-integrated model based on an optimization index from among multiple layer-integrated model candidates with different layer-integrated sections generated by executing the processes of the extraction unit 26 and the determination unit 28 multiple times. The optimization index is arbitrary, but it is preferable to use an index that uses at least one of a numerical value representing the processing time of each layer-integrated section, specifically, the amount of calculation of the deep learning model and the amount of data exchanged between the external memory and the AI chip. For example, the optimization unit 30 calculates the total amount of product-sum calculations of each layer-integrated section, the variance of the read/write time of the convolution kernel and the input/output feature, etc. as the optimization index. The optimization unit 30 selects and outputs the layer-integrated model candidate with the smallest variance as the final layer-integrated model. As a result, a candidate with the most uniform processing time for each layer-integrated section included in the layer-integrated model candidate is selected, and when assembling pipeline processing, processing delays due to waiting for the completion of the previous stage can be suppressed, enabling efficient execution of inference processing.

　次に、第１実施形態に係る層統合判定装置１０の作用について説明する。図８は、層統合判定装置１０による層統合判定処理の流れを示すフローチャートである。ＣＰＵ１１がＲＯＭ１２又はストレージ１４から層統合判定プログラムを読み出して、ＲＡＭ１３に展開して実行することにより、層統合判定処理が行なわれる。層統合判定処理は、本開示の層統合判定方法の一例である。 Next, the operation of the layer merging determination device 10 according to the first embodiment will be described. FIG. 8 is a flowchart showing the flow of the layer merging determination process by the layer merging determination device 10. The layer merging determination process is performed by the CPU 11 reading out a layer merging determination program from the ROM 12 or storage 14, expanding it into the RAM 13, and executing it. The layer merging determination process is an example of the layer merging determination method of the present disclosure.

　ステップＳ１０において、ＣＰＵ１１は、生成部２１として、演算統合処理を実行する。演算統合処理は、図３に示す従来手法の演算統合処理と同様である。次に、ステップＳ１２のループ処理として、指定回数の繰り返し処理が実行される。繰り返し処理の指定回数は、外部からハイパーパラメータとして指定する。 In step S10, the CPU 11, functioning as the generation unit 21, executes a computation integration process. The computation integration process is similar to the computation integration process of the conventional method shown in FIG. 3. Next, as the loop process in step S12, a specified number of repetitions are executed. The specified number of repetitions is specified externally as a hyperparameter.

　具体的には、ステップＳ１４で、ＣＰＵ１１は、抽出部２６として、パターン記述部４２に登録されているパターンについて、マッチングを行うパターンの順序を変更する。パターンの順序は任意に設定可能であり、例えば、ランダムに順序を入れ替えたり、パターン内の畳み込み層の数でグルーピングして規則的に順序を入れ替えたりしてよい。これは、後述するステップＳ１６の処理で、繰り返し処理毎に異なる層統合済みモデル候補を生成するための処理である。 Specifically, in step S14, the CPU 11, functioning as the extraction unit 26, changes the order of patterns to be matched for the patterns registered in the pattern description unit 42. The order of the patterns can be set arbitrarily; for example, the order may be changed randomly, or the order may be changed regularly by grouping the patterns by the number of convolutional layers in the patterns. This is a process for generating a different layer-integrated model candidate for each iteration in the process of step S16 described below.

　次に、ステップＳ１６で、ＣＰＵ１１は、層統合済みモデル候補生成処理を実行する。次に、ステップＳ１８で、ＣＰＵ１１は、最適化部３０として、繰り返し処理が指定回数に到達したか否かを判定する。指定回数に到達した場合には、ステップＳ２０へ移行し、到達していない場合には、ステップＳ１２のループ処理を繰り返す。 Next, in step S16, the CPU 11 executes a layer-integrated model candidate generation process. Next, in step S18, the CPU 11, functioning as the optimization unit 30, determines whether the repetition process has reached a designated number of times. If the repetition process has reached the designated number of times, the process proceeds to step S20, and if the repetition process has not reached the designated number of times, the process repeats the loop process of step S12.

　ここで、図９を参照して、ステップＳ１６で実行される層統合済みモデル候補生成処理について説明する。 Now, with reference to FIG. 9, we will explain the layer-integrated model candidate generation process executed in step S16.

　ＣＰＵ１１は、パターン記述部４２に登録されたパターンを、上記ステップＳ１４で変更した順番でそれぞれ処理対象のパターンとして設定して、ステップＳ１６０のループ処理を実行する。具体的には、ＣＰＵ１１は、演算統合済みモデルを走査して、ステップＳ１６２のループ処理を実行する。 The CPU 11 sets each of the patterns registered in the pattern description unit 42 as a pattern to be processed in the order changed in step S14, and executes the loop process of step S160. Specifically, the CPU 11 scans the computationally integrated model and executes the loop process of step S162.

　ステップＳ１６４で、ＣＰＵ１１は、抽出部２６として、未統合の区間、すなわち、処理対象のパターン以外のパターンに基づいて抽出された層統合区間以外の区間が残っているか否かを判定する。未統合の区間が残っている場合には、ステップＳ１６６へ移行し、残っていない場合には、ステップＳ１７６へ移行する。 In step S164, the CPU 11, functioning as the extraction unit 26, determines whether or not there are any unintegrated sections remaining, i.e., sections other than the layer-integrated sections extracted based on patterns other than the pattern to be processed. If there are any unintegrated sections remaining, the process proceeds to step S166; if there are no unintegrated sections remaining, the process proceeds to step S176.

　ステップＳ１６６では、ＣＰＵ１１は、抽出部２６として、演算統合済みモデル（計算グラフ）の未統合の区間において、処理対象のパターンと一致する部分グラフを探索する。ＣＰＵ１１は、抽出部２６として、処理対象のパターンと一致する部分グラフが存在するか否かを判定する。部分グラフが存在する場合には、ステップＳ１６８へ移行し、存在しない場合には、ステップＳ１６２のループ処理を終了する。 In step S166, the CPU 11, functioning as the extraction unit 26, searches for a subgraph that matches the pattern to be processed in the unintegrated section of the computationally integrated model (computation graph). The CPU 11, functioning as the extraction unit 26, determines whether or not a subgraph that matches the pattern to be processed exists. If a subgraph exists, the process proceeds to step S168; if not, the loop process of step S162 ends.

　ステップＳ１６８では、ＣＰＵ１１は、抽出部２６として、処理対象のパターンと一致した部分グラフを抽出する。そして、ＣＰＵ１１は、判定部２８として、部分グラフ内の各ノードのパラメータを抽出すると共に、条件記述部４４に登録された条件式を読み出す。さらに、ＣＰＵ１１は、判定部２８として、抽出したパラメータを用いて、読み出した条件式を計算し、部分グラフが該当の条件式を満たすか否かを判定する。部分グラフが条件を満たす場合には、ステップＳ１７０へ移行し、条件を満たさない場合には、ステップＳ１６２のループ処理を終了する。 In step S168, the CPU 11, functioning as the extraction unit 26, extracts a subgraph that matches the pattern to be processed. Then, the CPU 11, functioning as the judgment unit 28, extracts parameters for each node in the subgraph and reads out a condition expression registered in the condition description unit 44. Furthermore, the CPU 11, functioning as the judgment unit 28, calculates the read out condition expression using the extracted parameters and judges whether or not the subgraph satisfies the corresponding condition expression. If the subgraph satisfies the condition, the process proceeds to step S170; if the subgraph does not satisfy the condition, the loop process of step S162 ends.

　ステップＳ１７０では、ＣＰＵ１１は、抽出部２６として、演算統合済みモデル全体の走査を終了したか否かを判定する。走査を終了していない場合には、ステップＳ１６２のループ処理を繰り返し、走査を終了した場合には、ステップＳ１７４へ移行する。ステップＳ１７４では、ＣＰＵ１１は、抽出部２６として、パターン記述部４２に登録された全てのパターンについて、演算統合済みモデルとのマッチング処理が終了したか否かを判定する。未処理のパターンが存在する場合には、未処理にパターンについてステップＳ１６０のループ処理を繰り返し、全てのパターンについて終了している場合には、ステップＳ１７６へ移行する。 In step S170, the CPU 11, functioning as the extraction unit 26, determines whether scanning of the entire computationally integrated model has been completed. If scanning has not been completed, the loop process of step S162 is repeated, and if scanning has been completed, the process proceeds to step S174. In step S174, the CPU 11, functioning as the extraction unit 26, determines whether matching processing with the computationally integrated model has been completed for all patterns registered in the pattern description unit 42. If there are unprocessed patterns, the loop process of step S160 is repeated for the unprocessed patterns, and if all patterns have been processed, the process proceeds to step S176.

　ステップＳ１７６では、演算統合済みモデルに未統合の区間が残っていない状態、又は、全てパターンについてのマッチング処理が終了した状態となっている。ＣＰＵ１１は、判定部２８として、上記ステップＳ１７０で抽出された層統合区間内の各層を統合して演算統合済みモデルを再構成した層統合済みモデル候補を生成する。なお、ＡＩチップで処理することを示すラベルが付された層で、いずれの層統合区間にも含まれなかった層は、単層で処理するものとし、各層統合区間と合わせて層統合済みモデル候補に含めて出力する。そして、層統合済みモデル候補生成処理を終了して、層統合判定処理（図８）へリターンする。ステップＳ１６の層統合済みモデル候補生成処理を含むループ処理（ステップＳ１２）が指定回数繰り返されることにより、指定回数分の層統合済みモデル候補が生成される。 In step S176, there are no unintegrated sections remaining in the computationally integrated model, or matching processing has been completed for all patterns. The CPU 11, as the determination unit 28, generates a layer-integrated model candidate by integrating each layer in the layer integration section extracted in step S170 above to reconstruct the computationally integrated model. Note that layers that are labeled to be processed by the AI chip and are not included in any layer integration section are treated as being processed as a single layer, and are output together with each layer integration section as part of the layer-integrated model candidate. Then, the layer-integrated model candidate generation process ends, and the process returns to the layer integration determination process (FIG. 8). The loop process (step S12) including the layer-integrated model candidate generation process in step S16 is repeated a specified number of times, generating a specified number of layer-integrated model candidates.

　次に、ステップＳ２０で、ＣＰＵ１１は、最適化部３０として、生成された層統合済みモデル候補毎に、ステップＳ２０のループ処理を実行する。具体的には、ステップＳ２２で、ＣＰＵ１１は、最適化部３０として、処理対象の層統合済みモデル候補に含まれる各層統合区間の処理時間を表す指標の分散を最適化の指標として計算する。 Next, in step S20, the CPU 11, functioning as the optimization unit 30, executes the loop process of step S20 for each generated layer-integrated model candidate. Specifically, in step S22, the CPU 11, functioning as the optimization unit 30, calculates, as an optimization index, the variance of an index representing the processing time of each layer-integrated section included in the layer-integrated model candidate to be processed.

　次に、ステップＳ２４で、ＣＰＵ１１は、最適化部３０として、全ての層統合済みモデル候補について、最適化の指標を計算する処理が終了したか否かを判定する。未処理の層統合済みモデル候補が存在する場合には、ステップＳ２０のループ処理を繰り返し、全て終了している場合には、ステップＳ２６へ移行する。 Next, in step S24, the CPU 11, functioning as the optimization unit 30, determines whether the process of calculating the optimization index for all layer-integrated model candidates has been completed. If there are any unprocessed layer-integrated model candidates, the loop process of step S20 is repeated, and if all have been completed, the process proceeds to step S26.

　ステップＳ２６では、ＣＰＵ１１は、最適化部３０として、最適化の指標に基づいて、層統合済みモデル候補から、最適な層統合済みモデルを選択する。例えば、ＣＰＵ１１は、最適化部３０として、最適化の指標として計算した各層統合区間の処理時間を表す指標の分散が最も小さくなる層統合済みモデル候補を、最終的な層統合済みモデルとして選択し出力する。そして、層統合判定処理は終了する。 In step S26, the CPU 11, as the optimization unit 30, selects the optimal layer-merged model from the layer-merged model candidates based on the optimization index. For example, the CPU 11, as the optimization unit 30, selects and outputs, as the final layer-merged model, the layer-merged model candidate that has the smallest variance of the index representing the processing time for each layer-merged section calculated as the optimization index. Then, the layer-merging determination process ends.

　以上説明したように、第１実施形態に係る層統合判定装置は、各層をノードで表現した計算グラフの形式で入力された、複数の畳み込み層を含む深層学習モデルの各層に対応する演算のうち、深層学習モデルの推論処理を実行する専用ハードウェアで処理可能な畳み込み層及び畳み込み層に付随する周辺の層の各々に対応する演算を統合して、演算統合済みモデルを生成し、演算統合済みモデルのうち、畳み込み層の各々の層構成に基づいて層統合可能な畳み込み層の組み合わせとしてパターン記述部に予め登録されたパターンに一致する部分グラフを抽出し、専用ハードウェアの仕様に基づく層統合の条件として条件記述部に予め登録された条件を満たす部分グラフに相当する区間を、層統合区間として判定し、層統合区間がそれぞれ異なる複数の層統合済みモデル候補のうち、所定の指標に基づいて最適な層統合済みモデルを選択して出力する。これにより、計算グラフとして表現された深層学習モデルのみを入力として、複数の畳み込み層を含む深層学習モデルの層統合の判定に要する時間を短縮しつつ、様々な深層学習モデルに対して柔軟に対応することができる。 As described above, the layer integration determination device according to the first embodiment integrates the operations corresponding to each of the convolution layers and the surrounding layers associated with the convolution layers that can be processed by the dedicated hardware that executes the inference processing of the deep learning model, among the operations corresponding to each layer of the deep learning model including multiple convolution layers input in the form of a computation graph in which each layer is represented by a node, to generate a computation-integrated model, extracts a subgraph from the computation-integrated model that matches a pattern preregistered in the pattern description section as a combination of convolution layers that can be layer-integrated based on the layer configuration of each of the convolution layers, determines a section corresponding to a subgraph that satisfies a condition preregistered in the condition description section as a condition for layer integration based on the specifications of the dedicated hardware, and selects and outputs the optimal layer-integrated model based on a predetermined index from among multiple layer-integrated model candidates with different layer-integration sections. This makes it possible to flexibly respond to various deep learning models while shortening the time required to determine layer integration of a deep learning model including multiple convolution layers using only the deep learning model represented as a computation graph as input.

＜第２実施形態＞
　第２実施形態では、予め深層学習モデルを走査して最適化で用いる条件を追加条件式として生成し、判定部に渡しておくことで、抽出部及び判定部の処理を１度で済ませられる形態について説明する。なお、第２実施形態に係る層統合判定装置において、第１実施形態に係る層統合判定装置１０と同様の構成については、同一符号を付して詳細な説明を省略する。また、第２実施形態に係る層統合判定装置のハードウェア構成は、図４に示す第１実施形態に係る層統合判定装置１０のハードウェア構成と同様であるため、説明を省略する。 Second Embodiment
In the second embodiment, a deep learning model is scanned in advance to generate conditions to be used in optimization as additional conditional expressions, and the conditions are passed to the judgment unit, so that the processing of the extraction unit and the judgment unit can be completed in one go. Note that in the layer integration judgment device according to the second embodiment, the same components as those in the layer integration judgment device 10 according to the first embodiment are denoted by the same reference numerals and detailed descriptions thereof are omitted. In addition, the hardware configuration of the layer integration judgment device according to the second embodiment is the same as the hardware configuration of the layer integration judgment device 10 according to the first embodiment shown in FIG. 4, and therefore the description thereof is omitted.

　第２実施形態に係る層統合判定装置２１０の機能構成について説明する。図１０は、層統合判定装置２１０の機能構成の例を示すブロック図である。図１０に示すように、層統合判定装置２１０は、機能構成として、ラベリング部２２及び演算統合部２４を含む生成部２１と、抽出部２６と、判定部２２８と、追加部２３２とを含む。また、層統合判定装置２１０の所定の記憶領域には、対応関係記述部４０と、パターン記述部４２と、条件記述部４４とが設けられる。各機能構成は、ＣＰＵ１１がＲＯＭ１２又はストレージ１４に記憶された層統合判定プログラムを読み出し、ＲＡＭ１３に展開して実行することにより実現される。 The functional configuration of the layer integration determination device 210 according to the second embodiment will be described. FIG. 10 is a block diagram showing an example of the functional configuration of the layer integration determination device 210. As shown in FIG. 10, the layer integration determination device 210 includes, as its functional configuration, a generation unit 21 including a labeling unit 22 and a calculation integration unit 24, an extraction unit 26, a determination unit 228, and an addition unit 232. In addition, a correspondence description unit 40, a pattern description unit 42, and a condition description unit 44 are provided in a predetermined storage area of the layer integration determination device 210. Each functional configuration is realized by the CPU 11 reading out a layer integration determination program stored in the ROM 12 or storage 14, expanding it in the RAM 13, and executing it.

　追加部２３２は、判定部２２８で用いる条件に、予め定めた層統合区間数に基づく最適化の条件を追加する。例えば、追加部２３２は、深層学習モデルの演算量、及び外部メモリとＡＩチップとでやり取りするデータ量の少なくとも一方を用いて、層統合済みモデルに含まれる各層統合区間の処理時間が均一化されるような各層統合区間の演算量の範囲を条件として追加する。 The adding unit 232 adds an optimization condition based on a predetermined number of layer integration sections to the conditions used by the determining unit 228. For example, the adding unit 232 adds, as a condition, a range of the amount of calculation for each layer integration section that is included in the layer-integrated model and that uniforms the processing time for each layer integration section, using at least one of the amount of calculation for the deep learning model and the amount of data exchanged between the external memory and the AI chip.

　より具体的には、追加部２３２は、層統合判定装置２１０に入力された深層学習モデル（計算グラフ）を取得し、計算グラフを走査して深層学習モデル全体の演算量を見積もる。追加部２３２は、演算量を、入力及び出力の特徴量マップのサイズ、畳み込み演算で用いるカーネルサイズ、チャネル数、畳み込みの行列演算及び活性化関数の乗算回数、バイアス加算の加算回数等を含む深層学習モデルのパラメータから見積もる。これらのパラメータは、上述の通り、計算グラフの各ノードが保持している。 More specifically, the adding unit 232 acquires the deep learning model (computation graph) input to the layer integration determination device 210, and scans the computation graph to estimate the computational volume of the entire deep learning model. The adding unit 232 estimates the computation volume from parameters of the deep learning model, including the sizes of the input and output feature maps, the kernel size used in the convolution computation, the number of channels, the convolution matrix computation and the number of multiplications of the activation function, the number of additions for bias addition, etc. These parameters are held by each node of the computation graph, as described above.

　追加部２３２は、見積もった学習モデル全体の演算量から、１層統合区間当たりの演算量の上限及び下限を設定する。具体的には、追加部２３２は、深層学習モデル全体の演算量を想定される層統合区間数で割ることで、１層統合区間当たりの演算量を計算する。想定される層統合区間数は、ハイパーパラメータとして人手で与えてもよいし、機械的に見積もりを行ってもよい。機械的に見積もりを行う場合、例えば、追加部２３２から追加条件式を与えていない状態で層統合済みモデルを生成し、生成された層統合済みモデルに含まれる層統合区間数を採用する。そして、追加部２３２は、１層統合区間当たりの演算量に基づいて、上限及び下限を設定する。 The adding unit 232 sets an upper limit and a lower limit of the amount of calculation per layer-integrated section from the estimated amount of calculation for the entire learning model. Specifically, the adding unit 232 calculates the amount of calculation per layer-integrated section by dividing the amount of calculation for the entire deep learning model by the expected number of layer-integrated sections. The expected number of layer-integrated sections may be given manually as a hyperparameter, or may be estimated mechanically. When estimating mechanically, for example, a layer-integrated model is generated in a state where no additional condition equation is given from the adding unit 232, and the number of layer-integrated sections included in the generated layer-integrated model is adopted. Then, the adding unit 232 sets an upper limit and a lower limit based on the amount of calculation per layer-integrated section.

　追加部２３２は、演算量の計算に用いた式と、１層統合区間当たりの演算量の範囲を示す追加条件式とを判定部２２８へ受け渡す。追加条件式は、例えば、下記（２）式のように書き表される。 The adding unit 232 passes the formula used to calculate the amount of calculation and an additional condition formula indicating the range of the amount of calculation per one-layer integration section to the determining unit 228. The additional condition formula is written, for example, as in the following formula (2).

　（２）式において、層統合区間を第ｌ_ｓ層から第ｌ_ｅ層とし、ｍ_ａは畳み込み層又は活性化関数層（ｔｙｐｅ＝ｃｏｎｖ，ａｃｔｉｖａｔｉｏｎ）である第ａ層の乗算回数、Ｃ_ｍｉｎは演算量の下限、Ｃ_ｍａｘは演算量の上限である。（２）式は、畳み込み層及び活性化関数層の乗算回数の和が上限から下限の範囲に含まれるか否かを判定する条件判定式となっている。 In formula (2), the layer integration section is from the l _s layer to the l _e layer, m _a is the number of multiplications of the a-th layer, which is a convolution layer or an activation function layer (type = conv, activation), C _min is the lower limit of the amount of calculation, and C _max is the upper limit of the amount of calculation. Formula (2) is a conditional judgment formula for judging whether the sum of the number of multiplications of the convolution layer and the activation function layer is within the range from the upper limit to the lower limit.

　なお、追加する条件は、深層学習モデルの演算量に関する条件に限定されるものではなく、外部メモリとやり取りされるデータ量を用い条件、これらを組み合わせた条件等としてもよい。また、（２）式では、追加条件式の例として上限及び下限を設定する場合を示したが、上限のみを設定した追加条件式を用いてもよい。例えば、ＡＩチップの演算能力とパイプライン処理の１ステージの処理時間とから１層統合区間当たりの演算量の上限を設定してもよい。 The conditions to be added are not limited to conditions related to the amount of calculations of the deep learning model, but may be conditions using the amount of data exchanged with external memory, or combinations of these. In addition, in formula (2), an example of an additional condition formula in which an upper and lower limits are set is shown, but an additional condition formula in which only an upper limit is set may also be used. For example, an upper limit on the amount of calculations per one-layer integration section may be set based on the computing power of the AI chip and the processing time for one stage of pipeline processing.

　また、追加部２３２は、作成した追加条件式と計算方法が同一の条件式が条件記述部４４に登録されている場合、条件記述部４４に登録された条件式を追加条件式で上書きすることにより、追加条件式を判定部２２８へ受け渡してもよい。 In addition, if a conditional expression using the same calculation method as the created additional conditional expression is registered in the condition description section 44, the adding section 232 may overwrite the conditional expression registered in the condition description section 44 with the additional conditional expression, thereby passing the additional conditional expression to the determination section 228.

　判定部２２８は、第１実施形態における判定部２８と同様に、抽出部２６により抽出された部分グラフが条件を満たすか否かを判定する。ただし、第２実施形態の判定部２２８は、条件記述部４４に登録された、ＡＩチップの仕様に基づく層統合の条件と共に、追加部２３２から受け渡された、層統合区間数に基づく最適化の条件を満たすか否かを判定する。判定部２２８は、条件を満たす層統合区間内の各層を統合して演算統合済みモデルを再構成した層統合済みモデルを生成して出力する。 The determination unit 228, like the determination unit 28 in the first embodiment, determines whether the subgraph extracted by the extraction unit 26 satisfies the conditions. However, the determination unit 228 in the second embodiment determines whether the optimization conditions based on the number of layer integration sections transferred from the addition unit 232 are satisfied, along with the layer integration conditions based on the AI chip specifications registered in the condition description unit 44. The determination unit 228 generates and outputs a layer-integrated model in which the layers in the layer integration sections that satisfy the conditions are integrated to reconstruct the computationally integrated model.

　次に、第２実施形態に係る層統合判定装置２１０の作用について説明する。図１１は、層統合判定装置２１０による層統合判定処理の流れを示すフローチャートである。ＣＰＵ１１がＲＯＭ１２又はストレージ１４から層統合判定プログラムを読み出して、ＲＡＭ１３に展開して実行することにより、層統合判定処理が行なわれる。 Next, the operation of the layer merging determination device 210 according to the second embodiment will be described. FIG. 11 is a flowchart showing the flow of the layer merging determination process by the layer merging determination device 210. The layer merging determination process is performed by the CPU 11 reading out a layer merging determination program from the ROM 12 or storage 14, expanding it into the RAM 13, and executing it.

　ステップＳ２１０で、ＣＰＵ１１は、追加部２３２として、例えば、人手で与えられたハイパーパラメータを取得することにより、又は、機械的に見積もりを行うことにより、想定される層統合区間数を決定する。次に、ステップＳ２１２で、ＣＰＵ１１は、追加部２３２として、層統合判定装置２１０に入力された深層学習モデル（計算グラフ）を取得し、計算グラフを走査して、深層学習モデルのパラメータから、深層学習モデル全体の演算量を見積もる。 In step S210, the CPU 11, as the adding unit 232, determines the expected number of layer integration sections, for example, by acquiring manually assigned hyperparameters or by mechanically estimating. Next, in step S212, the CPU 11, as the adding unit 232, acquires the deep learning model (computation graph) input to the layer integration determination device 210, scans the computation graph, and estimates the amount of calculation for the entire deep learning model from the parameters of the deep learning model.

　次に、ステップＳ２１４で、ＣＰＵ１１は、追加部２３２として、深層学習モデル全体の演算量を想定される層統合区間数で割ることで、１層統合区間当たりの演算量を計算する。そして、ＣＰＵ１１は、追加部２３２として、計算した１層統合区間当たりの演算量に基づいて、１層統合区間当たりの演算量の範囲（例えば、上限及び下限）を設定し、追加条件式を作成する。次に、ステップＳ２１６で、ＣＰＵ１１は、追加部２３２として、演算量の計算に用いた式と、作成した追加条件式とを判定部２２８へ受け渡す。 Next, in step S214, the CPU 11, as the adding unit 232, calculates the amount of calculation per layer integration section by dividing the amount of calculation for the entire deep learning model by the expected number of layer integration sections. Then, the CPU 11, as the adding unit 232, sets a range of the amount of calculation per layer integration section (e.g., upper and lower limits) based on the calculated amount of calculation per layer integration section, and creates an additional condition equation. Next, in step S216, the CPU 11, as the adding unit 232, passes the equation used to calculate the amount of calculation and the created additional condition equation to the determining unit 228.

　次に、ステップＳ１０で、ＣＰＵ１１は、生成部２１として、演算統合処理を実行する。演算統合処理は、図３に示す従来手法の演算統合処理と同様である。次に、ステップＳ２１８で、ＣＰＵ１１は、抽出部２６及び判定部２２８として、層統合済みモデル生成処理を実行する。層統合済みモデル生成処理は、図９に示す層統合済みモデル候補生成処理と同様である。ただし、ステップＳ１６８で、部分グラフが条件を満たすか否かを判定する際、ＣＰＵ１１は、判定部２２８として、条件記述部４４に登録された条件と共に、追加部２３２から受け渡された条件を満たすか否かを判定する。また、ステップＳ１７６において生成されるモデルを、層統合済みモデル候補ではなく、最終的な層統合済みモデルとする。 Next, in step S10, the CPU 11, functioning as the generation unit 21, executes a calculation integration process. The calculation integration process is similar to the calculation integration process of the conventional method shown in FIG. 3. Next, in step S218, the CPU 11, functioning as the extraction unit 26 and the determination unit 228, executes a layer-integrated model generation process. The layer-integrated model generation process is similar to the layer-integrated model candidate generation process shown in FIG. 9. However, when determining in step S168 whether the subgraph satisfies the condition, the CPU 11, functioning as the determination unit 228, determines whether the subgraph satisfies the condition transferred from the addition unit 232, together with the condition registered in the condition description unit 44. Furthermore, the model generated in step S176 is not the layer-integrated model candidate, but the final layer-integrated model.

　次に、ステップＳ２２０で、判定部２２８が、生成した層統合済みモデルを出力し、層統合判定処理は終了する。 Next, in step S220, the determination unit 228 outputs the generated layer-integrated model, and the layer-integration determination process ends.

　以上説明したように、第２実施形態に係る層統合判定装置は、条件記述部に登録された、ＡＩチップの仕様に基づく層統合の条件と共に、例えば、１層統合区間の演算量の範囲等の、予め定めた層統合区間数に基づく最適化の条件を満たす層統合区間内の各層を統合して、層統合済みモデルを生成する。これにより、第１実施形態における抽出部及び判定部の繰り返し処理が不要となり、第１実施形態に係る層統合判定装置よりも、層統合判定処理の処理時間を削減することができる。 As described above, the layer integration determination device according to the second embodiment integrates layers within a layer integration section that satisfies optimization conditions based on a predetermined number of layer integration sections, such as the range of the amount of calculations per layer integration section, along with layer integration conditions based on the specifications of the AI chip, which are registered in the condition description section, to generate a layer-integrated model. This eliminates the need for the repeated processing of the extraction section and determination section in the first embodiment, and makes it possible to reduce the processing time of the layer integration determination process compared to the layer integration determination device according to the first embodiment.

　なお、上記各実施形態でＣＰＵがソフトウェア（プログラム）を読み込んで実行した層統合判定処理を、ＣＰＵ以外の各種のプロセッサが実行してもよい。この場合のプロセッサとしては、ＦＰＧＡ（Ｆｉｅｌｄ－Ｐｒｏｇｒａｍｍａｂｌｅ　Ｇａｔｅ　Ａｒｒａｙ）等の製造後に回路構成を変更可能なＰＬＤ（Ｐｒｏｇｒａｍｍａｂｌｅ　Ｌｏｇｉｃ　Ｄｅｖｉｃｅ）、及びＡＳＩＣ（Ａｐｐｌｉｃａｔｉｏｎ　Ｓｐｅｃｉｆｉｃ　Ｉｎｔｅｇｒａｔｅｄ　Ｃｉｒｃｕｉｔ）等の特定の処理を実行させるために専用に設計された回路構成を有するプロセッサである専用電気回路等が例示される。また、層統合判定処理を、これらの各種のプロセッサのうちの１つで実行してもよいし、同種又は異種の２つ以上のプロセッサの組み合わせ（例えば、複数のＦＰＧＡ、及びＣＰＵとＦＰＧＡとの組み合わせ等）で実行してもよい。また、これらの各種のプロセッサのハードウェア的な構造は、より具体的には、半導体素子等の回路素子を組み合わせた電気回路である。 In addition, the layer integration judgment process executed by the CPU by reading the software (program) in each of the above embodiments may be executed by various processors other than the CPU. Examples of processors in this case include PLDs (Programmable Logic Devices) such as FPGAs (Field-Programmable Gate Arrays) whose circuit configuration can be changed after manufacture, and dedicated electric circuits such as ASICs (Application Specific Integrated Circuits), which are processors having a circuit configuration designed exclusively to execute specific processes. In addition, the layer integration judgment process may be executed by one of these various processors, or may be executed by a combination of two or more processors of the same or different types (for example, multiple FPGAs, and a combination of a CPU and an FPGA, etc.). In addition, the hardware structure of these various processors is, more specifically, an electric circuit that combines circuit elements such as semiconductor elements.

　また、上記各実施形態では、層統合判定プログラムがストレージに予め記憶（インストール）されている態様を説明したが、これに限定されない。プログラムは、ＣＤ－ＲＯＭ（Ｃｏｍｐａｃｔ　Ｄｉｓｋ　Ｒｅａｄ　Ｏｎｌｙ　Ｍｅｍｏｒｙ）、ＤＶＤ－ＲＯＭ（Ｄｉｇｉｔａｌ　Ｖｅｒｓａｔｉｌｅ　Ｄｉｓｋ　Ｒｅａｄ　Ｏｎｌｙ　Ｍｅｍｏｒｙ）、及びＵＳＢ（Ｕｎｉｖｅｒｓａｌ　Ｓｅｒｉａｌ　Ｂｕｓ）メモリ等の非一時的（ｎｏｎ－ｔｒａｎｓｉｔｏｒｙ）記憶媒体に記憶された形態で提供されてもよい。また、プログラムは、ネットワークを介して外部装置からダウンロードされる形態としてもよい。 In addition, in each of the above embodiments, the layer integration judgment program is described as being pre-stored (installed) in storage, but this is not limiting. The program may be provided in a form stored in a non-transitory storage medium such as a CD-ROM (Compact Disk Read Only Memory), a DVD-ROM (Digital Versatile Disk Read Only Memory), or a USB (Universal Serial Bus) memory. The program may also be downloaded from an external device via a network.

　以上の各実施形態に関し、更に以下の付記を開示する。 The following notes are further provided with respect to each of the above embodiments.

　（付記項１）
　メモリと、
　前記メモリに接続された少なくとも１つのプロセッサと、
　を含み、
　前記プロセッサは、
　各層をノードで表現した計算グラフの形式で入力された、複数の畳み込み層を含む深層学習モデルの各層に対応する演算のうち、深層学習モデルの推論処理を実行する専用ハードウェアで処理可能な畳み込み層及び前記畳み込み層に付随する周辺の層の各々に対応する演算を統合して、演算統合済みモデルを生成し、
　前記演算統合済みモデルのうち、畳み込み層の各々の層構成に基づいて層統合可能な畳み込み層の組み合わせとしてパターン記述部に予め登録されたパターンに一致する部分グラフを抽出し、
　前記専用ハードウェアの仕様に基づく層統合の条件として条件記述部に予め登録された条件を満たす前記部分グラフに相当する区間を、層統合区間として判定する
　ように構成されている層統合判定装置。 (Additional Note 1)
Memory,
at least one processor coupled to the memory;
Including,
The processor,
Among the operations corresponding to each layer of a deep learning model including a plurality of convolution layers, which are input in the form of a computation graph in which each layer is represented by a node, operations corresponding to each of the convolution layers that can be processed by dedicated hardware that executes the inference processing of the deep learning model and the surrounding layers associated with the convolution layers are integrated to generate a computation-integrated model;
extracting a subgraph that matches a pattern registered in advance in a pattern description section as a combination of convolutional layers that can be integrated based on the layer configuration of each of the convolutional layers from the computationally integrated model;
a layer merging determination device configured to determine, as a layer merging interval, an interval corresponding to the subgraph that satisfies a condition registered in advance in a condition description section as a condition for layer merging based on a specification of the dedicated hardware.

　（付記項２）
　層統合判定処理を実行するようにコンピュータによって実行可能なプログラムを記憶した非一時的記憶媒体であって、
　前記層統合判定処理は、
　各層をノードで表現した計算グラフの形式で入力された、複数の畳み込み層を含む深層学習モデルの各層に対応する演算のうち、深層学習モデルの推論処理を実行する専用ハードウェアで処理可能な畳み込み層及び前記畳み込み層に付随する周辺の層の各々に対応する演算を統合して、演算統合済みモデルを生成し、
　前記演算統合済みモデルのうち、畳み込み層の各々の層構成に基づいて層統合可能な畳み込み層の組み合わせとしてパターン記述部に予め登録されたパターンに一致する部分グラフを抽出し、
　前記専用ハードウェアの仕様に基づく層統合の条件として条件記述部に予め登録された条件を満たす前記部分グラフに相当する区間を、層統合区間として判定する
　非一時的記憶媒体。 (Additional Note 2)
A non-transitory storage medium storing a program executable by a computer to execute a layer integration determination process,
The layer integration determination process includes:
Among the operations corresponding to each layer of a deep learning model including a plurality of convolution layers, which are input in the form of a computation graph in which each layer is represented by a node, operations corresponding to each of the convolution layers that can be processed by dedicated hardware that executes the inference processing of the deep learning model and the surrounding layers associated with the convolution layers are integrated to generate a computation-integrated model;
extracting a subgraph that matches a pattern registered in advance in a pattern description section as a combination of convolutional layers that can be integrated based on the layer configuration of each of the convolutional layers from the computationally integrated model;
A non-transitory storage medium that determines, as a layer integration interval, an interval corresponding to the subgraph that satisfies a condition registered in advance in a condition description section as a condition for layer integration based on a specification of the dedicated hardware.

１０、２１０  層統合判定装置
１１   ＣＰＵ
１２   ＲＯＭ
１３   ＲＡＭ
１４   ストレージ
１５   入力部
１６   表示部
１７   通信Ｉ／Ｆ
１９   バス
２１   生成部
２２   ラベリング部
２４   演算統合部
２６   抽出部
２８、２２８  判定部
２３２追加部
３０   最適化部
４０   対応関係記述部
４２   パターン記述部
４４   条件記述部 10, 210 layer integration determination device 11 CPU
12 ROM
13 RAM
14 Storage 15 Input unit 16 Display unit 17 Communication I/F
19 Bus 21 Generation unit 22 Labeling unit 24 Operation integration unit 26 Extraction unit 28, 228 Determination unit 232 Addition unit 30 Optimization unit 40 Correspondence relationship description unit 42 Pattern description unit 44 Condition description unit

Claims

A generation unit that integrates operations corresponding to each of the convolution layers and the surrounding layers associated with the convolution layers that can be processed by dedicated hardware that executes the inference processing of the deep learning model, among operations corresponding to each layer of the deep learning model including multiple convolution layers, input in the form of a computation graph in which each layer is represented by a node, to generate a computation-integrated model;
an extraction unit that extracts a subgraph that matches a pattern registered in advance in a pattern description unit as a combination of convolutional layers that can be integrated based on the layer configuration of each of the convolutional layers from the computationally integrated model;
a determination unit that determines, as a layer integration section, an interval corresponding to the subgraph that satisfies a condition registered in advance in a condition description unit as a layer integration condition based on a specification of the dedicated hardware;
A layer integration determination device comprising:

The layer integration determination device according to claim 1, further comprising an optimization unit that selects an optimal layer integration model based on a predetermined index from among a plurality of layer integration model candidates each having a different layer integration section, which are generated by executing the processes of the extraction unit and the determination unit multiple times.

The layer integration determination device according to claim 2, wherein the optimization unit uses at least one of the amount of calculation of the deep learning model and the amount of data exchanged between an external memory and the dedicated hardware as the predetermined index, and selects the layer integration model in which the processing time of each layer integration section included in the layer integration model is most uniform as the optimal layer integration model.

The layer integration determination device according to claim 1, further comprising an addition unit that adds an optimization condition based on a predetermined number of layer integration sections to the conditions used by the determination unit.

The layer integration determination device according to claim 4, wherein the adding unit uses at least one of the amount of calculation of the deep learning model and the amount of data exchanged between an external memory and the dedicated hardware to set a condition for adding a range of the amount of calculation of the layer integration section such that the processing time of each layer integration section included in the computationally integrated model is uniform.

The generation unit is
A labeling unit that identifiably labels operations of the input deep learning model that match operations that are pre-registered in a correspondence description unit and can be processed by the dedicated hardware;
a computation integration unit that generates the computation-integrated model by integrating, as a processing block, a combination of operations that is processable by the dedicated hardware and matches a combination of primitive operations that can be integrated and is registered in advance in the correspondence description unit, and reconstructing a computation graph;
The layer integration determination device according to any one of claims 1 to 5, comprising:

A layer integration determination method executed by a layer integration determination device including a generation unit, an extraction unit, and a determination unit,
The generation unit integrates operations corresponding to each of the convolution layers that can be processed by dedicated hardware that executes the inference processing of the deep learning model and the surrounding layers associated with the convolution layers, among the operations corresponding to each layer of the deep learning model including a plurality of convolution layers, input in the form of a computation graph in which each layer is represented by a node, to generate a computation-integrated model;
the extraction unit extracts, from the computationally integrated model, a subgraph that matches a pattern registered in advance in a pattern description unit as a combination of convolutional layers that can be integrated based on a layer configuration of each of the convolutional layers;
the determination unit determines, as a layer-merging interval, an interval corresponding to the subgraph that satisfies a condition registered in advance in a condition description unit as a condition for layer-merging based on a specification of the dedicated hardware.

A layer integration determination program for causing a computer to function as each part of a layer integration determination device according to any one of claims 1 to 5.