JP3597341B2

JP3597341B2 - Globally accelerated learning method for neural network model and its device

Info

Publication number: JP3597341B2
Application number: JP14433197A
Authority: JP
Inventors: 慶広落合; 智鈴木
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 1997-06-02
Filing date: 1997-06-02
Publication date: 2004-12-08
Anticipated expiration: 2017-06-02
Also published as: JPH10334070A

Description

【０００１】
【発明の属する技術分野】
本発明は、ニューラルネットを用いた処理、非線形最適化法を用いた処理などに適用できる神経回路網モデルの大域的加速化学習法およびその装置に関し、更に詳しくは、学習データが提示される毎に重み振動を抑制した逐次学習を用いて学習時間の短縮化を図ると共に最適解を探索する際に乱数探索に変更して局所的極小値を回避し、大域的極小値に探索しうる神経回路網モデルの大域的加速化学習法およびその装置に関する。
【０００２】
【従来の技術】
初めに、本発明が適用できる神経回路網モデルの一例を挙げ、これに対する学習法を説明する。ここでは、典型例として階層型神経回路網モデルを用いるが、再帰結合をもつ神経回路網モデルなど、他の構造をもつモデルにも適用できる。
【０００３】
階層型神経回路網モデルは、図２に示すように、１層の入力層、複数層の中間層、１層の出力層からなる層状のネットワークモデルであり、各層はユニット、重み、バイアスから構成される。ユニットは、前層のユニットの出力値（ｘ_ｉ（ｉ＝１，２，…，Ｌ，Ｌ：前層のユニット数）と重み（ｗ_ｉ，ｉ：重みの番号）の積の総和、およびバイアス（ｂ_ｉ，ｉ：ユニットの番号）を加算した値を入力値として受け、入力値にある非線形変換（ｆ（・））を施した値（ｙ）を出力し、この出力値を次層のユニットへ伝達する構造をもつ（図２、式（１））。但し、ここでは、入力層のユニットの入出力変換関数は、線形、入力層以外の層のユニットの非線形変換関数ｆ（・）は、典型例であるシグモイド関数を用いる（式（１））が、モデルに応じて他の変換関数を用いることも考えられる。
【０００４】
【数１】

【外１】

る評価基準を最小化するような重みの値を推定するための学習則を考える。ここでは、評価基準の典型例として神経回路網モデルの出力値（Ｏ_ｊ（ｊ＝１，２，…，Ｍ），Ｍ：出力層のユニットの番号）と学習用出力値である教師データ（Ｔ_ｊ（ｊ＝１，２，…，Ｍ））の残差二乗和（式（２））を用いる。
【０００５】
【数２】

Ｄ．Ｅ．Ｒｕｍｅｌｈａｒｔらは神経回路網モデルの学習法として重みの更新則に慣性項を付加した学習法を提案した。しかし、評価関数曲面が谷の形状を成す場合、学習過程において、重みが谷を横切る方向に振動するため、重みは、最適解が存在する谷を降下する方向に更新されず、収束速度が低下するという問題がある。この問題を解決するために、落合らは、重みの振動を抑制するＫｉｃｋＯｕｔ法を提案した（文献：落合慶広、戸田尚宏、臼井支朗：“重みの振動を抑制する階層型ニューラルネットワークの加速化−ＫｉｃｋＯｕｔ法−”、電気学会論文誌、Ｖｏｌ．１１３−Ｃ，Ｎｏ．１２，ｐｐ．１１５４−１１６２（１９９３）参照）。
【０００６】
この学習法を以下に示す。
【０００７】
【数３】

【０００８】
一方、評価関数曲面上に局所的極小値が存在する場合、探索過程において局所的極小値に収束するため、大域的極小値を探索できないという問題がある。
【０００９】
これを解決する学習則として、極値探索において局所的極小値が求められた場合、その局所解を中心としたペナルティー関数を構成し、極小値を消滅させた評価関数（（９）式）を構成し、これに基づいて新たな極小値を探索するダイナミックトンネルアルゴリズム（文献：Ｙ．Ｙａｏ：”Ｄｙｎａｍｉｃｔｕｎｎｅｌｉｎｇａｌｇｏｒｉｔｈｍｆｏｒｇｌｏｂａｌｏｐｔｉｍｉｚａｔｉｏｎ”，ＩＥＥＥＴｒａｎｓ．ｏｎＳｙｓ．ＭａｎａｎｄＣｙｂｅｒｎ．ＳＭＣ−１９−５，ｐｐ．１２２２−１２３０（１９８９）参照）が提案されている。この方法は、２つのアルゴリズムから構成される。次のような最適化問題が与えられた場合、
【外２】

【００１０】
【数５】

【００１１】
【外３】

【００１２】
【数６】

【００１３】
【外４】

【数７】

【外５】

【００１４】
【数８】

前述したように、ダイナミックトンネルアルゴリズムでは、極値探索と大域探索を繰り返して用いることにより、大域的最適解を求めている。
【００１５】
しかしながら、この大域的最適化法で用いられている極値探索アルゴリズムは、原始的な最急降下法を用いており、収束が非常に遅く、また、数値積分による常微分方程式の解を求めるのも収束速度の点では遅いという問題がある。
【００１６】
また、実際的には、大域的探索を行う際に調整しなければならないパラメータとしてｋ_ｊ，ｋがあるが、この調整は経験的にしか調整することができず、その調整も非常に困難である。
【００１７】
【発明が解決しようとする課題】
ＫｉｃｋＯｕｔ法は、評価関数曲面における谷の形状に対して加速化の効果を発揮する極値探索型のアルゴリズムであり、ＫｉｃｋＯｕｔ法のみでは、評価関数曲面上に局所的な極小値が存在する場合、これを回避することは困難である。
【００１８】
また、ダイナミックトンネリングアルゴリズムは、大域的な最小値を探索するアルゴリズムであるが、収束が非常に遅いという問題がある。
【００１９】
更に、ＫｉｃｋＯｕｔ法は、学習すべきパターンを全て提示した後に１回だけ重みを更新する一括学習法を用いており、重みの振動を補正するための補正項の係数は、全学習パターンを提示した後にしか求めることができない。従って、学習アルゴリズムを一反復するために要する時間は、パターン数の増加に対して指数関数的に増加し、学習時間も膨大になるという問題がある。これを改善するためには、学習パターンを１パターン提示する度に重みを更新する逐次学習を導入する必要があるが、一括学習型のＫｉｃｋＯｕｔ法を単純に逐次学習型に変更しても、ＫｉｃｋＯｕｔ法と同程度の加速効果は得られないという問題がある。
【００２０】
本発明は、上記に鑑みてなされたもので、その目的とするところは、評価関数曲面上に局所的極小値が存在しても大域的な極小値を探索することができると共に学習の収束の速さを加速して短時間で大域的極小値を探索しうる神経回路網モデルの大域的加速化学習装置を提供することにある。
【００２１】
【課題を解決するための手段】
上記目的を達成するため、請求項１記載の本発明は、ファイル読み込み装置および時系列データを計測するセンサを有する入力部と、処理部と、出力部から構成される神経回路網モデルの大域的加速化学習装置であって、前記処理部は、神経回路網モデルの重みの初期値、学習率の初期値、学習率の増減率、平滑化微係数、慣性率、補正係数の初期値、極値探索および大域的探索の停止基準値、制約条件の係数、および大域的最適化に関する係数を前記入力部から転送されて初期設定する変数初期化手段と、学習すべき入力データを前記入力部から転送されて一時的に蓄積し、該入力データの全パターンを神経回路網モデルに提示し、各入力パターンに対して神経回路網モデルの出力値を計算し、この出力値と教師データから計算される全パターンに対する評価基準値に基づいて現在の反復点（ｋ反復目とする）における重みに関する評価関数の一次偏微分である勾配を計算するデータ提示手段と、１から平滑化微係数を引いた値を前記データ提示手段で求めたｋ反復目の勾配に乗じ、この値と１反復前（ｋ−１反復目とする）の平滑化微分に平滑化微係数を乗じた値を加算し、ｋ反復目における平滑化微分値を計算して求める平滑化微分値算出手段と、前記平滑化微分値算出手段で求めたｋ反復目の勾配およびｋ−１反復目の平滑化微分値を各要素毎に個別に乗じた演算結果が正値の場合、学習率に学習率の増加率を加えて学習率を増加させて各学習率を個別に更新し、演算結果が負値の場合、学習率に学習率の減少率を乗じて学習率を減少させ、各学習率を個別に更新する学習率更新手段と、前記データ提示手段で求めた勾配、および前記学習率更新手段で求めた各重みに対応した個別の学習率を乗じた修正量を各要素毎に個別に計算し、この修正量にｋ−１反復目の重みの修正量に慣性率を乗じた修正量を各重みに個別に加えて、重みの修正量を計算する重み更新手段と、前記データ提示手段で求めたｋ反復目の勾配とｋ−１反復目の勾配から勾配の差分（ｋ反復目における勾配の差分とする）を計算し、この結果を補正項付加判断手段に転送する勾配差分計算手段と、前記勾配差分計算手段で求めたｋ反復目の勾配の差分と１反復前のｋ−１反復目の勾配の差分の内積を計算し、この結果が負値の場合には補正項を追加すると判断して補正量計算手段に重みを転送し、前記内積の結果が正値の場合には補正量を付加しないと判断して極値探索収束判定手段に重みを転送する補正項付加判定手段と、前記重み更新手段で求めた重みの修正量に、更にｋ反復目の勾配の差分とｋ反復目の重みの修正量の内積値をｋ反復目の勾配の差分の大きさ（ノルム）の２乗および２で除算した結果（スカラー量）にｋ反復目の勾配の差分を乗じた補正量を計算し、この結果を重み補正手段に転送する補正量計算手段と、前記補正量計算手段から転送された重みの補正量に基づき、重みを要素毎に個別に更新し、前記極値探索収束判定手段に転送する重み補正手段と、極値探索により探索された点である局所解を初期値として、局所解における評価関数の極値を取り除いた新たな評価関数を構成し、この評価関数を最小化することにより大域的探索を行う大域的探索手段と、前記重み更新手段あるいは前記重み補正手段から転送された重みを用いて極値探索を行い、極値探索の停止基準を満たす場合には、極値探索を終了して大域的探索手段による処理を実行し、極値探索の停止基準を満たさない場合には、該停止基準を満たすまで前記データ提示手段による処理から反復実行させる極値探索収束判定手段と、前記大域的探索手段による大域的探索の実行結果が大域的最適化の停止基準を満たしているか否かを判定し、該大域的最適化の停止基準を満たしている場合は処理を終了させる大域的探索収束判定手段と、前記大域的収束判定手段により大域的最適化の停止基準を満たさないと判定されたとき、局所解における評価関数値よりも更に小さな評価関数値を有する解が探索できた場合は、制約条件に関するパラメータを調整した後、前記データ提示手段による処理以降の極値探索を実行させる制約条件係数更新手段と、前記大域的収束判定手段により大域的最適化の停止基準を満たさないと判定されたとき、局所解における評価関数値よりも更に小さな評価関数値を有する解が探索できない場合は、大域的最適化に関する係数を調整した後、前記大域的探索手段による処理に戻り、大域的探索を反復実行し、指定された回数だけ反復実行した場合は処理を終了する大域的最適化係数調整手段とを有することを主旨とする。
【００２８】
請求項１記載の本発明にあっては、評価関数曲面上に局所的極小値が存在するような場合でも、大域的な極小値を探索することが可能となるとともに、局所的極小値を探索する際、谷などの条件数の大きい曲面が存在する場合、この領域において発生する重みの振動を抑制することにより収束を加速化し、大域的な極小値を探索する時間を短縮することができる。
【００２９】
また、請求項２記載の本発明は、ファイル読み込み装置および時系列データを計測するセンサを有する入力部と、処理部と、出力部から構成される神経回路網モデルの大域的加速化学習装置であって、前記処理部は、神経回路網モデルの重みの初期値、学習率の初期値、学習率の増減率、平滑化微係数、慣性率、補正係数の初期値、極値探索および大域的探索の停止基準値、制約条件の係数、および大域的最適化に関する係数を前記入力部から転送されて初期設定する変数初期化手段と、学習すべき入力データを前記入力部から転送されて一時的に蓄積し、該入力データの全パターンを神経回路網モデルに提示し、各入力パターンに対して神経回路網モデルの出力値を計算し、この出力値と教師データから計算される全パターンに対する評価基準値に基づいて現在の反復点（ｋ反復目とする）における重みに関する評価関数の一次偏微分である勾配を計算するデータ提示手段と、１から平滑化微係数を引いた値を前記データ提示手段で求めたｋ反復目の勾配に乗じ、この値と１反復前（ｋ−１反復目とする）の平滑化微分に平滑化微係数を乗じた値を加算し、ｋ反復目における平滑化微分値を計算して求める平滑化微分値算出手段と、前記平滑化微分値算出手段で求めたｋ反復目の勾配およびｋ−１反復目の平滑化微分値を各要素毎に個別に乗じた演算結果が正値の場合、学習率に学習率の増加率を加えて学習率を増加させて各学習率を個別に更新し、演算結果が負値の場合、学習率に学習率の減少率を乗じて学習率を減少させ、各学習率を個別に更新する学習率更新手段と、前記データ提示手段で求めた勾配、および前記学習率更新手段で求めた各重みに対応した個別の学習率を乗じた修正量を各要素毎に個別に計算し、この修正量にｋ−１反復目の重みの修正量に慣性率を乗じた修正量を各重みに個別に加えて、重みの修正量を計算する重み更新手段と、前記データ提示手段で求めたｋ反復目の勾配とｋ−１反復目の勾配から勾配の差分（ｋ反復目における勾配の差分とする）を計算し、この結果を補正項付加判断手段に転送する勾配差分計算手段と、前記勾配差分計算手段で求めたｋ反復目の勾配とｋ−１反復目の平滑化微分を各要素毎に計算し、この結果が負値の場合には、前記重み更新手段で求めた重みの修正量に、現在の反復点における勾配の差分に固定値の補正係数を乗じた補正量を各要素毎に個別に各重みに加えるかまたは前記学習率更新手段で求めた勾配と平滑化微分の内積値が正値の場合には補正量を付加しない補正量計算手段と、ｋ反復目の勾配とｋ−１反復目の平滑化微分を乗算した結果が正値の場合は補正係数を減少し、乗算結果が負値の場合は補正係数を増加する補正係数増減手段と、前記重み更新手段および補正量計算手段で求めた重みの修正量を用いて重みを要素毎に個別に更新する重み補正手段と、極値探索により探索された点である局所解を初期値として、局所解における評価関数の極値を取り除いた新たな評価関数を構成し、この評価関数を最小化することにより大域的探索を行う大域的探索手段と、極値探索の停止基準を満たす場合には、極値探索を終了して、前記大域的探索手段による処理を実行し、極値探索の停止基準を満たさない場合には、前記データ提示手段による処理に戻り、極値探索の停止基準を満たすまで前記データ提示手段による処理を反復実行する極値探索収束判定手段と、前記大域的探索手段による大域的探索の実行結果が大域的最適化の停止基準を満たしているか否かを判定し、該大域的最適化の停止基準を満たしている場合は処理を終了させる大域的探索収束判定手段と、前記大域的収束判定手段により大域的最適化の停止基準を満たさないと判定されたとき、局所解における評価関数値よりも更に小さな評価関数値を有する解が探索できた場合は、制約条件に関するパラメータを調整した後、前記データ提示手段による処理以降の極値探索を実行させる制約条件係数更新手段と、前記大域的収束判定手段により大域的最適化の停止基準を満たさないと判定されたとき、局所解における評価関数値よりも更に小さな評価関数値を有する解が探索できない場合は、大域的最適化に関する係数を調整した後、前記大域的探索手段による処理に戻り、大域的探索を反復実行し、指定された回数だけ反復実行した場合は処理を終了する大域的最適化係数調整手段とを有することを主旨とする。
【００３０】
請求項２記載の本発明にあっては、評価関数曲面上に局所的極小値が存在するような場合でも、大域的な極小値を探索することが可能となるとともに、重みおよび学習率を個別に更新することが可能となり、多くの学習パターンを有する課題の学習または時系列データなどを用いた逐次学習などにおいて学習時間の大幅な短縮が可能となる。
【００３１】
更に、請求項３記載の本発明は、請求項２記載の発明において、前記平滑化微分値算出手段は、１から平滑化微係数を引いた値をｋ−１反復目の平滑化微分値に乗じ、この値と平滑化微係数を勾配に乗じた値を加算し、ｋ反復目における平滑化微分値を各要素毎に個別に計算する手段を有し、前記学習率更新手段は、前記平滑化微分値算出手段で求めたｋ反復目の平滑化微分およびｋ−１反復目の平滑化微分値を各要素毎に個別に乗じ、この演算結果が正値の場合、学習率に学習率の増加率を加えて学習率を増加させ、演算結果が負値の場合、学習率に学習率の減少率を乗じて学習率を減少させるように学習率を要素毎に個別に更新する手段を有し、前記補正量計算手段は、前記平滑化微分値算出手段で求めたｋ反復目の平滑化微分とｋ−１反復目の平滑化微分を各要素毎に乗算し、この結果が負値の場合には、前記平滑化微分値算出手段で求めた重みの修正量に、ｋ反復目における勾配の差分に可変の補正係数を乗じた補正量を求めるかまたはｋ反復目およびｋ−１反復目の平滑化微分の乗算結果が正値の場合には補正量を計算しない手段を有し、前記補正係数増減手段は、ｋ反復目の平滑化微分とｋ−１反復目の平滑化微分を乗算した結果が正値の場合は補正係数を減少し、乗算結果が負値の場合は補正係数を増加する手段を有することを主旨とする。
【００３２】
請求項３記載の本発明にあっては、評価関数曲面上に局所的極小値が存在するような場合でも、大域的な極小値を探索することが可能となるとともに、学習時の収束の不安定性を減少させることができる。
【００３３】
【発明の実施の形態】
以下、図面を用いて本発明の実施の形態について説明する。
【００３４】
図１は、本発明の第１ないし第３の実施形態に適用される神経回路網モデルの大域的加速化学習法を実施する学習装置の構成を示すブロック図である。図１に示す学習装置は、入力部１００、処理部２００および出力部３００から構成され、入力部１００は、学習に用いる学習率などの各種変数などを読み込むためのファイル読み込み装置１０１、時系列データを計測する各種センサ１０２、ＴＶカメラ１０３などからなる。処理部２００は、変数初期化部２０１、データ提示部２０２、学習率更新部２０３、重み更新部２０４、勾配差分計算部２０５、補正項付加判断部２０６、補正量計算部２０７、重み補正部２０８、極値探索の収束判定部２０９、大域的探索部２１０、大域的探索収束判定部２１１、制約条件係数更新部２１２、大域的最適化係数更新部２１３からなる。
【００３５】
本発明の第１の実施形態に係る神経回路網モデルの大域的加速化学習法は、ＫｉｃｋＯｕｔ法とトンネルアルゴリズムを融合させた学習法とし、トンネルアルゴリズムにおいて収束が遅いおよびパラメータｋの調整が困難であるという問題点を解決するために、次の手順により探索する。始めに、局所的極小値を見い出すために（３）〜（６）式のＫｉｃｋＯｕｔ法を用いて極値探索を行う。そして、極値探索において局所的最小値が見い出された後、この局所的最小値よりも更に小さな評価関数値をもつ点を見つけるために、次の手順に基づいた大域的探索を行う。
【００３６】
大域的探索を行う際の評価関数を次式のように定義する。
【００３７】
【数９】

これに対して、大域的探索においてもＫｉｃｋＯｕｔ法を適用する。
【００３８】
【数１０】

また、ダイナミックトンネリングアルゴリズムでは、経験的にしか調整することができなかったパラメータk_j、λを次のように調整する。
【００３９】
（１）ｋ_ｊ，μ_０，α＞０，０＜β＜１を設定する。
【００４０】
【外６】

【数１１】

条件（（１５）式）を満足しないで、大域的探索により、極値解における評価関
【外７】

合、すなわち、
【数１２】

【外８】

開始点として、（６）以下を実行する。（１６）式を満たす点を見い出せなかった場合、（４）以下を実行する。
【００４１】
【数１３】

【００４２】
【外９】

外ならば、μ_ｋ＋１＝μ_ｋ、ｋ＝ｋ＋１として（２）へ戻る。しかし、大域的探索を指定された回数だけ反復実行しても、大域的探索の停止基準を満たさない場合は終了する。
【００４３】
本発明の第２の実施形態に係る大域的加速化学習法では、重みが振動する際、各要素ごとに個別に重みの値を補正する補正法を備えた逐次型の学習法であり、本学習法における補正項の計算方法は、重みの振動を判定するために、評価関数曲面上の大域的な勾配を表す平滑化微分、および勾配を利用して求めている。
【００４４】
記憶容量、計算量を極力低減させるために、学習率の更新基準、または重みの補正基準としてｋ反復目の勾配とｋ−１反復目の平滑化微分の積を用いている。更に、補正係数として計算値ではなく設定値を用いることにより、第１の実施形態で用いていた内積演算を排除し、重みの振動を各要素ごとに個別に補正することを可能にする。
【００４５】
第２の実施形態において、第１の実施形態の中で用いられているＫｉｃｋＯｕｔ法に相当する部分の学習則を以下に示す。
【００４６】
【数１４】

但し、上記は極値探索における学習則であり、大域的探索においては、下記の記号に置き換えた学習則を使用する。
【００４７】
ｇ_ｋ，ｉ → ｅ_ｋ，ｉ …（２３）
ｙ_ｋ，ｉ → ｚ_ｋ，ｉ …（２４）
本発明の第３の実施形態の学習法では、学習を安定して収束するためには、学習率、および補正項付加の判断基準に用いる勾配などの評価関数曲面の情報として全パターンに対する大域的な情報を用いる必要があるため、ｋ、およびｋ−１反復目の平滑化微分の積を用いている。更に、補正係数を学習の状況に応じて変化させることにより、適切な補正を可能にする。
【００４８】
第３の実施形態で用いられる学習則を以下に示す。
【００４９】
【数１５】

但し、上記は極値探索における学習則であり、大域的探索においては、下記の記号に置き換えた学習則を使用する。
【００５０】
ｇ_ｋ，ｉ → ｅ_ｋ，ｉ …（２９）
ｙ_ｋ，ｉ → ｚ_ｋ，ｉ …（３０）
次に、図３および図４に示すフローチャートを参照して、作用を説明する。
【００５１】
図１において、入力部１００からセンサ１０２、ＴＶカメラ１０３などを用いて計測した時系列データなどを入力し、データ提示部２０２に学習用データを転送する（ステップＳ１１）。従って、データ提示部２０２には時系列データが一時的に蓄えられる。また、ファイル読み込み装置１０１から、学習に必要な学習率などの値を読み込み、変数初期化部２０１に転送される（ステップＳ１３）。
【００５２】
変数初期化部２０１は、神経回路網モデルの重み、および学習率などの初期値を設定し（ステップＳ１５）、データ提示部２０２は、センサ１０２、ＴＶカメラ１０３などから転送されてきた時系列データを１パターンずつ神経回路網モデルに提示し、前向きの計算により、神経回路網モデルの出力値を計算し（ステップＳ１７）、このモデル出力と教師データを用いて、予め、与えられた評価基準値を計算する。この後、上記の評価基準値に基づいて神経回路網モデルの逆方向計算を行い、勾配を計算する（ステップＳ１９）。
【００５３】
学習率変更部２０３は、勾配または平滑化微分の符合に基づいて学習率を更新し（ステップＳ２１）、重み更新部２０４は、上述した学習率、勾配などとを用いて重みを更新し、勾配などを勾配差分計算部２０５に転送する（ステップＳ２３）。
【００５４】
勾配差分計算部２０５は、データ提示部２０２において求められた勾配を用いてｋ，ｋ−１の反復点における勾配の差分を計算し、この情報などを補正項付加判断部２０６に転送する（ステップＳ２５）。
【００５５】
補正項付加判断部２０６は、転送された勾配などに基づき、第１の実施形態においては、平滑化微分の積を、第２の実施形態においては、勾配と平滑化微分の積を計算し、この値が負値の場合、補正量計算部２０７に重みの値などを転送する。または、上記の値が正値の場合には、収束判定部２０９に重みの値などを転送する（ステップＳ２７）。
【００５６】
補正量計算部２０７では、補正項付加判断部２０６の指令に従い、転送された重み、勾配の差分などから、重みの補正量を計算し、これを重み補正部２０８に転送する（ステップＳ２９）。
【００５７】
重み補正部２０８は、転送された重みの補正量に基づき重みの値を補正し、補正した重みを収束判定部２０９に転送する（ステップＳ３１）。
【００５８】
極値探索収束判定部２０９は、重み補正部２０８または重み更新部２０４から転送された重みを用いて、極値探索の停止基準が満たされているかを判定する（ステップＳ３３）。
【００５９】
大域的探索部２１０は、新たに設定された評価関数Ｅ（ｗ）に対して最小化を実行することにより、評価関数値Ｅ（ｗ）がより小さくなる点を探索する（ステップＳ３５）。
【００６０】
大域的探索収束判定部２１１は、大域的探索を実行した結果見い出された点が大域的最適化の停止基準を満たしているかを判定する（ステップＳ３７）。
【００６１】
制約条件係数更新部２１２は、大域的探索の結果、評価関数値がより小さな点が見い出された場合には、制約条件の係数を更新する（ステップＳ３９）。
【００６２】
大域的最適化係数更新部２１３は、大域的探索の結果、評価関数値がより小さな点が見い出されなかった場合、大域的最適化の係数を更新する（ステップＳ４１）。以上の処理を大域的探索を指定した回数だけ反復しても終了していない場合には、終了する。
【００６３】
上記各実施形態の学習装置には、（１）全ての学習率を同じ値に設定し、この値を学習過程において固定（学習率の増加率を０、学習率の減少率を１と設定）、または可変とした場合、（２）平滑化微係数を０と設定した場合、すなわち、重みの補正基準、学習率の更新基準に勾配のみを用いる場合、（３）慣性率の値を０に設定した場合も含まれる。
【００６４】
また、第２の実施形態の学習装置には、（１）補正係数を増加させる際、補正係数に正の値を加算する場合、または１より大きい値を乗算する場合、（２）補正係数を減少させる際、補正係数に、０より大きく、かつ、１未満の値を乗じる場合、または補正係数から正の値を減算する場合、（３）上記（１），（２）を併用して、補正係数を増減する場合、（４）補正係数を増減せずに、一定値に設定する場合も含まれる。
【００６５】
【発明の効果】
以上説明したように、本発明によれば、評価関数曲面上に局所的極小値が存在するような問題に対して、大域的な極小値を探索することが可能となるとともに、局所探索においては、評価関数曲面上で谷の構造が存在する場合には、収束を加速化し、短時間で大域的な極小値を探索することができる。
【００６６】
また、本発明によれば、評価関数曲面上に局所的極小値が存在するような場合でも大域的な極小値を探索することが可能となるとともに、少ない記憶領域および計算量により時系列データなどのように学習データが無限個存在する課題の学習を短時間で終了することができる。
【００６７】
更に、本発明によれば、評価関数曲面上に局所的極小値が存在するような場合でも大域的な極小値を探索することが可能となるとともに、時系列データなどのように学習データが無限個存在する課題の学習を短時間で終了し、かつ安定した収束が可能となる。
【図面の簡単な説明】
【図１】本発明の第１ないし第３の実施形態に適用される神経回路網モデルの大域的加速化学習法を実施する学習装置の構成を示すブロック図である。
【図２】神経回路網モデルの一例を示す図である。
【図３】図１に示す神経回路網モデルの大域的加速化学習法の手順の一部を示すフローチャートである。
【図４】図１に示す神経回路網モデルの大域的加速化学習法の手順の残りの部分を示すフローチャートである。
【符号の説明】
１００入力部
２００処理部
２０１変数初期化部
２０２データ提示部
２０３学習率更新部
２０４重み更新部
２０５勾配差分計算部
２０６補正項付加判断部
２０７補正量計算部
２０８重み補正部
２０９極値探索収束判定部
２１０大域的探索部
２１１大域的探索収束判定部
２１２制約条件係数更新部
２１３大域的最適化係数更新部
３００出力部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a method and a device for global acceleration learning of a neural network model applicable to processing using a neural network, processing using a non-linear optimization method, and the like. A neural circuit that can reduce the learning time by using sequential learning with weight oscillation suppressed, change to random number search when searching for the optimal solution, avoid local minima, and search for global minima The present invention relates to a global acceleration learning method for a network model and an apparatus therefor.
[0002]
[Prior art]
First, an example of a neural network model to which the present invention can be applied will be described, and a learning method for the model will be described. Here, a hierarchical neural network model is used as a typical example, but the present invention can be applied to a model having another structure such as a neural network model having a recursive connection.
[0003]
As shown in FIG. 2, the hierarchical neural network model is a layered network model including one input layer, a plurality of intermediate layers, and one output layer, and each layer includes a unit, a weight, and a bias. Is done. The unit is the output value (x _i (I = 1, 2,..., L, L: the number of units in the previous layer) and weight (w _i , I: the number of weights) and the bias (b _i , I: unit number) as an input value, output a value (y) obtained by applying a non-linear transformation (f (·)) to the input value, and transmit the output value to a unit in the next layer. (FIG. 2, Equation (1)). Here, the input / output conversion function of the unit of the input layer is linear, and the nonlinear conversion function f (·) of the unit of the layer other than the input layer uses a sigmoid function which is a typical example (Equation (1)). It is also conceivable to use another conversion function according to the model.
[0004]
(Equation 1)

[Outside 1]

Consider a learning rule for estimating a weight value that minimizes an evaluation criterion. Here, as a typical example of the evaluation criterion, the output value (O _j (J = 1, 2,..., M), M: unit number of the output layer) and teacher data (T _j (J = 1, 2,..., M)) is used.
[0005]
(Equation 2)

D. E. FIG. Have proposed a learning method in which an inertia term is added to a weight updating rule as a learning method for a neural network model. However, when the evaluation function surface has a valley shape, the weight oscillates in the direction crossing the valley in the learning process, so the weight is not updated in the direction of descending the valley where the optimal solution exists, and the convergence speed decreases. There is a problem of doing. To solve this problem, Ochiai et al. Proposed the Kick Out method for suppressing weight oscillations (Literature: Yoshihiro Ochiai, Naohiro Toda, Shiro Usui: "Acceleration of hierarchical neural networks for suppressing weight oscillations" -Kick Out Method- ", Transactions of the Institute of Electrical Engineers of Japan, Vol. 113-C, No. 12, pp. 1154-1162 (1993)).
[0006]
This learning method will be described below.
[0007]
(Equation 3)

[0008]
On the other hand, when a local minimum exists on the evaluation function surface, there is a problem that the global minimum cannot be searched because the search process converges to the local minimum.
[0009]
As a learning rule to solve this, when a local minimum is found in the extremum search, a penalty function centering on the local solution is constructed, and an evaluation function (Eq. (9)) that eliminates the local minimum is used. A dynamic tunneling algorithm that constructs and searches for a new minimum value based on the dynamic tunneling algorithm (literature: Y. Yao: "Dynamic tunneling algorithm for global optimization", IEEE Trans. On Sys. Man and Cyber. SMC-19p. .1222-1230 (1989)). This method consists of two algorithms. Given the following optimization problem,
[Outside 2]

[0010]
(Equation 5)

[0011]
[Outside 3]

[0012]
(Equation 6)

[0013]
[Outside 4]

(Equation 7)

[Outside 5]

[0014]
(Equation 8)

As described above, in the dynamic tunnel algorithm, a global optimal solution is obtained by repeatedly using an extremum search and a global search.
[0015]
However, the extremum search algorithm used in this global optimization method uses a primitive steepest descent method, convergence is very slow, and it is also difficult to find the solution of the ODE by numerical integration. There is a problem that the convergence speed is slow.
[0016]
In practice, k is a parameter that must be adjusted when performing a global search. _j , K, but this adjustment can only be made empirically and is very difficult.
[0017]
[Problems to be solved by the invention]
The Kick Out method is an extremum search type algorithm that exerts the effect of acceleration on the shape of the valley on the evaluation function surface. With the Kick Out method alone, a local minimum exists on the evaluation function surface. In this case, it is difficult to avoid this.
[0018]
The dynamic tunneling algorithm is an algorithm for searching for a global minimum, but has a problem that convergence is extremely slow.
[0019]
Further, the Kick Out method uses a collective learning method in which the weight is updated only once after all the patterns to be learned are presented, and the coefficient of the correction term for correcting the vibration of the weight indicates all the learning patterns. Can only be sought after. Therefore, the time required for one iteration of the learning algorithm increases exponentially with the increase in the number of patterns, and there is a problem that the learning time becomes enormous. In order to improve this, it is necessary to introduce sequential learning in which the weight is updated each time one learning pattern is presented, but even if the collective learning type Kick Out method is simply changed to the sequential learning type, There is a problem that the same acceleration effect as in the kick out method cannot be obtained.
[0020]
The present invention has been made in view of the above, and it is an object of the present invention to be able to search for a global minimum value even if a local minimum value exists on an evaluation function surface, and to reduce learning convergence. A neural network model that can search for global minima in a short time by accelerating speed Global accelerated learning device Is to provide.
[0021]
[Means for Solving the Problems]
To achieve the above object, the present invention according to claim 1 provides a global neural network model comprising an input unit having a file reading device and a sensor for measuring time-series data, a processing unit, and an output unit. The acceleration learning device, wherein the processing unit includes an initial value of a weight of the neural network model, an initial value of a learning rate, an increase / decrease rate of the learning rate, a smoothed differential coefficient, an inertia rate, an initial value of a correction coefficient, and an extreme value. A variable initialization means for transferring and initializing a stop reference value of value search and global search, a coefficient of a constraint condition, and a coefficient relating to global optimization from the input unit, and input data to be learned from the input unit. Transferred and temporarily stored, presents all the patterns of the input data to the neural network model, calculates the output value of the neural network model for each input pattern, and calculates the output value and the teacher data. All Data presenting means for calculating a gradient, which is the first partial derivative of an evaluation function relating to the weight at the current iteration point (k-th iteration), based on the evaluation reference value for the pattern, and a value obtained by subtracting the smoothing derivative from 1 Is multiplied by the gradient at the k-th iteration determined by the data presentation means, and a value obtained by multiplying this value by the smoothing derivative of the previous iteration (referred to as the (k-1) -th iteration) by the smoothing derivative is added. Means for calculating a smoothed differential value in the eye, and a gradient at the k-th iteration and a smoothed differential value at the (k-1) th iteration obtained by the smoothed differential value calculating means for each element. If the result of individual multiplication is a positive value, the learning rate is increased by adding the learning rate to the learning rate, the learning rate is increased, and each learning rate is individually updated.If the calculation result is a negative value, the learning rate is learned. Learning to reduce the learning rate by multiplying by the rate reduction rate and update each learning rate individually An update unit, a gradient obtained by the data presentation unit, and a correction amount multiplied by an individual learning rate corresponding to each weight obtained by the learning rate update unit are individually calculated for each element; weight updating means for individually adding a correction amount obtained by multiplying the weight correction amount of the (k-1) th iteration by the inertia rate to each weight to calculate the weight correction amount; A gradient difference calculating means for calculating a gradient difference (hereinafter referred to as a gradient difference at the k-th iteration) from the gradient and the gradient at the (k-1) th iteration, and transferring the result to a correction term addition determining means; Calculates the inner product of the difference between the gradient at the k-th iteration and the difference between the gradient at the (k-1) -th iteration before, and if the result is a negative value, determines that a correction term is to be added and calculates the correction amount. Transfer the weight to the means, and add a correction amount if the result of the inner product is a positive value A correction term addition determining means for transferring the weight to the extreme value search convergence determining means upon determining that the weight is not to be corrected; Is calculated by multiplying the result (scalar amount) obtained by dividing the inner product value of the correction amount of the above by the square of the magnitude (norm) of the gradient difference at the k-th iteration and 2 and the difference of the gradient at the k-th iteration, A correction amount calculating unit that transfers the result to the weight correcting unit; and a weight that is individually updated for each element based on the correction amount of the weight transferred from the correction amount calculating unit, and is transferred to the extreme value search convergence determining unit. Weight correction means to perform, and a new evaluation function from which the extremum of the evaluation function in the local solution is removed, using the local solution which is a point searched by the extremum search as an initial value, and minimizing the evaluation function A global search means for performing a global search by An extremum search is performed using the weight transferred from the weight updating means or the weight correction means, and when the stop criteria for the extremum search is satisfied, the extremum search is terminated and the processing by the global search means is executed. When the stop criterion for the extremum search is not satisfied, the extremum search convergence determining means for repeatedly executing the processing from the data presenting means until the stop criterion is satisfied, and the execution of the global search by the global search means A global search convergence determination means for determining whether a result satisfies a global optimization stop criterion, and terminating the process when the result satisfies the global optimization stop criterion; When it is determined by the means that the stopping criteria of the global optimization are not satisfied, if a solution having an evaluation function value smaller than the evaluation function value of the local solution can be searched, the parameter relating to the constraint condition is searched. After adjusting the data, when the constraint condition coefficient updating means for executing the extremum search after the processing by the data presenting means and the global convergence determining means determine that the stop criteria of the global optimization is not satisfied. If it is not possible to search for a solution having an evaluation function value smaller than the evaluation function value in the local solution, after adjusting the coefficients related to global optimization, the process returns to the global search means, and the global search is repeatedly executed. The main purpose is to have a global optimization coefficient adjusting means for terminating the process when the process is repeatedly executed a designated number of times.
[0028]

Claim

1 In the present invention described, even when a local minimum exists on the evaluation function surface, it is possible to search for a global minimum, and when searching for a local minimum, When a curved surface having a large number of conditions such as a valley exists, convergence can be accelerated by suppressing the oscillation of the weight generated in this region, and the time for searching for a global minimum can be reduced.
[0029]
According to a second aspect of the present invention, there is provided a global acceleration learning apparatus for a neural network model including an input unit having a file reading device and a sensor for measuring time-series data, a processing unit, and an output unit. The processing unit includes an initial value of the weight of the neural network model, an initial value of the learning rate, an increase / decrease rate of the learning rate, a smoothed differential coefficient, an inertia rate, an initial value of the correction coefficient, an extreme value search, and a global value. A variable initialization means for transferring and initializing a search stop reference value, a constraint condition coefficient, and a coefficient relating to global optimization from the input unit, and temporarily transferring input data to be learned from the input unit. And presents all the patterns of the input data to the neural network model, calculates the output value of the neural network model for each input pattern, and evaluates all the patterns calculated from this output value and the teacher data. Data presenting means for calculating a gradient, which is a first-order partial differential of an evaluation function relating to a weight at a current iteration point (k-th iteration) based on a quasi-value, and presenting a value obtained by subtracting a smoothing derivative from 1 The value obtained by multiplying the gradient at the k-th iteration obtained by the means and a value obtained by multiplying this value by the smoothing derivative one iteration before (referred to as the (k-1) -th iteration) by the smoothing differential coefficient is added, and smoothing at the k-th iteration is performed. Smoothing differential value calculating means for calculating and calculating a differential value, and the gradient at the k-th iteration and the smoothing differential value at the (k-1) -th iteration obtained by the smoothing differential value calculating means are individually multiplied for each element. If the calculation result is a positive value, the learning rate is added to the learning rate and the learning rate is increased to increase the learning rate, and each learning rate is individually updated. If the calculation result is a negative value, the learning rate is reduced to the learning rate. Learning rate updating means for reducing the learning rate by multiplying The amount of correction obtained by multiplying the gradient obtained by the data presentation means and the individual learning rate corresponding to each weight obtained by the learning rate updating means is calculated individually for each element, and this correction amount is k-1 repeated. Weight updating means for individually adding to each weight a correction amount obtained by multiplying the correction amount of the eye weight by the inertia ratio, and a gradient and k-th of the k-th iteration obtained by the data presentation means. A gradient difference is calculated from the gradient at the first iteration (referred to as a gradient difference at the k-th iteration), and the result is transferred to the correction term addition determining means; and k obtained by the gradient difference calculating means is calculated. The gradient at the iteration and the smoothed derivative at the (k-1) th iteration are calculated for each element, and if the result is a negative value, the correction amount of the weight obtained by the weight updating means is added to the current iteration point. The correction amount obtained by multiplying the gradient difference by a fixed value correction coefficient is individually weighted for each element. Or a correction amount calculating unit that does not add a correction amount when the inner product of the gradient and the smoothing derivative obtained by the learning rate updating unit is a positive value. When the result of multiplication by the smoothing derivative is a positive value, the correction coefficient is decreased by the correction coefficient, and when the result of the multiplication is a negative value, the correction coefficient is increased by the correction coefficient increasing / decreasing means, and the weight updating means and the correction amount calculating means are used. Weight correction means for individually updating the weight for each element using the weight correction amount, and a new method in which a local solution which is a point searched by the extreme value search is set as an initial value and an extreme value of an evaluation function in the local solution is removed. Global search means for performing a global search by minimizing the evaluation function, and, when a stop criterion for the extremum search is satisfied, terminating the extremum search and completing the global search. Execute processing by search means and stop extreme value search If the criterion is not satisfied, the process returns to the data presenting means, and the extreme value search convergence determining means for repeatedly executing the processing by the data presenting means until the stop criteria of the extreme value search is satisfied, and the global search means Global search convergence determining means for determining whether or not the execution result of the global search satisfies the stop criterion of global optimization, and terminating the process if the stop criterion of the global optimization is satisfied, When the global convergence determining means determines that the stop criteria of the global optimization is not satisfied, if a solution having an evaluation function value smaller than the evaluation function value in the local solution can be searched, a parameter relating to the constraint condition is used. After the adjustment, the constraint condition coefficient updating means for executing the extremum search after the processing by the data presenting means, and the global optimization is stopped by the global convergence determining means. When it is determined that the criterion is not satisfied, if a solution having an evaluation function value smaller than the evaluation function value in the local solution cannot be searched for, after adjusting the coefficient related to global optimization, the processing by the global search unit is performed. The main purpose is to have a global optimization coefficient adjusting means for repeatedly executing the global search and terminating the process when the specified number of times has been repeatedly executed.
[0030]
Claim 2 According to the present invention described above, even when a local minimum exists on the evaluation function surface, it is possible to search for a global minimum, and individually update the weight and the learning rate. This makes it possible to greatly reduce the learning time in learning a task having many learning patterns or in sequential learning using time-series data or the like.
[0031]
Further, in the invention according to claim 3, in the invention according to claim 2, the smoothing differential value calculating means calculates a value obtained by subtracting a smoothing differential coefficient from 1 as a smoothing differential value at the (k-1) th iteration. Multiplying this value and a value obtained by multiplying the gradient by the smoothing differential coefficient, and individually calculating a smoothed differential value at the k-th iteration for each element. The k-th iteration smoothing differential and the k-1th iteration smoothing differential obtained by the generalized differential value calculation means are individually multiplied for each element, and if the operation result is a positive value, the learning rate is calculated by subtracting the learning rate from the learning rate. There is a means to increase the learning rate by adding the rate of increase, and if the calculation result is a negative value, multiply the learning rate by the reduction rate of the learning rate to update the learning rate individually for each element so as to reduce the learning rate. The correction amount calculating means calculates the k-th iteration of the smoothed differential obtained by the smoothed differential value calculating means and k-1. The smoothing differential of the second iteration is multiplied for each element. If the result is a negative value, the correction amount of the weight obtained by the smoothing differential value calculation means is changed to the difference of the gradient at the kth iteration. A means for calculating a correction amount by multiplying the correction coefficient or calculating a correction amount when a multiplication result of the smoothing differentiation at the k-th iteration and the (k-1) -th iteration is a positive value; Has a means for decreasing the correction coefficient when the result of multiplying the smoothing derivative at the kth iteration and the smoothing derivative at the k-1th iteration is a positive value, and increasing the correction coefficient when the result of the multiplication is a negative value That is the main purpose.
[0032]
Claim 3 In the present invention described above, even when a local minimum exists on the evaluation function surface, it is possible to search for a global minimum, and to reduce the instability of convergence during learning. Can be done.
[0033]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0034]
FIG. 1 is a block diagram showing a configuration of a learning device that performs a global accelerated learning method of a neural network model applied to the first to third embodiments of the present invention. The learning device shown in FIG. 1 includes an input unit 100, a processing unit 200, and an output unit 300. The input unit 100 includes a file reading device 101 for reading various variables such as a learning rate used for learning, and time-series data. , A TV camera 103, and the like. The processing unit 200 includes a variable initialization unit 201, a data presentation unit 202, a learning rate update unit 203, a weight update unit 204, a gradient difference calculation unit 205, a correction term addition determination unit 206, a correction amount calculation unit 207, and a weight correction unit 208. , An extreme value search convergence determination unit 209, a global search unit 210, a global search convergence determination unit 211, a constraint condition coefficient update unit 212, and a global optimization coefficient update unit 213.
[0035]
The global accelerated learning method of the neural network model according to the first embodiment of the present invention is a learning method in which the Kick Out method and the tunnel algorithm are fused, and the tunnel algorithm has slow convergence and it is difficult to adjust the parameter k. In order to solve the problem of, search is performed according to the following procedure. First, in order to find a local minimum, an extreme value search is performed using the Kick Out method of the equations (3) to (6). After a local minimum is found in the extremum search, a global search is performed based on the following procedure to find a point having an evaluation function value smaller than the local minimum.
[0036]
An evaluation function for performing a global search is defined as follows.
[0037]
(Equation 9)

On the other hand, the Kick Out method is also applied to the global search.
[0038]
(Equation 10)

Also, with the dynamic tunneling algorithm, the parameter k which could only be adjusted empirically _j , Λ are adjusted as follows.
[0039]
(1) k _j , Μ ₀ , Α> 0, 0 <β <1.
[0040]
[Outside 6]

(Equation 11)

Without satisfying the condition (Equation (15)), an evaluation function in the extreme value solution is obtained by a global search.
[Outside 7]

If,
(Equation 12)

[Outside 8]

(6) The following is executed as a starting point. If no point satisfying the expression (16) is found, the following (4) is executed.
[0041]
(Equation 13)

[0042]
[Outside 9]

If outside, μ _{k + 1} = Μ _k , K = k + 1, and returns to (2). However, if the global search does not satisfy the global search stop criterion even if the global search is repeatedly executed a specified number of times, the process ends.
[0043]
The global acceleration learning method according to the second embodiment of the present invention is a sequential learning method including a correction method for individually correcting the weight value for each element when the weight oscillates. In the method of calculating the correction term in the learning method, a smoothing derivative representing a global gradient on the evaluation function surface and a gradient are used to determine the oscillation of the weight.
[0044]
In order to reduce the storage capacity and the amount of calculation as much as possible, the product of the gradient at the k-th iteration and the smoothed derivative at the (k-1) -th iteration is used as a criterion for updating the learning rate or as a criterion for correcting the weight. Further, by using the set value instead of the calculated value as the correction coefficient, the inner product operation used in the first embodiment can be eliminated, and the vibration of the weight can be individually corrected for each element.
[0045]
In the second embodiment, a learning rule corresponding to the kick out method used in the first embodiment will be described below.
[0046]
[Equation 14]

However, the above is a learning rule in an extremum search, and a learning rule replaced with the following symbols is used in a global search.
[0047]
g _{k, i} → e _{k, i} … (23)
y _{k, i} → z _{k, i} … (24)
In the learning method according to the third embodiment of the present invention, in order to stably converge the learning, global information for all patterns is used as information of a learning rate and an evaluation function surface such as a gradient used as a criterion for adding a correction term. Therefore, the product of the smoothed derivatives at the k-th and (k−1) -th iterations is used. Further, appropriate correction is made possible by changing the correction coefficient according to the learning situation.
[0048]
The learning rules used in the third embodiment are shown below.
[0049]
[Equation 15]

However, the above is a learning rule in an extremum search, and a learning rule replaced with the following symbols is used in a global search.
[0050]
g _{k, i} → e _{k, i} … (29)
y _{k, i} → z _{k, i} … (30)
Next, the operation will be described with reference to the flowcharts shown in FIGS.
[0051]
In FIG. 1, time-series data and the like measured using the sensor 102, the TV camera 103, and the like are input from the input unit 100, and the learning data is transferred to the data presentation unit 202 (step S11). Therefore, the time series data is temporarily stored in the data presentation unit 202. Further, a value such as a learning rate required for learning is read from the file reading device 101 and transferred to the variable initialization unit 201 (step S13).
[0052]
The variable initializing unit 201 sets initial values such as the weight of the neural network model and the learning rate (step S15), and the data presenting unit 202 generates the time-series data transferred from the sensor 102, the TV camera 103, and the like. Is presented to the neural network model one pattern at a time, the output value of the neural network model is calculated by forward calculation (step S17), and the evaluation reference value given in advance is calculated using the model output and the teacher data. Is calculated. Thereafter, a backward calculation of the neural network model is performed based on the evaluation reference value, and a gradient is calculated (step S19).
[0053]
The learning rate changing unit 203 updates the learning rate based on the gradient or the sign of the smoothing derivative (step S21), and the weight updating unit 204 updates the weight using the learning rate and the gradient described above, and updates the gradient. Is transferred to the gradient difference calculation unit 205 (step S23).
[0054]
The gradient difference calculation unit 205 calculates a gradient difference at k, k−1 repetition points using the gradient obtained by the data presentation unit 202, and transfers this information and the like to the correction term addition determination unit 206 (step S25).
[0055]
The correction term addition determining unit 206 calculates the product of the smoothing derivative in the first embodiment and the product of the gradient and the smoothing derivative in the second embodiment, based on the transferred gradient and the like. If the value is a negative value, the weight value and the like are transferred to the correction amount calculation unit 207. Alternatively, if the above value is a positive value, the weight value or the like is transferred to the convergence determining unit 209 (step S27).
[0056]
The correction amount calculation unit 207 calculates a weight correction amount based on the transferred weight, gradient difference, and the like in accordance with the instruction from the correction term addition determination unit 206, and transfers this to the weight correction unit 208 (step S29).
[0057]
The weight correction unit 208 corrects the value of the weight based on the transferred correction amount of the weight, and transfers the corrected weight to the convergence determination unit 209 (Step S31).
[0058]
The extreme value search convergence determination unit 209 determines whether the stop criteria for the extreme value search is satisfied using the weight transferred from the weight correction unit 208 or the weight update unit 204 (step S33).
[0059]
The global search unit 210 searches for a point where the evaluation function value E (w) becomes smaller by performing minimization on the newly set evaluation function E (w) (step S35).
[0060]
The global search convergence determination unit 211 determines whether a point found as a result of performing the global search satisfies a global optimization stop criterion (step S37).
[0061]
When a point having a smaller evaluation function value is found as a result of the global search, the constraint condition coefficient update unit 212 updates the constraint condition coefficient (step S39).
[0062]
When a point having a smaller evaluation function value is not found as a result of the global search, the global optimization coefficient updating unit 213 updates the global optimization coefficient (step S41). If the above process has not been completed after repeating the global search the specified number of times, the process ends.
[0063]
In the learning apparatus of each of the above embodiments, (1) all learning rates are set to the same value, and the values are fixed in the learning process (the increasing rate of the learning rate is set to 0, and the decreasing rate of the learning rate is set to 1). Or (2) when the smoothing differential coefficient is set to 0, that is, when only the gradient is used as the weight correction reference and the learning rate update reference, (3) the value of the inertia rate is set to 0. This includes the case of setting.
[0064]
Further, the learning apparatus of the second embodiment has the following features: (1) when increasing the correction coefficient, when adding a positive value to the correction coefficient, or when multiplying the correction coefficient by a value larger than 1, When decreasing the correction coefficient, when multiplying the correction coefficient by a value greater than 0 and less than 1, or when subtracting a positive value from the correction coefficient, (3) using the above (1) and (2) together The case where the correction coefficient is increased or decreased includes (4) the case where the correction coefficient is set to a constant value without being increased or decreased.
[0065]
【The invention's effect】
As described above, according to the present invention, it is possible to search for a global minimum for a problem in which a local minimum exists on the evaluation function surface, and in the local search, When a valley structure exists on the evaluation function surface, convergence can be accelerated, and a global minimum can be searched in a short time.
[0066]
Further, according to the present invention, it is possible to search for a global minimum even when a local minimum exists on the evaluation function surface, and time-series data and the like can be obtained with a small storage area and a small amount of calculation. Thus, learning of a task having an infinite number of learning data can be completed in a short time.
[0067]
Furthermore, according to the present invention, it is possible to search for a global minimum even when a local minimum exists on the evaluation function surface, and the learning data is infinite, such as time-series data. Learning of a plurality of existing tasks is completed in a short time, and stable convergence becomes possible.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of a learning device that performs a global accelerated learning method for a neural network model applied to first to third embodiments of the present invention.
FIG. 2 is a diagram illustrating an example of a neural network model.
FIG. 3 is a flowchart showing a part of a procedure of a global acceleration learning method of the neural network model shown in FIG. 1;
FIG. 4 is a flowchart showing the rest of the procedure of the global acceleration learning method of the neural network model shown in FIG. 1;
[Explanation of symbols]
100 Input section
200 processing unit
201 Variable initialization section
202 Data presentation part
203 Learning rate update unit
204 weight update unit
205 Gradient difference calculator
206 Correction term addition judgment unit
207 Correction amount calculation unit
208 Weight correction unit
209 Extreme value search convergence judgment unit
210 Global Search Unit
211 Global search convergence judgment unit
212 constraint condition coefficient update unit
213 Global optimization coefficient update unit
300 output unit

Claims

An input unit having a file reading device and a sensor for measuring time-series data, a processing unit, and a global acceleration learning device for a neural network model including an output unit,
The processing unit includes:
Neural network model weight initial value, learning rate initial value, learning rate increase / decrease rate, smoothed differential coefficient, inertia rate, correction coefficient initial value, extreme value search and global search stop reference values, constraints Coefficient, and a coefficient for global optimization, transferred from the input unit and initialized by a variable initialization means ,
Input data to be learned is transferred from the input section and temporarily stored , all patterns of the input data are presented to the neural network model, and the output value of the neural network model is calculated for each input pattern. Data presenting means for calculating a gradient which is a first-order partial differential of an evaluation function relating to the weight at the current iteration point (k-th iteration) based on the output value and evaluation reference values for all patterns calculated from the teacher data; ,
The value obtained by subtracting the smoothing derivative from 1 is multiplied by the gradient at the k-th iteration obtained by the data presentation means , and this value and the smoothing derivative before the first iteration (referred to as the (k-1) th iteration) are added to the smoothing derivative. And a smoothed differential value calculating means for calculating and obtaining a smoothed differential value at the k-th iteration,
If the result of multiplying the gradient of the k-th iteration and the smoothed differential value of the (k-1) th iteration obtained by the smoothing differential value calculation means individually for each element is a positive value, the learning rate is increased by the learning rate. Each learning rate is updated individually by increasing the learning rate by adding the rate.If the calculation result is a negative value, the learning rate is reduced by multiplying the learning rate by the reduction rate of the learning rate, and each learning rate is individually adjusted. Learning rate updating means for updating ;
The amount of correction obtained by multiplying the gradient obtained by the data presentation means and the individual learning rate corresponding to each weight obtained by the learning rate updating means is individually calculated for each element, and this correction amount is k-1 repeated. Weight update means for individually adding to each weight a correction amount obtained by multiplying the correction amount of the eye by the inertia rate, and calculating the correction amount of the weight ;
A gradient difference (hereinafter referred to as a gradient difference at the k-th iteration) is calculated from the gradient at the k-th iteration and the gradient at the (k-1) -th iteration determined by the data presentation means , and the result is transferred to the correction term addition determining means. Gradient difference calculating means ,
An inner product of the difference between the gradient at the k-th iteration obtained by the gradient difference calculator and the difference between the gradient at the (k-1) -th iteration before the first iteration is calculated, and when the result is a negative value, it is determined that a correction term is added. And transferring the weight to the correction amount calculating means, and when the result of the inner product is a positive value, determining that the correction amount is not added and transferring the weight to the extreme value search convergence determining means,
In addition to the weight correction amount obtained by the weight updating means , the inner product of the gradient difference at the k-th iteration and the weight correction amount at the k-th iteration is squared to the magnitude ( norm ) of the gradient difference at the k-th iteration. Correction amount calculating means for calculating a correction amount obtained by multiplying a result (scalar amount) obtained by dividing by 2 and 2 with a gradient difference at the k-th iteration , and transferring the result to weight correction means ;
Weight correction means for updating the weight individually for each element based on the correction amount of the weight transferred from the correction amount calculation means, and transferring the weight to the extreme value search convergence determination means;
Using the local solution, which is the point searched by the extremum search, as the initial value, construct a new evaluation function by removing the extremum of the evaluation function in the local solution, and perform a global search by minimizing this evaluation function Global search means ;
An extremum search is performed using the weight transferred from the weight updating means or the weight correction means , and when the stop criteria of the extremum search is satisfied, the extremum search is terminated and the processing by the global search means is executed. If the stop criterion for the extremum search is not satisfied, an extremum search convergence determination unit that repeatedly executes the processing from the data presentation unit until the stop criterion is satisfied ,
It is determined whether or not the execution result of the global search performed by the global search unit satisfies a stop criterion for global optimization. If the stop criterion for global optimization is satisfied, the process is terminated. Search convergence determining means ;
When the global convergence determining means determines that the stop criteria of the global optimization is not satisfied, if a solution having an evaluation function value smaller than the evaluation function value in the local solution can be searched, a parameter relating to the constraint condition is used. After adjusting, a constraint condition coefficient updating means for executing an extremum search after the processing by the data presenting means ,
When the global convergence determination means determines that the stop criteria of the global optimization is not satisfied, if a solution having an evaluation function value smaller than the evaluation function value in the local solution cannot be searched for, the global optimization after adjusting the coefficients, returns to processing by the global search means, performs repeated global search, a case where only iteration number of times specified to have a global optimization coefficient adjusting means for terminating the process A globally accelerated learning device for neural network models.

An input unit having a file reading device and a sensor for measuring time-series data, a processing unit, and a global acceleration learning device for a neural network model including an output unit,
The processing unit includes:
Neural network model weight initial value, learning rate initial value, learning rate increase / decrease rate, smoothed differential coefficient, inertia rate, correction coefficient initial value, extreme value search and global search stop reference values, constraints Coefficient, and a coefficient for global optimization, transferred from the input unit and initialized by a variable initialization means ,
Input data to be learned is transferred from the input section and temporarily stored , all patterns of the input data are presented to the neural network model, and the output value of the neural network model is calculated for each input pattern. Data presenting means for calculating a gradient which is a first-order partial differential of an evaluation function relating to the weight at the current iteration point (k-th iteration) based on the output value and evaluation reference values for all patterns calculated from the teacher data; ,
The value obtained by subtracting the smoothing derivative from 1 is multiplied by the gradient at the k-th iteration obtained by the data presentation means , and this value and the smoothing derivative before the first iteration (referred to as the (k-1) th iteration) are added to the smoothing derivative. And a smoothed differential value calculating means for calculating and obtaining a smoothed differential value at the k-th iteration,
If the result of multiplying the gradient of the k-th iteration and the smoothed differential value of the (k-1) th iteration obtained by the smoothing differential value calculation means individually for each element is a positive value, the learning rate is increased by the learning rate. Each learning rate is updated individually by increasing the learning rate by adding the rate.If the calculation result is a negative value, the learning rate is reduced by multiplying the learning rate by the reduction rate of the learning rate, and each learning rate is individually adjusted. Learning rate updating means for updating ;
The amount of correction obtained by multiplying the gradient obtained by the data presentation means and the individual learning rate corresponding to each weight obtained by the learning rate updating means is individually calculated for each element, and this correction amount is k-1 repeated. Weight update means for individually adding to each weight a correction amount obtained by multiplying the correction amount of the eye by the inertia rate, and calculating the correction amount of the weight ;
A gradient difference (hereinafter referred to as a gradient difference at the k-th iteration) is calculated from the gradient at the k-th iteration and the gradient at the (k-1) -th iteration determined by the data presentation means , and the result is transferred to the correction term addition determining means. Gradient difference calculating means ,
The gradient at the k-th iteration and the smoothed derivative at the (k-1) -th iteration calculated by the gradient difference calculation means are calculated for each element. If the result is a negative value, the weight of the weight determined by the weight update means is calculated. A correction amount obtained by multiplying the difference of the gradient at the current repetition point by a correction coefficient of a fixed value to the correction amount is individually added to each weight for each element, or the gradient and the smoothed derivative obtained by the learning rate updating means are calculated . A correction amount calculating unit that does not add a correction amount when the inner product value is a positive value;
correction coefficient increasing / decreasing means for decreasing the correction coefficient when the result of multiplying the gradient at the k-th iteration and the smoothing differentiation at the k-1th iteration is a positive value, and increasing the correction coefficient when the result of the multiplication is a negative value;
Weight correction means for individually updating the weight for each element using the correction amount of the weight determined by the weight update means and the correction amount calculation means ,
Using the local solution, which is the point searched by the extremum search, as the initial value, construct a new evaluation function by removing the extremum of the evaluation function in the local solution, and perform a global search by minimizing this evaluation function Global search means ;
When the stop criterion for the extreme value search is satisfied, the extreme value search is terminated, and the processing by the global search means is executed. When the stop criterion for the extreme value search is not satisfied, the processing by the data presentation means is performed. Returning to the extreme value search convergence determination means repeatedly performing the processing by the data presentation means until the stop criteria of the extreme value search is satisfied,
It is determined whether or not the execution result of the global search performed by the global search unit satisfies a stop criterion for global optimization. If the stop criterion for global optimization is satisfied, the process is terminated. Search convergence determining means ;
When the global convergence determining means determines that the stop criteria of the global optimization is not satisfied, if a solution having an evaluation function value smaller than the evaluation function value in the local solution can be searched, a parameter relating to the constraint condition is used. After adjusting, a constraint condition coefficient updating means for executing an extremum search after the processing by the data presenting means ,
When the global convergence determination means determines that the stop criteria of the global optimization is not satisfied, if a solution having an evaluation function value smaller than the evaluation function value in the local solution cannot be searched for, the global optimization after adjusting the coefficients, returns to processing by the global search means, performs repeated global search, a case where only iteration number of times specified to have a global optimization coefficient adjusting means for terminating the process A globally accelerated learning device for neural network models.

The smoothing differential value calculating means multiplies a value obtained by subtracting the smoothing differential coefficient from 1 into a smoothing differential value at the (k-1) th iteration, and adds a value obtained by multiplying this value by a gradient of the smoothing differential coefficient, means for individually calculating the smoothed differential value at the k-th iteration for each element,
The learning rate updating means individually multiplies the smoothed differential value at the k-th iteration and the smoothed differential value at the (k-1) th iteration obtained by the smoothed differential value calculating means for each element, and calculates the positive value In the case of, the learning rate is increased by adding the learning rate increasing rate to the learning rate, and when the calculation result is a negative value, the learning rate is multiplied by the learning rate decreasing rate to reduce the learning rate. It has means to update individually for each,
The correction amount calculating means multiplies, for each element, the smoothing derivative at the k-th iteration and the smoothing derivative at the (k-1) th iteration obtained by the smoothing differential value calculating means , and when the result is a negative value, Calculates the correction amount obtained by multiplying the correction amount of the weight obtained by the smoothing differential value calculating means by the variable correction coefficient to the gradient difference at the k-th iteration, or performs the smoothing at the k-th iteration and the (k-1) -th iteration. Means for not calculating the correction amount when the multiplication result of the generalized derivative is a positive value,
The correction coefficient increasing / decreasing means decreases the correction coefficient when the result of multiplying the smoothed derivative at the k-th iteration and the smoothed derivative at the (k-1) -th iteration is a positive value. 3. The global accelerated learning device for a neural network model according to claim 2, further comprising means for increasing the number of learning operations.