WO2020235693A1

WO2020235693A1 - Learning method, learning device, and learning program for ai agent that behaves like human

Info

Publication number: WO2020235693A1
Application number: PCT/JP2020/020624
Authority: WO
Inventors: 崇松原; 邦昭上原; 洋一野本
Original assignee: Equos Research Co Ltd; Kobe University NUC
Current assignee: Equos Research Co Ltd; Kobe University NUC
Priority date: 2019-05-23
Filing date: 2020-05-25
Publication date: 2020-11-26
Anticipated expiration: 2021-11-23
Also published as: JP2020191022A

Abstract

Provided is a learning method for an agent that combines an optimum action exceeding human capability and a human-like behavior which are acquired by reinforcement learning. Play data by a human expert and state data SR and action data AR in the recording of play of the agent of an imitation learning or a reinforcement learning model are inputted. The state data SR is inputted to the learning executor of a blended model of the reinforcement and imitation learning models and the action data AF of the blended model is outputted. A first loss error between the action data AR of the human expert or the agent of the imitation learning model and the action data AF is calculated. A second loss error between the action data ARL outputted on the basis of the state data SR by the executor of the reinforcement learning model or an optimum action algorithm and the action data AF is calculated. A blending error is calculated on the basis of a weight ratio of the first and second loss errors, and the learning executor parameter of the blended model is updated on the basis of the blending error.

Description

AI agent learning methods, learning devices and learning programs that behave like humans

　本発明は、エージェントＡＩ（人工知能）を訓練する際に、強化学習モデルとの誤差と、人のエキスパートの行動や模倣学習モデルとの誤差とを巧く加味して学習できる方法、装置及びプログラムに関するものである。 The present invention is a method, device, and program that can skillfully take into account the error with the reinforcement learning model and the error with the behavior of a human expert or the imitation learning model when training the agent AI (artificial intelligence). It is about.

　近年、強化学習（Reinforcement learning；以下、“ＲＬ”ということがある）、特に、深層強化学習（ＤＲＬ）は、電子ゲームにおいて利用されるＡＩ(人工知能)、自動車などの車両の自動運転制御、ロボットの自律制御などのアルゴリズムに応用されている。強化学習は、エピソードという単位で学習を繰り返し、環境の中で試行錯誤した結果の報酬を用いて対象となる環境に適応する手法であり方策を最適化することで学習を行うものであり、深層強化学習は、畳み込みニューラルネットワーク（ＣＮＮ）の情報処理力を利用して、画像データなどの高次元入力に基づき、強化学習を行うものである。 In recent years, reinforcement learning (hereinafter sometimes referred to as "RL"), especially deep reinforcement learning (DRL), is AI (artificial intelligence) used in electronic games, automatic driving control of vehicles such as automobiles, It is applied to algorithms such as autonomous control of robots. Reinforcement learning is a method of repeating learning in units of episodes and adapting to the target environment using the rewards of trial and error in the environment, and learning is performed by optimizing the policy. Reinforcement learning uses the information processing power of a convolutional neural network (CNN) to perform reinforcement learning based on high-dimensional input such as image data.

　強化学習や深層強化学習では、環境内におけるエージェントが、現在の状態を観測し、取るべき行動を決定するといった問題が設定され、エージェントは行動を選択することにより環境から報酬を得る。このような環境とエージェントとのやりとりを繰り返しながら、報酬、行動、状態の３つの変数を用いて学習していく。
　強化学習や深層強化学習では、エージェントが状態空間と行動空間を持つある環境で行動を取る。タイムステップ毎に、方策が特徴抽出用学習器で状態を特徴ベクトルに変換し、価値関数計算用学習器で特徴ベクトルに対する行動確率を出力する。エージェントが行動を取った後、環境はスカラーの報酬と次の状態を出力する。最終状態まで行動を取ることを繰り返すと、エピソードが終了する。 In reinforcement learning and deep reinforcement learning, a problem is set in which an agent in the environment observes the current state and decides an action to be taken, and the agent receives a reward from the environment by selecting an action. While repeating the interaction between the environment and the agent, learning is performed using the three variables of reward, behavior, and state.
In reinforcement learning and deep reinforcement learning, agents act in an environment with a state space and an action space. At each time step, the policy converts the state into a feature vector with the feature extraction learner, and outputs the action probability for the feature vector with the value function calculation learner. After the agent takes action, the environment prints a scalar reward and the next state. The episode ends when you repeat the action until the final state.

　強化学習や深層強化学習により構築したエージェントＡＩは、環境と対話しているうちに試行錯誤を重ねて学習して様々な課題が解決でき、人間の能力を遥かに超えることから、ゲームＡＩ、ロボットの自律制御や自動運転などのアルゴリズムに応用されている（例えば、非特許文献１～５を参照。）。しかしながら、その振る舞いを人間が予測できないことから、例えば人間と共同で作業するロボットや人が乗車する自動運転ＡＩの実用化に対し、ユーザビリティの点で課題が大きい。また、強化学習や深層強化学習により構築したエージェントＡＩは、収益を最大化するように訓練されるため高い性能を示すが、実用化する際には、このような性能指標以外のことも考慮する必要がある。例えば、テレビゲームにおいて、ゲームのプレイヤーが操作できないキャラクタであるＮＰＣ（Non Player Character）を、強化学習のエージェントＡＩにすると、そのエージェントＡＩが強すぎるため、プレイヤーがゲームをあまり楽しむことができない可能性がある。また、自動運転に応用する際には、高い性能を目指して訓練された強化学習のエージェントＡＩは、激しく加減速したり急に曲がったりして、隣接する車や歩行者などに不安を与える恐れがある。そこで、人間らしいエージェントＡＩを設計する必要がある。 Agent AI constructed by reinforcement learning and deep reinforcement learning can solve various problems by learning through trial and error while interacting with the environment, far exceeding human ability, so game AI, robots It is applied to algorithms such as autonomous control and automatic operation of the above (see, for example, Non-Patent Documents 1 to 5). However, since humans cannot predict the behavior, there is a big problem in terms of usability for the practical application of, for example, a robot that works in collaboration with humans and an automatic driving AI on which a human rides. In addition, agent AI constructed by reinforcement learning and deep reinforcement learning shows high performance because it is trained to maximize profits, but when it is put into practical use, consideration other than such performance indicators is also taken into consideration. There is a need. For example, in a video game, if an NPC (Non Player Character), which is a character that the player of the game cannot operate, is used as an agent AI for reinforcement learning, the agent AI may be too strong for the player to enjoy the game very much. There is. In addition, when applied to autonomous driving, the reinforcement learning agent AI trained for high performance may accelerate or decelerate violently or turn suddenly, causing anxiety to adjacent cars and pedestrians. There is. Therefore, it is necessary to design a human-like agent AI.

　一方、模倣学習(Imitation Learning；以下、“ＩＬ”ということがある)は、主に人のエキスパートの振舞いを再現する目的で、人間の行動を模倣するエージェントＡＩを構築するために採用されている。模倣学習においては、エキスパートが従っている方策を最適な方策であると仮定し、エージェントＡＩの方策がエキスパートの振舞いに近づくように学習が行われる。模倣学習では、人のエキスパートに最適な方策により、状態と行動のペアの系列という形で提供された上で、エージェントＡＩの方策を観測される状態からエキスパートに取れそうな行動を推測するように訓練されるので、人のエキスパートを模倣し、人らしい振舞いが期待できる。しかしながら、学習される方策は提供されたデータに制限されることから、模倣学習のエージェントＡＩの性能は、人のエキスパートの性能を超えることが殆ど無いといった問題がある（例えば、非特許文献６～８を参照。）。 On the other hand, imitation learning (hereinafter sometimes referred to as “IL”) is adopted to build an agent AI that imitates human behavior, mainly for the purpose of reproducing the behavior of human experts. .. In imitation learning, it is assumed that the policy followed by the expert is the optimal policy, and learning is performed so that the policy of the agent AI approaches the behavior of the expert. In imitation learning, the optimal policy for a human expert is provided in the form of a series of state-behavior pairs, and then the agent AI policy is inferred from the observed state to the behavior that the expert is likely to take. Being trained, you can expect human-like behavior by imitating human experts. However, since the learning policy is limited to the provided data, there is a problem that the performance of the imitation learning agent AI hardly exceeds the performance of a human expert (for example, Non-Patent Documents 6 to 6). See 8.).

V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp.529, 2015.V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp.529, 2015. D. Silver et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, p. 354, 2017.D. Silver et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, p. 354, 2017. B. Vikas, “Deep reinforcement learning approach to autonomous navigation,” 2017.B. Vikas, “Deep reinforcement learning approach to automatic navigation,” 2017. T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” CoRR, vol. abs/1509.02971, 2015.T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” CoRR, vol. Abs / 1509.02971, 2015. D. Silver et al., “Deterministic policy gradient algorithms,” in Proceedings of the 31st International Conference on International Conference on Machine Learning, Volume 32, ser. ICML’14. JMLR.org, 2014, pp. I-387-I-395.D. Silver et al., “Deterministic policy gradient algorithms,” in Proceedings of the 31st International Conference on International Conference on Machine Learning, Volume 32, ser. ICML'14. JMLR.org, 2014, pp. -395. S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 627-635.S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in Proceedings of the fourteenth international 627 and stat-conference on artificial intelligence .. J. Ortega et al., “Imitating human playing styles in super mario bros,” Entertainment Computing, vol. 4, no. 2, pp. 93-104, 2013.J. Ortega et al., “Imitating human playing styles in super mario bros,” Entertainment Computing, vol. 4, no. 2, pp. 93-104, 2013. J. Ho and S. Ermon, “Generative adversarial imitation learning,” CoRR, vol. abs/1606.03476, 2016.J. Ho and S. Ermon, “Generative adversarial imitation learning,” CoRR, vol. Abs / 1606.03476, 2016.

　かかる状況に鑑みて、本発明は、強化学習で獲得される人の能力を超えた効率性の高い最適行動と、人のエキスパートの行動を模倣することで獲得される人らしい振舞いを両立させたエージェントの学習方法、装置及びプログラムを提供することを目的とする。 In view of this situation, the present invention has achieved both highly efficient optimal behavior that exceeds the human ability acquired by reinforcement learning and human-like behavior acquired by imitating the behavior of a human expert. It is an object of the present invention to provide learning methods, devices and programs for agents.

　上記課題を解決すべく、本発明のエージェントの学習方法は、所定の環境下でエージェントが判断して最適行動をとる振舞いを実現する学習器と人らしい振舞いを実現する学習器とを融合させ、エージェントの行動方策を、人らしく最適行動をとるように最適化する学習方法であって、下記ａ）～ｆ）のステップを備える。
ａ）人のエキスパートによるプレイデータ、又は所定の目的で作成されたエージェントのプレイデータの少なくとも何れかの記録における状態データＳ_Ｒと行動データＡ_Ｒを入力する入力ステップ。
ｂ）最適行動をとる振舞いに関わる強化学習モデルと人らしい振舞いに関わる模倣学習モデルとの融合モデルの学習実行器に対して、状態データＳ_Ｒを入力して融合モデルの行動データＡ_Ｆを出力させる学習ステップ。
ｃ）行動データＡ_Ｒと行動データＡ_Ｆとの第１の損失誤差を算出する第１の損失誤差算出ステップ。
ｄ）強化学習モデルの実行器又は最適行動アルゴリズムにより状態データＳ_Ｒに基づいて出力される行動データＡ_ＲＬと、行動データＡ_Ｆとの第２の損失誤差を算出する第２の損失誤差算出ステップ。
ｅ）第１及び第２の損失誤差の重み比率に基づいて融合誤差を算出する融合誤差算出ステップ。
ｆ）融合誤差に基づいて融合モデルの学習実行器のパラメタを更新する更新ステップ。 In order to solve the above problems, the learning method of the agent of the present invention fuses a learning device that realizes a behavior in which an agent judges and takes an optimum action under a predetermined environment and a learning device that realizes a human-like behavior. It is a learning method that optimizes the behavior policy of an agent so as to take the optimum behavior like a human being, and includes the following steps a) to f).
a) Human play data expert or a predetermined input step of inputting status data S _R and behavioral data A _R in at least one of recording of the play data of agents created for the purpose.
b) for the learning execution unit of the fusion model of the imitation learning model involved in the reinforcement learning model and people seems to behavior related to the behavior to take the optimal action, enter the state data S _R output the action data A _F of the fusion model Learning steps to let.
c) The first loss error calculation step for calculating the first loss error between the behavior data _AR and the behavior data _AF .
and behavior data A _RL which is output based on the state data S _R by the execution unit or optimal behavior algorithm of d) reinforcement learning model, the second loss error calculating step of calculating a second loss error between behavior data A _F ..
e) Fusion error calculation step of calculating the fusion error based on the weight ratio of the first and second loss errors.
f) An update step that updates the parameters of the learning executor of the fusion model based on the fusion error.

　本発明のエージェントの学習方法によれば、強化学習モデルとの誤差と、人のエキスパートの行動や模倣学習モデルとの誤差をフィードバッグして、人らしい振舞いでかつ最適行動をとるエージェントＡＩを構築することができる。
　ここで、強化学習モデルの実行器は、一般的な強化学習と同様に、環境内におけるエージェントが現在の状態を観測し、取るべき行動を決定する方策が設定され、エージェントは価値関数の値が大きくなるように、方策から行動を選択して報酬を得ることを繰り返し、報酬と行動と状態の３つの変数を用いてフィードバックしながら学習して得た実行器である。 According to the agent learning method of the present invention, the error between the reinforcement learning model and the behavior of a human expert or the imitation learning model is fed back to construct an agent AI that behaves like a human and takes optimal behavior. can do.
Here, the executor of the reinforcement learning model is set with a policy of observing the current state of the agent in the environment and deciding the action to be taken, as in the case of general reinforcement learning, and the agent has the value of the value function. It is an executor obtained by repeatedly selecting an action from a policy and obtaining a reward so as to increase the size, and learning while feeding back using three variables of reward, action, and state.

　環境とは、例えば、電子ゲームの画面の画像データの集合や、ロボットや車両に搭載されたカメラの撮影動画の画像データの時系列などの集合である。状態データとは、画像データの集合である環境を観測して得られる個々のゲーム画像、自車両の周辺環境映像、車線位置などのデータである。また、行動データとは、例えば、電子ゲームであればゲームの入力デバイス操作、コントローラの入力（移動方向、ジャンプなど）であり、自動車の自動運転であればハンドル、アクセル、ブレーキ操作などである。
　このように、強化学習は、報酬と行動と状態の３つの変数を用いてフィードバックしながら学習する仕組みによって機能するのに対し、模倣学習は、教師エージェントと生徒エージェントを用いた仕組みで機能する。 The environment is, for example, a set of image data of an electronic game screen or a time series of image data of a moving image taken by a camera mounted on a robot or a vehicle. The state data is data such as individual game images obtained by observing the environment, which is a set of image data, surrounding environment images of the own vehicle, and lane positions. Further, the action data is, for example, a game input device operation and a controller input (movement direction, jump, etc.) in the case of an electronic game, and a steering wheel, accelerator, brake operation, etc. in the case of automatic driving of an automobile.
In this way, reinforcement learning functions by a mechanism of learning while feeding back using three variables of reward, behavior, and state, whereas imitation learning functions by a mechanism using a teacher agent and a student agent.

　また、融合モデルの出力の行動データＡ_Ｆは、行動の確率分布で表現される。例えば、コントローラの各入力ボタン操作の確率分布、ハンドル、アクセル、ブレーキ操作の確率分布などである。ここで、確率分布は、最適な一つだけの操作出力の場合も、δ分布として含まれるとしている。より詳しくは、離散行動の場合は確率分布で表現され、連続行動の場合は最適なものが一つだけ出力されることになる。
　また、損失誤差とは、例えば損失関数を用いて算出する現時点の出力と期待する出力との誤差をいう。 Moreover, the behavior data _AF of the output of the fusion model is expressed by the probability distribution of the behavior. For example, the probability distribution of each input button operation of the controller, the probability distribution of the steering wheel, the accelerator, and the brake operation. Here, it is assumed that the probability distribution is included as a δ distribution even in the case of only one optimum operation output. More specifically, in the case of discrete behavior, it is expressed by a probability distribution, and in the case of continuous behavior, only one optimum one is output.
Further, the loss error means an error between the current output calculated by using a loss function and the expected output, for example.

　本発明のエージェントの学習方法の入力ステップにおいて、状態データＳ_Ｒと行動データＡ_Ｒは、下記１）～５）の場合が存在し、それらの場合に応じて、第１及び第２の損失誤差算出ステップの処理が異なる。
１）人のエキスパートによるプレイデータの記録である場合
　第１の損失誤差算出ステップは、人のエキスパートによるプレイデータにおける行動データＡ_Ｒと、行動データＡ_Ｆとの誤差を、損失関数を用いて算出する。
　第２の損失誤差算出ステップは、強化学習モデルに状態データＳ_Ｒを入力して出力させた行動データＡ_ＲＬを知識蒸留によるソフトターゲット化した行動データと、行動データＡ_Ｆとの誤差を、損失関数を用いて算出する。 In input step of learning how the agent of the present invention, the state data S _R and behavioral data A _R, the following 1) There are cases 1-5), as the case thereof, first and second loss error The processing of the calculation step is different.
1) When recording play data by a human expert In the first loss error calculation step, the error between the behavior data _AR and the behavior data _AF in the play data by a human expert is calculated using a loss function. To do.
Second loss error calculating step includes a behavior data software targeted behavioral data A _RL which are outputted by inputting the state data S _R in reinforcement learning model by knowledge distillation, the error between the behavior data A _F, loss Calculate using a function.

２）模倣学習モデルのエージェントのプレイデータの記録である場合
　第１の損失誤差算出ステップは、模倣学習モデルのエージェントのプレイデータにおける行動データＡ_Ｒと行動データＡ_Ｆとの誤差、又は、模倣学習モデルに前記状態データＳ_Ｒを入力して出力させた行動データＡ_ＩＬを知識蒸留によるソフトターゲット化した行動データと、行動データＡ_Ｆとの誤差を、損失関数を用いて算出する。
　第２の損失誤差算出ステップは、強化学習モデルに状態データＳ_Ｒを入力して出力させた行動データＡ_ＲＬを知識蒸留によるソフトターゲット化した行動データと、行動データＡ_Ｆとの誤差を、損失関数を用いて算出する。 2) When the play data of the agent of the imitation learning model is recorded The first loss error calculation step is the error between the behavior data _AR and the behavior data _AF in the play data of the agent of the imitation learning model, or the imitation learning. The error between the behavior data A _IL soft-targeted by knowledge distillation and the behavior data A _F , which is output by inputting the state data S _R into the model, is calculated using a loss function.
Second loss error calculating step includes a behavior data software targeted behavioral data A _RL which are outputted by inputting the state data S _R in reinforcement learning model by knowledge distillation, the error between the behavior data A _F, loss Calculate using a function.

３）強化学習モデルのエージェントのプレイデータの記録である場合
　第１の損失誤差算出ステップは、模倣学習モデルに状態データＳ_Ｒを入力して出力させた行動データＡ_ＩＬを知識蒸留によるソフトターゲット化した行動データと、行動データＡ_Ｆとの誤差を、損失関数を用いて算出する。
　第２の損失誤差算出ステップは、強化学習モデルに状態データＳ_Ｒを入力して出力させた行動データＡ_ＲＬを知識蒸留によるソフトターゲット化した行動データと、行動データＡ_Ｆとの誤差を、損失関数を用いて算出する。 3) The first loss error calculating step when a record of the play data of the agent of reinforcement learning models, software targeted by Knowledge distillation was enter to output the state data S _R to the imitation learning model behavior data A _IL The error between the behavior data and the behavior data _AF is calculated using the loss function.
Second loss error calculating step includes a behavior data software targeted behavioral data A _RL which are outputted by inputting the state data S _R in reinforcement learning model by knowledge distillation, the error between the behavior data A _F, loss Calculate using a function.

４）人のエキスパートによるプレイデータの記録における状態データＳ_ＨＥと行動データＡ_ＨＥ、及び、強化学習モデルのエージェントのプレイデータの記録における状態データＳ_ＲＬと行動データＡ_ＲＬである場合
　学習ステップは、状態データＳ_ＨＥを入力して行動データＡ_Ｆ１を出力させ、及び、状態データＳ_ＲＬを入力して行動データＡ_Ｆ２を出力させる。
　第１の損失誤差算出ステップは、人のエキスパートによるプレイデータにおける行動データＡ_ＨＥと、行動データＡ_Ｆ１との誤差を、損失関数を用いて算出する。
　第２の損失誤差算出ステップは、強化学習モデルに状態データＳ_ＲＬを入力して出力させた行動データＡ_ＲＬを知識蒸留によるソフトターゲット化した行動データと、行動データＡ_Ｆ２との誤差を、損失関数を用いて算出する。 4) When the state data S _HE and the behavior data A _HE in the recording of the play data by the human expert and the state data S _RL and the behavior data A _RL in the recording of the play data of the agent of the reinforcement learning model, the learning step is The state data _SHE is input to output the action data A _F1 , and the state data _SRL is input to output the action data A _F2 .
In the first loss error calculation step, the error between the behavior data A _HE and the behavior data A _F1 in the play data by a human expert is calculated using the loss function.
In the second loss error calculation step, the error between the behavior data A _RL soft-targeted by knowledge distillation and the behavior data A _F2 , which is output by inputting the state data S _RL into the reinforcement learning model, is lost. Calculate using a function.

５）模倣学習モデルのエージェントのプレイデータの記録における状態データＳ_ＩＬと行動データＡ_ＩＬ、及び、強化学習モデルのエージェントのプレイデータの記録における状態データＳ_ＲＬと行動データＡ_ＲＬである場合
　学習ステップは、状態データＳ_ＩＬを入力して行動データＡ_Ｆ１を出力させ、及び、状態データＳ_ＲＬを入力して行動データＡ_Ｆ２を出力させる。
　第１の損失誤差算出ステップは、模倣学習モデルのエージェントのプレイデータにおける行動データＡ_ＩＬを知識蒸留によるソフトターゲット化した行動データと、行動データＡ_Ｆ１との誤差を、損失関数を用いて算出する。
　第２の損失誤差算出ステップは、強化学習モデルに状態データＳ_ＲＬを入力して出力させた行動データＡ_ＲＬを知識蒸留によるソフトターゲット化した行動データと、行動データＡ_Ｆ２との誤差を、損失関数を用いて算出する。 5) imitation learning model agents play data state data S _IL and behavioral data A _IL in the _recording, and, when the learning step is a state data S _RL and behavioral data A _RL in the recording play data Agent Reinforcement Learning Inputs the state data _SIL to output the action data A _F1 , and inputs the state data _SRL to output the action data A _F2 .
In the first loss error calculation step, the error between the behavior data _AIL soft-targeted by knowledge distillation in the play data of the agent of the imitation learning model and the behavior data A _F1 is calculated using the loss function. ..
In the second loss error calculation step, the error between the behavior data A _RL soft-targeted by knowledge distillation and the behavior data A _F2 , which is output by inputting the state data S _RL into the reinforcement learning model, is lost. Calculate using a function.

　本発明のエージェントの学習方法における知識蒸留によるソフトターゲット化は、エージェントの行動方策の出力が、熱度Ｔのパラメタを有し、ハードターゲット化とソフトターゲット化に対応するものである。熱度Ｔが“０”の場合は、行動方策の出力は一つの行動のみが０より大きな確率を持つハードターゲットとなり、行動は一意に決定される。一方、熱度Ｔが“０”でない場合は、複数の行動が０より大きな確率をもち、行動は確率的に決定される。 In the soft targeting by knowledge distillation in the agent learning method of the present invention, the output of the agent's action policy has a parameter of heat degree T, and corresponds to hard targeting and soft targeting. When the heat degree T is "0", the output of the action policy becomes a hard target in which only one action has a probability greater than 0, and the action is uniquely determined. On the other hand, when the heat degree T is not "0", the plurality of actions have a probability greater than 0, and the actions are stochastically determined.

　次に、本発明のエージェントの学習装置について説明する。
　本発明のエージェントの学習装置は、所定の環境下でエージェントが判断して最適行動をとる振舞いを実現する学習器と人らしい振舞いを実現する学習器とを融合させ、エージェントの行動方策を人らしく振舞うように最適化する学習装置であって、下記Ａ）～Ｆ）を備える。
Ａ）人のエキスパートによるプレイデータ、又は所定の目的で作成されたエージェントのプレイデータの少なくとも何れかの記録における状態データＳ_Ｒと行動データＡ_Ｒを入力する入力部。
Ｂ）最適行動をとる振舞いに関わる強化学習モデルと人らしい振舞いに関わる模倣学習モデルとの融合モデルに対して、状態データＳ_Ｒを入力して融合モデルの行動データＡ_Ｆを出力する学習実行器。
Ｃ）行動データＡ_Ｒと行動データＡ_Ｆとの第１の損失誤差を算出する第１の損失誤差算出器。
Ｄ）強化学習モデルの実行器又は最適行動アルゴリズムにより状態データＳ_Ｒに基づいて出力される行動データＡ_ＲＬと、行動データＡ_Ｆとの第２の損失誤差を算出する第２の損失誤差算出器。
Ｅ）第１及び第２の損失誤差の重み比率に基づいて融合誤差を算出する融合誤差算出器。
Ｆ）融合誤差に基づいて前記融合モデルの学習実行器のパラメタを更新する更新部。 Next, the learning device of the agent of the present invention will be described.
The agent learning device of the present invention fuses a learning device that realizes a behavior in which an agent judges and takes an optimum action under a predetermined environment and a learning device that realizes a human-like behavior, and makes the agent's behavior policy human-like. It is a learning device that optimizes to behave and is equipped with the following A) to F).
An input unit for inputting status data S _R and behavioral data A _R in at least one of recording of the play data of agents created A) human experts play data, or a predetermined purpose.
Against fusion model and imitation learning model related to reinforcement learning model and human seems behaviors related to behavior taking B) optimal action, learning execution outputting a behavior data A _F of the fused model to input state data S _R ..
C) A first loss error calculator for calculating a first loss error between the behavior data _AR and the behavior data _AF .
And behavior data A _RL which is output based on the state data S _R by the execution unit or optimal behavior algorithm D) reinforcement learning model, the second loss error calculator for calculating a second loss error between behavior data A _F ..
E) A fusion error calculator that calculates the fusion error based on the weight ratio of the first and second loss errors.
F) An update unit that updates the parameters of the learning executor of the fusion model based on the fusion error.

　本発明のエージェントの学習装置の入力部において、上述した本発明のエージェントの学習方法の入力ステップと同様に、上述の１）～５）の場合の状態データＳ_Ｒと行動データＡ_Ｒに応じて、第１及び第２の損失誤差算出器の算出処理が異なる。
　また、本発明のエージェントの学習装置における知識蒸留によるソフトターゲット化は、上述した本発明のエージェントの学習方法と同様に、エージェントの行動方策の出力が、熱度Ｔのパラメタを有し、ハードターゲット化とソフトターゲット化に対応し得るものであり、熱度Ｔが“０”の場合は、行動方策の出力は一つの行動のみが０より大きな確率を持つハードターゲットとなり、行動は一意に決定され、一方、熱度Ｔが“０”でない場合は、複数の行動が０より大きな確率をもち、行動は確率的に決定される。 At the input of the agent of the learning device of the present invention, similarly to the input step of learning how the agent of the present invention described above, according to the state data S _R and behavioral data A _R in the case of the above 1) to 5) , The calculation process of the first and second loss error calculators is different.
Further, in the soft targeting by knowledge distillation in the learning device of the agent of the present invention, the output of the action policy of the agent has a parameter of heat degree T and is hard-targeted as in the above-described learning method of the agent of the present invention. When the degree of heat T is "0", the output of the action policy becomes a hard target with a probability that only one action has a probability greater than 0, and the action is uniquely determined, while the action is uniquely determined. If the heat degree T is not "0", the plurality of actions have a probability greater than 0, and the actions are determined probabilistically.

　本発明のエージェントの学習プログラムは、上述の本発明のエージェントの学習方法における全てのステップを、コンピュータに実行させるためのプログラムである。
　また本発明のエージェントの学習プログラムは、上述の本発明のエージェントの学習装置における入力部、学習実行器、第１の損失誤差算出器、第２の損失誤差算出器、融合誤差算出器及び更新部として、コンピュータを機能させるためのプログラムである。 The agent learning program of the present invention is a program for causing a computer to execute all the steps in the above-mentioned agent learning method of the present invention.
Further, the learning program of the agent of the present invention includes an input unit, a learning executor, a first loss error calculator, a second loss error calculator, a fusion error calculator and an update unit in the above-mentioned agent learning device of the present invention. It is a program for operating a computer.

　本発明によれば、強化学習で獲得される人間の能力を超えた効率性の高い最適行動と、人のエキスパートの行動を模倣することで獲得される人らしい振舞いを両立させたエージェントＡＩを構築することができる。 According to the present invention, an agent AI that balances highly efficient optimal behavior that exceeds human ability acquired by reinforcement learning and human-like behavior acquired by imitating the behavior of a human expert is constructed. can do.

本発明のエージェントの学習装置の機能ブロック図Functional block diagram of the learning device of the agent of the present invention 実施例１のエージェントの学習装置の機能ブロック図Functional block diagram of the learning device of the agent of the first embodiment 実施例１のエージェントの学習方法の概略フロー図Schematic flow chart of the learning method of the agent of Example 1 実施例１の融合モデルの実行処理フロー図Execution processing flow diagram of the fusion model of the first embodiment 実施例１の人らしい行動との損失誤差関数Ｌ_ＨＥの算出処理フロー図Calculation processing flow chart of loss error function _LHE with human behavior of Example 1 実施例１の最適行動との損失誤差関数Ｌ_ＲＬの算出処理フロー図Calculation processing flow chart of the loss error function _LRL with the optimum behavior of the first embodiment 実施例１の融合損失誤差関数Ｌ_Ｍｉｘの算出処理フロー図Calculation processing flow chart of fusion loss error function L _Mix of Example 1 実施例１の融合モデルの学習処理フロー図Learning processing flow diagram of the fusion model of Example 1 実施例２のエージェントの学習装置の機能ブロック図Functional block diagram of the learning device of the agent of the second embodiment 実施例２の人らしい行動との損失誤差関数Ｌ_ＩＬの算出処理フロー図Calculating process flow diagram of a loss error function L _IL with human likely behavioral Example 2 実施例３のエージェントの学習装置の機能ブロック図Functional block diagram of the learning device of the agent of the third embodiment 実施例４のエージェントの学習装置の機能ブロック図Functional block diagram of the learning device of the agent of the fourth embodiment 実施例５のエージェントの学習装置の機能ブロック図Functional block diagram of the learning device of the agent of the fifth embodiment 実施例５のエージェントの学習方法の概略フロー図Schematic flow chart of the learning method of the agent of Example 5 実施例５の融合モデルの実行処理フロー図１Execution processing flow of the fusion model of Example 5 FIG. 実施例５の融合モデルの実行処理フロー図２Execution processing flow of the fusion model of Example 5 FIG. 実施例５の人らしい行動との損失誤差関数Ｌ_ＨＥの算出処理フロー図Calculation processing flow chart of loss error function _LHE with human behavior of Example 5 実施例５の最適行動との損失誤差関数Ｌ_ＲＬの算出処理フロー図Calculation processing flow chart of loss error function _LRL with optimum behavior of Example 5 実施例６のエージェントの学習装置の機能ブロック図Functional block diagram of the learning device of the agent of the sixth embodiment 実施例３の損失誤差関数Ｌ_ＩＬの算出処理フロー図Calculation processing flow chart of loss error function _LIL of Example 3 ソフトターゲット化処理器の説明図Explanatory drawing of soft targeting processor

　以下、本発明の実施形態の一例を、図面を参照しながら詳細に説明していく。なお、本発明の範囲は、以下の実施例や図示例に限定されるものではなく、幾多の変更及び変形が可能である。 Hereinafter, an example of the embodiment of the present invention will be described in detail with reference to the drawings. The scope of the present invention is not limited to the following examples and illustrated examples, and many modifications and modifications can be made.

　図１は、本発明のエージェントの学習装置の機能ブロック図を示している。学習装置１は、融合モデル学習実行器２、人らしい行動との損失関数算出器３、最適行動との損失関数算出器４、融合モデル損失関数算出器５、及び、データベース（ＤＢ）６から構成される。融合モデル学習実行器２は、データベース６から状態データを取得し、融合モデルの実行処理を行い、得られた行動データを、損失関数算出器（３，４）へ出力する。損失関数算出器（３，４）は、データベース６から状態データ又は行動データを入力し、入力したデータと融合モデル学習実行器２から入力した行動データをそれぞれ損失誤差関数算出器に取込み、損失誤差関数Ｌ_ＨＬ及び損失誤差関数Ｌ_ＲＬをそれぞれ算出する。融合モデル損失関数算出器５は、損失関数算出器（３，４）が算出した損失誤差Ｌ_ＨＬ及び損失誤差Ｌ_ＲＬを入力し、人より効率的かつ人らしい振舞いとの損失誤差の融合損失誤差関数Ｌ_Ｍｉｘを算出する。融合モデル学習実行器２は、融合モデル損失関数算出器５から融合損失誤差関数Ｌ_Ｍｉｘを取得し、融合モデルの学習処理を行う。
　以下の実施例では、融合モデルのタイプ、学習に用いるプレイデータの種別、学習のコントロールの種別の組合せパターンに分けて、本発明のエージェントの学習装置の機能ブロック図を説明する。組合せパターンを表１に纏める。 FIG. 1 shows a functional block diagram of the learning device of the agent of the present invention. The learning device 1 is composed of a fusion model learning executor 2, a loss function calculator 3 with human-like behavior, a loss function calculator 4 with optimal behavior, a fusion model loss function calculator 5, and a database (DB) 6. Will be done. The fusion model learning executor 2 acquires state data from the database 6, performs execution processing of the fusion model, and outputs the obtained behavior data to the loss function calculators (3, 4). The loss function calculators (3, 4) input state data or behavior data from the database 6, and take in the input data and the behavior data input from the fusion model learning executor 2 into the loss error function calculator, respectively, and the loss error. The function L _HL and the loss error function L _RL are calculated respectively. The fusion model loss function calculator 5 inputs the loss error L _HL and the loss error L _RL calculated by the loss function calculator (3, 4), and the fusion loss error of the loss error with the behavior more efficient and human-like than human beings. Calculate the function L _Mix . The fusion model learning executor 2 acquires the fusion loss error function L _Mix from the fusion model loss function calculator 5 and performs learning processing of the fusion model.
In the following examples, the functional block diagram of the learning device of the agent of the present invention will be described by dividing into a combination pattern of a fusion model type, a play data type used for learning, and a learning control type. The combination patterns are summarized in Table 1.

　図２は、実施例１のエージェントの学習装置の機能ブロック図を示している。図２に示すように、学習装置１１は、融合モデル学習実行器２、人らしい行動との損失関数算出器３、最適行動との損失関数算出器４、融合モデル損失関数算出器５及びデータベース６１から成る。データベース６１には、人のエキスパートによるプレイデータが記憶されている。損失関数算出器３には、第１の損失誤差算出器３０が設けられ、損失関数算出器４には、強化学習モデル実行器８、ソフトターゲット化処理器９及び第２の損失誤差算出器４０が設けられている。 FIG. 2 shows a functional block diagram of the learning device of the agent of the first embodiment. As shown in FIG. 2, the learning device 11 includes a fusion model learning executor 2, a loss function calculator 3 with human-like behavior, a loss function calculator 4 with optimal behavior, a fusion model loss function calculator 5, and a database 61. Consists of. Play data by a human expert is stored in the database 61. The loss function calculator 3 is provided with a first loss error calculator 30, and the loss function calculator 4 includes a reinforcement learning model executor 8, a soft targeting processor 9, and a second loss error calculator 40. Is provided.

　学習装置１１を用いたエージェントの学習方法について、図３～８を参照しながら説明する。図３は、実施例１のエージェントの学習方法の概略フロー図を示している。図３に示すように、まず、融合モデル学習実行器２において、融合モデルの実行処理を行う（ステップＳ１１）。次に、損失関数算出器３において、人らしい行動との損失誤差関数Ｌ_ＨＥの算出処理を行う（ステップＳ１２）。また、これとは別に、最適行動との損失誤差関数Ｌ_ＲＬの算出処理を行う（ステップＳ１３）。人らしい行動との損失誤差関数Ｌ_ＨＥ及び最適行動との損失誤差関数Ｌ_ＲＬを基に、融合モデル損失関数算出器５を用いて、融合損失誤差関数Ｌ_Ｍｉｘの算出処理を行う（ステップＳ１４）。融合モデル学習実行器２において、算出された融合損失誤差関数Ｌ_Ｍｉｘを取得し、融合モデルの学習処理を行う（ステップＳ１５）。融合損失誤差関数Ｌ_Ｍｉｘが所定値以上の場合には（ステップＳ１６）、再度、融合モデル学習実行器２において、融合モデルの実行処理を行う（ステップＳ１１）。 An agent learning method using the learning device 11 will be described with reference to FIGS. 3 to 8. FIG. 3 shows a schematic flow chart of the learning method of the agent of the first embodiment. As shown in FIG. 3, first, the fusion model learning executor 2 performs the execution processing of the fusion model (step S11). Next, the loss function calculator 3 performs a calculation process of the loss error function _LHE with human behavior (step S12). In addition to this, the calculation process of the loss error function _LRL with the optimum behavior is performed (step S13). Based on the loss error function L _HE with human-like behavior and the loss error function L _RL with optimal behavior, the fusion model loss function calculator 5 is used to calculate the fusion loss error function L _Mix (step S14). .. In the fusion model learning executor 2, the calculated fusion loss error function L _Mix is acquired, and the fusion model learning process is performed (step S15). When the fusion loss error function L _Mix is equal to or greater than a predetermined value (step S16), the fusion model learning executor 2 again executes the fusion model execution process (step S11).

　次に、図３で示した各処理について説明する。図４は、実施例１の融合モデルの実行処理フロー図を示している。図４に示すように、融合モデル学習実行器２は、データベース６１より状態データＳ_Ｒを取得する（ステップＳ１１１）。融合モデルπへ状態データＳ_Ｒを入力し実行する（ステップＳ１１２）。融合モデルの行動データＡ_Ｆを記憶する（ステップＳ１１３）。 Next, each process shown in FIG. 3 will be described. FIG. 4 shows an execution processing flow diagram of the fusion model of the first embodiment. As shown in FIG. 4, a fusion model learning execution unit 2 acquires the status data _{S R} from the database 61 (step S111). To fusion model π type the state data _{S R} to run (step S112). The behavior data _AF of the fusion model is stored (step S113).

　図５は、実施例１の人らしい行動との損失誤差関数Ｌ_ＨＥの算出処理フロー図を示している。図５に示すように、人らしい行動との損失関数算出器３における第１の損失誤差算出器３０は、データベース６１より人の行動データＡ_Ｒを取得する（ステップＳ１２１）。また、損失関数算出器３は、融合モデルの行動データＡ_Ｆを取得する（ステップＳ１２２）。第１の損失誤差算出器３０は、取得したＡ_ＲとＡ_Ｆより、損失誤差関数Ｌ_ＨＥを算出する（ステップＳ１２３）。 FIG. 5 shows a calculation processing flow diagram of the loss error function _LHE with the human-like behavior of Example 1. As shown in FIG. 5, the first loss error calculator 30 in the loss function calculator 3 and likely human behavior acquires behavior data A _R of the human from the database 61 (step S121). Further, the loss function calculator 3 acquires the behavior data _AF of the fusion model (step S122). The first loss error calculator 30 calculates the loss error function _LHE from the acquired _AR and _AF (step S123).

　図６は、実施例１の最適行動との損失誤差関数Ｌ_ＲＬの算出処理フロー図を示している。図６に示すように、最適行動との損失関数算出器４において、強化学習モデル実行器８は、データベース６１より状態データＳ_Ｒを取得する（ステップＳ１３１）。強化学習モデルπ_ＲＬへ状態データＳ_Ｒを入力、実行し、行動データＡ_Ｒを出力する（ステップＳ１３２）。ソフトターゲット化処理器９へ行動データＡ_Ｒを入力、実行し、第２の損失誤差算出器４０へ行動データＡ_ＳＴ２を出力する（ステップＳ１３３）。また、これとは別に、第２の損失誤差算出器４０は、融合モデルの行動データＡ_Ｆを取得する（ステップＳ１３４）。第２の損失誤差算出器４０は、取得した行動データＡ_ＳＴ２と行動データＡ_Ｆより損失誤差関数Ｌ_ＲＬを算出する（ステップＳ１３５）。 FIG. 6 shows a calculation processing flow diagram of the loss error function _LRL with the optimum behavior of the first embodiment. As shown in FIG. 6, the loss function calculator 4 with optimal behavior, reinforcement learning model execution unit 8 acquires the status data S _R from the database 61 (step S131). Reinforcement enter the learning model [pi _RL to state data _{S R,} Run, and outputs the behavior data _{A R} (step S132). The action data _AR is input to and executed from the soft targeting processor 9, and the action data A _ST2 is output to the second loss error calculator 40 (step S133). Separately from this, the second loss error calculator 40 acquires the behavior data _AF of the fusion model (step S134). The second loss error calculator 40 calculates the loss error function _LRL from the acquired behavior data A _ST2 and behavior data _AF (step S135).

　ここで、ソフトターゲット化処理器９について図２１を参照して説明する。ソフトターゲット化処理器９における知識蒸留によるソフトターゲット化は、前述のとおり、エージェントの行動方策の出力が、熱度Ｔのパラメタを有し、ハードターゲット化とソフトターゲット化に対応するものである。熱度Ｔ＝０の場合は、行動方策の出力は一つの行動のみが０より大きな確率を持つハードターゲットとなり、行動は一意に決定される。一方、熱度Ｔ＞０の場合（Ｔが０でない場合）は、複数の行動が０より大きな確率をもち、行動は確率的に決定される。
　ソフトターゲット化処理器について、最適行動との損失関数算出器に用いられる場合を例にして説明する。図２１（１）に示すように、強化学習モデル実行器が、状態データを入力し、実行器の出力データＧをソフトターゲット化処理器に出力し、さらにソフトターゲット化処理器が、出力データＬを損失誤差Ｌ_ＲＬ算出器に出力する。損失誤差Ｌ_ＲＬ算出器は、融合モデルの出力の行動データと、ソフトターゲット化処理器の出力データとを入力し損失誤差関数を算出する。 Here, the soft targeting processor 9 will be described with reference to FIG. In the soft targeting by knowledge distillation in the soft targeting processor 9, as described above, the output of the agent's action policy has a parameter of heat degree T, and corresponds to the hard targeting and the soft targeting. When the heat degree T = 0, the output of the action policy becomes a hard target in which only one action has a probability greater than 0, and the action is uniquely determined. On the other hand, when the heat degree T> 0 (when T is not 0), the plurality of actions have a probability greater than 0, and the actions are stochastically determined.
The case where the soft targeting processor is used as a loss function calculator with the optimum behavior will be described as an example. As shown in FIG. 21 (1), the reinforcement learning model executor inputs the state data, outputs the output data G of the executor to the soft targeting processor, and the soft targeting processor further outputs the output data L. _Is output to the loss error _LRL calculator. The loss error _LRL calculator calculates the loss error function by inputting the behavior data of the output of the fusion model and the output data of the soft targeting processor.

　この場合、図２１（３）に示すように、ソフトターゲット化処理器のパターンは大きくわけて３通り存在する。先ず、ソフトターゲット化処理器内の処理で、熱度Ｔを処理し、Ｔ＝０，Ｔ＞０の場合に分けてそれぞれデータｍとデータＬにつき損失誤差関数を算出するパターン（これを“パターンＡ”とする）、次に、ソフトターゲット化処理器内の処理で、熱度Ｔを処理し、Ｔ＞０とするデータＬにつき損失誤差関数を算出するパターン（これを“パターンＢ”とする）、そして、ソフトターゲット化処理器内の処理で、熱度Ｔを処理し、Ｔ＝１とするデータＧにつき損失誤差関数を算出するパターン（これを“パターンＣ１”とする）と、Ｔ＝０とするデータＧにつき損失誤差関数を算出するパターン（これを“パターンＣ２”とする）が存在する。熱度Ｔによるソフトターゲット化処理は、具体的には、図２１（２）に示すグラフのような処理関数を用いて、指数関数のパラメタとして熱度Ｔを使用している。上記のパターンとデータ名とデータ形式を表２に纏める。 In this case, as shown in FIG. 21 (3), there are roughly three patterns of the soft targeting processor. First, in the processing in the soft targeting processor, the heat degree T is processed, and the loss error function is calculated for each of the data m and the data L separately for the cases of T = 0 and T> 0 (this is referred to as "pattern A"). Then, in the processing in the soft targeting processor, the heat degree T is processed and the loss error function is calculated for the data L where T> 0 (this is referred to as “pattern B”). Then, in the processing in the soft targeting processor, the heat degree T is processed, and the loss error function is calculated for the data G in which T = 1 (this is referred to as “pattern C1”), and T = 0. There is a pattern for calculating the loss error function for the data G (this is referred to as "pattern C2"). Specifically, in the soft targeting process by the heat degree T, the heat degree T is used as a parameter of the exponential function by using the processing function as shown in the graph shown in FIG. 21 (2). Table 2 summarizes the above patterns, data names, and data formats.

　図７は、実施例１の融合損失誤差関数Ｌ_Ｍｉｘの算出処理フロー図を示している。図７に示すように、融合モデル損失関数算出器５は、損失誤差関数Ｌ_ＨＥ及びＬ_ＲＬを取得する（ステップＳ１４１）。融合モデル損失関数算出器５は、トレードオフ係数α、Ｌ_ＨＥ及びＬ_ＲＬより融合損失誤差関数Ｌ_Ｍｉｘを算出する（ステップＳ１４２）。 FIG. 7 shows a calculation processing flow diagram of the fusion loss error function L _Mix of the first embodiment. As shown in FIG. 7, the fusion model loss function calculator 5 acquires the loss error functions L _HE and L _RL (step S141). The fusion model loss function calculator 5 calculates the fusion loss error function L _Mix from the trade-off coefficients α, L _HE, and L _RL (step S142).

　図８は、実施例１の融合モデルの学習処理フロー図を示している。図８に示すように、融合モデル学習実行器２は、融合損失誤差関数Ｌ_Ｍｉｘを取得する（ステップＳ１５１）。融合モデル学習実行器２は、融合損失誤差関数Ｌ_Ｍｉｘが小さくなる様に融合モデルのパラメタを変更する（ステップＳ１５２）。 FIG. 8 shows a learning processing flow diagram of the fusion model of the first embodiment. As shown in FIG. 8, the fusion model learning executor 2 acquires the fusion loss error function L _Mix (step S151). The fusion model learning executor 2 changes the parameters of the fusion model so that the fusion loss error function L _Mix becomes small (step S152).

　図９は、実施例２のエージェントの学習装置の機能ブロック図を示している。図９に示すように、学習装置１２は、融合モデル学習実行器２、人らしい行動との損失関数算出器３、最適行動との損失関数算出器４、融合モデル損失関数算出器５及びデータベース６２から成る。データベース６２には、模倣学習モデル実行器７に設けられた模倣学習モデルπ_ＩＬによるプレイデータが記憶されている。損失関数算出器３には、第１の損失誤差算出器３０が設けられ、損失関数算出器４には、強化学習モデル実行器８、ソフトターゲット化処理器９及び第２の損失誤差算出器４０が設けられている。
　実施例２のエージェントの学習装置１２は、使用するデータベースが、模倣学習モデルによるプレイデータが記憶されたデータベース６２である点で、人のエキスパートによるプレイデータが記憶されたデータベース６１を使用する実施例１の学習装置１１とは異なる。そのため、実施例２のエージェントの学習方法では、人らしい行動との損失誤差関数Ｌ_ＩＬの算出処理について、実施例１とは異なっている。 FIG. 9 shows a functional block diagram of the learning device of the agent of the second embodiment. As shown in FIG. 9, the learning device 12 includes a fusion model learning executor 2, a loss function calculator 3 with human-like behavior, a loss function calculator 4 with optimal behavior, a fusion model loss function calculator 5, and a database 62. Consists of. The database 62 stores play data by the imitation learning model π _IL provided in the imitation learning model executor 7. The loss function calculator 3 is provided with a first loss error calculator 30, and the loss function calculator 4 includes a reinforcement learning model executor 8, a soft targeting processor 9, and a second loss error calculator 40. Is provided.
The agent learning device 12 of the second embodiment uses the database 61 in which the play data by a human expert is stored in that the database used is the database 62 in which the play data by the imitation learning model is stored. It is different from the learning device 11 of 1. Therefore, in the learning method for the agent of Example 2, the process of calculating the loss error function L _IL with likely human behavior is different from the first embodiment.

　図１０は、実施例２の人らしい行動との損失誤差関数Ｌ_ＩＬの算出処理フロー図を示している。図１０に示すように、人らしい行動との損失関数算出器３における第１の損失誤差算出器３０は、データベース６２より模倣学習モデルの行動データＡ_ＩＬを取得する（ステップＳ２１１）。また、損失関数算出器３は、融合モデルの行動データＡ_Ｆを取得する（ステップＳ２１２）。第１の損失誤差算出器３０は、取得したＡ_ＩＬとＡ_Ｆより、損失誤差関数Ｌ_ＩＬを算出する（ステップＳ２１３）。 Figure 10 shows a calculation process flow diagram of a loss error function L _IL with human likely behavioral Example 2. As shown in FIG. 10, the first loss error calculator 30 in the loss function calculator 3 and likely human behavior acquires behavior data A _IL imitation learning model from the database 62 (step S211). Further, the loss function calculator 3 acquires the behavior data _AF of the fusion model (step S212). The first loss error calculator 30 calculates the loss error function L _IL from the acquired A _IL and _AF (step S213).

　図１１は、実施例３のエージェントの学習装置の機能ブロック図を示している。図１１に示すように、学習装置１３は、融合モデル学習実行器２、人らしい行動との損失関数算出器３、最適行動との損失関数算出器４、融合モデル損失関数算出器５及びデータベース６２から成る。データベース６２には、模倣学習モデル実行器７に設けられた模倣学習モデルπ_ＩＬによるプレイデータが記憶されている。損失関数算出器４には、強化学習モデル実行器８、ソフトターゲット化処理器９及び第２の損失誤差算出器４０が設けられている。
　実施例３のエージェントの学習装置１３では、実施例２とは異なり、損失関数算出器３には、第１の損失誤差算出器３０だけではなく、模倣学習モデル実行器７及びソフトターゲット化処理器９が設けられている。 FIG. 11 shows a functional block diagram of the learning device of the agent of the third embodiment. As shown in FIG. 11, the learning device 13 includes a fusion model learning executor 2, a loss function calculator 3 with human-like behavior, a loss function calculator 4 with optimal behavior, a fusion model loss function calculator 5, and a database 62. Consists of. The database 62 stores play data by the imitation learning model π _IL provided in the imitation learning model executor 7. The loss function calculator 4 is provided with a reinforcement learning model executor 8, a soft targeting processor 9, and a second loss error calculator 40.
In the agent learning device 13 of the third embodiment, unlike the second embodiment, the loss function calculator 3 includes not only the first loss error calculator 30, but also the imitation learning model executor 7 and the soft targeting processor. 9 is provided.

　図２０は、実施例３の人らしい行動との損失誤差関数Ｌ_ＩＬの算出処理フロー図を示している。図２０に示すように、まず、損失関数算出器３は、データベース６２から状態データＳ_Ｒを取得する（ステップＳ４１１）。模倣学習モデル実行器７の模倣学習モデルπ_ＩＬへ状態データＳ_Ｒを入力、実行し、行動データＡ_ＩＬを出力する（ステップＳ４１２）。ソフトターゲット化処理器９へ行動データＡ_ＩＬを入力、実行し、行動データＡ_ＳＴ１を出力する（ステップＳ４１３）。また、これとは別に、第１の損失誤差算出器３０は、融合モデルの行動データＡ_Ｆを取得する（ステップＳ４１４）。第１の損失誤差算出器３０は、取得した行動データＡ_ＳＴ１と行動データＡ_Ｆより損失誤差関数Ｌ_ＩＬを算出する（ステップＳ４１５）。 Figure 20 shows a calculation process flow diagram of a loss error function L _IL with people likely behavior of Example 3. As shown in FIG. 20, first, the loss function calculator 3 obtains the status data _{S R} from the database 62 (step S411). Imitation learning model mimics learning model π input state data _{S R} to _IL execution unit 7 executes, and outputs the behavior data _{A IL} (step S412). The action data _AIL is input to and executed in the soft targeting processor 9, and the action data A _ST1 is output (step S413). Separately from this, the first loss error calculator 30 acquires the behavior data _AF of the fusion model (step S414). First loss error calculator 30 calculates a than the loss error function _{L IL} and action data _{A ST1} acquired behavioral data _{A F} (step S415).

　図１２は、実施例４のエージェントの学習装置の機能ブロック図を示している。図１２に示すように、学習装置１４は、融合モデル学習実行器２、人らしい行動との損失関数算出器３、最適行動との損失関数算出器４、融合モデル損失関数算出器５及びデータベース６３から成る。データベース６２には、強化学習モデル実行器８に設けられた強化学習モデルπ_ＲＬによるプレイデータが記憶されている。損失関数算出器３に模倣学習モデル実行器７、ソフトターゲット化処理器９及び第１の損失誤差算出器３０が設けられている点は、実施例３と同様である。また、損失関数算出器４には、強化学習モデル実行器８、ソフトターゲット化処理器９及び第２の損失誤差算出器４０が設けられている。
　実施例４のエージェントの学習装置１４では、使用するデータベースが、強化学習モデルによるプレイデータが記憶されたデータベース６３である点で、模倣学習モデルによるプレイデータが記憶されたデータベース６２を使用する実施例３の学習装置１３と異なるが、その他の点は同様である。 FIG. 12 shows a functional block diagram of the learning device of the agent of the fourth embodiment. As shown in FIG. 12, the learning device 14 includes a fusion model learning executor 2, a loss function calculator 3 with human-like behavior, a loss function calculator 4 with optimal behavior, a fusion model loss function calculator 5, and a database 63. Consists of. The database 62 stores play data by the reinforcement learning model π _RL provided in the reinforcement learning model executor 8. The loss function calculator 3 is provided with the imitation learning model executor 7, the soft targeting processor 9, and the first loss error calculator 30, as in the third embodiment. Further, the loss function calculator 4 is provided with a reinforcement learning model executor 8, a soft targeting processor 9, and a second loss error calculator 40.
In the agent learning device 14 of the fourth embodiment, the database 62 in which the play data by the imitation learning model is stored is used in that the database used is the database 63 in which the play data by the reinforcement learning model is stored. It is different from the learning device 13 of No. 3, but is the same in other respects.

　図１３は、実施例５のエージェントの学習装置の機能ブロック図を示している。図１３に示すように、学習装置１５は、融合モデル学習実行器２、人らしい行動との損失関数算出器３、最適行動との損失関数算出器４、融合モデル損失関数算出器５及びデータベース（６１，６３）から成る。損失関数算出器３には、第１の損失誤差算出器３０が設けられ、損失関数算出器４には、第２の損失誤差算出器４０が設けられている。
　実施例５の学習装置１５は、人のエキスパートによるプレイデータが記憶されたデータベース６１と、強化学習モデル実行器８に設けられた強化学習モデルπ_ＲＬによるプレイデータが記憶されたデータベース６３という２種類のデータベースを使用する点で、実施例１～４とは異なっている。 FIG. 13 shows a functional block diagram of the learning device of the agent of the fifth embodiment. As shown in FIG. 13, the learning device 15 includes a fusion model learning executor 2, a loss function calculator 3 with human-like behavior, a loss function calculator 4 with optimal behavior, a fusion model loss function calculator 5, and a database ( It consists of 61, 63). The loss function calculator 3 is provided with a first loss error calculator 30, and the loss function calculator 4 is provided with a second loss error calculator 40.
The learning device 15 of the fifth embodiment has two types, a database 61 in which play data by a human expert is stored and a database 63 in which play data by the reinforcement learning model π _RL provided in the reinforcement learning model executor 8 is stored. It is different from Examples 1 to 4 in that it uses the database of.

　そこで、学習装置１５を用いたエージェントの学習方法について、図１４～１８を参照しながら説明する。図１４は、実施例５のエージェントの学習方法の概略フロー図を示している。図１４に示すように、まず、融合モデル学習実行器２は、融合モデルの実行処理１を行う（ステップＳ３１）。またこれとは別に、融合モデル学習実行器２は、融合モデルの実行処理２を行う（ステップＳ３２）。ステップＳ３１の融合モデルの実行処理１の後に、損失関数算出器３において、人らしい行動との損失誤差関数Ｌ_ＨＥの算出処理を行う（ステップＳ３３）。また、ステップＳ３２の融合モデルの実行処理２の後に、最適行動との損失誤差関数Ｌ_ＲＬの算出処理を行う（ステップＳ３４）。なお、ステップＳ３３はステップＳ３１の後に行われ、ステップＳ３４はステップＳ３２の後に行われればよく、例えば、ステップＳ３１の前にステップＳ３２を行ってもよい。
　人らしい行動との損失誤差関数Ｌ_ＨＥ及び最適行動との損失誤差関数Ｌ_ＲＬを基に、融合モデル損失関数算出器５を用いて、融合損失誤差関数Ｌ_Ｍｉｘの算出処理を行う（ステップＳ３５）。融合モデル学習実行器２において、算出された融合損失誤差関数Ｌ_Ｍｉｘを取得し、融合モデルの学習処理を行う（ステップＳ３６）。融合損失誤差関数Ｌ_Ｍｉｘが所定値以上の場合には（ステップＳ３７）、再度、融合モデル学習実行器２において、融合モデルの実行処理１及び２を行う（ステップＳ３１，ステップＳ３２）。 Therefore, the learning method of the agent using the learning device 15 will be described with reference to FIGS. 14 to 18. FIG. 14 shows a schematic flow chart of the learning method of the agent of Example 5. As shown in FIG. 14, first, the fusion model learning executor 2 performs the execution process 1 of the fusion model (step S31). Separately from this, the fusion model learning executor 2 performs the execution process 2 of the fusion model (step S32). After the execution process 1 of the fusion model in step S31, the loss function calculator 3 performs a calculation process of the loss error function _LHE with human behavior (step S33). Further, after the execution process 2 of the fusion model in step S32, the calculation process of the loss error function _LRL with the optimum action is performed (step S34). The step S33 may be performed after the step S31, and the step S34 may be performed after the step S32. For example, the step S32 may be performed before the step S31.
Based on the loss error function L _HE with human-like behavior and the loss error function L _RL with optimal behavior, the fusion model loss function calculator 5 is used to calculate the fusion loss error function L _Mix (step S35). .. In the fusion model learning executor 2, the calculated fusion loss error function L _Mix is acquired, and the fusion model learning process is performed (step S36). When the fusion loss error function L _Mix is equal to or greater than a predetermined value (step S37), the fusion model learning executor 2 again performs the fusion model execution processes 1 and 2 (steps S31 and S32).

　次に、図１４で示した各処理について説明する。図１５は、実施例５の融合モデルの実行処理フロー図１を示している。図１５に示すように、融合モデル学習実行器２は、データベース６１（ＤＢ１）より状態データＳ_Ｒ１を取得する（ステップＳ３１１）。融合モデルπへ状態データＳ_Ｒ１を入力し実行する（ステップＳ３１２）。融合モデルの行動データＡ_Ｆ１を記憶する（ステップＳ３１３）。 Next, each process shown in FIG. 14 will be described. FIG. 15 shows an execution processing flow FIG. 1 of the fusion model of the fifth embodiment. As shown in FIG. 15, the fusion model learning executor 2 acquires the state data S _R1 from the database 61 (DB1) (step S311). The state data S _R1 is input to the fusion model π and executed (step S312). The behavior data A _F1 of the fusion model is stored (step S313).

　図１６は、実施例５の融合モデルの実行処理フロー図２を示している。図１６に示すように、融合モデル学習実行器２は、データベース６３（ＤＢ２）より状態データＳ_Ｒ２を取得する（ステップＳ３２１）。融合モデルπへ状態データＳ_Ｒ２を入力し実行する（ステップＳ３２２）。融合モデルの行動データＡ_Ｆ２を記憶する（ステップＳ３２３）。 FIG. 16 shows an execution processing flow FIG. 2 of the fusion model of the fifth embodiment. As shown in FIG. 16, the fusion model learning executor 2 acquires the state data S _R2 from the database 63 (DB2) (step S321). The state data S _R2 is input to the fusion model π and executed (step S322). The behavior data A _F2 of the fusion model is stored (step S323).

　図１７は、実施例５の人らしい行動との損失誤差関数Ｌ_ＨＥの算出処理フロー図を示している。図１７に示すように、人らしい行動との損失関数算出器３における第１の損失誤差算出器３０は、データベース６１より人の行動データＡ_Ｒを取得する（ステップＳ３３１）。また、損失関数算出器３は、融合モデルの行動データＡ_Ｆ１を取得する（ステップＳ３３２）。第１の損失誤差算出器３０は、取得したＡ_ＲとＡ_Ｆ１より、損失誤差関数Ｌ_ＨＥを算出する（ステップＳ３３３）。 FIG. 17 shows a calculation processing flow diagram of the loss error function _LHE with the human-like behavior of Example 5. As shown in FIG. 17, the first loss error calculator 30 in the loss function calculator 3 and likely human behavior acquires behavior data A _R of the human from the database 61 (step S331). Further, the loss function calculator 3 acquires the behavior data A _F1 of the fusion model (step S332). First loss error calculator 30, from the obtained _{A R} and _{A F1,} calculates the loss error function _{L HE} (step S333).

　図１８は、実施例５の最適行動との損失誤差関数Ｌ_ＲＬの算出処理フロー図を示している。図１８に示すように、最適行動との損失関数算出器４は、データベース６３（ＤＢ２）より強化学習モデルの行動データＡ_ＲＬを取得する（ステップＳ３４１）。また、これとは別に、第２の損失誤差算出器４０は、融合モデルの行動データＡ_Ｆ２を取得する（ステップＳ３４２）。第２の損失誤差算出器４０は、取得した行動データＡ_ＲＬと行動データＡ_Ｆ２より損失誤差関数Ｌ_ＲＬを算出する（ステップＳ３４３）。 FIG. 18 shows a calculation processing flow diagram of the loss error function _LRL with the optimum behavior of the fifth embodiment. As shown in FIG. 18, a loss function calculator 4 with optimal action acquires behavior data _{A RL} reinforcement learning model from the database 63 (DB2) (step S341). Separately from this, the second loss error calculator 40 acquires the behavior data A _F2 of the fusion model (step S342). The second loss error calculator 40 calculates the loss error function L _RL from the acquired behavior data A _RL and the behavior data A _F2 (step S343).

　なお、図１４に示すステップＳ３５の融合損失誤差関数Ｌ_Ｍｉｘの算出処理については、図７に示す実施例１の融合損失誤差関数Ｌ_Ｍｉｘの算出処理フロー図と同様の処理を行う。また、図１４に示すステップＳ３６の融合モデルの学習処理については、図８に示す実施例１の融合モデルの学習処理フロー図と同様の処理を行う。 The calculation process of the fusion loss error function L _{Mix in} step S35 shown in FIG. 14 is the same as the calculation process flow diagram of the fusion loss error function L _Mix of Example 1 shown in FIG. 7. Further, the learning process of the fusion model in step S36 shown in FIG. 14 is performed in the same manner as the learning process flow diagram of the fusion model of Example 1 shown in FIG.

　図１９は、実施例６のエージェントの学習装置の機能ブロック図を示している。図１９に示すように、学習装置１６は、融合モデル学習実行器２、人らしい行動との損失関数算出器３、最適行動との損失関数算出器４、融合モデル損失関数算出器５及びデータベース（６２，６３）から成る。損失関数算出器３には、模倣学習モデル実行器７、ソフトターゲット化処理器９及び第１の損失誤差算出器３０が設けられ、損失関数算出器４には、強化学習モデル実行器８、ソフトターゲット化処理器９及び第２の損失誤差算出器４０が設けられている。
　実施例６の学習装置１６は、模倣学習モデル実行器７に設けられた模倣学習モデルπ_ＩＬによるプレイデータが記憶されたデータベース６２（ＤＢ１）と、強化学習モデル実行器８に設けられた強化学習モデルπ_ＲＬによるプレイデータが記憶されたデータベース６３（ＤＢ２）という２種類のデータベースを使用する点で、データベース（６１，６３）を使用する実施例５とは異なっている。 FIG. 19 shows a functional block diagram of the learning device of the agent of the sixth embodiment. As shown in FIG. 19, the learning device 16 includes a fusion model learning executor 2, a loss function calculator 3 with human-like behavior, a loss function calculator 4 with optimal behavior, a fusion model loss function calculator 5, and a database ( It consists of 62,63). The loss function calculator 3 is provided with an imitation learning model executor 7, a soft targeting processor 9, and a first loss error calculator 30, and the loss function calculator 4 is provided with a reinforcement learning model executor 8, software. A targeting processor 9 and a second loss error calculator 40 are provided.
The learning device 16 of the sixth embodiment has a database 62 (DB1) in which play data by the imitation learning model π _IL provided in the imitation learning model executor 7 is stored, and a reinforcement learning provided in the reinforcement learning model executor 8. It differs from Example 5 in that the database (61, 63) is used in that it uses two types of databases, the database 63 (DB2) in which the play data by the model π _RL is stored.

（融合モデルの性能評価について）
　上述の実施例で説明したとおり、強化学習モデルの高い性能を保ったまま人らしいエージェントＡＩを学習するという学習方法は、人の性能を凌駕する効率的な最適行動をとるエージェントＡＩを学習する処理と、人らしく行動を選択するエージェントＡＩを学習する処理の２つから構成されている。
　各処理は、それぞれの強化学習と模倣学習の課題として取り組まれている。そこで、本発明では、強化学習と模倣学習の融合モデルについて、離散行動空間の場合の方策の蒸留に基づき、連続行動空間の場合は敵対模倣学習に基づく方法となっている。
　また、π^＊を強化学習モデルによる最適な方策、π_ＨＥを人（エキスパート）の方策とし、これらの２つの方策の比率を決めるパラメタをα∈(０，１)とすると、目的関数は下記の式のようになる。 (About performance evaluation of fusion model)
As explained in the above embodiment, the learning method of learning a human-like agent AI while maintaining the high performance of the reinforcement learning model is a process of learning an agent AI that takes an efficient optimum action that surpasses the human performance. And, it is composed of two processes of learning the agent AI that selects an action like a human being.
Each process is tackled as a task of reinforcement learning and imitation learning. Therefore, in the present invention, the fusion model of reinforcement learning and imitation learning is a method based on the distillation of measures in the case of discrete action space and based on hostile imitation learning in the case of continuous action space.
If π ^* is the optimal policy based on the reinforcement learning model, π _HE is the human (expert) policy, and the parameter that determines the ratio of these two policies is α ∈ (0,1), the objective function is as follows. It becomes like an expression.

　離散行動空間の場合、模倣学習の目的関数を、模倣学習の先行する研究に従って、以下の交差エントロピー損失として定義する。 In the case of discrete behavioral space, the objective function of imitation learning is defined as the following cross entropy loss according to the previous research of imitation learning.

　人（エキスパート）の方策π_ＨＥは、数理モデルとして定義するのが難しく、実験的にサンプリングされたデータの上で学習を行うことにする。π_ＨＥからソフトターゲットが得られないにも関わらず、方策の蒸留には、ハードターゲットとソフトターゲットの重付き平均を計算することでより良い性能が得られる。したがって、人（エキスパート）に提供されるデータをハードターゲットとし、学習済みモデルの方策π^(Ｔ) _ＲＬの出力を、熱度Ｔで調整したものをソフトターゲットとする。最終的に、損失関数は以下のようになる。 Human (expert) policy π _HE is difficult to define as a mathematical model, so we will train on experimentally sampled data. Despite the fact that soft targets cannot be obtained from π _HE , better performance can be obtained for policy distillation by calculating the weighted average of hard and soft targets. Therefore, the data provided to the person (expert) is set as the hard target, and the output of the policy π ^(T) _RL of the trained model adjusted by the heat degree T is set as the soft target. Finally, the loss function looks like this:

　連続行動空間の場合には、模倣学習方法として、既知の方法のＧＡＩＬ（Generative Adversarial Imitation Learning)法を用いる（ＧＡＩＬ法については、非特許文献８を参照。）。ＧＡＩＬ法は教師モデルπからサンプリングされた軌道τ～πを必要とする。ＧＡＩＬ法における識別器Ｄ_ｗに最大化される目的関数と生徒モデルに最小化される目的関数は以下のようになる。 In the case of a continuous action space, a known GAIL (Generative Adversarial Imitation Learning) method is used as an imitation learning method (for the GAIL method, see Non-Patent Document 8). The GAIL method requires trajectories τ-π sampled from the teacher model π. The objective function maximized by the classifier D _w in the GAIL method and minimized by the student model are as follows.

　ここで、τは生徒モデルからサンプリングされた軌道τ～πである。融合化モデルとする際には、教師モデルを人（エキスパート）と強化学習モデルにするため、それぞれのエキスパートから軌道τ_ＨＥ～π_ＨＥと、τ_ＲＬ～π_ＲＬをサンプリングする。さらに、融合の損失関数を以下のように置き換えることができる。直感的には、識別器Ｄ_ｗは、人（エキスパート）と強化学習モデルの方策間の融合方策を認めるように学習され、この識別器を騙せるように訓練される生徒モデルπが融合方策に近づき、両方のエキスパートの長所を模倣すると期待される。 Here, τ is the orbits τ to π sampled from the student model. In order to make the teacher model a human (expert) and a reinforcement learning model, the trajectories τ _HE to π _HE and τ _RL to π _RL are sampled from each expert. In addition, the fusion loss function can be replaced as follows: Intuitively, the discriminator D _w is trained to recognize the fusion policy between the person (expert) and the policy of the reinforcement learning model, and the student model π trained to deceive this classifier becomes the fusion policy. Expected to approach and mimic the strengths of both experts.

　実施した実験について説明する。
（実験１）Atari 2600 ゲーム（Gopher）
　実施例１の学習装置を、まず離散行動空間のGopherという名称のAtari 2600システムのゲームに適用した。このゲームは、地下から地上に出てくるねずみ（Gopher）が人参を取れないように、ゲームのユーザが農夫として行動し（振舞い）、左右に動いたり穴を埋めたりすることである。人（エキスパート）と訓練済みの学習モデルがそれぞれ55000のフレームを提供した上、訓練セットを50000、テストセットを5000として学習を行った。特に、生徒モデルを訓練するために学習率１０^-４のAdam optimizerとDropout率０．５を利用した。また、知識蒸留の熱度をＴ＝０．１、トレードオフ係数をα＝０．９３として、融合モデルの学習を行った。 The experiments carried out will be described.
(Experiment 1) Atari 2600 game (Gopher)
The learning device of Example 1 was first applied to a game of an Atari 2600 system named Gopher in a discrete action space. In this game, the user of the game acts as a farmer (behavior) and moves left and right or fills a hole so that the mouse (Gopher) coming out from the basement to the ground cannot take the carrot. A person (expert) and a trained learning model each provided 55,000 frames, and the training set was 50,000 and the test set was 5,000. In particular, an Adam optimizer with a learning rate of ^10-4 and a Dropout rate of 0.5 were used to train the student model. In addition, the fusion model was trained with the heat of knowledge distillation set to T = 0.1 and the trade-off coefficient set to α = 0.93.

（実験２）Torcs
　Torcs（Wymann, B.,“The open racing car simulator”,(2015)を参照）は、自動運転の研究で最もよく利用されるシミュレータの１つである。このTorcsを用いた実験は、GymTorcs環境をベースにした。エージェントＡＩの観測空間は、車から端までの距離、敵の車までの距離、現在の速度や加速度など、全体で６５の連続値から成っている。行動空間は、２つの要素「左右」と「加減速」から成っており、取りうる値は[-1.0，1.0]の範囲に限られる。
　報酬関数は、走った距離にし、強化学習モデルをOpenAI Baselinesを基に訓練した。さらに、人らしさが見分けられる状況が現れるように、Torcsのシミュレータに停車ボットをした上、人（エキスパート）に６０秒の２２０エピソードをプレイさせ、そのデータを収集した。単なる模倣学習エージェントを、人（エキスパート）のデータ上で、強化学習モデルの訓練と同じく、OpenAIのＧＡＩＬ法を用いて訓練を行った。最後に、ＧＡＩＬ法における学習器の更新を実施し、両方のエキスパートの影響を等しくするためにトレードオフ係数α＝０．５にして、融合モデルの学習を行った。
（実験３）Appleゲーム
　Appleゲームは、ランダムに１つ画面上に現われたリンゴをプレイヤーが収集し、収集されたリンゴの数でスコアを出すものである。 (Experiment 2) Torcs
Torcs (see Wymann, B., “The open racing car simulator”, (2015)) is one of the most commonly used simulators in autonomous driving research. This experiment using Torcs was based on the GymTorcs environment. The observation space of the agent AI consists of 65 continuous values in total, such as the distance from the car to the edge, the distance to the enemy car, the current speed and acceleration. The action space consists of two elements, "left and right" and "acceleration / deceleration", and the possible values are limited to the range [-1.0, 1.0].
The reward function was the distance traveled, and the reinforcement learning model was trained based on OpenAI Baselines. Furthermore, in order to show the situation where humanity can be discerned, a stop bot was made to the Torcs simulator, and a person (expert) was made to play 220 episodes of 60 seconds, and the data was collected. A mere imitation learning agent was trained on human (expert) data using the GAIL method of OpenAI in the same way as the training of the reinforcement learning model. Finally, the learner was updated in the GAIL method, and the fusion model was trained with a trade-off coefficient α = 0.5 in order to equalize the influences of both experts.
(Experiment 3) Apple game In the Apple game, the player collects apples that randomly appear on the screen and gives a score based on the number of collected apples.

（３）人間らしさの感性試験
　各モデルの評価以外は、モデルの人らしさを評価するためにダブル・ブラインドで感性試験を実施した。その試験は男性２３人、女性３人の計２６人の審査員を対象にした。年齢は２７から５９歳、平均年齢は４４歳である。感性試験の審査員全員、本調査以前に、本発明内容に関する資料には接触がなかった。初めに、審査員に各ゲームのルールを説明し、人らしい行動（振舞い）を理解してもらえるように各ゲームの体験会を実施した。調査の内容は、各審査員、ゲーム毎に２本の動画（Gopherの場合１５秒、Torcsの場合３０秒）を提供し、人間かＡＩかの判断とその理由を依頼した。 (3) Kansei test of humanity Except for the evaluation of each model, a double-blind sensitivity test was conducted to evaluate the humanity of the model. The test targeted a total of 26 judges, 23 males and 3 females. The age range is 27 to 59 years and the average age is 44 years. All the judges of the Kansei test had no contact with the materials related to the content of the present invention before this investigation. First, we explained the rules of each game to the judges and held a hands-on session for each game so that they could understand the human behavior (behavior). As for the content of the survey, each judge provided two videos (15 seconds for Gopher and 30 seconds for Torcs) for each game, and asked them to judge whether they were humans or AI and why.

　実験結果について説明する。
（１）Atari 2600 ゲーム（Gopher）
　性能に関して、点数の高い順に、強化学習モデル（比較例３）、次に実施例の融合モデル、最後に人（比較例１）と模倣学習モデル（比較例２）の順番になった。実施例の融合モデルは、強化学習モデル（比較例３）に提供されたターゲットをα＝０．８で優先したにも関わらず、スコアの向上が３点しかなく、単体の強化学習モデルの点数との差が大きい。感性試験に関して、強化学習モデル（比較例３）は、あまり人間らしくないと判断されたが、実施例の融合モデルは点数だけではなく、人らしさでも、人（比較例１）とその模倣（比較例２）より高いスコアを示した。
　融合モデルは、強化学習の目標に向かった学習傾向と人（エキスパート）の振舞いを学習できたことがわかる。なお、意外にも、融合モデルは人（エキスパート）よりも、人らしいと判断された。その理由を解明するために、審査員のコメント分析によってよく現れた感想は、“無駄な動きが少ない”、“動きが細かい、プログラム感が動きにある”、“穴を順番に埋めようとする”であった。従って、人（エキスパート）は、特にゲームをあまりしない審査員に高い性能が期待されていなかったと考えられる。点数の結果を表３に纏める。 The experimental results will be described.
(1) Atari 2600 game (Gopher)
Regarding the performance, the reinforcement learning model (Comparative Example 3) was followed by the fusion model of the example, and finally the person (Comparative Example 1) and the imitation learning model (Comparative Example 2) in descending order of the score. In the fusion model of the example, although the target provided in the reinforcement learning model (Comparative Example 3) was prioritized at α = 0.8, the score was improved by only 3 points, and the score of the single reinforcement learning model was obtained. There is a big difference with. Regarding the Kansei test, it was judged that the reinforcement learning model (Comparative Example 3) was not very human-like, but the fusion model of the Example was not only in terms of score but also in humanity as well as human (Comparative Example 1) and its imitation (Comparative Example). 2) Showed a higher score.
It can be seen that the fusion model was able to learn the learning tendency and the behavior of the person (expert) toward the goal of reinforcement learning. Surprisingly, the fusion model was judged to be more human than human (expert). In order to clarify the reason, the impressions often expressed by the judge's comment analysis are "less useless movement", "fine movement, program feeling in movement", "trying to fill the hole in order". "Met. Therefore, it is considered that people (experts) were not expected to have high performance, especially for judges who do not play many games. The results of the scores are summarized in Table 3.

（２）Torcs
　性能の評価に関して、まず、それぞれの人（エキスパート）、ＧＡＩＬによる人の模倣学習、強化学習のＤＤＰＧと融合モデルの点数を比較した。点数の結果を表４に纏める。実験より、ＧＡＩＬは訓練済みの強化学習モデルや決定論的なボットの模倣に優れているが、人（エキスパート）の模倣の効率は意外に低いことが観測された。それは、人（エキスパート）の方策は複雑で、基本的なニューラルネットワークでの扱いが困難だと推定される。一方で、実施例の融合モデルは、例えば、強化学習モデル（比較例３）の高速度や、人（比較例１）の曲がり方という特徴の模倣に成功した。さらに、人（比較例１）と強化学習モデル（比較例３）のように全体のトラックを走れるようになった。
　人らしいと判断された割合が低い順に、まず強化学習モデル（比較例３）は“走るのが速すぎる”や“高速で角を曲がる”という理由で、あまり人間らしくないと判断された。意外にも、人（比較例１）は人らしくないと判断された。同じ動画に対して、審査員のコメントが多様であるが、より低い性能を示した人の模倣エージェントＡＩが、人（エキスパート）より人らしいと判断されたため、Gopherと同じく、人（比較例１）に示された性能は高いといえる。最後に、融合モデルは最も人らしいと判断された上、強化学習モデル（比較例３）に近い性能を示した。 (2) Torcs
Regarding the evaluation of performance, first, the scores of the DDPG and the fusion model of each person (expert), imitation learning of the person by GAIL, and reinforcement learning were compared. The results of the scores are summarized in Table 4. From experiments, it was observed that GAIL excels in imitating trained reinforcement learning models and deterministic bots, but the efficiency of imitation of humans (experts) is surprisingly low. It is presumed that the human (expert) policy is complicated and difficult to handle with a basic neural network. On the other hand, the fusion model of the example succeeded in imitating the characteristics of, for example, the high speed of the reinforcement learning model (Comparative Example 3) and the bending method of a person (Comparative Example 1). Furthermore, it became possible to run the entire track like a person (Comparative Example 1) and a reinforcement learning model (Comparative Example 3).
In ascending order of the percentage of people judged to be human, the reinforcement learning model (Comparative Example 3) was judged to be not very human because it "runs too fast" or "turns a corner at high speed". Surprisingly, it was judged that the person (Comparative Example 1) was not human. For the same video, the judges' comments are diverse, but the imitation agent AI of the person who showed lower performance was judged to be more human than the person (expert), so like Gopher, the person (Comparative Example 1) It can be said that the performance shown in) is high. Finally, the fusion model was judged to be the most human, and showed performance close to that of the reinforcement learning model (Comparative Example 3).

（実験３）Appleゲーム
　Appleゲームは単純な原理、すなわち、プレイヤーのアバターを出現したリンゴの位置に移動させるゲームである。そして、強化学習が最も高いスコアを達成し、続いて、実施例の学習方法、そして最後に、人（比較例１）とその模倣学習が続いた。
　人らしさに関する限り、人のエージェント（比較例１）が最もよかった。実施例の学習方法は、スコアにおいて人（比較例１）及びその模倣学習モデルのエージェント（比較例２）を上回りながら、強化学習モデルのエージェント（比較例３）よりも人らしい行動を示し、その後に人（比較例１）が続いた。この結果から、実施例の学習方法は、このゲームにおける人らしい行動と高性能のバランスをとることがわかる。 (Experiment 3) Apple game An Apple game is a game that moves a player's avatar to the position of an apple that appears. Reinforcement learning achieved the highest score, followed by the learning method of the examples, and finally the person (Comparative Example 1) and its imitation learning.
As far as humanity is concerned, the human agent (Comparative Example 1) was the best. The learning method of the example showed more human-like behavior than the agent of the reinforcement learning model (Comparative Example 3) while surpassing the agent of the human (Comparative Example 1) and its imitation learning model (Comparative Example 2) in the score, and then. Followed by a person (Comparative Example 1). From this result, it can be seen that the learning method of the embodiment balances human behavior and high performance in this game.

　本発明は、自動車の自動運転、工業用ロボットアームの自動制御など幅広い分野におけるエージェントの学習に有用である。 The present invention is useful for learning agents in a wide range of fields such as automatic driving of automobiles and automatic control of industrial robot arms.

　１，１１～１６　学習装置
　２　融合モデル学習実行器
　３，４　損失関数算出器
　５　融合モデル損失関数算出器
　６，６１～６３　データベース
　７　模倣学習モデル実行器
　８　強化学習モデル実行器
　９　ソフトターゲット化処理器
　３０　第１の損失誤差算出器
　４０　第２の損失誤差算出器
1,11-16 Learning device 2 Fusion model learning executor 3,4 Loss function calculator 5 Fusion model loss function calculator 6,61-63 Database 7 Imitation learning model executor 8 Reinforcement learning model executor 9 Soft targeting process Instrument 30 First loss error calculator 40 Second loss error calculator

Claims

The learning device that realizes the behavior that the agent judges and takes the optimum action under the predetermined environment and the learning device that realizes the human-like behavior are fused, and the action policy of the agent is optimized to take the optimum action like a person. It ’s a learning method
An input step for inputting status data S _R and behavioral data A _R in the play data, or at least one of recording of the play data of the agent created with given purpose by a human expert,
Against learning executor fusion model and imitation learning model related to the human seems behavior as reinforcement learning model related to behavior taking the optimal action, the action data A _F of the fused model to input the state data S _R Learning steps to output and
A first loss error calculating step of calculating a first loss error between the behavior data A _F and the action data A _R,
And behavior data A _RL which is output based on the state data S _R by the execution unit or optimal behavior algorithm of the reinforcement learning model, the second loss error calculation for calculating a second loss error between the behavior data A _F Steps and
A fusion error calculation step that calculates the fusion error based on the weight ratio of the first and second loss errors, and
An agent learning method comprising: an update step of updating a parameter of a learning executor of the fusion model based on the fusion error.

In the input step, the status data S _R and behavioral data A _R is, in the case of recording the play data by human experts,
The first loss error calculation step described above is
And said action data A _R in the play data by human experts, the error between the behavior data A _F, is calculated by using the loss function,
The second loss error calculation step described above is
Soft targeting the behavioral data with the status data S _R has been inputted to output the behavior data A _RL in the RL model by Knowledge distillation, the error between the behavior data A _F, is calculated by using the loss function The agent learning method according to claim 1, wherein the agent learning method.

In the input step, when the state data S _R and behavioral data A _R is a record of the play data of the agent of imitation learning model,
The first loss error calculation step described above is
The imitation learning model Agent error between the behavior data A _R in the play data and the action data A _F, or the imitation learning model in the state data S _R knowledge behavior data A _IL that enter to output The error between the behavior data softly targeted by distillation and the behavior data _AF was calculated using a loss function.
The second loss error calculation step described above is
The error between the behavior data A _RL soft-targeted by knowledge distillation and the behavior data A _F , which is output by inputting the state data S _R into the reinforcement learning model, is calculated using a loss function. The agent learning method according to claim 1, wherein the agent learning method.

In the input step, the status data S _R and behavioral data A _R is, in the case of recording the play data of the agent of reinforcement learning model,
The first loss error calculation step described above is
The soft targeting the behavioral data imitation learning model in the state data S _R behavioral data A _IL that enter to output by knowledge distillation, the error between the behavior data A _F, is calculated by using the loss function ,
The second loss error calculation step described above is
The error between the behavior data A _RL soft-targeted by knowledge distillation and the behavior data A _F , which is output by inputting the state data S _R into the reinforcement learning model, is calculated using a loss function. The agent learning method according to claim 1, wherein the agent learning method.

In the input step, the status data S _R and behavioral data A _R is, status data S _HE and behavioral data A _HE at recording play data by human _experts, and, in the recording of the play data of the agent of the reinforcement learning model In the case of state data S _RL and behavior data A _RL ,
In the learning step, the state data _SHE is input to output the action data A _F1 , and the state data _SRL is input to output the action data A _F2 .
The first loss error calculation step described above is
The error between the behavior data A _HE and the behavior data A _F1 in the play data by a human expert is calculated by using the loss function.
The second loss error calculation step described above is
The error between the behavior data A _RL soft-targeted by knowledge distillation and the behavior data A _F2 output by inputting the state data S _RL into the reinforcement learning model is calculated by using a loss function. The agent learning method according to claim 1, wherein the agent learning method.

In the input step, the state data S _R and the behavior data _AR are the state data _SIL and the behavior data A _IL in the recording of the play data of the agent of the imitation learning model, and the play data of the agent of the reinforcement learning model. In the case of state data S _RL and behavior data A _RL in the recording of
In the learning step, the state data _SIL is input to output the action data A _F1 , and the state data _SRL is input to output the action data A _F2 .
The first loss error calculation step described above is
The error between the behavior data A _IL soft-targeted by knowledge distillation in the play data of the agent of the imitation learning model and the behavior data A _F1 is calculated by using a loss function.
The second loss error calculation step described above is
The error between the behavior data A _RL soft-targeted by knowledge distillation and the behavior data A _F2 output by inputting the state data S _RL into the reinforcement learning model is calculated by using a loss function. The agent learning method according to claim 1, wherein the agent learning method.

The soft targeting by knowledge distillation according to any one of claims 2 to 6, wherein the output of the agent's action policy has a parameter of heat degree T and can correspond to hard targeting and soft targeting. The described agent learning method.

A learning device that optimizes the behavior policy of the agent so that it behaves in a human-like manner by fusing a learning device that realizes the behavior that the agent judges and takes the optimum behavior under a predetermined environment and a learning device that realizes the human-like behavior. And
An input unit for inputting the state data S _R and behavioral data A _R in the play data, or at least one of recording of the play data of the agent created with given purpose by a human expert,
Against fusion model and imitation learning model related to the human seems behavior as reinforcement learning model related to behavior taking the optimal action, learning execution for outputting behavior data A _F of the fused model to input the state data S _R With a vessel
A first loss error calculator for calculating a first loss error between the behavior data A _F and the action data A _R,
And behavior data A _RL which is output based on the state data S _R by the execution unit or optimal behavior algorithm of the reinforcement learning model, the second loss error calculation for calculating a second loss error between the behavior data A _F With a vessel
A fusion error calculator that calculates the fusion error based on the weight ratio of the first and second loss errors, and
An agent learning device including an update unit that updates parameters of the learning executor of the fusion model based on the fusion error.

Depending on the state data S _R and behavioral data A _R in the input section, the following 1) to 5) the agent of the learning apparatus according to claim 8, characterized in that it comprises a configuration in which any one of the calculation process of :
1) the state data S _R and behavioral data A _R is, in the case of recording the play data by human experts,
The first loss error calculator described above is
And said action data A _R in the play data by human experts, the error between the behavior data A _F, is calculated by using the loss function,
The second loss error calculator described above is
Soft targeting the behavioral data with the status data S _R has been inputted to output the behavior data A _RL in the RL model by Knowledge distillation, the error between the behavior data A _F, is calculated by using the loss function ,
2) If the state data S _R and behavioral data A _R is a record of the play data of the agent of imitation learning model,
The first loss error calculator described above is
The imitation learning model Agent error between the behavior data A _R in the play data and the action data A _F, or the imitation learning model in the state data S _R knowledge behavior data A _IL that enter to output The error between the behavior data softly targeted by distillation and the behavior data _AF was calculated using a loss function.
The second loss error calculator described above is
Soft targeting the behavioral data with the status data S _R has been inputted to output the behavior data A _RL in the RL model by Knowledge distillation, the error between the behavior data A _F, is calculated by using the loss function ,
3) the state data S _R and behavioral data A _R is, in the case of recording the play data of the agent of reinforcement learning model,
The first loss error calculator described above is
The soft targeting the behavioral data imitation learning model in the state data S _R behavioral data A _IL that enter to output by knowledge distillation, the error between the behavior data A _F, is calculated by using the loss function ,
The second loss error calculator described above is
Soft targeting the behavioral data with the status data S _R has been inputted to output the behavior data A _RL in the RL model by Knowledge distillation, the error between the behavior data A _F, is calculated by using the loss function ,
4) the state data S _R and behavioral data A _R is, status data S _HE and behavioral data A _HE at recording play data by human _experts, and state data S in the recording of the play data of the agent of the reinforcement learning model _RL and behavior data A If _RL ,
The learning executor inputs the state data _SHE to output the action data A _F1 , and inputs the state data _SRL to output the action data A _F2 .
The first loss error calculator described above is
The error between the behavior data A _HE and the behavior data A _F1 in the play data by a human expert is calculated by using the loss function.
The second loss error calculator described above is
The error between the behavior data A _RL soft-targeted by knowledge distillation and the behavior data A _F2 , which was output by inputting the state data S _RL into the reinforcement learning model, was calculated using a loss function. ,
5) The state data S _R and the behavior data _AR are used in the recording of the state data _SIL and the behavior data A _IL in the recording of the play data of the agent of the imitation learning model, and the play data of the agent of the reinforcement learning model. In the case of state data S _RL and behavior data A _RL ,
The learning executor inputs the state data _SIL to output the action data A _F1 , and inputs the state data _SRL to output the action data A _F2 .
The first loss error calculator described above is
The error between the behavior data A _IL soft-targeted by knowledge distillation in the play data of the agent of the imitation learning model and the behavior data A _F1 is calculated by using a loss function.
The second loss error calculator described above is
The error between the behavior data A _RL soft-targeted by knowledge distillation and the behavior data A _F2 output by inputting the state data S _RL into the reinforcement learning model is calculated using a loss function. ..

The learning of the agent according to claim 9, wherein the soft targeting by the knowledge distillation is characterized in that the output of the action policy of the agent has a parameter of heat degree T and can correspond to the hard targeting and the soft targeting. apparatus.

An agent learning program for causing a computer to execute all steps in the agent learning method according to any one of claims 1 to 7.

A computer is used as an input unit, a learning executor, a first loss error calculator, a second loss error calculator, a fusion error calculator, and an update unit in the learning device of any of the agents of claims 8 to 10. An agent learning program to make it work.