JP7634222B2

JP7634222B2 - Optimization device, optimization method, and optimization program

Info

Publication number: JP7634222B2
Application number: JP2021021962A
Authority: JP
Inventors: 高志竹川; 春輝高橋; 裕酒井; 朋樹深井
Original assignee: kinawa Institute of Science and Technology Graduate University; Kogakuin University
Current assignee: kinawa Institute of Science and Technology Graduate University; Kogakuin University
Priority date: 2021-02-15
Filing date: 2021-02-15
Publication date: 2025-02-21
Anticipated expiration: 2041-02-15
Also published as: JP2022124284A

Description

本発明は、最適化装置、最適化方法、及び最適化プログラムに関する。 The present invention relates to an optimization device, an optimization method, and an optimization program.

問題設定に対する解決手法のアプローチとして、隠れマルコフモデル、ベイズ推定、及び強化学習等の手法が用いられている。 Hidden Markov models, Bayesian estimation, and reinforcement learning are used as approaches to solving the problem.

例えば、特許文献１には、隠れ状態数および観測確率の種類と共にモデルの候補数が指数的に増加しても高速にモデル選択を実現できる隠れ変数モデル推定装置が開示されている。この隠れ変数モデル推定装置は、周辺化対数尤度関数を完全変数に対する推定量に関してラプラス近似した近似量の下界として定義される基準値を最大化することによって変分確率を計算する変分確率計算部を有する。また、隠れ変数モデル推定装置は、各隠れ状態に対して観測確率の種類とパラメータを推定することで最適な隠れ変数モデルを推定するモデル推定部と、変分確率計算部が変分確率を計算する際に用いた基準値が収束したか否かを判定する収束判定部とを有する。 For example, Patent Document 1 discloses a hidden variable model estimation device that can achieve high-speed model selection even when the number of model candidates increases exponentially along with the number of hidden states and types of observation probabilities. This hidden variable model estimation device has a variational probability calculation unit that calculates the variational probability by maximizing a reference value defined as the lower bound of an approximation obtained by Laplace approximating the marginal log-likelihood function with respect to an estimate for the complete variables. The hidden variable model estimation device also has a model estimation unit that estimates an optimal hidden variable model by estimating the type of observation probability and parameters for each hidden state, and a convergence determination unit that determines whether the reference value used when the variational probability calculation unit calculates the variational probability has converged.

また、特許文献２には、環境と相互作用する強化学習エージェントが遂行する行動を選択するシステムが開示されている。このシステムは、目標回帰型ニューラルネットワーク（ＮＮ）の現在の隠れ状態に従って処理して、時間ステップについて、目標空間における初期の目標ベクトルを生成し、目標回帰型ＮＮの内部状態を更新するように構成される、処理する工程を有している。 Patent Document 2 also discloses a system for selecting an action to be performed by a reinforcement learning agent that interacts with an environment. The system includes a processing step configured to process according to a current hidden state of a goal regression neural network (NN) to generate an initial goal vector in a goal space for a time step and update the internal state of the goal regression NN.

また、強化学習は、状態と行動の組み合わせに対して報酬と次の状態が決定する手法である。 Reinforcement learning is a method in which the reward and next state are determined for each combination of state and action.

再表２０１３／１７９５７９号公報Re-table 2013/179579 publication 特表２０２０－５０８５２４号公報Special Publication No. 2020-508524

強化学習の枠組みにおいて、標準的にＱ学習と呼ばれる手法が用いられている。Ｑ学習は離散の状態に対して定義されるが、現実の課題は膨大な観測状態が存在するため、通常のＱ学習では学習が難しい場合が多い。 Within the framework of reinforcement learning, a method called Q-learning is typically used. Q-learning is defined for discrete states, but real-world problems often have a huge number of observed states, making learning difficult using standard Q-learning.

近年、発展系であるＱ学習と多層のニューラルネットワークを組み合わせたＤｅｅｐＱＮｅｔｗｏｒｋ（ＤＱＮ）がさまざまな課題において有効であることが示されている。学習済みのＤＱＮは非常に高い性能を示すが、動作の内部状態がブラックボックスで与えられた環境をどのように解釈しているかが不明である。また、学習後の性能は高いが学習には多くの反復を必要とし，学習中に効果的に報酬を獲得することはあまり考慮されていない。 In recent years, Deep Q Network (DQN), which combines an advanced version of Q-learning with a multi-layered neural network, has been shown to be effective in a variety of tasks. Although trained DQNs show very high performance, it is unclear how the internal state of the operation is interpreted in a black box given the environment. In addition, although performance after training is high, many iterations are required for learning, and little consideration is given to effectively obtain rewards during learning.

一方、膨大な観測から重要な隠れ状態を推定しつつ状態遷移を効果的に学習するベイズ推定を用いたアルゴリズムも広く知られている。しかし、この手法では報酬の予測と状態遷移を別に扱うため、報酬と無関係な状態を詳細に分析していることとなり、問題設定によってはメモリ、及び計算量などに多大な無駄が生じる。また、多腕バンディット問題と呼ばれる枠組みにおいて、学習と報酬獲得とをバランス良く行う汎用の手法としてトンプソンサンプリングが知られているが、複雑な問題に直接適用することはできない。 On the other hand, algorithms using Bayesian estimation that effectively learn state transitions while estimating important hidden states from a huge amount of observations are also widely known. However, this method treats reward prediction and state transitions separately, which means that states unrelated to rewards are analyzed in detail, and depending on the problem setting, this can result in a large amount of waste in memory and calculations. In addition, Thompson sampling is known as a general-purpose method that balances learning and reward acquisition in a framework called the multi-armed bandit problem, but it cannot be directly applied to complex problems.

本発明は、上記事情を鑑みて成されたものであり、演算に係る効率を向上させることを可能とする最適化装置、最適化方法、及び最適化プログラムを提供することを目的とする。 The present invention was made in consideration of the above circumstances, and aims to provide an optimization device, an optimization method, and an optimization program that can improve the efficiency of calculations.

上記目的を達成するために、本発明に係る最適化装置は、状態の推移法則、観測状態の法則、及び報酬の法則による各法則が定義されている系を用い、エージェントの行動を繰り返して前記各法則を学習し報酬を獲得するモデルにおいて、隠れ状態を所定の態様に変更した独自隠れ状態、及び前記独自隠れ状態における現在の状態の推定を保持し、前記観測状態の法則は、条件付き確率の条件として時刻ｔの観測を用い、前記独自隠れ状態を得るように、前記状態の推移法則は、条件付き確率の条件として時刻ｔの前記独自隠れ状態及び時刻ｔ＋１の前記独自隠れ状態を用い、前記エージェントの行動を得るように、前記状態の推移法則、及び前記観測状態の法則を定義する設定部と、前記各法則をもとにサンプリングした確率を表す各パラメータの分布と、前記現在の状態の推定とを仮定して、ベルマン方程式に基づいて前記エージェントの最適行動を決定し、前記各法則、所定の事前分布、及び前記最適行動を含む観測情報に対してベイズの定理を適用して得られた事後分布により、前記現在の状態の推定、及び前記各法則を用いた前記分布を更新することを繰り返す更新部と、を含んで構成されている。 In order to achieve the above object, the optimization device according to the present invention uses a system in which the state transition law, the observation state law, and the reward law are defined, and in a model in which the agent's actions are repeated to learn each of the laws and obtain a reward, the optimization device holds an estimate of a unique hidden state in which the hidden state has been changed to a predetermined state, and a current state in the unique hidden state, and the observation state law uses an observation at time t as a condition for conditional probability to obtain the unique hidden state, and the state transition law uses the unique hidden state at time t and the unique hidden state at time t+1 as conditions for conditional probability to obtain the agent's action. The optimization device includes: a setting unit that defines the state transition law and the observation state law so that the unique hidden state is obtained by using an observation at time t as a condition for conditional probability, and the state transition law defines the state transition law and the observation state law so that the agent's action is obtained by using the unique hidden state at time t and the unique hidden state at time t+1 as conditions for conditional probability; and an update unit that determines the agent's optimal action based on the Bellman equation, assuming a distribution of each parameter representing the probability sampled based on each of the laws, and an estimate of the current state, and repeatedly updates the estimate of the current state and the distribution using each of the laws by a posterior distribution obtained by applying Bayes' theorem to the observation information including the laws, a predetermined prior distribution, and the optimal action.

本発明に係る最適化方法は、状態の推移法則、観測状態の法則、及び報酬の法則による各法則が定義されている系を用い、エージェントの行動を繰り返して前記各法則を学習し報酬を獲得するモデルにおいて、隠れ状態を所定の態様に変更した独自隠れ状態、及び前記独自隠れ状態における現在の状態の推定を保持し、前記観測状態の法則は、条件付き確率の条件として時刻ｔの観測を用い、前記独自隠れ状態を得るように、前記状態の推移法則は、条件付き確率の条件として時刻ｔの前記独自隠れ状態及び時刻ｔ＋１の前記独自隠れ状態を用い、前記エージェントの行動を得るように、前記状態の推移法則、及び前記観測状態の法則を定義し、前記各法則をもとにサンプリングした確率を表す各パラメータの分布と、前記現在の状態の推定とを仮定して、ベルマン方程式に基づいて前記エージェントの最適行動を決定し、前記各法則、所定の事前分布、及び前記最適行動を含む観測情報に対してベイズの定理を適用して得られた事後分布により、前記現在の状態の推定、及び前記各法則を用いた前記分布を更新することを繰り返す、処理をコンピュータに実行させる。 The optimization method according to the present invention uses a system in which the state transition law, the observation state law, and the reward law are defined, and in a model in which the agent's actions are repeated to learn each of the laws and obtain a reward, the system holds an estimate of a unique hidden state in which the hidden state has been changed to a predetermined state, and an estimate of the current state in the unique hidden state, the observation state law uses an observation at time t as a condition for conditional probability to obtain the unique hidden state, the state transition law defines the state transition law and the observation state law so as to obtain the agent's action using the unique hidden state at time t and the unique hidden state at time t+1 as conditions for conditional probability, and determines the agent's optimal action based on the Bellman equation, assuming a distribution of each parameter representing the probability sampled based on each of the laws and an estimate of the current state, and causes a computer to execute the following process: updating the estimate of the current state and the distribution using each of the laws using a posterior distribution obtained by applying Bayes' theorem to the observation information including each of the laws, a predetermined prior distribution, and the optimal action.

本発明の最適化装置、最適化方法、及び最適化プログラムによれば、演算に係る効率を向上させることを可能とする、という効果が得られる。 The optimization device, optimization method, and optimization program of the present invention have the effect of making it possible to improve the efficiency of calculations.

状態及び法則の推定に関して、従来手法の遷移図と、本実施形態の手法の遷移図との一例を示した図である。FIG. 13 is a diagram showing an example of a transition diagram of a conventional method and a transition diagram of the method of the present embodiment regarding estimation of states and rules. 本発明の実施形態に係る最適化装置の各機能構成を示す図である。FIG. 2 is a diagram illustrating the functional configuration of an optimization device according to an embodiment of the present invention. 最適化装置のハードウェア構成を示すブロック図である。FIG. 2 is a block diagram showing a hardware configuration of the optimization device. 本発明の実施形態に係る最適化装置の最適化処理ルーチンを示す図である。FIG. 4 is a diagram showing an optimization processing routine of the optimization device according to the embodiment of the present invention. 本実施形態の手法と他の手法の実験結果の一例を示すグラフである。11 is a graph showing an example of experimental results of the method of this embodiment and other methods. 実験における収束時の隠れ状態数を表にした図である。This is a table showing the number of hidden states at convergence in an experiment.

以下、図面を参照して本発明の実施形態を詳細に説明する。 The following describes an embodiment of the present invention in detail with reference to the drawings.

まず、本発明の実施形態における原理的な説明をする。 First, we will explain the principles of the embodiment of the present invention.

図１は、状態及び法則の推定に関して、従来手法の遷移図と、本実施形態の手法の遷移図との一例を示した図である。まず基本的な原理として、従来手法の状態及び法則の推定について説明する。従来手法、及び本実施形態の手法は共通して、状態の推移法則、観測状態の法則、及び報酬の法則による各法則が定義されている系（遷移図）を用い、エージェントの行動を繰り返して各法則を学習し報酬を獲得する内部モデルを持つ。図１上は、従来手法の状態及び法則の推定の遷移図である。時刻ｔに対して観測ｏ_ｔが得られ、行動ａ_ｔを選択すると報酬ｒ_ｔと次の観測ｏ_ｔ＋１が得られる環境が与えられたとする。この場合に、割引率γに対する長期報酬Σ_τ＝０ ^∞γ^τｒ_ｔ＋τをできるだけ大きくするような選択を行いたい。また、観測ｏ_ｔの背景には隠れ状態が存在し、行動によって隠れ状態が確率的に変化し、報酬も確率的に決定されるものとする。標準的には、隠れ状態をｓ_ｔとし，状態の推移法則ｐ（ｓ_ｔ＋１│ｓ_ｔ，ａ_ｔ）、観測状態の法則ｐ（ｏ_ｔ│ｓ_ｔ）及び報酬の法則ｐ（ｒ_ｔ│ｓ_ｔ，ａ_ｔ）が定義されている系を想定する。以下、推移法則、観測法則、及び報酬法則という。ただし、目的を達成するためのアルゴリズム（エージェント）にとってこれらの法則は未知であり。行動を繰り返して法則を学習しつつ並行して高い報酬を獲得する必要がある。 FIG. 1 is a diagram showing an example of a transition diagram of a conventional method and a transition diagram of the method of this embodiment regarding the estimation of states and rules. First, as a basic principle, the estimation of states and rules of the conventional method will be described. The conventional method and the method of this embodiment have in common an internal model that uses a system (transition diagram) in which each rule is defined by the state transition rule, the observation state rule, and the reward rule, and repeats the actions of the agent to learn each rule and obtain a reward. The upper part of FIG. 1 is a transition diagram of the estimation of states and rules of the conventional method. Assume that an environment is given in which an observation o _t is obtained at time t, and when an action a _t is selected, a reward r _t and a next observation o _t+1 are obtained. In this case, it is desired to make a selection such that the long-term reward Σ _τ=0 ^∞ γ ^τ r _t+τ for the discount rate γ is as large as possible. In addition, it is assumed that a hidden state exists in the background of the observation o _t , the hidden state changes probabilistically depending on the action, and the reward is also determined probabilistically. Typically, a system is assumed in which the hidden state is s _t and the state transition law p(s _t+1 | _{s t} , a _t ), the observed state law p(o _t | _{s t} ), and the reward law p(r _t | _{s t} , a _t ) are defined. Hereinafter, these are referred to as the transition law, the observation law, and the reward law. However, these laws are unknown to the algorithm (agent) for achieving the goal. It is necessary to repeat actions to learn the laws while simultaneously earning a high reward.

一方、本発明の実施形態に係る原理において、エージェントは、環境に対する内部モデルとして、独自隠れ状態ｓ_ｔ’と現在の状態の推定ｑ（ｓ_ｔ’）を保持している。ここで、ｓ_ｔ’はｓ_ｔの報酬に関連する要素に着目して簡略化したものを想定している。ただし、ｓ_ｔは存在そのものと推移則は仮定しているものの実際に推定するわけではない。何らかの複雑な状態と推移則があるとして、それを直接考えることなく、報酬の観点から不要な状態を排除したものがｓ_ｔ’である。 On the other hand, in the principle according to the embodiment of the present invention, the agent holds its own hidden state s _t ' and an estimate q(s _t ') of the current state as an internal model for the environment. Here, s _t ' is assumed to be a simplification focusing on elements related to the reward of s _t . However, although the existence of s _t itself and transition rules are assumed, they are not actually estimated. Assuming that there are some complex states and transition rules, s _t ' is obtained by eliminating unnecessary states from the perspective of reward without directly considering them.

本実施形態の手法では、推移法則、観測法則、及び報酬法則についても内部モデルを持つが、観測法則と推移法則とに関して実際の法則と異なるｐ（ｓ_ｔ’│ｏ_ｔ）とｐ（ａ_ｔ│ｓ_ｔ’，ｓ_ｔ＋１’）との形式を用いていることが特徴である。報酬法則ｐ（ｒ_ｔ│ｓ_ｔ’，ａ_ｔ）に関しては実際の法則と同様である。このような形式により、本実施形態の状態及び法則の推定の遷移図は、図１下のようにできる。エージェントは各法則のパラメータを確率分布として保持し、観測結果に応じて学習する。具体的には、確率を表すパラメータであるＭ’（ｓ_ｔ’，ｓ_ｔ＋１’，ａ_ｔ）＝ｐ（ａ_ｔ│ｓ_ｔ’，ｓ_ｔ＋１’），Ｎ’（ｓ_ｔ’，ｏ_ｔ）＝ｐ（ｏ_ｔ│ｓ_ｔ’），Ｌ（ｓ_ｔ’，ａ_ｔ，ｒ_ｔ）＝ｐ（ｒ_ｔ│ｓ_ｔ’，ａ_ｔ）に対して、パラメータの予測であるｑ（Ｍ’），ｑ（Ｎ’），ｑ（Ｌ）が設定されている。ｑ（Ｍ’）とｑ（Ｎ’）は容易に実際の法則と対応するｑ（Ｍ）とｑ（Ｎ）に変換できる。 The method of this embodiment has internal models for the transition law, observation law, and reward law, but is characterized in that it uses the formats p(s _t '|o _t ) and p(a _t | _{s t} ', s _t+1 ') for the observation law and transition law, which differ from the actual law. The reward law p(r _t | _{s t} ', a _t ) is the same as the actual law. With this format, the transition diagram of the state and law estimation of this embodiment can be made as shown in the lower part of Figure 1. The agent holds the parameters of each law as a probability distribution and learns according to the observation results. Specifically, the parameters M'(s _t ', s _t+1 ', a _t ) = p( a t | _s _t ', s _t+1 '), N'(s _t ', o _t ) = p( o _t | s _t '), L(s _t ', a _t , r _t ) = p( r _t | _{s t} ', a _t ) are set as probability parameters, and the parameter predictions q(M'), q(N'), and q(L) are set. q(M') and q(N') can be easily converted to q(M) and q(N) that correspond to the actual laws.

ここで、パラメータについて説明する。例えば、ｐ（ｒ｜ｓ，Ｌ）では、ｓ１という状態でｒの取り得る値がｒ１，ｒ２，ｒ３だったとすると、Ｌ（ｓ１，ｒ１）はｒ１が得られる確率を表し、Ｌ（ｓ１，ｒ１）＋Ｌ（ｓ１，ｒ２）＋Ｌ（ｓ１，ｒ３）＝１と、Ｌは行列として表現できる。Ｍについては、ｓ，ｓ’，ａのｉｎｄｅｘをとるテンソルとなり、ａについて和をとると１となる。 Now let us explain the parameters. For example, in p(r|s,L), if the possible values of r in state s1 are r1, r2, and r3, then L(s1,r1) represents the probability that r1 is obtained, and L can be expressed as a matrix, L(s1,r1) + L(s1,r2) + L(s1,r3) = 1. M is a tensor that takes the indices of s, s', and a, and the sum over a is 1.

エージェントは、次の手順で動作する。［１］確率分布ｑ（Ｍ），ｑ（Ｎ），ｑ（Ｌ）に従ってＭ，Ｎ，Ｌをサンプリングする。［２］サンプリングしたＭ，Ｎ，Ｌと状態推定ｑ（ｓ_ｔ’）が正しいと仮定した場合の最適行動ａ_ｔをベルマン方程式に基づいて決定し出力する。最適行動ａ_ｔを出力した結果、新しい情報としてｒ_ｔ，ｏ_ｔ＋１を得る。［３］法則ｐ（ａ_ｔ│ｓ_ｔ’，ｓ_ｔ＋１’），ｐ（ｏ_ｔ│ｓ_ｔ’），ｐ（ｒ_ｔ│ｓ_ｔ’，ａ_ｔ）と、事前分布ｑ（ｓ_ｔ’），ｑ（Ｍ’），ｑ（Ｎ），ｑ（Ｌ）と、観測された情報ｏ_ｔ，ａ_ｔ，ｒ_ｔ，ｏ_ｔ＋１に対してベイズの定理を適用する。得られた事後分布を、新たな知識ｑ（ｓ_ｔ＋１’），ｑ（Ｍ’），ｑ（Ｎ），ｑ（Ｌ）として更新する。［４］その後、［１］に戻り反復する。 The agent operates in the following procedure. [1] Sample M, N, and L according to the probability distributions q(M), q(N), and q(L). [2] Determine and output the optimal action a _t based on the Bellman equation assuming that the sampled M, N, and L and the state estimate q(s _t ') are correct. As a result of outputting the optimal action a _t , obtain r _t , o _t+1 as new information. [3] Apply Bayes' theorem to the rules p(a _t | _{s t} ',s _t+1 '), p(o _t | _{s t} '), p(r _t | _s t ',a _t ), the prior distributions q(s _t '), q(M'), q(N), and q(L), and the observed information o _t , a _t , r _t , and o _t+1 . Update the obtained posterior distribution as new knowledge q(s _t+1 '), q(M'), q(N), and q(L). [4] Then, return to [1] and repeat.

ベルマン方程式では、行動ａを行った場合の長期報酬ｑの期待値が求まるので、単純にｑが最大の行動を取る。ベイズの定理を適用とは、各法則ｐ（ａ｜ｓ，ｓ’，Ｍ），ｐ（ｓ｜ｏ，Ｌ），ｐ（ｒ｜ｓ，Ｎ）と事前分布ｐ（Ｌ，Ｍ，Ｎ）を用いて、事後分布ｐ（ｓ，ｓ’，Ｌ，Ｍ，Ｎ｜ｏ，ｏ’，ｒ，ｒ’）を計算することを指す。以下に事後分布の計算の適用例を示す。 The Bellman equation calculates the expected long-term reward q when action a is taken, so we simply take the action that maximizes q. Applying Bayes' theorem means calculating the posterior distribution p(s,s',L,M,N|o,o',r,r') using each rule p(a|s,s',M), p(s|o,L), p(r|s,N) and the prior distribution p(L,M,N). An example of applying the calculation of the posterior distribution is shown below.

その他、計算の手法は毎回変分ベイズを用いて事後分布を収束するまで計算するが、従来手法では毎回事後分布を収束するまで計算せず変分ベイズの１ｓｔｅｐのみ更新する実装がされている。また、本実施形態の手法の方が計算量は増えるが、オンライン性の大きな向上が見込める。 In addition, the calculation method uses variational Bayes to calculate the posterior distribution until it converges each time, but in conventional methods, the posterior distribution is not calculated until it converges each time, and only one step of variational Bayes is updated. Also, although the amount of calculation required by the method of this embodiment is greater, it is expected to greatly improve online performance.

上記手順の特徴について説明する。手順［１］は状態推移についてトンプソンサンプリングを適用することが従来手法では試みられていない。従来手法としては、［１］及び［２］をまとめる形でニューラルネットワークによるｑの予測を行うことが主流である。また、手順［２］については、Ｍ，Ｎ，Ｌが既知の場合にｑを求めることは標準的な手法であり、Ｑを元にｓｏｆｔｍａｘで確率的に行動を決定するのが標準的な手法である。これに対して、本実施形態の手法では［１］でサンプリングしているので、ｓｏｆｔｍａｘは使わず単純にｑが最大の行動を選択する点に特徴がある。手順［３］については、従来手法では確率モデルｐ（ｓ_ｔ＋１｜ｓ_ｔ，ａ_ｔ），ｐ（ｏ_ｔ｜ｓ_ｔ）を仮定するのに対し、本実施形態の手法では、ｐ（ａ_ｔ｜ｓ_ｔ’，ｓ_ｔ＋１’），ｐ（ｓ_ｔ’｜ｏ_ｔ）を仮定して定式化している点が大きく異なる。また、従来手法では観測ｏ_ｔを単純なカテゴリカル分布と仮定しているのに対し、本実施形態の手法ではカテゴリカル分布の直積に拡張している。 The features of the above procedures will be described. In procedure [1], the conventional method does not attempt to apply Thompson sampling to state transitions. Conventional methods mainly predict q using a neural network by combining [1] and [2]. In procedure [2], it is a standard method to obtain q when M, N, and L are known, and it is a standard method to determine an action probabilistically using softmax based on Q. In contrast, the method of this embodiment is characterized in that sampling is performed in [1], so that softmax is not used and the action with the largest q is simply selected. As for procedure [3], the conventional method assumes probability models p(s _t+1 |s _t , a _t ), p(o _t |s _t ), while the method of this embodiment assumes and formulates p(a _t |s _t ', s _t+1 '), p(s _t '|o _t ), which is a major difference. Furthermore, in the conventional method, the observation o _t is assumed to have a simple categorical distribution, whereas in the method of this embodiment, it is expanded to a direct product of categorical distributions.

なお、上記の例では、計算には変分ベイズを用いることとしているが、手順自体に変分ベイズが必須ではなく、他の計算手法を用いてもよい。 In the above example, variational Bayesian calculations are used, but variational Bayesian calculations are not required for the procedure itself, and other calculation methods may be used.

＜本発明の実施形態に係る最適化装置の構成＞
次に、本発明の実施形態に係る最適化装置の構成について説明する。 <Configuration of the optimization device according to the embodiment of the present invention>
Next, the configuration of the optimization device according to the embodiment of the present invention will be described.

図２は、本発明の実施形態に係る最適化装置１００の各機能構成を示す図である。図２に示すように、この最適化装置１００は、機能的には、設定部１１０と、更新部１１２と、記憶部１２０とを備えている。 Figure 2 is a diagram showing the functional configuration of the optimization device 100 according to an embodiment of the present invention. As shown in Figure 2, the optimization device 100 functionally comprises a setting unit 110, an update unit 112, and a storage unit 120.

設定部１１０は、状態の推移法則、及び観測状態の法則の定義を設定し、当該設定を記憶部１２０に保存する。以下、各法則に関して、適宜当該設定を読み出して処理を行う。 The setting unit 110 sets the state transition rules and the definition of the observed state rules, and stores the settings in the storage unit 120. Hereafter, the settings are read out as appropriate for each rule and processing is performed.

上記原理において示したように、設定によって、隠れ状態ｓ_ｔを所定の態様に変更した独自隠れ状態ｓ_ｔ’、及び独自隠れ状態における現在の状態の推定ｑ（ｓ_ｔ’）を保持する。設定によって、観測状態の法則は、条件付き確率の条件として時刻ｔの観測ｏ_ｔを用い、独自隠れ状態ｓ_ｔ’を得るようにする（ｐ（ｓ_ｔ’│ｏ_ｔ））。設定によって、状態の推移法則は、条件付き確率の条件として時刻ｔの独自隠れ状態ｓ_ｔ’及び時刻ｔ＋１の独自隠れ状態ｓ_ｔ＋１’を用い、エージェントの行動ａ_ｔを得るようにする（ｐ（ａ_ｔ│ｓ_ｔ’，ｓ_ｔ＋１’））。 As shown in the above principle, the setting holds the unique hidden state s _{t '} obtained by changing the hidden state s _t in a predetermined manner, and the estimate q(s _t ') of the current state in the unique hidden state. The setting holds the law of the observation state to obtain the unique hidden state s _t ' using the observation o _t at time t as a condition of the conditional probability (p(s _t ' | _{o t} )). The setting holds the law of the state transition to obtain the agent's action a _t using the unique hidden state s _t ' at time t and the unique hidden state s _t+1 ' at time t+1 as a condition of the conditional probability (p(a _t | _{s t} ', s _t+1 ')).

更新部１１２は、エージェントの最適行動ａ_ｔを決定し、分布を更新することを繰り返す。更新は、予め定めた条件を満たすまで繰り返せばよい。エージェントの最適行動ａ_ｔは、まず、上記エージェントの動作の手順［１］に従って、各法則をもとに確率を表す各パラメータのＭ，Ｎ，Ｌをサンプリングする。次に手順［２］に従って、サンプリングした各パラメータＭ，Ｎ，Ｌと、現在の状態の推定ｑ（ｓ_ｔ’）とを仮定して、ベルマン方程式に基づいてエージェントの最適行動ａ_ｔを決定し出力する。最適行動ａ_ｔを出力した結果、新しい情報としてｒ_ｔ，ｏ_ｔ＋１を得る。手順［３］に従って、各法則ｐ（ａ_ｔ│ｓ_ｔ’，ｓ_ｔ＋１’），ｐ（ｏ_ｔ│ｓ_ｔ’），ｐ（ｒ_ｔ│ｓ_ｔ’，ａ_ｔ）、所定の事前分布ｑ（ｓ_ｔ’），ｑ（Ｍ’），ｑ（Ｎ），ｑ（Ｌ）、及び最適行動を含む観測情報ｏ_ｔ，ａ_ｔ，ｒ_ｔ，ｏ_ｔ＋１に対してベイズの定理を適用して事後分布を得る。そして、更新部１１２は、得られた事後分布により、現在の状態の推定、及び各法則を用いた分布ｑ（ｓ_ｔ＋１’），ｑ（Ｍ’），ｑ（Ｎ），ｑ（Ｌ）を更新することを繰り返すことにより、最終的に収束された分布を出力する。 The update unit 112 repeatedly determines the optimal action a _t of the agent and updates the distribution. The update may be repeated until a predetermined condition is satisfied. The optimal action a _t of the agent is determined by first sampling each parameter M, N, and L representing a probability based on each rule according to the above-mentioned procedure [1] of the agent's operation. Next, according to procedure [2], assuming each sampled parameter M, N, and L and an estimate q(s _t ') of the current state, determining and outputting the optimal action a _t of the agent based on the Bellman equation. As a result of outputting the optimal action a _t , new information r _t and o _t+1 are obtained. According to procedure [3], Bayes' theorem is applied to each rule p(a _t |s _t ',s _t+1 '), p(o _t | _{s t} '), p(r _t | _s t ',a _t ), predetermined prior distributions q(s _t '), q(M'), q(N), q(L), and observation information o _t , a _t , r _t , o _t+1 including optimal actions to obtain a posterior distribution. Then, the update unit 112 repeats the estimation of the current state and the updating of the distributions q(s _t+1 '), q(M'), q(N), q(L) using each rule using the obtained posterior distribution, thereby outputting a finally converged distribution.

記憶部１２０には、設定部１１０で設定された各法則に係る設定、更新部１１２の計算過程の計算データ、及び計算結果が保存される。 The memory unit 120 stores the settings related to each rule set by the setting unit 110, the calculation data of the calculation process by the update unit 112, and the calculation results.

図３は、最適化装置１００のハードウェア構成を示すブロック図である。 Figure 3 is a block diagram showing the hardware configuration of the optimization device 100.

図３に示すように、最適化装置１００は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１１、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１２、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１３、ストレージ１４、入力部１５、表示部１６及び通信インタフェース（Ｉ／Ｆ）１７を有する。各構成は、バス１９を介して相互に通信可能に接続されている。 As shown in FIG. 3, the optimization device 100 has a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a storage 14, an input unit 15, a display unit 16, and a communication interface (I/F) 17. Each component is connected to each other via a bus 19 so as to be able to communicate with each other.

ＣＰＵ１１は、中央演算処理ユニットであり、各種プログラムを実行したり、各部を制御したりする。すなわち、ＣＰＵ１１は、ＲＯＭ１２又はストレージ１４からプログラムを読み出し、ＲＡＭ１３を作業領域としてプログラムを実行する。ＣＰＵ１１は、ＲＯＭ１２又はストレージ１４に記憶されているプログラムに従って、上記各構成の制御及び各種の演算処理を行う。本実施形態では、ＲＯＭ１２又はストレージ１４には、最適化プログラムが格納されている。 The CPU 11 is a central processing unit that executes various programs and controls each part. That is, the CPU 11 reads a program from the ROM 12 or the storage 14, and executes the program using the RAM 13 as a working area. The CPU 11 controls each of the above components and performs various calculation processes according to the program stored in the ROM 12 or the storage 14. In this embodiment, an optimization program is stored in the ROM 12 or the storage 14.

ＲＯＭ１２は、各種プログラム及び各種データを格納する。ＲＡＭ１３は、作業領域として一時的にプログラム又はデータを記憶する。ストレージ１４は、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）又はＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の記憶装置により構成され、オペレーティングシステムを含む各種プログラム、及び各種データを格納する。 The ROM 12 stores various programs and various data. The RAM 13 temporarily stores programs or data as a working area. The storage 14 is composed of a storage device such as an HDD (Hard Disk Drive) or SSD (Solid State Drive), and stores various programs including an operating system, and various data.

入力部１５は、マウス等のポインティングデバイス、及びキーボードを含み、各種の入力を行うために使用される。 The input unit 15 includes a pointing device such as a mouse and a keyboard, and is used to perform various input operations.

表示部１６は、例えば、液晶ディスプレイであり、各種の情報を表示する。表示部１６は、タッチパネル方式を採用して、入力部１５として機能してもよい。 The display unit 16 is, for example, a liquid crystal display, and displays various information. The display unit 16 may also function as the input unit 15 by adopting a touch panel system.

通信インタフェース１７は、端末等の他の機器と通信するためのインタフェースであり、例えば、イーサネット（登録商標）、ＦＤＤＩ、Ｗｉ－Ｆｉ（登録商標）等の規格が用いられる。 The communication interface 17 is an interface for communicating with other devices such as terminals, and uses standards such as Ethernet (registered trademark), FDDI, and Wi-Fi (registered trademark).

＜本発明の実施形態に係る最適化装置の作用＞
次に、本発明の実施形態に係る最適化装置１００の作用について説明する。最適化装置１００の各部としてＣＰＵ１１が、図４に示す最適化処理ルーチンを実行する。 <Function of the optimization device according to the embodiment of the present invention>
Next, a description will be given of the operation of the optimization device 100 according to the embodiment of the present invention. As one of the components of the optimization device 100, the CPU 11 executes an optimization processing routine shown in FIG.

ステップＳ１００では、ＣＰＵ１１が、状態の推移法則、及び観測状態の法則の定義を設定し、当該設定を記憶部１２０に保存する。 In step S100, the CPU 11 sets the state transition rules and the definition of the observed state rules, and stores the settings in the memory unit 120.

ステップＳ１０２では、ＣＰＵ１１が、エージェントの動作の手順［１］に従って、各パラメータのＭ，Ｎ，Ｌをサンプリングする。 In step S102, the CPU 11 samples the M, N, and L parameters according to the agent's operation procedure [1].

ステップＳ１０４では、ＣＰＵ１１が、手順［２］に従って、サンプリングした各パラメータＭ，Ｎ，Ｌと、現在の状態の推定ｑ（ｓ_ｔ’）とを仮定して、ベルマン方程式に基づいてエージェントの最適行動ａ_ｔを決定する。 In step S104, the CPU 11 determines the optimal action at of the agent based on the Bellman equation, assuming the sampled parameters M, N, and L and the current state estimate q(s _t ') in accordance with procedure [ ₂ ].

ステップＳ１０５では、ＣＰＵ１１が、最適行動ａ_ｔを出力した結果、新しい情報としてｒ_ｔ，ｏ_ｔ＋１を得る。 In step S105, the CPU 11 outputs the optimal action a _t , and as a result, obtains r _t and o _t+1 as new information.

ステップＳ１０６では、ＣＰＵ１１が、手順［３］に従って、各法則、所定の事前分布、及び最適行動ａ_ｔを含む観測情報に対してベイズの定理を適用して事後分布を得る。 In step S106, the CPU 11 obtains a posterior distribution by applying Bayes' theorem to the observed information including each rule, a predetermined prior distribution, and the optimal action a _t , according to procedure [3].

ステップＳ１０８では、ＣＰＵ１１が、更新の条件を満たすか否かを判定する。条件を満たすと判定した場合にはステップＳ１１０へ移行し、条件を満たさないと判定した場合にはステップＳ１０２に戻って処理を繰り返す（「手順［４］）。 In step S108, the CPU 11 determines whether the update conditions are met. If it is determined that the conditions are met, the process proceeds to step S110. If it is determined that the conditions are not met, the process returns to step S102 and repeats the process ("Procedure [4]").

ステップＳ１１０では、最終的に得られた分布を出力し、処理を終了する。 In step S110, the final distribution obtained is output and the process ends.

以上、説明した本発明の実施形態によれば、演算に係る効率を向上させることが可能である。また、技術のポイントは大きく、３つのポイントが挙げられる。１点目は確率モデルによる状態推定と意思決定問題とを統合した点、２点目は観測則と推移則とを通常の形式でなく独自の形式の法則とした点、３点目は手順におけるサンプリングの活用である。 According to the embodiment of the present invention described above, it is possible to improve the efficiency of calculations. Furthermore, there are three main key points of the technology. The first point is that state estimation using a probabilistic model is integrated with the decision-making problem, the second point is that the observation rules and transition rules are in a unique format rather than the usual format, and the third point is the use of sampling in the procedure.

１点目の確率モデルによる状態推定と意思決定問題とを統合した点について説明する。これまで、カルマンフィルタなど観測からの隠れ状態の推定モデルについては様々な手法が提案されている。また、Ｑ学習を代表として状態推移環境における意思決定問題についても多数の研究がある。しかし、現実として重要な問題であるにもかかわらず、両者を統合した問題については限定した取り組みしか行われていなかった。１点目の観点において本実施形態の技術は、状態推定と意思決定問題とを統合を手法である。 The first point, the integration of state estimation using a probabilistic model and the decision-making problem, will be explained. Various methods have been proposed so far for estimating models of hidden states from observations, such as the Kalman filter. There has also been a great deal of research on decision-making problems in state transition environments, typified by Q-learning. However, despite being an important problem in reality, only limited efforts have been made on the problem of integrating the two. In terms of the first point, the technology of this embodiment is a method of integrating state estimation and the decision-making problem.

２点目の独自の形式の法則とした点について説明する。本実施形態では、観測則と推移則とを、ｐ（ｓ_ｔ’│ｏｔ）とｐ（ａ_ｔ│ｓ_ｔ’，ｓ_ｔ＋１’）の形式としたことである。この定式化により、確率モデルの上で観測ｏ_ｔが推定すべき値でなく、すでに与えられた決定事項として扱うことができる。通常の形で定式化した場合、観測ｏ_ｔは報酬と無関係にすべて別のものとして真の推移則全体を推定しようとする。一方。本実施形態の手法では観測ｏ_ｔではなく、報酬が予測できれば十分となるため、真の推移則ではなく報酬の予測に必要な隠れ状態ｓ_ｔ’のみにより推移則が再構成される。このことにより、隠れ状態の数が少なくなり、学習を効率良く行うことが可能である。人間の認知などにおいて、視覚及び聴覚などの膨大な観測に対して、意思決定に真に必要な状態は少数である。本法則を用いることにより、このような高度なメカニズムをシンプルなモデルで効果的に実現可能である。 The second point is that the law is in a unique format. In this embodiment, the observation law and the transition law are in the format of p(s _t '|o t) and p(a _t | _{s t} ',s _t+1 '). This formulation allows the observation o _t to be treated as a decision that has already been given, rather than as a value to be estimated, on the probabilistic model. When formulated in a normal form, the observation o _t is treated as something completely separate, regardless of the reward, and the entire true transition law is estimated. On the other hand. In the method of this embodiment, since it is sufficient to predict the reward, not the observation o _t , the transition rule is reconstructed only by the hidden state s _t ' required for predicting the reward, not the true transition rule. This reduces the number of hidden states, making it possible to perform learning efficiently. In human cognition, for example, there are a large number of observations such as vision and hearing, and only a small number of states are truly required for decision-making. By using this law, such advanced mechanisms can be effectively realized with a simple model.

３点目の手順におけるサンプリングの活用について説明する。不確実な内部モデルに対して、現在の推定に基づいてできるだけ報酬を得ようとする活用か、将来の報酬のために情報を得る探索か、のどちらを行うかが重要な問題である。本実施形態では、原理において述べた（２）及び（３）の手順により活用を行うが、元々の推定の不確定性を（１）のサンプリングで考慮しているので、最も効果的な探索を行うことができる。よって、状態推定モデルとサンプリングとによる意思決定を組み合わせたことによる効率化が図られている点が新規である。 The third point is the use of sampling in the procedure. The important issue is whether to use an uncertain internal model to obtain as much reward as possible based on the current estimation, or to explore to obtain information for future rewards. In this embodiment, the procedure (2) and (3) described in the principle are used for utilization, but since the uncertainty of the original estimation is taken into account in the sampling in (1), the most effective exploration can be performed. Therefore, what is new is the efficiency achieved by combining decision-making based on a state estimation model and sampling.

以上に示したように、本発明の実施形態の技術は、演算に係る効率を向上させることを可能とする。すなわち、状態と行動に依存して隠れ状態が推移し、観測状態が生成される隠れマルコフモデルに、報酬が付加されたモデルに対して、機会損失を少なくしつつ、少ない回数で状態の遷移を正しく推定し、結果の説明性が高いアルゴリズムを提供する。 As described above, the technology of the embodiment of the present invention makes it possible to improve the efficiency of calculations. In other words, for a hidden Markov model in which the hidden state transitions depending on the state and action and the observed state is generated, and a reward is added to the model, an algorithm is provided that correctly estimates state transitions in a small number of iterations while reducing opportunity loss and that provides high explainability of the results.

機会損失とは、エージェントの行動に関する損失である。例えば、ａ１という行動を取れば報酬がｒ１得られるにも関わらず、ａ２という行動でｒ２を得た場合、ｒ１－ｒ２が機会損失となる。初期状態で情報が不完全な場合には機会損失を０にすることはできないため、不完全な情報に従って行動したり、情報が十分あるのにも関わらず探索的な行動を取ったりする場合に、機会損失が大きくなるという性質がある。アルゴリズムを長期間繰り返し実行した場合に平均機会損失が少ないことが重要であり、機会損失を少なくする、ことと、トータルの報酬獲得を大きくする、こととはほぼ同じ意味を表す。 Opportunity loss is the loss associated with an agent's actions. For example, if taking action a1 would earn r1 reward, but taking action a2 earns r2, then r1-r2 is the opportunity loss. If information is incomplete at the initial state, it is not possible to reduce opportunity loss to zero, so opportunity loss tends to be large when acting based on incomplete information or when taking exploratory action despite having sufficient information. It is important to have a low average opportunity loss when an algorithm is repeatedly executed over a long period of time, and reducing opportunity loss is roughly the same as increasing the total reward obtained.

［実験結果］
本実施形態の手法の実験結果を説明する。図５は、本実施形態の手法と他の手法の実験結果の一例を示すグラフである。図５は、推定手法ごとの試行回数ごとの累積報酬を示している。本実施形態の提案手法は、Ｏ_ｍｕｌ→Ｓ→Ｒである。本実施形態の提案手法が最も早く最適な報酬を得られている。また、図６は、実験における収束時の隠れ状態数を表にした図である。本実施形態の手法では、観測状態が３２種類に対して報酬と関係のある８種類の隠れ状態を推定することができている。このことが、高速に学習を行える要因である。 [Experimental Results]
The experimental results of the method of this embodiment will be described. FIG. 5 is a graph showing an example of the experimental results of the method of this embodiment and other methods. FIG. 5 shows the cumulative reward for each number of trials for each estimation method. The method proposed in this embodiment is O _mul →S→R. The method proposed in this embodiment can obtain the optimal reward the fastest. FIG. 6 is a table showing the number of hidden states at the time of convergence in the experiment. In the method of this embodiment, it is possible to estimate 8 types of hidden states related to the reward for 32 types of observed states. This is the reason why learning can be performed at high speed.

なお、本実施形態の法則に関して補足する。図１で示されている法則は、エージェントには知らされない「真の法則」であり、この法則においては隠れ状態から観測が生成される。通常、確率モデルにおいては真の法則を求めることが一般的であるが、本実施形態で想定している問題設定では必ずしも複雑な真の法則すべてを推定することが必要ではない。特に状態ｓから観測ｏが得られるという部分については，次回どのような観測が得られるかという法則を学習することになるが、実際には状態ｏは観測できるため予測は不要であり、隠れ状態ｓと報酬ｒさえ予測できればよい。よって、ｏをあえて予測しないですむ定式化を考えた結果が本実施形態の手法である。 Note that we will provide some additional information regarding the laws of this embodiment. The laws shown in Figure 1 are "true laws" that are not known to the agent, and in these laws, observations are generated from hidden states. Normally, in a probabilistic model, it is common to seek true laws, but in the problem setting assumed in this embodiment, it is not necessarily necessary to estimate all of the complex true laws. In particular, for the part where observation o is obtained from state s, the law of what observation will be obtained next time is learned, but since state o can actually be observed, prediction is not necessary; it is sufficient to only predict the hidden state s and reward r. Therefore, the method of this embodiment is the result of considering a formulation that does not require the deliberate prediction of o.

例えば、真の法則において、４つの状態｛ｓ０，ｓ１，ｓ２，ｓ３｝と１対１で観測｛ｏ０，ｏ１，ｏ２，ｏ３｝が対応しているが、報酬の観点からは４つの状態が等価である場合を考えると、報酬を得るという目的のためにはどの観測であっても等価といえる。従来手法では、ｓ０という状態からｏ０が、ｓ１という状態からｏ２が生成されるということを区別して学習し、それに伴って状態数も多く必要となっていた。本実施形態の手法では、ｏ０，ｏ１，ｏ２，ｏ３が観測された場合は共通の状態であるということを学習している。真の法則というのも必ずしも１つの見方に固定されるものでなく、状態は同じで観測が確率的に生成されるという解釈もできる。その意味においてこの例ではｓとｓ’とは完全に等価に対応している。 For example, in the law of truth, four states {s0, s1, s2, s3} correspond one-to-one to observations {o0, o1, o2, o3}, but if we consider the case where the four states are equivalent from the viewpoint of reward, any observation can be said to be equivalent for the purpose of obtaining reward. In the conventional method, it was learned that o0 is generated from state s0, and o2 is generated from state s1, and therefore a large number of states were required. In the method of this embodiment, it is learned that when o0, o1, o2, and o3 are observed, they are a common state. The law of truth is not necessarily fixed to one view, and it can also be interpreted as the same state and observations are generated probabilistically. In that sense, in this example, s and s' correspond completely equivalently.

一般に、従来手法のエージェントは必ず観測ｏが説明できるような状態とその推移モデルを構築している。一方、本実施形態の手法では様々な解釈があり得る中で、一見多様で複雑な観測に捉われずに、状態を報酬の観点からシンプルに再構成する。ｐ（ｓ｜ｏ），ｐ（ｏ）からベイズの定理によりｐ（ｏ｜ｓ）が求まることからも，ｓとｓ’とは等価なモデルといえる。 In general, agents using conventional methods always construct a state and its transition model that can explain the observation o. On the other hand, the method of this embodiment reconstructs the state simply from the perspective of reward, without being caught up in seemingly diverse and complex observations, which may have various interpretations. Since p(o|s) can be found from p(s|o) and p(o) using Bayes' theorem, s and s' can be said to be equivalent models.

本発明の実施形態の技術は、観測状態と報酬が与えられる広範囲の問題設定に対し適用可能であり、従来の手法に比べて高速に学習可能であるため、多岐にわたる応用が考えられる。特にロボットの分野、及びＡＩを用いたエージェントシステムの高性能化などに活用が期待される。 The technology of the embodiment of the present invention can be applied to a wide range of problem settings in which observed states and rewards are given, and can learn faster than conventional methods, so a wide range of applications are conceivable. In particular, it is expected to be used in the field of robotics and to improve the performance of agent systems using AI.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the spirit of the invention.

１００最適化装置
１１０設定部
１１２更新部
１２０記憶部 100 Optimization device 110 Setting unit 112 Update unit 120 Storage unit

Claims

In a model in which a system is used in which each law is defined based on a state transition law, an observed state law, and a reward law, and an agent repeats actions to learn each of the laws and obtain a reward,
maintaining a unique hidden state in which the hidden state has been modified in a predetermined manner, and an estimate of the current state in the unique hidden state;
The law of the observation state is such that, using the observation at time t as the condition of the conditional probability to obtain the unique hidden state,
The state transition rule uses the unique hidden state at time t and the unique hidden state at time t+1 as the condition of the conditional probability to obtain the agent's action:
A setting unit that defines a transition rule of the state and a rule of the observed state;
determining an optimal action for the agent based on a Bellman equation assuming a distribution of each parameter representing a probability of sampling based on each of the laws and an estimate of the current state;
an update unit that repeats estimating the current state and updating the distribution using each of the laws based on a posterior distribution obtained by applying Bayes' theorem to observation information including each of the laws, a predetermined prior distribution, and the optimal action;
An optimization device comprising:

The optimization device according to claim 1, wherein the update unit selects an action that maximizes the long-term reward q based on the Bellman equation by sampling each parameter in the agent's action procedure.

In a model in which a system is used in which each law is defined based on a state transition law, an observed state law, and a reward law, and an agent repeats actions to learn each of the laws and obtain a reward,
maintaining a unique hidden state in which the hidden state has been modified in a predetermined manner, and an estimate of the current state in the unique hidden state;
The law of the observation state is such that, using the observation at time t as the condition of the conditional probability to obtain the unique hidden state,
The state transition rule uses the unique hidden state at time t and the unique hidden state at time t+1 as the condition of the conditional probability to obtain the agent's action:
defining transition laws for said states and laws for said observed states;
determining an optimal action for the agent based on a Bellman equation assuming a distribution of each parameter representing a probability of sampling based on each of the laws and an estimate of the current state;
repeating the estimation of the current state and updating the distribution using each of the laws based on a posterior distribution obtained by applying Bayes' theorem to the observed information including each of the laws, a predetermined prior distribution, and the optimal action;
An optimization method for having a computer execute a process.

A program for causing a computer to function as each part of the optimization device according to claim 1 or claim 2.