WO2024180656A1

WO2024180656A1 - Learning device, control device, control system, learning method, and storage medium

Info

Publication number: WO2024180656A1
Application number: PCT/JP2023/007289
Authority: WO
Inventors: 凜高野
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2023-02-28
Filing date: 2023-02-28
Publication date: 2024-09-06
Anticipated expiration: 2025-08-28
Also published as: JPWO2024180656A1

Abstract

In the present invention, a level set function learning unit learns a level set function in which an operation parameter stipulating an operation objective and operation environment of a robot, and a control parameter of the robot are input, and in which an evaluation value relating to the achievability of the operation objective based on the operation parameter and the control parameter is output. A high-level controller learning unit learns a high-level controller that determines, for the robot, a control parameter for realizing the objective operation based on the operation parameter, on the basis of the precision of prediction of the control parameter and the level set function.

Description

Learning device, control device, control system, learning method, and storage medium

　本願は、学習装置、制御装置、制御システム、学習方法および記憶媒体に関する。 This application relates to a learning device, a control device, a control system, a learning method, and a storage medium.

　特許文献１には、動作機械となる１以上の仮想モデルを生成し、生成した仮想モデルが配置された仮想環境下で動作機械を仮想的に動作させる情報処理装置について記載されている。パラメータで指定される動作機械の種別が複数ある場合、当該情報処理装置は各種別の動作機械に対してシミュレーションを行う。また、当該情報処理装置は、仮想環境下のセンサでの検出結果を用い、学習対象となる動作機械を任意の制御内容で動作させ、予め設定された動作結果が得られるかどうかを判定する。 Patent Document 1 describes an information processing device that generates one or more virtual models that serve as moving machines, and virtually operates the moving machines in a virtual environment in which the generated virtual models are placed. When there are multiple types of moving machines specified by parameters, the information processing device performs simulations for each type of moving machine. In addition, the information processing device uses the detection results of sensors in the virtual environment to operate the moving machine to be learned with arbitrary control content, and determines whether a preset operation result is obtained.

特開２０１９－１５３２４６号公報JP 2019-153246 A

　特許文献１に記載の情報処理装置は、学習対象となる動作機械により予め設定された動作結果が得られるかどうかを判定するに過ぎない。動作環境によっては、学習時に用いたパラメータの調整が所望される場合がある。調整後のパラメータに対する動作結果は予め知得できないため、動作の成否を判定することができない。 The information processing device described in Patent Document 1 merely determines whether a preset operation result can be obtained by the operating machine that is the learning target. Depending on the operating environment, it may be desirable to adjust the parameters used during learning. Since the operation result for the adjusted parameters cannot be known in advance, it is not possible to determine whether the operation will be successful.

　本願は、上記の課題を解決する学習装置、制御装置、制御システム、学習方法および記憶媒体を提供することを目的とする。 The present application aims to provide a learning device, a control device, a control system, a learning method, and a storage medium that solve the above problems.

　本願の第１の態様によれば、学習装置は、ロボットの動作目標および動作環境を規定する動作パラメータと前記ロボットの制御パラメータを入力とし、前記動作パラメータと前記制御パラメータに基づく動作目標の到達可能性に関する評価値を出力とするレベルセット関数を学習するレベルセット関数学習部と、
　前記ロボットに前記動作パラメータに基づく目標動作を実現するための制御パラメータを定めるハイレベル制御器を、前記レベルセット関数と前記制御パラメータの予測精度に基づいて学習するハイレベル制御器学習部と、を備える。 According to a first aspect of the present application, a learning device includes a level set function learning unit that learns a level set function that receives as input an operation parameter that defines an operation goal and an operation environment of a robot and a control parameter of the robot, and outputs an evaluation value regarding the reachability of the operation goal based on the operation parameter and the control parameter;
The robot is equipped with a high-level controller learning unit that learns a high-level controller that determines control parameters for realizing a target motion based on the motion parameters, based on the level set function and the prediction accuracy of the control parameters.

　本願の第２の態様によれば、学習方法は、学習装置が、ロボットの動作目標および動作環境を規定する動作パラメータと前記ロボットの制御パラメータを入力とし、前記動作パラメータと前記制御パラメータに基づく動作目標の到達可能性に関する評価値を出力とするレベルセット関数を学習するレベルセット関数学習ステップと、前記ロボットに前記動作パラメータに基づく目標動作を実現するための制御パラメータを定めるハイレベル制御器を、前記レベルセット関数と前記制御パラメータの予測精度に基づいて学習するハイレベル制御器学習ステップと、を実行する。 According to a second aspect of the present application, the learning method includes a level set function learning step in which a learning device learns a level set function that receives as input operation parameters that define the robot's operation target and operation environment and the robot's control parameters, and outputs an evaluation value related to the attainment of the operation target based on the operation parameters and the control parameters, and a high-level controller learning step in which a high-level controller that determines control parameters for realizing the robot's target operation based on the operation parameters is learned based on the level set function and the prediction accuracy of the control parameters.

　本願の第３の態様は、制御装置は、ロボットの動作目標および動作環境を規定する動作パラメータに基づく目標動作を実現するための制御パラメータを定めるハイレベル制御部と、レベルセット関数を用いて前記動作パラメータと前記制御パラメータに基づく目標動作の実現性を示す評価値を算出し、前記評価値に基づいて前記目標動作を実現可と判定するとき、前記ロボットの動作制御に前記制御パラメータを用い、前記評価値に基づいて前記目標動作を実現否と判定するとき、前記評価値に基づき前記目標動作を実現可とする制御パラメータを探索する動作計画部と、を備える。 In a third aspect of the present application, the control device comprises a high-level control unit that determines control parameters for realizing a target motion based on motion parameters that define the motion objectives and motion environment of the robot, and a motion planning unit that calculates an evaluation value indicating the feasibility of the target motion based on the motion parameters and the control parameters using a level set function, and when it is determined that the target motion is feasible based on the evaluation value, uses the control parameters for motion control of the robot, and when it is determined that the target motion is not feasible based on the evaluation value, searches for control parameters that will make the target motion feasible based on the evaluation value.

　本願の一態様によれば、ハイレベル制御器の学習に用いた制御パラメータを修正しても、目標状態への到達可否を判定することができる。 According to one aspect of the present application, even if the control parameters used to learn the high-level controller are modified, it is possible to determine whether the target state can be reached.

第１実施形態に係る制御システムの構成例を示す図である。FIG. 1 is a diagram illustrating an example of the configuration of a control system according to a first embodiment. 既知タスクパラメータの例を示す図である。FIG. 11 is a diagram showing an example of known task parameters. 未知タスクパラメータの例を示す図である。FIG. 13 is a diagram illustrating an example of unknown task parameters. 第１実施形態に係る学習装置のハードウェア構成例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of the learning device according to the first embodiment. 第１実施形態に係るロボットコントローラのハードウェア構成例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of a robot controller according to the first embodiment. 第１実施形態に係るロボットを例示する図である。FIG. 1 is a diagram illustrating a robot according to a first embodiment. 抽象空間において表現されたシステム状態を例示する図である。FIG. 1 illustrates an example system state represented in an abstract space. 第１実施形態に係るスキル実行に関する制御系の構成例を示す図である。FIG. 2 is a diagram illustrating an example of the configuration of a control system related to skill execution in the first embodiment. 第１実施形態に係るスキルデータベースの更新に関する学習装置の機能構成例を示す図である。FIG. 2 is a diagram illustrating an example of a functional configuration of a learning device related to updating of a skill database in the first embodiment. 第１実施形態に係るスキル学習部の構成例を示す図である。FIG. 2 is a diagram illustrating an example of the configuration of a skill learning unit according to the first embodiment. 第１実施形態に係るスキル学習部におけるデータフローを例示するデータフロー図である。4 is a data flow diagram illustrating a data flow in a skill learning unit according to the first embodiment. FIG. 第１実施形態に係る学習処理を例示するフローチャートである。4 is a flowchart illustrating a learning process according to the first embodiment. 第１実施形態に係る動作計画を例示するフローチャートである。1 is a flowchart illustrating an operation plan according to the first embodiment. 第２実施形態に係るスキル学習部におけるデータフローを例示するデータフロー図である。FIG. 11 is a data flow diagram illustrating a data flow in a skill learning unit according to the second embodiment. 第２実施形態に係る学習処理を例示するフローチャートである。10 is a flowchart illustrating a learning process according to a second embodiment. 第３実施形態に係るスキル学習部におけるデータフローを例示するデータフロー図である。A data flow diagram illustrating a data flow in a skill learning unit according to the third embodiment. 第３実施形態に係るシステムモデル学習部の機能構成例を示す概略ブロック図である。FIG. 13 is a schematic block diagram illustrating an example of a functional configuration of a system model learning unit according to the third embodiment. 第３実施形態に係る学習処理を例示するフローチャートである。13 is a flowchart illustrating a learning process according to the third embodiment. 本願の実施形態に係る学習装置の最小構成例を示す概略ブロック図である。FIG. 2 is a schematic block diagram illustrating an example of a minimum configuration of a learning device according to an embodiment of the present application. 本願の実施形態に係る制御装置の最小構成例を示す概略ブロック図である。FIG. 2 is a schematic block diagram illustrating an example of a minimum configuration of a control device according to an embodiment of the present application.

　以下、本願の実施形態について図面を用いて説明する。以下の説明は、請求の範囲を限定することを意図するものではない。また、実施形態の中で説明されている技術的特徴の組み合わせの全てが課題の解決に必須であるとは限らない。即ち、組み合わせの一部が省略されても、他の一部が課題の解決を導くことがある。なお、任意の文字「Ａ」の上に任意の記号「ｘ」が付された文字を、本願では便宜上、「Ａ^ｘ」と表すことがある。例えば、ｙ＾との表記は、文字ｙの真上に記号＾を組み合わせてなる文字を示すことがある。 Hereinafter, the embodiments of the present application will be described with reference to the drawings. The following description is not intended to limit the scope of the claims. In addition, not all combinations of technical features described in the embodiments are necessarily essential to solving the problem. In other words, even if some of the combinations are omitted, other parts may lead to the solution of the problem. In addition, for convenience, a character with an arbitrary symbol "x" added above an arbitrary letter "A" may be expressed as "A ^x " in this application. For example, the notation y^ may indicate a character formed by combining the symbol ^ directly above the letter y.

＜第１実施形態＞
（１）システム構成
　第１実施形態に係る制御システム１００のシステム構成例について説明する。
　図１は、第１実施形態に係る制御システム１００の構成例を示す図である。制御システム１００は、学習装置１と、記憶装置２と、ロボットコントローラ３と、計測装置４と、ロボット５とを備える。学習装置１は、記憶装置２と無線または有線で各種のデータを入出力可能に接続する。ロボットコントローラ３は、記憶装置２、計測装置４およびロボット５のそれぞれと無線または有線で各種のデータを入出力可能に接続する。学習装置１と記憶装置２との接続、ロボットコントローラ３と記憶装置２、計測装置４およびロボット５のそれぞれとの接続は、直接なされてもよいし、通信ネットワークを経由してなされてもよい。 First Embodiment
(1) System Configuration An example of the system configuration of the control system 100 according to the first embodiment will be described.
1 is a diagram showing an example of the configuration of a control system 100 according to the first embodiment. The control system 100 includes a learning device 1, a storage device 2, a robot controller 3, a measuring device 4, and a robot 5. The learning device 1 is connected to the storage device 2 wirelessly or via a wire so as to be able to input and output various types of data. The robot controller 3 is connected to each of the storage device 2, the measuring device 4, and the robot 5 wirelessly or via a wire so as to be able to input and output various types of data. The connection between the learning device 1 and the storage device 2, and the connection between the robot controller 3 and each of the storage device 2, the measuring device 4, and the robot 5 may be made directly or via a communication network.

　学習装置１は、与えられたタスクを実行するためのロボット５の動作を学習する。動作の学習において、例えば、自己教師あり学習（Self-Supervised Learning；ＳＳＬ）などの機械学習法が用いられる。また、学習装置１は、学習対象とする動作を実行可能とするシステム状態の集合を学習する。 The learning device 1 learns the actions of the robot 5 to execute a given task. In learning the actions, a machine learning method such as self-supervised learning (SSL) is used. The learning device 1 also learns a set of system states that enable the execution of the action to be learned.

　本願では、学習装置１による学習対象とする動作ならびにシステム状態は、限定されない。学習装置１は、制御可能とし、かつ、その制御を学習可能とする動作ならびにシステム情報を制御対象とすることができる。制御対象とする処理は、位置または形態の変化を伴う動作に限られない。例えば、センサを用いて測定データを取得することが制御対象とするロボット５の動作に含まれてもよい。 In this application, the actions and system states to be learned by the learning device 1 are not limited. The learning device 1 can control actions and system information that are controllable and that enable learning of that control. The processes to be controlled are not limited to actions that involve changes in position or form. For example, the actions of the robot 5 to be controlled may include obtaining measurement data using a sensor.

　システム状態は、ロボット５の他、ロボット５の動作環境を含む制御対象とするシステムの状態を意味する。ロボット５とロボット５の動作環境とを総称して、対象システム、または、単にシステムと呼ぶことがある。物体を把持するタスクなど、対象物を扱うタスクに対しては、その対象物も対象システムに含まれてもよい。 The system state refers to the state of the system to be controlled, including the robot 5 and the operating environment of the robot 5. The robot 5 and the operating environment of the robot 5 are sometimes collectively referred to as the target system, or simply the system. For tasks that involve handling an object, such as a task to grasp an object, the object may also be included in the target system.

　対象システムの状態をシステム状態、または、単に状態と呼ぶことがある。タスクに定められているタスク完了時のシステム状態を、そのタスクの目標状態、または、単に目標状態とも呼ぶことがある。目標状態の集合を目標状態集合と呼ぶことがある。タスクの目標状態に到達することを、そのタスクを達成する、あるいは、そのタスクに成功すると呼ぶことがある。
　スキルを実行することでタスクが達成される場合、スキル実行終了時の状態が目標状態に該当する。
　タスク開始時のシステム状態を、そのタスクの初期状態と呼ぶことがある。 The state of a target system is sometimes called the system state or simply the state. The system state upon completion of a task is sometimes called the goal state of the task or simply the goal state. A set of goal states is sometimes called a goal state set. Reaching the goal state of a task is sometimes called accomplishing the task or succeeding in the task.
When a task is accomplished by executing a skill, the state at the end of skill execution corresponds to the goal state.
The state of the system when a task begins is sometimes called the initial state of that task.

　学習装置１は、ロボット５のスキルに対して学習を実行する。個々のスキルは、ロボット５の特定の１または複数の動作をモジュール化してなる。本願では、主に、一度に１つのタスクに対して１つのスキルの実行によって達成できるタスクを想定し、学習装置１が、そのタスクを達成するためのスキルの学習を行う場合を例にして説明する。 The learning device 1 performs learning on the skills of the robot 5. Each skill is formed by modularizing one or more specific actions of the robot 5. In this application, we will mainly assume a task that can be achieved by executing one skill for one task at a time, and explain the case where the learning device 1 learns the skills to achieve that task.

　但し、ロボットコントローラ３は、複数のスキルを組み合わせてなるタスクを実行可能としてもよい。例えば、ロボットコントローラ３が、複数のサブタスクからなるタスクについて、個々のサブタスクを実行するためのスキルを組み合わせ、当該タスクの実行を計画してもよい。 However, the robot controller 3 may be capable of executing a task that is a combination of multiple skills. For example, the robot controller 3 may plan the execution of a task that is made up of multiple subtasks by combining skills for executing the individual subtasks.

　学習装置１は、スキルに関する学習において、そのスキルを実行可能とする状態集合の学習を実行してもよい。学習装置１は、学習により得られたスキルに関する情報を、記憶装置２に記憶し、スキルデータベースを形成する。スキルデータベースに登録される情報は、スキルタプル（Skill Tuple）と呼ぶことがある。スキルタプルは、モジュール化により、当該スキルをなす動作の実行に要する種々の情報を含む。学習装置１は、記憶装置２に記憶される詳細システムモデル情報、ローレベル制御器情報および目標パラメータ情報に基づいて、スキルタプルを生成する。 The learning device 1 may, in learning a skill, learn a set of states that makes it possible to execute the skill. The learning device 1 stores information about the skill obtained by learning in the storage device 2, and forms a skill database. Information registered in the skill database is sometimes called a skill tuple. A skill tuple includes various information required to execute the operation that constitutes the skill through modularization. The learning device 1 generates a skill tuple based on detailed system model information, low-level controller information, and target parameter information stored in the storage device 2.

　記憶装置２は、学習装置１およびロボットコントローラ３により参照されうる各種の情報を記憶する。記憶装置２は、例えば、詳細システムモデル情報と、ローレベル制御器情報と、目標パラメータ情報と、スキルデータベースとを記憶する。なお、記憶装置２は、必ずしも学習装置１またはロボットコントローラ３と別個に構成されていなくてもよい。記憶装置２は、学習装置１またはロボットコントローラ３に内蔵されてもよい。記憶装置２は、学習装置１またはロボットコントローラ３に直接接続または内蔵されたハードディスクなどの外部記憶装置、フラッシュメモリなどの記憶媒体を含んで構成されてもよい。記憶装置２は、学習装置１およびロボットコントローラ３とデータ通信を実行可能とするサーバ装置であってもよい。また、記憶装置２は、複数の記憶媒体を含んで構成され、個々の記憶媒体が制御システム１００において分散配置されてもよい。 The storage device 2 stores various information that can be referenced by the learning device 1 and the robot controller 3. The storage device 2 stores, for example, detailed system model information, low-level controller information, target parameter information, and a skill database. The storage device 2 does not necessarily have to be configured separately from the learning device 1 or the robot controller 3. The storage device 2 may be built into the learning device 1 or the robot controller 3. The storage device 2 may be configured to include an external storage device such as a hard disk directly connected to or built into the learning device 1 or the robot controller 3, or a storage medium such as a flash memory. The storage device 2 may be a server device that can execute data communication with the learning device 1 and the robot controller 3. The storage device 2 may also be configured to include multiple storage media, and each storage medium may be distributed in the control system 100.

　詳細システムモデル情報は、実空間における対象システムのモデルを表す情報である。本願では、実空間における対象システムのモデルを「詳細システムモデル」とも称することがある。詳細システムモデルを、より抽象化した「抽象」システムモデルとの区別するために、「詳細」システムモデルと表記することがある。
　詳細システムモデル情報は、詳細システムモデルを表す微分方程式または差分方程式を用いて示されてもよい。詳細システムモデル情報は、ロボット５の動作を模擬するシミュレータのプログラムとして構成されていてもよい。 Detailed system model information is information that represents a model of a target system in real space. In this application, the model of a target system in real space may also be referred to as a "detailed system model." The detailed system model may be referred to as a "detailed" system model to distinguish it from a more abstract "abstract" system model.
The detailed system model information may be represented by a differential equation or a difference equation that represents the detailed system model. The detailed system model information may be configured as a simulator program that simulates the operation of the robot 5.

　ローレベル制御器情報は、ローレベル制御器に関する情報である。ローレベル制御器は、ハイレベル制御器から出力されるパラメータ値に基づき実際のロボット５の動作を制御するための制御入力を生成する。例えば、ローレベル制御器は、ハイレベル制御器がロボット５の軌道を生成する場合に、当該軌道に従ってロボット５の動作を追従する制御入力を生成する。また、ローレベル制御器は、ハイレベル制御器から出力されるパラメータに基づきＰＩＤ（Proportional Integral Differential）によるサーボ制御にてロボット５の動作を制御してもよい。 The low-level controller information is information related to the low-level controller. The low-level controller generates a control input for controlling the actual movement of the robot 5 based on parameter values output from the high-level controller. For example, when the high-level controller generates a trajectory for the robot 5, the low-level controller generates a control input that follows the movement of the robot 5 according to the trajectory. The low-level controller may also control the movement of the robot 5 using servo control with PID (Proportional Integral Differential) based on parameters output from the high-level controller.

　目標パラメータ情報は、学習装置１による学習対象とするスキルごとに設けられる。目標パラメータ情報、例えば、初期状態情報と、目標状態／既知タスクパラメータ情報と、未知タスクパラメータ情報と、実行時間情報と、一般制約情報とを含む。ここで、タスクの可変部分をタスクパラメータと称することがある。 Target parameter information is provided for each skill to be learned by the learning device 1. Target parameter information includes, for example, initial state information, target state/known task parameter information, unknown task parameter information, execution time information, and general constraint information. Here, the variable parts of a task are sometimes referred to as task parameters.

　タスクパラメータのうち数値で表されるものを既知タスクパラメータと称することがある。既知タスクパラメータの例として、タスクが対象物を把持するタスクである場合における把持対象物の大きさなど、タスクにおける対象物の大きさ、および、タスクを実行するためのロボット５の軌跡などが挙げられる。
　既知タスクパラメータは、スキルのパラメータとしても扱われうる。 Among the task parameters, those that are expressed by numerical values are sometimes referred to as known task parameters. Examples of known task parameters include the size of an object to be grasped in a case where the task is a task of grasping an object, the size of an object in the task, and the trajectory of the robot 5 for executing the task.
The known task parameters may also be treated as parameters of a skill.

　図２は、既知タスクパラメータの例を示す図である。図２は、ロボット５が、円柱の形状の対象物を把持するタスクを実行する場合の例を示している。この場合、対象物である円柱の半径および高さが既知タスクパラメータの例に該当する。 FIG. 2 is a diagram showing examples of known task parameters. FIG. 2 shows an example of a case where the robot 5 executes a task of grasping an object having a cylindrical shape. In this case, the radius and height of the object, which is the cylinder, correspond to examples of known task parameters.

　一方、タスクパラメータのうち数値での表現が困難なものを未知タスクパラメータと称することがある。未知タスクパラメータの例として、タスクが対象物を把持するタスクである場合の把持対象物の形状など、タスクにおける対象物の形状、および、タスクを実行するために必要なスキルなど、タスクを実行するためのロボット５の動作の種類を挙げることができる。 On the other hand, task parameters that are difficult to express numerically are sometimes called unknown task parameters. Examples of unknown task parameters include the shape of the object in the task, such as the shape of the object to be grasped when the task is to grasp an object, and the type of movement of the robot 5 to execute the task, such as the skills required to execute the task.

　図３は、未知タスクパラメータの例を示す図である。図３は、ロボット５が、いろいろな形状の対象物を把持するタスクを実行する場合の例を示している。この場合、対象物の形状が未知パラメータの例に該当する。 FIG. 3 is a diagram showing examples of unknown task parameters. FIG. 3 shows an example in which the robot 5 executes a task of grasping objects of various shapes. In this case, the shapes of the objects correspond to an example of unknown parameters.

　また、本願では、制御システム１００が、数値化されたシステム状態を扱うことを前提とし、目標状態が数値で表される。例えば、ロボット５がピックアンドプレイス（Pick And Place）を行うタスクの場合、目標状態は、操作対象物の座標が所定の範囲内にあることとして表される。 Furthermore, in this application, it is assumed that the control system 100 handles quantified system states, and the target state is expressed as a numerical value. For example, in the case of a task in which the robot 5 performs pick-and-place, the target state is expressed as the coordinates of the object to be operated being within a predetermined range.

　初期状態情報は、対象のスキルを実行可能な状態の集合を示す情報である。スキルの実行開始時の状態を、そのスキルの初期状態、または、単に初期状態とも称する。初期状態の集合を初期状態集合とも称する。初期状態のパラメータを初期状態パラメータとも称する。例えば、ロボット５のタスクがピックアンドプレイスである場合、初期手先位置姿勢が初期状態パラメータに該当する。初期手先位置とは、動作実行のロボット５のエンドエフェクタの位置に相当する。初期姿勢は、動作実行前のロボット５全体の形状に相当する。ロボット５が複数の関節を含んで構成される場合には、互いに接続し合う２個の関節からなる組ごとの関節間のなす角度が初期姿勢の要素となる。
　本願では、初期状態もしくは初期状態パラメータをｘ_ｓまたはｘ_ｓｉと表すことがある。ここでは、「ｉ」は、個々の初期状態を識別する識別番号を表す正の整数である。また、スキルの実行開始時の時刻を０とし、初期状態をｘ_０と表す場合がある。 The initial state information is information indicating a set of states in which the target skill can be executed. The state at the start of execution of a skill is also referred to as the initial state of that skill, or simply as the initial state. The set of initial states is also referred to as an initial state set. The parameters of the initial state are also referred to as initial state parameters. For example, when the task of the robot 5 is pick-and-place, the initial hand position and posture correspond to the initial state parameters. The initial hand position corresponds to the position of the end effector of the robot 5 when performing an action. The initial posture corresponds to the overall shape of the robot 5 before performing an action. When the robot 5 is configured to include multiple joints, the angle between each pair of two joints that are connected to each other is an element of the initial posture.
In the present application, an initial state or an initial state parameter may be expressed as _xs or _xsi , where "i" is a positive integer representing an identification number for identifying each initial state. In addition, the time when the skill execution starts may be set to 0, and the initial state may be expressed as _x0 .

　目標状態／既知タスクパラメータ情報は、対象のスキルの実行によって到達可能な状態である目標状態がとり得る値と、対象のスキルにおける動的なパラメータとして扱われる既知タスクパラメータがとり得る値との組み合わせの集合を示す情報である。例えば、ロボット５が対象物を把持するスキルにおいて、目標把持位置姿勢が目標状態の要素として含まれうる。目標把持位置は、把持対象とする物体を把持するエンドエフェクタの位置を指し、当該物体の位置を基準とした相対目標値で表されてもよい。目標姿勢は、当該物体の把持動作開始時におけるロボット５の形状の目標値を指す。目標状態には、とり得る値としてフォーム・クロージャ（Form Closure）、フォース・クロージャ（Force Closure）などの安定把持条件に関する情報を含んでいてもよい。
　本願では、目標状態と既知タスクパラメータ値との組み合わせを目標状態／既知タスクパラメータ値または目標状態パラメータと称し、β_ｇまたはβ_ｇｉと表すことがある。ここでは、「ｉ」は、個々の目標状態／既知タスクパラメータ値を識別する識別番号を表す正の整数である。ロボット５のタスクがピックアンドプレイスである場合、ロボット５のエンドエフェクタの最終の目標位置および操作対象物の位置が目標状態パラメータに該当する。その場合、操作対象物の大きさが既知タスクパラメータに該当する。 The goal state/known task parameter information is information indicating a set of combinations of possible values of a goal state, which is a state that can be reached by executing a target skill, and possible values of known task parameters treated as dynamic parameters in the target skill. For example, in a skill in which the robot 5 grasps an object, a target gripping position and posture may be included as an element of the goal state. The target gripping position refers to the position of the end effector that grasps the object to be grasped, and may be expressed as a relative target value based on the position of the object. The target posture refers to a target value of the shape of the robot 5 at the start of the gripping operation of the object. The goal state may include information on stable gripping conditions such as form closure and force closure as possible values.
In this application, a combination of a goal state and a known task parameter value is referred to as a goal state/known task parameter value or a goal state parameter, and may be expressed as β _g or β _gi . Here, "i" is a positive integer representing an identification number that identifies each goal state/known task parameter value. When the task of the robot 5 is pick-and-place, the final target position of the end effector of the robot 5 and the position of the operation target object correspond to the goal state parameters. In this case, the size of the operation target object corresponds to the known task parameter.

　タスクにおける目標状態の違い、および、既知タスクパラメータ値の違いをスキルのパラメータとして扱うことで、目標状態および既知タスクパラメータ値のいずれか、または、両方が異なるタスクを、１つのスキルとして実行することができる。 By treating the differences in the goal states of a task and the differences in known task parameter values as skill parameters, tasks that differ in either the goal states or known task parameter values, or both, can be executed as a single skill.

　例えば、学習装置１が、予測器（Predictor）を用いてスキルの学習に関する処理を行う場合、目標状態および既知タスクパラメータ値を予測器に入力して、目標状態および既知タスクパラメータ値に応じた出力値を得ることができる。予測器は、例えば、ニューラルネットワークまたはガウス過程（Gaussian Process；ＧＰ）などの学習モデル（機械学習におけるモデル）を用いて構成される。 For example, when the learning device 1 performs processing related to skill learning using a predictor, a target state and known task parameter values can be input to the predictor to obtain an output value corresponding to the target state and known task parameter values. The predictor is configured using a learning model (a model in machine learning) such as a neural network or a Gaussian Process (GP).

　なお、スキルによっては既知タスクパラメータが設定されない場合がある。この場合、目標状態／既知タスクパラメータ情報が、目標状態がとり得る値の集合として構成されていてもよい。また、目標状態／既知タスクパラメータ値β_ｇは、目標状態を示す値であってもよい。 Note that, depending on the skill, there may be cases where the known task parameter is not set. In this case, the goal state/known task parameter information may be configured as a set of values that the goal state can take. Also, the goal state/known task parameter value β _g may be a value indicating the goal state.

　未知タスクパラメータ情報は、未知タスクパラメータに関する情報である。例えば、未知パラメータに関するデータの確率分布が未知タスクパラメータ情報にて示されていてもよい。１つのスキルが複数の未知タスクパラメータを有する場合、それぞれの未知タスクパラメータに関する情報が、未知タスクパラメータ情報にて示されていてもよい。
　なお、目標状態／既知タスクパラメータ、未知タスクパラメータに対応する値が固定値であってもよいし、可変となることもある。 The unknown task parameter information is information about an unknown task parameter. For example, the unknown task parameter information may indicate a probability distribution of data about the unknown parameter. When one skill has multiple unknown task parameters, the unknown task parameter information may indicate information about each unknown task parameter.
Values corresponding to the target state/known task parameters and unknown task parameters may be fixed values or may be variable.

　本願では、未知タスクパラメータ値をτまたはτ_ｊと表すことがある。ここでは、「ｊ」は、未知タスクパラメータ値を識別する識別番号を表す正の整数である。
　なお、未知タスクパラメータは、その値を体系立てて数値化することが困難な点で数値での表現が困難だが、未知タスクパラメータ値が等しいか否かは判定可能であるものとする。例えば、未知タスクパラメータが対象物の形状を表す場合、２つの対象物の形状を比較して、未知タスクパラメータ値が等しいか否かを判定可能とする。
　制御システム１００は、２つのタスクにおける未知タスクパラメータ値が同じ場合は、それら２つのタスクを同じタスクとして扱い、未知タスクパラメータ値が異なる場合は、それら２つのタスクを別々のタスクとして扱う。τまたはτ_ｊでタスクを表す場合がある。上記の「ｊ」は、個々のタスクを識別する識別番号を表す正の整数と捉えることもできる。 In this application, an unknown task parameter value may be represented as τ or τ _j , where “j” is a positive integer representing an identification number that identifies the unknown task parameter value.
Although it is difficult to express unknown task parameters numerically because it is difficult to systematically quantify their values, it is possible to determine whether unknown task parameter values are equal. For example, if an unknown task parameter represents the shape of an object, it is possible to determine whether the unknown task parameter values are equal by comparing the shapes of two objects.
The control system 100 treats two tasks as the same task if the unknown task parameter values in the two tasks are the same, and treats the two tasks as separate tasks if the unknown task parameter values are different. A task may be represented by τ or τ _j . The above "j" may be regarded as a positive integer representing an identification number that identifies an individual task.

　実行時間情報は、スキル実行時の時間制限に関する情報である。例えば、実行時間情報が、スキルの実行時間（スキルの実行にかかる時間）、または、スキル実行開始から終了までの時間の許容条件値、あるいはこれら両方を示していてもよい。
　一般制約情報は、例えば、ロボット５の可動範囲の制限、速度の制限、入力の制限に関する条件など、一般的な制約条件を示す情報である。 The execution time information is information about a time limit for executing a skill. For example, the execution time information may indicate the execution time of a skill (the time required to execute the skill), or the allowable condition value of the time from the start to the end of the skill execution, or both.
The general constraint information is information indicating general constraint conditions, such as conditions related to limits on the range of motion of the robot 5, speed limits, and input limits.

　スキルデータベースは、スキルごとに設定されるスキルタプルを有するデータベースである。スキルタプルは、対象のスキルを実行するためのハイレベル制御器に関する情報と、対象のスキルを実行するためのローレベル制御器に関する情報と、対象のスキルを実行可能な状態（例えば、スキルにおける初期状態）および目標状態/既知タスクパラメータ値の組み合わせの集合に関する情報とを含んでいてもよい。対象のスキルを実行可能な状態および目標状態/既知タスクパラメータ値の集合を、実行可能状態集合とも称する。 The skill database is a database that has a skill tuple that is set for each skill. The skill tuple may include information about a high-level controller for executing the target skill, information about a low-level controller for executing the target skill, and information about a set of states in which the target skill can be executed (e.g., the initial state of the skill) and combinations of goal states/known task parameter values. The set of states in which the target skill can be executed and goal states/known task parameter values is also referred to as an executable state set.

　実行可能状態集合は、実際の空間を抽象化してなる抽象空間において定義されてもよい。実行可能状態集合は、例えば、ガウス過程回帰（Gaussian Process Regression；ＧＰＲ）や、レベルセット推定法（Level Set Estimation；ＬＳＥ）を用いて推定されたレベルセット関数、またはレベルセット関数の近似関数を用いて表すことができる。言い換えると、実行可能状態集合が、ある状態および目標状態/既知タスクパラメータ値の組み合わせを含んでいるか否かを、該ある状態および目標状態/既知タスクパラメータ値の組み合わせに対するガウス過程回帰の値（たとえば、平均値）や、該ある状態および目標状態/既知タスクパラメータ値の組み合わせに対する近似関数の値が、実行可能性について判定する制約条件を満たしているか否かによって判定することができる。
　以下では、実行可能状態集合を示す関数としてレベルセット関数を用いる場合を例に説明するが、これに限定されない。 The feasible state set may be defined in an abstract space obtained by abstracting an actual space. The feasible state set may be expressed by, for example, a level set function estimated by Gaussian Process Regression (GPR) or Level Set Estimation (LSE), or an approximation function of the level set function. In other words, whether the feasible state set includes a combination of a certain state and a target state/known task parameter value may be determined by whether a value (e.g., an average value) of the Gaussian process regression for the combination of the certain state and the target state/known task parameter value, or a value of an approximation function for the combination of the certain state and the target state/known task parameter value, satisfies a constraint condition for determining feasibility.
In the following, a case where a level set function is used as a function indicating a feasible state set will be described as an example, but the present invention is not limited to this.

　ロボットコントローラ３は、計測装置４から供給される計測信号およびスキルデータベース等に基づき、ロボット５の動作計画を策定する制御装置である。ロボットコントローラ３は、計画した動作をロボット５に実行させるための制御指令（制御入力）を生成し、ロボット５に当該制御指令を供給する。 The robot controller 3 is a control device that formulates an operation plan for the robot 5 based on the measurement signals supplied from the measuring device 4, a skill database, etc. The robot controller 3 generates control commands (control inputs) for causing the robot 5 to execute the planned operations, and supplies the control commands to the robot 5.

　例えば、ロボットコントローラ３は、ロボット５に実行させるタスクを、所定のタイムステップ（時間刻み）ごとのロボット５が受付可能なタスクを示すシーケンスに変換する。そして、ロボットコントローラ３は、生成したシーケンスで指示されるタスクの実行指令に相当する制御指令に基づき、ロボット５を制御する。制御指令は、ローレベル制御器が出力する制御入力に相当する。 For example, the robot controller 3 converts the tasks to be executed by the robot 5 into a sequence indicating tasks that the robot 5 can accept for each predetermined time step (time interval). The robot controller 3 then controls the robot 5 based on control commands equivalent to execution commands for the tasks instructed in the generated sequence. The control commands correspond to the control inputs output by the low-level controller.

　計測装置４は、１または複数のセンサを含んで構成され、ロボット５によりタスクが実行される作業空間内の状態を検出する。当該センサは、例えば、カメラ、測域センサ、ソナーまたはこれらの組み合わせである。計測装置４は、生成した計測信号をロボットコントローラ３に供給する。計測装置４には、作業空間内で移動する自走式または飛行式のセンサ（ドローンを含む）が含まれてもよい。また、計測装置４には、ロボット５自体に設けられたセンサ、および作業空間内の他の物体に設けられたセンサなどが含まれてもよい。また、計測装置４には、作業空間内の音を検出するセンサ、即ち、マイクロホンを含まれてもよい。このように、計測装置４は、作業空間内の状態を検出する種々のセンサであって、任意の場所に設けられたセンサを含んでもよい。 The measurement device 4 includes one or more sensors and detects the state within the workspace in which the robot 5 executes a task. The sensor is, for example, a camera, a range sensor, a sonar, or a combination of these. The measurement device 4 supplies the generated measurement signal to the robot controller 3. The measurement device 4 may include a self-propelled or flying sensor (including a drone) that moves within the workspace. The measurement device 4 may also include a sensor provided on the robot 5 itself, and a sensor provided on another object in the workspace. The measurement device 4 may also include a sensor that detects sound within the workspace, i.e., a microphone. In this way, the measurement device 4 may include various sensors that detect the state within the workspace and are provided at any location.

　ロボット５は、ロボットコントローラ３から供給される制御指令に基づき指示されたタスクに関する動作を実行する。ロボット５は、例えば、組み立て工場、食品工場などの各種工場、または、物流の現場などで動作するロボットであってもよい。ロボット５は、垂直多関節型ロボット、水平多関節型ロボット、または、その他の構造を有するロボットであってもよい。ロボット５は、ロボット５自体の状態を示す状態信号をロボットコントローラ３に供給してもよい。この状態信号は、ロボット５の全体または特定部位（例えば、関節など）の状態（例えば、位置、角度等）を検出するセンサの出力信号であってもよいし、ロボット５の動作の進捗状態を示す信号であってもよい。 The robot 5 executes operations related to the tasks instructed based on the control commands supplied from the robot controller 3. The robot 5 may be, for example, a robot that operates in various factories such as an assembly plant or a food factory, or at a logistics site. The robot 5 may be a vertical articulated robot, a horizontal articulated robot, or a robot having another structure. The robot 5 may supply a status signal indicating the status of the robot 5 itself to the robot controller 3. This status signal may be an output signal of a sensor that detects the status (e.g., position, angle, etc.) of the entire robot 5 or a specific part (e.g., a joint, etc.) or may be a signal indicating the progress of the operation of the robot 5.

　なお、図１に例示される制御システム１００の構成には、種々の変更がなされてもよい。例えば、ロボットコントローラ３とロボット５が、一体に構成されていてもよい。他の例では、学習装置１、記憶装置２およびロボットコントローラ３のうち少なくともいずれか２つが一体に構成されていてもよい。
　また、制御システム１００の制御対象はロボット５に限定されない。学習装置１が学習可能とする種々の制御対象を、制御システム１００の制御対象としてもよい。 1 may be modified in various ways. For example, the robot controller 3 and the robot 5 may be integrated. In another example, at least two of the learning device 1, the storage device 2, and the robot controller 3 may be integrated.
Furthermore, the control target of the control system 100 is not limited to the robot 5. Various control targets that the learning device 1 can learn may be the control targets of the control system 100.

（２）ハードウェア構成
　次に、本実施形態に係る学習装置１のハードウェア構成例について説明する。
　図４は、本実施形態に係る学習装置１のハードウェア構成例を示す図である。学習装置１は、ハードウェアとして、プロセッサ１１と、メモリ１２と、インタフェース１３とを含む。プロセッサ１１、メモリ１２およびインタフェース１３は、データバス１０を介して各種のデータを入出力可能に接続されている。 (2) Hardware Configuration Next, an example of the hardware configuration of the learning device 1 according to this embodiment will be described.
4 is a diagram showing an example of the hardware configuration of the learning device 1 according to this embodiment. The learning device 1 includes, as hardware, a processor 11, a memory 12, and an interface 13. The processor 11, the memory 12, and the interface 13 are connected via a data bus 10 so as to be able to input and output various types of data.

　プロセッサ１１は、メモリ１２に記憶されているプログラムを実行することにより、学習装置１の全体の制御を行うコントローラ（演算装置）として機能する。本願では、「プログラムを実行する」または「プログラムの実行」とは、プログラムに記述されている各種の指令で指示される処理を実行することを示すことがある。プロセッサ１１は、例えば、ＣＰＵ（Central Processing Unit）、ＧＰＵ（Graphics Processing Unit）、ＴＰＵ（Tensor Processing Unit）などのプロセッサである。プロセッサ１１は、１個に限られず、複数のプロセッサを含んで構成されていてもよい。プロセッサ１１は、学習装置１のコンピュータを構成する。 The processor 11 functions as a controller (computing device) that controls the entire learning device 1 by executing a program stored in the memory 12. In this application, "executing a program" or "executing a program" may refer to executing processing instructed by various commands written in the program. The processor 11 is, for example, a processor such as a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), or a TPU (Tensor Processing Unit). The processor 11 is not limited to one processor, and may be configured to include multiple processors. The processor 11 constitutes the computer of the learning device 1.

　メモリ１２は、学習装置１、または、その他のハードウェアにより参照される各種の情報を記憶する記憶媒体を備える。メモリ１２は、例えば、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）、フラッシュメモリなどの各種の揮発性メモリおよび不揮発性メモリを含んで構成される。メモリ１２には、プロセッサ１１が実行する処理を指示する指令が記述されたプログラムが記憶される。メモリ１２に記憶する情報の一部は、学習装置１と通信可能な１または複数の外部記憶装置（例えば、記憶装置２）により記憶されてもよく、学習装置１のその他の部材に対して着脱自在な記憶媒体により記憶されていてもよい。 Memory 12 comprises a storage medium that stores various information referenced by learning device 1 or other hardware. Memory 12 is composed of various volatile and non-volatile memories, such as RAM (Random Access Memory), ROM (Read Only Memory), and flash memory. Memory 12 stores programs that contain commands that instruct the processing to be executed by processor 11. Some of the information stored in memory 12 may be stored in one or more external storage devices (e.g., storage device 2) that can communicate with learning device 1, or may be stored in storage media that are removable from other components of learning device 1.

　インタフェース１３は、学習装置１と他の機器とを各種のデータを入出力可能に接続するためのインタフェースを備える。インタフェース１３は、他の機器とデータを無線で送受信するためのネットワークアダプタなどのワイヤレスインタフェースであってもよく、他の装置とデータを有線で送受信するためのハードウェアインターフェースであってもよい。例えば、インタフェース１３は、タッチパネル、ボタン、キーボード、音声入力装置などのユーザの入力（外部入力）を受け付ける入力装置、ディスプレイ、プロジェクタ等の表示装置、スピーカなどの音出力装置等と接続してもよい。 The interface 13 includes an interface for connecting the learning device 1 to other devices so that various data can be input and output. The interface 13 may be a wireless interface such as a network adapter for wirelessly transmitting and receiving data to and from other devices, or a hardware interface for wired transmission and reception of data to and from other devices. For example, the interface 13 may be connected to an input device that accepts user input (external input) such as a touch panel, button, keyboard, or voice input device, a display device such as a display or projector, or a sound output device such as a speaker.

　なお、学習装置１のハードウェア構成は、図４に例示される構成に限定されない。例えば、学習装置１は、表示装置、入力装置または音出力装置の少なくともいずれかを内蔵してもよい。また、学習装置１は、記憶装置２を含んで構成されてもよい。 The hardware configuration of the learning device 1 is not limited to the configuration illustrated in FIG. 4. For example, the learning device 1 may incorporate at least one of a display device, an input device, and a sound output device. The learning device 1 may also be configured to include a storage device 2.

　図５は、本実施形態に係るロボットコントローラ３のハードウェア構成例を示す図である。ロボットコントローラ３は、ハードウェアとして、プロセッサ３１と、メモリ３２と、インタフェース３３とを含む。プロセッサ３１、メモリ３２およびインタフェース３３は、データバス３０を用いて、各種のデータを入出力可能に接続されている。 FIG. 5 is a diagram showing an example of the hardware configuration of the robot controller 3 according to this embodiment. The robot controller 3 includes, as hardware, a processor 31, a memory 32, and an interface 33. The processor 31, the memory 32, and the interface 33 are connected via a data bus 30 so that various types of data can be input and output.

　プロセッサ３１は、メモリ３２に記憶されているプログラムを実行することにより、ロボットコントローラ３の全体の制御を行うコントローラ（演算装置）として機能する。プロセッサ３１は、例えば、ＣＰＵ、ＧＰＵ、ＴＰＵなどのプロセッサである。プロセッサ３１は、１個のプロセッサに限られず、複数のプロセッサを含んで構成されてもよい。 The processor 31 functions as a controller (computing device) that performs overall control of the robot controller 3 by executing the programs stored in the memory 32. The processor 31 is, for example, a processor such as a CPU, a GPU, or a TPU. The processor 31 is not limited to a single processor, and may be configured to include multiple processors.

　メモリ３２は、１個または複数個の記憶媒体を含んで構成される。メモリ３２は、例えば、ＲＡＭ、ＲＯＭ、フラッシュメモリなどの各種の揮発性メモリおよび不揮発性メモリを含む。また、メモリ３２には、プロセッサ３１が実行するプログラムが記憶される。なお、メモリ３２が記憶する情報の一部は、ロボットコントローラ３と通信可能な１または複数の外部記憶装置（例えば、記憶装置２）に記憶されてもよく、ロボットコントローラ３の他の部位に対して着脱自在な記憶媒体に記憶されていてもよい。 The memory 32 is configured to include one or more storage media. The memory 32 includes various types of volatile and non-volatile memory, such as RAM, ROM, and flash memory. The memory 32 also stores programs executed by the processor 31. Some of the information stored in the memory 32 may be stored in one or more external storage devices (e.g., storage device 2) capable of communicating with the robot controller 3, or may be stored in a storage medium that is detachable from other parts of the robot controller 3.

　インタフェース３３は、ロボットコントローラ３と他の装置とを各種のデータを入出力可能に接続するためのインタフェースである。インタフェース３３は、他の装置と無線でデータを送受信するためのネットワークアダプタなどのワイヤレスインタフェースを備えてもよいし、他の装置と有線でデータを送受信するためのハードウェアインターフェースを備えてもよい。 The interface 33 is an interface for connecting the robot controller 3 to other devices so that various data can be input and output. The interface 33 may include a wireless interface such as a network adapter for wirelessly transmitting and receiving data to and from other devices, or may include a hardware interface for wired transmission and reception of data to and from other devices.

　なお、ロボットコントローラ３のハードウェア構成は、図５に例示される構成に限定されない。例えば、ロボットコントローラ３が、表示装置、入力装置または音出力装置の少なくともいずれかを内蔵してもよい。また、ロボットコントローラ３が、記憶装置２を含んで構成されていてもよい。 The hardware configuration of the robot controller 3 is not limited to the configuration exemplified in FIG. 5. For example, the robot controller 3 may incorporate at least one of a display device, an input device, and a sound output device. The robot controller 3 may also be configured to include a storage device 2.

（３）抽象空間
　次に、本実施形態に係る抽象空間について説明する。抽象空間は、実空間とは別個の仮想空間であって、ロボットコントローラ３により、スキルタプルに基づいてロボット５の動作計画の策定に用いられる空間である。 (3) Abstract Space Next, the abstract space according to this embodiment will be described. The abstract space is a virtual space separate from the real space, and is used by the robot controller 3 to formulate an operation plan for the robot 5 based on the skill tuples.

　図６は、実空間において物体の把持を行うロボット（マニピュレータ）５と、把持対象物体６とを例示する。
　図７は、図６に例示されるシステム状態を抽象空間において表現する。 FIG. 6 illustrates a robot (manipulator) 5 that grasps an object in real space, and an object 6 to be grasped.
FIG. 7 represents the system state illustrated in FIG. 6 in an abstract space.

　一般的に、ピックアンドプレイスをタスクとするロボット５の動作計画を策定するには、ロボット５のエンドエフェクタ形状、把持対象物体６の幾何形状、ロボット５の把持位置・姿勢および把持対象物体６の物体特性等を考慮した厳密な計算が必要となる。これに対し、本実施形態では、ロボットコントローラ３は、ロボット５、把持対象物体６などの各物体の状態が抽象的に（簡略的に）表された抽象空間において動作計画を策定する。図７の例では、抽象空間では、ロボット５のエンドエフェクタに対応する抽象モデル５ｘと、把持対象物体６に対応する抽象モデル６ｘと、ロボット５による把持対象物体６の把持動作実行可能領域（破線枠６０参照）とが定義される。なお、抽象空間においても、上記のように、実行可能状態集合は、スキルを実行可能な、初期状態と目標状態/既知タスクパラメータ値との組み合わせの集合として示される。図７の例では、把持スキルを実行可能な、初期状態と目標状態/既知タスクパラメータ値との組み合わせの集合を、破線枠６０の把持動作実行可能領域として例示している。実行可能状態集合または目標状態／既知タスクパラメータ値は、ロボットの動作目標および動作環境を規定する動作パラメータに相当する。
　このように、抽象空間におけるロボットの状態により、エンドエフェクタの状態等が抽象的に表現される。また、操作対象物または環境物体に該当する各物体の状態についても、例えば、作業台などの基準物体の位置を基準とする座標系等において抽象的に表現されうる。 In general, to formulate an operation plan for the robot 5 having a pick-and-place task, strict calculation is required that takes into consideration the shape of the end effector of the robot 5, the geometric shape of the object 6 to be grasped, the grasping position and posture of the robot 5, and the object characteristics of the object 6 to be grasped. In contrast, in this embodiment, the robot controller 3 formulates an operation plan in an abstract space in which the states of each object, such as the robot 5 and the object 6 to be grasped, are abstractly (simply) represented. In the example of FIG. 7, in the abstract space, an abstract model 5x corresponding to the end effector of the robot 5, an abstract model 6x corresponding to the object 6 to be grasped, and a region (see dashed frame 60) in which the robot 5 can grasp the object 6 to be grasped are defined. Note that, in the abstract space as well, as described above, the executable state set is represented as a set of combinations of the initial state and the target state/known task parameter value in which the skill can be executed. 7, a set of combinations of initial states and goal states/known task parameter values that allow the grasping skill to be executed is illustrated as a grasp operation executable region in a dashed frame 60. The executable state set or goal states/known task parameter values correspond to operation parameters that define the robot's operation goals and operating environment.
In this way, the state of the end effector and the like are abstractly represented by the state of the robot in the abstract space. Also, the state of each object corresponding to the operation target or the environmental object can be abstractly represented in a coordinate system based on the position of a reference object such as a workbench.

　本実施形態におけるロボットコントローラ３は、スキルを利用し、実際のシステムを抽象化した抽象空間において動作計画を策定する。これにより、マルチステージタスクにおいても動作計画に要する計算コストを好適に抑制することができる。図７の例では、ロボットコントローラ３は、抽象空間において定義される把持可能領域（破線枠６０）において、把持を実行するためのスキルを実行する動作計画を策定し、策定した動作計画に基づきロボット５の制御指令を生成する。 In this embodiment, the robot controller 3 uses skills to formulate an action plan in an abstract space that abstracts the actual system. This makes it possible to effectively reduce the computational costs required for action planning even in multi-stage tasks. In the example of FIG. 7, the robot controller 3 formulates an action plan to execute skills for performing grasping in a graspable area (dashed frame 60) defined in the abstract space, and generates control commands for the robot 5 based on the formulated action plan.

　以後では、実空間におけるシステム（本願では、「実システム」と呼ぶことがある）の状態を「ｘ」、抽象空間におけるシステム（本願では、「抽象システム」と呼ぶことがある）の状態を「ｘ’」と表記して、これらを区別する場合がある。状態ｘ’は、ベクトル（本願では、「抽象状態ベクトル」と呼ぶことがある）として表される。例えば、ピックアンドプレイスなどのタスクについて、抽象状態ベクトルは、操作対象物の状態（例えば、位置、姿勢、速度、等）を表すベクトル、操作可能なロボット５のエンドエフェクタの状態を表すベクトル、環境物体の状態を表すベクトルを含む。状態ｘ’は、実システムにおける一部の要素の状態を抽象的に表した状態ベクトルとして定義される。
　同様に、実空間における目標状態／既知タスクパラメータ値を「β_ｇ」、抽象空間における目標状態／既知タスクパラメータ値を「β_ｇ’」と表記して、これらを区別する場合がある。 Hereinafter, the state of a system in real space (sometimes referred to as a "real system" in this application) may be denoted as "x" and the state of a system in abstract space (sometimes referred to as an "abstract system" in this application) may be denoted as "x'" to distinguish between them. The state x' is expressed as a vector (sometimes referred to as an "abstract state vector" in this application). For example, for a task such as pick-and-place, the abstract state vector includes a vector representing the state of an object to be manipulated (e.g., position, orientation, velocity, etc.), a vector representing the state of an end effector of a manipulable robot 5, and a vector representing the state of an environmental object. The state x' is defined as a state vector that abstractly represents the state of some elements in the real system.
Similarly, a goal state/known task parameter value in the real space may be expressed as "β _g ", and a goal state/known task parameter value in the abstract space may be expressed as "β _g '" to distinguish between them.

（４）スキル実行に関する制御系
　次に、本実施形態に係るスキルの実行に関する制御系の構成例について説明する。図８は、本実施形態に係るスキル実行に関する制御系の構成例を示す図である。ロボットコントローラ３のプロセッサ３１は、機能的には、動作計画部３４と、ハイレベル制御部３５と、ローレベル制御部３６とを備える。また、システム５０は、実際のシステム（ロボット５を含む実システム）に相当する。
　本願では、ハイレベル制御部３５をハイレベル制御器とも称され、π_Ｈと表すことがある。ハイレベル制御部３５は、制御手段の例に該当する。ローレベル制御部３６をローレベル制御器とも称され、π_Ｌと表すことがある。
　ロボットコントローラ３は、ロボット５を制御する制御装置の例に該当する。 (4) Control System for Skill Execution Next, a configuration example of a control system for skill execution according to this embodiment will be described. Fig. 8 is a diagram showing a configuration example of a control system for skill execution according to this embodiment. The processor 31 of the robot controller 3 functionally includes a motion planning unit 34, a high-level control unit 35, and a low-level control unit 36. Furthermore, the system 50 corresponds to an actual system (an actual system including the robot 5).
In this application, the high level control section 35 is also called a high level controller and may be represented as π _H. The high level control section 35 corresponds to an example of a control means. The low level control section 36 is also called a low level controller and may be represented as π _L.
The robot controller 3 is an example of a control device that controls the robot 5 .

　動作計画部３４は、抽象システムにおける状態ｘ’とスキルデータベースとに基づき、ロボット５の動作計画を策定する。動作計画部３４は、例えば、目標状態を時相論理に基づく論理式により表現する。動作計画部３４が、線形時相論理、ＭＴＬ（Metric Temporal Logic）、ＳＴＬ（Signal Temporal Logic）などの予め設定された時相論理を用いて論理式を表現するようにしてもよい。
　動作計画部３４は、生成した論理式をタイムステップごとのシーケンス（動作シーケンス）に変換する。この動作シーケンスには、例えば、各タイムステップにおいて使用されるスキルに関する情報が含まれる。 The motion planning unit 34 formulates a motion plan for the robot 5 based on the state x' in the abstract system and the skill database. The motion planning unit 34 expresses the goal state by a logical formula based on temporal logic, for example. The motion planning unit 34 may express the logical formula by using a preset temporal logic such as linear temporal logic, metric temporal logic (MTL), or signal temporal logic (STL).
The action planning unit 34 converts the generated logical formula into a sequence (action sequence) for each time step. This action sequence includes, for example, information about the skills used in each time step.

　ハイレベル制御部３５は、動作計画部３４が生成した動作シーケンスから、タイムステップごとに実行すべきスキルを認識する。そして、ハイレベル制御部３５は、現在のタイムステップにおいて実行すべきスキルに対応するスキルタプルに含まれるハイレベル制御器「π_Ｈ」に基づき、ローレベル制御部３６への入力となるパラメータ「α」を生成する。 The high-level control unit 35 recognizes the skill to be executed for each time step from the action sequence generated by the action planning unit 34. Then, the high-level control unit 35 generates a parameter "α" to be input to the low-level control unit 36 based on the high-level controller "π _H " included in the skill tuple corresponding to the skill to be executed in the current time step.

　ハイレベル制御部３５は、実行すべきスキルの実行開始時における抽象空間での状態「ｘ_ｓ’」および目標状態／既知タスクパラメータ値「β_ｇ’」の組み合わせが、そのスキルの実行可能状態集合「χ_０’」に属する場合に、式（１）に示されるように制御パラメータαを生成する。 The high-level control unit 35 generates a control parameter α as shown in equation (1) when the combination of the state “x _s ′” in the abstract space and the target state/known task parameter value “β _g ′” at the start of execution of the skill to be executed belongs to the executable state set “χ ₀ ′” of that skill.

　スキルの実行開始時における初期状態は、例えば、抽象空間におけるシステム状態をもって表現される。
　ロボットコントローラ３には、レベルセット関数の近似関数ｇ＾を予め設定しておく。ロボットコントローラ３は、レベルセット関数の近似関数ｇ＾を用いて式（２）を満足するか否かにより、初期状態ｘ_ｓ’が実行可能状態集合χ_０’に属するか否かを判定することができる。 The initial state at the start of skill execution is represented, for example, by a system state in an abstract space.
The approximation function g^ of the level set function is preset in the robot controller 3. The robot controller 3 can determine whether the initial state x _s ' belongs to the feasible state set χ ₀ ' by determining whether the approximation function g^ of the level set function satisfies equation (2).

　式（２）は、ある状態からの、スキルの実行可能性を判定する制約条件を表すとみなすことができる。あるいは、近似関数ｇ＾は、ある初期状態ｘ_ｓ’から、既知タスクパラメータ値の下で目標状態に到達できるかどうかを評価するためのモデルとみなすこともできる。
　近似関数ｇ＾は、後述するように、学習装置１が学習を行って得られる。本願では、近似関数ｇ＾自体をレベルセット関数と指すこともある。 Equation (2) can be regarded as a constraint that determines the feasibility of the skill from a certain state. Alternatively, the approximation function g can be regarded as a model for evaluating whether a goal state can be reached from an initial state _xs ' under known task parameter values.
As will be described later, the approximation function g^ is obtained by performing learning by the learning device 1. In this application, the approximation function g^ itself may be referred to as a level set function.

　ロボットコントローラ３は、ローレベル制御部３６を用いてスキルを実行することで、式（３）に示すように、実行開始からＴ時間経過した時点における抽象システム状態ｘ’（Ｔ）が目標状態集合χ’_ｄに属するか否かを判定することができる。目標状態集合をχ’_ｄは、判定対象のスキルの実行後の抽象空間における目標状態の集合である。Ｔは、判定対象のスキルの実行時間を示す。 The robot controller 3 executes a skill using the low-level control unit 36, and can determine whether or not the abstract system state x'(T) at the time T has elapsed since the start of execution belongs to a goal state set _χ'd , as shown in formula (3). The goal state set _χ'd is a set of goal states in an abstract space after the execution of the skill to be determined. T indicates the execution time of the skill to be determined.

　ローレベル制御部３６は、ハイレベル制御部３５が生成した制御パラメータαと、システム５０から得られる現在の実システムでの状態ｘおよび目標状態／既知タスクパラメータ値β_ｇとに基づき、入力「ｕ」を生成する。ローレベル制御部３６は、スキルタプルに含まれるローレベル制御器「π_Ｌ」に基づき、式（４）に示されるように入力ｕを制御指令として生成する。 The low-level control unit 36 generates an input “u” based on the control parameter α generated by the high-level control unit 35, and the current state x and the target state/known task parameter value _βg in the real system obtained from the system 50. The low-level control unit 36 generates the input u as a control command as shown in equation (4) based on the low-level controller “π _L ” included in the skill tuple.

　なお、ローレベル制御器π_Ｌは、必ずしも式（４）により表記可能な入出力関係に限られず、異なる入出力関係を有してもよい。 It should be noted that the low-level controller π _L is not necessarily limited to an input/output relationship that can be expressed by equation (4), and may have a different input/output relationship.

　ローレベル制御部３６は、計測装置４から出力される計測信号（ロボット５から取得される信号が含まれてもよい）等に基づき所定の状態認識技術を用いて認識したロボット５および環境の状態ｘを認識する。
　システム５０のモデルは、ロボット５への入力ｕと、状態ｘに基づいて、状態ｘの時間変化ｘ^・と出力とする関数「ｆ」を表す状態方程式（式（５））により表される。式（５）において演算子「^・」は、時間についての微分、または、時間についての差分を表す。 The low-level control unit 36 recognizes the state x of the robot 5 and the environment using a predetermined state recognition technology based on the measurement signals output from the measurement device 4 (which may include signals obtained from the robot 5) and the like.
The model of the system 50 is expressed by a state equation (equation (5)) that expresses a function "f" that is an output and a time change x ^· of the state x based on an input u to the robot 5 and the state x. In equation (5), the operator " ^· " represents a differentiation with respect to time or a difference with respect to time.

（５）スキルデータベースの更新
　次に、本実施形態に係るスキルデータベースの更新について説明する。図９は、本実施形態に係るスキルデータベースの更新に関する学習装置１の機能構成例を示す図である。学習装置１のプロセッサ１１は、機能的には、抽象システムモデル設定部１４と、スキル学習部１５と、スキルタプル生成部１６とを備える。なお、図９では、各ブロックについて授受されるデータの種別が例示されているが、これに限定されない。 (5) Updating the Skill Database Next, updating of the skill database according to this embodiment will be described. Fig. 9 is a diagram showing an example of a functional configuration of the learning device 1 related to updating of the skill database according to this embodiment. The processor 11 of the learning device 1 functionally comprises an abstract system model setting unit 14, a skill learning unit 15, and a skill tuple generating unit 16. Note that Fig. 9 shows an example of the type of data exchanged for each block, but is not limited to this.

　抽象システムモデル設定部１４は、詳細システムモデル情報に基づき、抽象システムモデルを設定する。設定される抽象システムモデルは、詳細システムモデル情報により特定される詳細システムモデルを簡略化して得られるモデルである。詳細システムモデルは、図８のシステム５０を表すモデルである。 The abstract system model setting unit 14 sets an abstract system model based on the detailed system model information. The abstract system model that is set is a model obtained by simplifying the detailed system model identified by the detailed system model information. The detailed system model is a model that represents the system 50 in Figure 8.

　抽象システムモデルは、詳細システムモデルにおける詳細システム状態ｘを基に構成される抽象状態ベクトルｘ’に示される抽象システム状態を与えるモデルである。動作計画部３４は、抽象システムモデルを用いてロボット５の動作計画を策定する。
　抽象システムモデル設定部１４は、例えば、予め記憶装置２等に記憶されたアルゴリズムに基づき、詳細システムモデルから抽象システムモデルを導出する。 The abstract system model is a model that provides an abstract system state indicated by an abstract state vector x' that is constructed based on a detailed system state x in the detailed system model. The motion planner 34 formulates a motion plan for the robot 5 using the abstract system model.
The abstract system model setting unit 14 derives the abstract system model from the detailed system model based on, for example, an algorithm stored in advance in the storage device 2 or the like.

　なお、抽象システムモデルに関する情報が予め記憶装置２等に記憶されていてもよい。この場合、抽象システムモデル設定部１４は、記憶装置２等から抽象システムモデルに関する情報を取得してもよい。抽象システムモデル設定部１４は、設定した抽象システムモデルに関する情報を、スキル学習部１５およびスキルタプル生成部１６に供給する。本願では、抽象システムモデル情報と詳細システムモデル情報をシステムモデル情報と総称することがある。 In addition, information regarding the abstract system model may be stored in advance in the storage device 2 or the like. In this case, the abstract system model setting unit 14 may acquire information regarding the abstract system model from the storage device 2 or the like. The abstract system model setting unit 14 supplies information regarding the set abstract system model to the skill learning unit 15 and the skill tuple generation unit 16. In this application, the abstract system model information and the detailed system model information may be collectively referred to as system model information.

　スキル学習部１５は、抽象システムモデル設定部１４が設定した抽象システムモデルと、記憶装置２に記憶される詳細システムモデル情報、ローレベル制御器情報および目標パラメータ情報とに基づき、スキル実行に係る制御の学習を行う。スキル学習部１５は、例えば、ハイレベル制御器π_Ｈから出力され、ローレベル制御器π_Ｌに提供される制御パラメータαを学習する。また、スキル学習部１５は、後述するように、レベルセット関数を学習し、制御パラメータαの学習のための訓練データを取得する。このとき、スキル学習部１５は、例えば、レベルセット関数の予測精度を評価する。 The skill learning unit 15 learns control related to skill execution based on the abstract system model set by the abstract system model setting unit 14 and the detailed system model information, low-level controller information, and target parameter information stored in the storage device 2. The skill learning unit 15 learns, for example, a control parameter α output from the high-level controller π _H and provided to the low-level controller π _L. In addition, the skill learning unit 15 learns a level set function and acquires training data for learning the control parameter α, as described later. At this time, the skill learning unit 15 evaluates, for example, the prediction accuracy of the level set function.

　スキルタプル生成部１６は、スキル学習部１５における学習により得られる実行可能状態集合χ_０’に関する情報と、ハイレベル制御器π_Ｈに関する情報と、抽象システムモデル設定部１４により設定される抽象システムモデルに関する情報と、ローレベル制御器情報と，目標パラメータ情報とを含み、これらを対応付けてなる組（タプル）をスキルタプルとして生成する。そして、スキルタプル生成部１６は、生成したスキルタプルを、スキルデータベースに登録する。スキルデータベースのデータは、ロボットコントローラ３がロボット５を制御するために用いられる。 The skill tuple generation unit 16 generates a tuple by correlating the following information: information on the feasible state set χ ₀ ' obtained by learning in the skill learning unit 15, information on the high-level controller π _H , information on the abstract system model set by the abstract system model setting unit 14, low-level controller information, and target parameter information. The skill tuple generation unit 16 then registers the generated skill tuple in a skill database. The data in the skill database is used by the robot controller 3 to control the robot 5.

　抽象システムモデル設定部１４、スキル学習部１５およびスキルタプル生成部１６の機能は、例えば、プロセッサ１１が所定のプログラムを実行することにより実現できる。また、実行対象とするプログラムを任意の不揮発性記憶媒体に予め記録しておき、プロセッサ１１は、そのプログラムをインストールしたうえで、実行することで、それらの機能を実現するようにしてもよい。なお、これらの機能の一部または全部は、ソフトウェアの実行による実現に限られず、ハードウェアまたはハードウェアとソフトウェアの組合せにより実現してもよい。また、これらの機能一部または全部は、例えば、ＦＰＧＡ（Field-Programmable Gate Array）またはマイクロコントローラ等、ユーザがプログラミング可能な集積回路を用いて実現されてもよい。その場合、これらの集積回路は、上記の機能を実現するためのプログラムを実行する。また、上記の機能の実現に用いられるハードウェアは、ＡＳＳＰ（Application Specific Standard Produce）、ＡＳＩＣ（Application Specific Integrated Circuit）または量子コンピュータ制御チップなど、他種のハードウェアであってもよい。この点は、後述する他の実施形態においても同様である。さらに、これらの各機能は，例えば，クラウドコンピューティング技術などを用いて、複数のコンピュータを協働して実現されていてもよい。 The functions of the abstract system model setting unit 14, the skill learning unit 15, and the skill tuple generating unit 16 can be realized, for example, by the processor 11 executing a predetermined program. Also, the program to be executed may be recorded in advance in any non-volatile storage medium, and the processor 11 may install and execute the program to realize these functions. Note that some or all of these functions are not limited to being realized by executing software, but may be realized by hardware or a combination of hardware and software. Also, some or all of these functions may be realized using integrated circuits that can be programmed by the user, such as a field-programmable gate array (FPGA) or a microcontroller. In that case, these integrated circuits execute programs for realizing the above functions. Also, the hardware used to realize the above functions may be other types of hardware, such as an application specific standard produce (ASSP), an application specific integrated circuit (ASIC), or a quantum computer control chip. This point is the same in other embodiments described later. Furthermore, each of these functions may be realized by multiple computers working together, for example, using cloud computing technology.

（６）スキル学習部の構成
　次に、本実施形態に係るスキル学習部１５の構成例について説明する。
　図１０は、本実施形態に係るスキル学習部１５の構成例を示す図である。スキル学習部１５は、機能的には、探索点集合設定部２１０と、データ取得部２２０と、学習設定部２３０を備える。 (6) Configuration of the Skill Learning Unit Next, an example of the configuration of the skill learning unit 15 according to this embodiment will be described.
10 is a diagram showing an example of the configuration of the skill learning unit 15 according to this embodiment. The skill learning unit 15 functionally comprises a search point set setting unit 210, a data acquisition unit 220, and a learning setting unit 230.

　探索点集合設定部２１０は、探索点集合初期化部２１１と、探索点選択部２１２とを備える。
　データ取得部２２０は、システムモデル設定部２２１と、問題設定計算部２２２と、データ更新部２２３とを備える。
　学習設定部２３０は、レベルセット関数学習部２３１と、予測精度評価関数設定部２３２と、予測精度評価部２３３と、制御器学習用評価関数設定部２３４と、ハイレベル制御器学習部２４０と、を備える。 The search point set setting unit 210 includes a search point set initialization unit 211 and a search point selection unit 212 .
The data acquisition unit 220 includes a system model setting unit 221 , a problem setting calculation unit 222 , and a data update unit 223 .
The learning setting unit 230 includes a level set function learning unit 231 , a prediction accuracy evaluation function setting unit 232 , a prediction accuracy evaluation unit 233 , a controller learning evaluation function setting unit 234 , and a high-level controller learning unit 240 .

　上記のように、スキル学習部１５は、訓練データ（Training Data）を生成し、生成した訓練データを用いてハイレベル制御器π_Ｈの学習を行う。また、スキル学習部１５は、レベルセット関数を学習する。
　探索点集合設定部２１０は、ハイレベル制御器π_Ｈの学習の対象とするタスク設定の候補として、初期状態ｘ_ｓと、目標状態／既知タスクパラメータ値β_ｇとの組み合わせからなる探索点を複数組取得する。探索点集合設定部２１０は、取得した複数の候補のうち、訓練データの取得対象とする探索点を選択する。訓練データは、ロボットコントローラ３によるロボット５の制御の学習のために用いられる。
　探索点集合設定部２１０は、探索点設定手段の例に該当する。 As described above, the skill learning unit 15 generates training data and uses the generated training data to learn the high-level controller π _H. In addition, the skill learning unit 15 learns a level set function.
The search point set setting unit 210 acquires a plurality of sets of search points each consisting of a combination of an initial state x _s and a target state/known task parameter value β _g as candidates for task setting to be learned by _the high-level controller π H. The search point set setting unit 210 selects a search point from which training data is to be acquired from the acquired candidates. The training data is used for learning the control of the robot 5 by the robot controller 3.
The search point set setting unit 210 corresponds to an example of a search point setting means.

　探索点集合初期化部２１１は、探索点集合を設定する。探索点集合は、複数の探索点を要素として含む集合である。個々の探索点は、ハイレベル制御器π_Ｈの学習、および、レベルセット関数の対象とするタスク設定の候補である。より具体的には、個々の探索点は、初期状態ｘ_ｓと、目標状態／既知タスクパラメータ値β_ｇとの組み合わせ、即ち、実行可能状態集合の候補となる動作パラメータに相当する。探索点集合初期化部２１１は、例えば、タスクごとに予め定めた値域においてランダムに分布するように所定の個数Ｎ（Ｎは、２以上の予め定めた整数）の探索点を設定する。 The search point set initialization unit 211 sets a search point set. The search point set is a set including a plurality of search points as elements. Each search point is a candidate for task setting to be the target of learning of the high-level controller π _H and the level set function. More specifically, each search point corresponds to a combination of an initial state x _s and a target state/known task parameter value β _g , that is, an operation parameter that is a candidate for an executable state set. The search point set initialization unit 211 sets a predetermined number N (N is a predetermined integer equal to or greater than 2) of search points so that they are randomly distributed in a value range predetermined for each task, for example.

　本願では、探索点集合をΞ_{ｃｈｅｃｋ}と表すことがある。また、個々の探索点は、（ｘ_ｓ，β_ｇ）またはξと表すことがある。
　探索点（ｘ_ｓ，β_ｇ）を定めることでタスク設定が特定され、ロボット５の動作が決まる。即ち、探索点（ｘ_ｓ，β_ｇ）は、タスクごとのロボット５の動作を示すパラメータとみなすこともできる。 In this application, the set of search points may be represented as Ξ _check , and each search point may be represented as (x _s , β _g ) or ξ.
By determining the search point (x _s , β _g ), a task setting is specified and the behavior of the robot 5 is determined. That is, the search point (x _s , β _g ) can also be regarded as a parameter indicating the behavior of the robot 5 for each task.

　探索点選択部２１２は、探索点集合Ξ_{ｃｈｅｃｋ}から次に訓練データを取得するための探索点ξ_ｉ（ｉは、１以上Ｎ以下の整数）を選択する。探索点選択部２１２は、選択した探索点ξ_ｉをデータ取得部２２０および学習設定部２３０に出力する。探索点選択部２１２は、例えば、探索点集合Ξ_{ｃｈｅｃｋ}をなす複数の探索点からランダムに１個の探索点ξ_ｉを選択する。学習設定部２３０においてレベルセット関数ｇ＾が少なくとも１回以上学習された後、探索点選択部２１２は、予測精度評価関数Ｊ_ｇ＾（ξ_ｉ，π_ｈ（ξ_ｉ））を用いて算出される評価値が最大となる探索点ξ_ｉを選択してもよい。 The search point selection unit 212 selects a search point ξ _i (i is an integer between 1 and N) for next acquiring training data from the search point set Ξ _check . The search point selection unit 212 outputs the selected search point ξ _i to the data acquisition unit 220 and the learning setting unit 230. The search point selection unit 212 randomly selects one search point ξ _i from a plurality of search points constituting the search point set Ξ _check , for example. After the level set function g^ is learned at least once in the learning setting unit 230, the search point selection unit 212 may select the search point ξ _i with the maximum evaluation value calculated using the prediction accuracy evaluation function J _g^ (ξ _i , π _h (ξ _i )).

　予測精度評価関数Ｊ_ｇ＾（ξ_ｉ，π_ｈ（ξ_ｉ））は、探索点に対し、レベルセット関数ｇ＾を用いて得られる評価値の推定精度を示す関数である。予測精度評価関数の評価値が大きいほど、推定精度が高いことを示す。レベルセット関数ｇ＾は、目標状態への到達可能性を示す評価値を定める。レベルセット関数ｇ＾の評価値が小さいほど、目標状態への到達可能性が高いことを示す。この場合、レベルセット関数ｇ＾の評価値を与える探索点は、目標状態に到達可能と判定される。 The prediction accuracy evaluation function J _^g ( _ξi , _πh ( _ξi )) is a function that indicates the estimated accuracy of the evaluation value obtained for a search point using the level set function g^. The larger the evaluation value of the prediction accuracy evaluation function, the higher the estimation accuracy. The level set function g^ determines an evaluation value that indicates the possibility of reaching the target state. The smaller the evaluation value of the level set function g^, the higher the possibility of reaching the target state. In this case, the search point that gives the evaluation value of the level set function g^ is determined to be capable of reaching the target state.

　データ取得部２２０は、探索点集合設定部２１０から入力される探索点ξ_ｉを用いてハイレベル制御器π_Ｈの学習用の訓練データを取得するとともに、レベルセット関数ｇ＾の学習用の訓練データを取得する。
　システムモデル設定部２２１は、探索点ξ_ｉに基づいて、最適制御問題を解くためシステムモデルに関する設定を行う。システムモデル設定部２２１は、予め設定したシステムモデル情報、制約条件、ローレベル制御器、目標時間、解探索関数および探索点ξ_ｉを示す問題設定情報を問題設定計算部２２２に設定する。本例におけるシステムモデル情報は、詳細システムモデル情報と抽象システムモデル情報を含む。制約条件には、タスクに関する制約条件、および、ロボットの動作に関する制約条件が含まれる。 The data acquisition unit 220 acquires training data for learning the high-level controller π _H using the search points ξ _i input from the search point set setting unit 210, and also acquires training data for learning the level set function g.
The system model setting unit 221 performs settings related to a system model to solve an optimal control problem based on the search point ξ _i . The system model setting unit 221 sets problem setting information indicating previously set system model information, constraint conditions, a low-level controller, a target time, a solution search function, and the search point ξ _i in the problem setting calculation unit 222. The system model information in this example includes detailed system model information and abstract system model information. The constraint conditions include constraint conditions related to the task and constraint conditions related to the operation of the robot.

　なお、最適制御問題（ＯＣＰ：Optimal Control Problem）は、式（６）に示すように所定の目標時間Ｔ以内に目標状態χ_ｄへの到達可能性を示すレベルセット関数ｇ（ｘ（Ｔ），ξ）の評価値が最小化する制御パラメータαを求める問題である。本願では、最適化とは、極力適切な値を探索するとの意味を含み、絶対的に最適な値を定めることに限らない。即ち、本願における最適化は、処理過程において、レベルセット関数、その他の評価値が一時的により不適切な値に変化する場合や、処理結果として、絶対的な最適値が得られない場合までは排除されない。最小化についても、極力小さい値を探索するとの意味を含み、絶対的な最小値を定めることに限らない。システム５０のモデル、即ち、システムモデルは、式（５）により与えられる。 The optimal control problem (OCP) is a problem of finding a control parameter α that minimizes the evaluation value of a level set function g(x(T), ξ) that indicates the possibility of reaching a target state _χd within a predetermined target time T as shown in formula (6). In this application, optimization includes the meaning of searching for the most appropriate value possible, and is not limited to determining an absolutely optimal value. In other words, optimization in this application does not exclude cases in which the level set function and other evaluation values temporarily change to more inappropriate values during the processing process, or in which an absolutely optimal value cannot be obtained as a processing result. Minimization also includes the meaning of searching for the smallest possible value, and is not limited to determining an absolute minimum value. A model of the system 50, i.e., a system model, is given by formula (5).

　ロボット５への制御入力ｕは、時刻ｔ（ｔは、０以上Ｔ以下の実数）におけるシステム状態ｘ（ｔ）、制御パラメータαと目標状態／既知タスクパラメータ値β_ｇにより定まるローレベル制御器π_Ｌ（ｘ（ｔ），α，β_ｇ）からの制御出力となる。制御パラメータαは、式（１）に示すようにハイレベル制御器π_Ｈ（ξ）から出力される。上記のように、時刻ｔ＝０における初期状態ｘ_ｓと目標状態／既知タスクパラメータ値β_ｇの組がξ_ｉに相当する。 The control input u to the robot 5 is the control output from a low-level controller π L (x(t), α, β _g ) determined by the system state x(t) at time t (t is a real number between ₀ and T), the control parameter α, and the target state/known task parameter value β _g . The control parameter α is output from a high-level controller π _H (ξ) as shown in equation (1). As described above, the set of the initial state x _s and the target state/known task parameter value β _g at time t=0 corresponds to ξ _i .

　問題設定計算部２２２は、システムモデル設定部２２１により設定された問題設定情報に基づいて、ロボット５によるタスク実行を示す解探索問題を設定する。解探索問題は、問題設定情報に示される制約条件を満たす解を求める問題を示す。
　より具体的には、問題設定計算部２２２は、問題設定情報に示される制約条件のもとで上記の最適制御問題を設定する。 The problem setting calculation unit 222 sets a solution search problem indicating task execution by the robot 5, based on the problem setting information set by the system model setting unit 221. The solution search problem indicates a problem of finding a solution that satisfies the constraint conditions indicated in the problem setting information.
More specifically, the problem setting calculation unit 222 sets the above optimal control problem under the constraint conditions indicated in the problem setting information.

　問題設定計算部２２２は、設定した最適制御問題を解き、極力小さいレベルセット関数ｇの評価値を最適値ｇ_ｉ ^＊として定め、最適値ｇ_ｉ ^＊を与える制御パラメータα_ｉ ^＊を算出することができる。問題設定計算部２２２は、探索点ξ_ｉのもとで算出した最適値ｇ_ｉ ^＊と制御パラメータα_ｉ ^＊を示す最適制御解情報をデータ更新部２２３に出力する。なお、問題設定計算部２２２は、さらに制御パラメータα_ｉ ^＊に基づいて設定されたシステムモデルのもとでシステム状態ｘ（ｔ）を定め、定めたシステム状態ｘ（ｔ）を最適制御解情報に含めてもよい。 The problem setting calculation unit 222 can solve the set optimal control problem, set the smallest possible evaluation value of the level set function g as the optimal value g _i ^* , and calculate the control parameter α _i ^* that gives the optimal value g _i ^* . The problem setting calculation unit 222 outputs optimal control solution information indicating the optimal value g _i ^* and the control parameter α _i ^* calculated under the search point ξ _i to the data update unit 223. Note that the problem setting calculation unit 222 may further set a system state x(t) under a system model set based on the control parameter α _i ^* , and include the set system state x(t) in the optimal control solution information.

　データ更新部２２３は、問題設定計算部２２２から入力される最適制御解情報に示される探索点ξ_ｉ、最適値ｇ_ｉ ^＊および制御パラメータα_ｉ ^＊の組を含むようにハイレベル制御器π_Ｈおよびレベルセット関数の訓練データＤ_ｏｐｔを更新する。訓練データＤ_ｏｐｔは、探索点ξ_ｉごとの最適値ｇ_ｉ ^＊と制御パラメータα_ｉ ^＊の組が累積されたデータセットをなし、獲得データ集合とも称される。 The data update unit 223 updates the training data D opt of the high-level controller π _H and the level set function so as to include a set of the search point ξ _i , the optimal value g _i ^* , and the control parameter α _i ^* indicated in the optimal control solution information input from the problem setting calculation unit 222. The training data D _opt is a data set in _which a set of the optimal value g _i ^* and the control parameter α _i ^* for each search point ξ _i is accumulated, and is also referred to as an acquired data set.

　学習設定部２３０は、獲得データ集合Ｄ_ｏｐｔを用いて、レベルセット関数ｇ＾を学習し、学習により得られたレベルセット関数ｇ＾に基づき予測精度評価関数Ｊ_ｇ＾と制御器学習用評価関数Ｊ_ｈを設定する。学習設定部２３０は、予測精度評価関数Ｊ_ｇ＾を用いて、レベルセット関数の学習継続の要否を判定する。上記のように、レベルセット関数ｇ＾は、システム状態および目標状態／既知タスクパラメータ値の組み合わせに対して、目標状態への到達可能性に係る評価値を算出するための関数である。予測精度評価関数Ｊ_ｇ＾は、レベルセット関数ｇ＾の予測精度、ひいてはハイレベル制御器π_Ｈの予測精度を評価するための関数である。予測精度は、レベルセット関数ｇ＾の学習継続の要否判定に用いられる。制御器学習用評価関数Ｊ_ｈは、ハイレベル制御器π_Ｈの学習において損失関数として用いられる関数である。 The learning setting unit 230 learns the level set function g^ using the acquired data set D _opt , and sets a prediction accuracy evaluation function J _g^ and a controller learning evaluation function J _h based on the level set function g^ obtained by learning. The learning setting unit 230 uses the prediction accuracy evaluation function J _g^ to determine whether or not learning of the level set function needs to be continued. As described above, the level set function g^ is a function for calculating an evaluation value related to the possibility of reaching the target state for a combination of the system state and the target state/known task parameter value. The prediction accuracy evaluation function J _g^ is a function for evaluating the prediction accuracy of the level set function g^ and, in turn, the prediction accuracy of the high-level controller π _H. The prediction accuracy is used to determine whether or not learning of the level set function g^ needs to be continued. The controller learning evaluation function J _h is a function used as a loss function in learning the high-level controller π _H.

　レベルセット関数学習部２３１は、獲得データ集合Ｄ_ｏｐｔを訓練データとして用いてレベルセット関数ｇ＾の学習を行う。レベルセット関数学習部２３１には、レベルセット関数ｇ＾のモデル、即ち、関数形を予め設定しておく。レベルセット関数学習部２３１は、レベルセット関数ｇ＾の学習において、例えば、レベルセット関数学習用評価関数を最小化するレベルセット関数ｇ＾のパラメータを探索する。レベルセット関数学習用評価関数は、例えば、式（７）に示されるように、獲得データ集合Ｄ_ｏｐｔの要素である探索点ξ_ｉごとのレベルセット関数ｇ＾の評価値ｇ＾（ξ_ｉ，α_ｉ）と目標値ｇ_ｉとの差分の絶対値の二乗和を用いることができる。式（７）において、Ｅ_{（ξｉ，ｇｉ，αｉ）∈Ｄｏｐｔ}〔…〕は、要素として、探索点ξ_ｉ、目標値ｇ_ｉ、制御パラメータα_ｉの組を含む獲得データ集合Ｄ_ｏｐｔに対する評価値…の期待値を示す。 The level set function learning unit 231 uses the acquired data set D _opt as training data to learn the level set function g^. A model of the level set function g^, i.e., a function form, is set in advance in the level set function learning unit 231. In learning the level set function g^, the level set function learning unit 231 searches for parameters of the level set function g^ that minimize the evaluation function for learning the level set function, for example. The evaluation function for learning the level set function can use, for example, the sum of squares of the absolute values of the difference between the evaluation value g^(ξ _i , α _i ) of the level set function g^ for each search point ξ _i that is an element of the acquired data set D _opt and the target value g _i as shown in Equation (7). In formula (7), E _{(ξi, gi, αi)εDopt} [...] denotes the expected value of the evaluation value... for the acquisition data set _Dopt including a set of search point _ξi , target value g _i and control parameter _αi as elements.

　目標値ｇ_ｉは、探索点ξ_ｉにより導出されるシステム状態ｘ（ｔ）の目標状態に到達可とするか否かにより、負値をとるか否かが異なる。レベルセット関数学習部２３１は、学習により定めたレベルセット関数ｇ＾を予測精度評価関数設定部２３２および制御器学習用評価関数設定部２３４に設定する。 The target value g _i may or may not be a negative value depending on whether or not it is possible to reach the target state of the system state x(t) derived by the search point ξ _i . The level set function learning unit 231 sets the level set function g determined by learning to the prediction accuracy evaluation function setting unit 232 and the controller learning evaluation function setting unit 234.

　そこで、レベルセット関数学習部２３１は、獲得データ集合Ｄ_ｏｐｔの要素である探索点ξ_ｉごとに、目標状態へのシステム状態ｘ（ｔ）の到達可否を判定し、到達可否の評価結果を獲得データ集合Ｄ_ｏｐｔに追加してもよい。
　レベルセット関数学習部２３１は、例えば、時刻Ｔにおける最終のシステム状態ｘ（Ｔ）が目標状態に近似する探索点ξ_ｉほど小さい目標値ｇ_ｉを追加してもよい。また、レベルセット関数学習部２３１は、システム状態ｘ（ｔ）が目標状態に到達可とする探索点ξ_ｉに対し、早期に目標状態に到達する探索点ξ_ｉほど小さい目標値ｇ_ｉを追加してもよい。 Therefore, the level set function learning unit 231 may determine whether the system state x(t) can reach the target state for each search point ξ _i , which is an element of the acquisition data set D _opt , and add the evaluation result of whether or not the target state can be reached to the acquisition data set D _opt .
For example, the level set function learning unit 231 may add a smaller target value g _i to a search point ξ _i where the final system state x(T) at time T is closer to the target state. Also, the level set function learning unit 231 may add a smaller target value g _{i to a search point ξ i} _where the system state _x (t) can reach the target state earlier.

　予測精度評価関数設定部２３２は、獲得データ集合Ｄ_ｏｐｔを参照し、レベルセット関数学習部２３１により設定されるレベルセット関数ｇ＾に基づいて予測精度評価関数Ｊ_ｇ＾を設定する。予測精度評価関数設定部２３２は、例えば、探索点ξ_ｉごとのレベルセット関数ｇ＾の評価値の分散σ_ｇ＾（ξ_ｉ，α_ｉ）を予測精度評価関数Ｊ_ｇ＾の評価値として定める。分散σ_ｇ＾（ξ_ｉ，α_ｉ）は、その値が大きいほど探索点ξ_ｉごとに目標状態への到達可否の判定結果が異なりうるため、レベルセット関数ｇ＾による目標状態の到達可能性の予測精度が低いことを示す指標として機能する。また、システム状態は、レベルセット関数ｇ＾に依拠して最適化されるため、分散σ_ｇ＾（ξ_ｉ，α_ｉ）は、その値が大きいほどハイレベル制御器π_Ｈにより予測されるシステム状態の予測精度の指標としても利用することができる。予測精度評価関数設定部２３２は、設定した予測精度評価関数Ｊ_ｇ＾の評価値を予測精度評価部２３３に設定する。 The prediction accuracy evaluation function setting unit 232 refers to the acquired data set D _opt and sets the prediction accuracy evaluation function J _g^ based on the level set function g^ set by the level set function learning unit 231. The prediction accuracy evaluation function setting unit 232 determines, for example, the variance σ _g^ (ξ _i , α _i ) of the evaluation value of the level set function g^ for each search point ξ _i as the evaluation value of the prediction accuracy evaluation function J _g^ . The variance σ _g^ (ξ _i , α _i ) functions as an index indicating that the prediction accuracy of the reachability of the target state by the level set function g^ is low, since the larger the value of the variance σ g^ (ξ i , α _{i )} is, the more different the determination result of whether the target state can be reached for each search point ξ i is. In addition, since the system state is optimized based on the level set function g^, the variance σ g^ (ξ i , α _i ) can also be used as an index of the prediction accuracy of the system state predicted by the high-level controller π _H , the larger the value of the variance σ _g^ (ξ _i , α i ). The prediction accuracy evaluation function setting unit 232 sets the evaluation value of the set prediction accuracy evaluation function J _g^ in the prediction accuracy evaluation unit 233 .

　予測精度評価部２３３は、予測精度評価関数設定部２３２により設定された予測精度評価関数Ｊ_ｇ＾の評価値を用いて、学習中のレベルセット関数ｇ＾の学習継続の要否を判定する。予測精度評価部２３３は、例えば、予測精度評価関数Ｊ_ｇ＾の評価値が予め設定された予測精度評価関数の判定閾値よりも大きいか否かに基づき学習継続の要否を判定する。その他、予測精度評価部２３３は、レベルセット関数ｇ＾の学習条件をさらに考慮して学習継続の要否を判定してもよい。例えば、予測精度評価部２３３は、前回の探索点ξ_ｉ－１に基づいて得られるレベルセット関数ｇ＾から現時点におけるレベルセット関数ｇ＾の変化量が所定の基準変化量以上であるか否かにより学習継続の要否を判定する。 The prediction accuracy evaluation unit 233 uses the evaluation value of the prediction accuracy evaluation function J _g^ set by the prediction accuracy evaluation function setting unit 232 to determine whether or not the learning of the level set function g^ needs to be continued during learning. For example, the prediction accuracy evaluation unit 233 determines whether or not the learning needs to be continued based on whether or not the evaluation value of the prediction accuracy evaluation function J g _^ is greater than a predetermined judgment threshold value of the prediction accuracy evaluation function. In addition, the prediction accuracy evaluation unit 233 may determine whether or not the learning needs to be continued by further considering the learning conditions of the level set function g^. For example, the prediction accuracy evaluation unit 233 determines whether or not the learning needs to be continued based on whether or not the amount of change in the level set function g^ at the current time from the level set function g^ obtained based on the previous search point ξ _i-1 is equal to or greater than a predetermined reference amount of change.

　学習継続要と判定される場合、予測精度評価部２３３は、学習継続要を示す学習継続フラグを探索点集合設定部２１０に出力する。予測精度評価部２３３から学習継続フラグが入力されるとき、探索点選択部２１２は、獲得データ集合Ｄ_ｏｐｔから新たな探索点ξ_ｉを選択し、選択した探索点ξ_ｉをデータ取得部２２０に出力する。そのため、新たな探索点ξ_ｉに基づいてレベルセット関数ｇ＾の学習が継続される。 When it is determined that learning needs to be continued, the prediction accuracy evaluation unit 233 outputs a learning continuation flag indicating the need for learning to the search point set setting unit 210. When the learning continuation flag is input from the prediction accuracy evaluation unit 233, the search point selection unit 212 selects a new search point ξ _i from the acquisition data set D _opt and outputs the selected search point ξ _i to the data acquisition unit 220. Therefore, learning of the level set function g^ is continued based on the new search point ξ _i .

　学習継続不要と判定される場合、予測精度評価部２３３は、学習継続フラグを探索点集合設定部２１０に出力せずに、レベルセット関数学習部２３１に対しレベルセット関数ｇ＾の学習を終了させ、かつ、ハイレベル制御器学習部２４０にハイレベル制御器π_Ｈの学習を終了させる。
　その後、レベルセット関数学習部２３１は、学習が終了した時点におけるレベルセット関数ｇ＾をロボットコントローラ３の動作計画部３４に設定する。また、ハイレベル制御器学習部２４０は、学習が終了した時点におけるハイレベル制御器π_Ｈのパラメータを動作計画部３４とハイレベル制御部３５に設定する。 If it is determined that continued learning is not necessary, the prediction accuracy evaluation unit 233 does not output a continuing learning flag to the search point set setting unit 210, but causes the level set function learning unit 231 to terminate learning of the level set function g^, and causes the high-level controller learning unit 240 to terminate learning of the high-level controller π _H.
Thereafter, the level set function learning unit 231 sets the level set function g^ at the time when learning is completed in the motion planning unit 34 of the robot controller 3. In addition, the high-level controller learning unit 240 sets the parameters of the high-level controller π _H at the time when learning is completed in the motion planning unit 34 and the high-level control unit 35.

　制御器学習用評価関数設定部２３４は、レベルセット関数学習部２３１により設定されるレベルセット関数ｇ＾の評価値とハイレベル制御器学習部２４０により設定されるハイレベル制御器π_Ｈに基づいて制御器学習用評価関数Ｊ_ｈを定める。制御器学習用評価関数設定部２３４は、定めた制御器学習用評価関数Ｊ_ｈをハイレベル制御器学習部２４０に設定する。制御器学習用評価関数設定部２３４は、例えば、式（８）に示されるように、レベルセット関数ｇ＾の平均値μ_ｇ＾（予測平均値）、レベルセット関数ｇ＾の分散μ_ｇ＾（予測分散）およびハイレベル制御器π_Ｈからの制御出力と制御パラメータα_ｉとの差分二乗和｜π_Ｈ（ξ_ｉ）－α_ｉ｜^２の加重和を制御器学習用評価関数Ｊ_ｈとして定める。式（８）において、ｋ、λは、それぞれ分散μ_ｇ＾、差分二乗和｜π_Ｈ（ξ_ｉ）－α_ｉ｜^２の重みパラメータを示す。重みパラメータｋ、λは、それぞれ予め定めた０または０よりも大きい正の実数値である。 The controller learning evaluation function setting unit 234 determines a controller learning evaluation function Jh based on the evaluation value of the level set function g^ set by the level set function learning unit 231 and the high-level controller π _H set by the high-level controller learning unit 240. The controller learning evaluation function setting unit 234 sets the determined controller learning _{evaluation function Jh} _to the high-level controller learning unit 240. The controller learning evaluation function setting unit 234 determines, for example, as shown in equation (8), the average value μ _g^ (prediction average value) of the level set function g^, the variance μ _g^ (prediction variance) of the level set function g^, and the weighted sum of the squared difference |π _H (ξ _i ) - α _i | ² between the control output from the high-level controller π _H and the control parameter α _i as the controller learning evaluation function _Jh . In equation (8), k and λ respectively indicate weighting parameters of the variance μ _g^ and the sum of squared differences |π _H (ξ _i )-α _i | ^2. The weighting parameters k and λ are each a predetermined positive real value of 0 or greater than 0.

　式（８）に例示されるように、制御器学習用評価関数Ｊ_ｈの成分にレベルセット関数ｇ＾の平均値μ_ｇ＾を含めることで、目標状態に到達できるようにハイレベル制御器π_Ｈが学習される。制御器学習用評価関数Ｊ_ｈの成分にレベルセット関数ｇ＾の分散σ_ｇ＾を含めることで、目標状態に安定的に到達できるようにハイレベル制御器π_Ｈが学習される。制御器学習用評価関数Ｊ_ｈの成分に差分二乗和｜π_Ｈ（ξ_ｉ）－α_ｉ｜^２を含めることで、制御出力として制御パラメータαが得られやすくなるようにハイレベル制御器π_Ｈが学習される。 As exemplified in equation (8), by including the average value μ _g^ of the level set function g^ in the components of the controller learning evaluation function J _h , the high-level controller π _H is trained to reach the target state. By including the variance σ _g^ of the level set function g^ in the components of the controller learning evaluation function J _h , the high-level controller π _H is trained to reach the target state stably. By including the sum of squared differences |π _H (ξ _i ) - α _i | ² in the components of the controller learning evaluation function J _h , the high-level controller π _H is trained to make it easier to obtain the control parameter α as a control output.

　ハイレベル制御器学習部２４０は、獲得データ集合Ｄ_ｏｐｔを訓練データとして参照し、制御器学習用評価関数Ｊ_ｈを用いてハイレベル制御器π_Ｈの学習を行う。ハイレベル制御器学習部２４０には、ハイレベル制御器π_Ｈのモデルを予め設定しておく。ハイレベル制御器学習部２４０は、ハイレベル制御器π_Ｈの学習において、例えば、制御器学習用評価関数Ｊ_ｈを最小化するハイレベル制御器π_Ｈのパラメータを探索する。ハイレベル制御器学習部２４０は、学習により得られたハイレベル制御器π_Ｈを問題設定計算部２２２と制御器学習用評価関数設定部２３４に設定する。 The high-level controller learning unit 240 refers to the acquisition data set D _opt as training data and performs learning of the high-level controller π _H using the controller learning evaluation function J _h . A model of the high-level controller π _H is set in advance in the high-level controller learning unit 240. In learning the high-level controller π _H , the high-level controller learning unit 240 searches for parameters of the high-level controller π _H that minimize the controller learning evaluation function J _h , for example. The high-level controller learning unit 240 sets the high-level controller π _H obtained by learning in the problem setting calculation unit 222 and the controller learning evaluation function setting unit 234.

　次に、本実施形態に係るスキル学習部１５におけるデータフローの例について説明する。図１１は、本実施形態に係るスキル学習部１５におけるデータフローを例示するデータフロー図である。
　探索点集合設定部２１０の探索点集合初期化部２１１は、記憶装置２に記憶される目標パラメータ情報を用いて探索点集合Ξ_{ｃｈｅｃｋ}を設定する。探索点集合初期化部２１１は、例えば、目標パラメータ情報を参照し、１個の初期状態ｘ_ｓｉと１個の目標状態／既知タスクパラメータ値β_ｇとの可能な組み合わせを探索点ξとして定め、Ｎ個の異なる探索点ξを含む探索点集合Ξ_{ｃｈｅｃｋ}を構成する。 Next, an example of data flow in the skill learning unit 15 according to this embodiment will be described below. Fig. 11 is a data flow diagram illustrating data flow in the skill learning unit 15 according to this embodiment.
The search point set initialization unit 211 of the search point set setting unit 210 sets the search point set Ξ _check using the target parameter information stored in the storage device 2. The search point set initialization unit 211, for example, refers to the target parameter information, defines a possible combination of one initial state x _si and one target state/known task parameter value β _g as a search point ξ, and configures a search point set Ξ _check including N different search points ξ.

　探索点選択部２１２は、探索点集合設定部２１０において構成された探索点集合Ξ_{ｃｈｅｃｋ}から未処理の１個の探索点ξ_ｉをランダムに選択する。探索点選択部２１２は、予測精度評価関数設定部２３２から予測精度評価関数Ｊ_ｇ＾が設定される場合には、未処理の探索点ξ_ｉのうち、予測精度評価関数Ｊ_ｇ＾の評価値が最大となる探索点ξ_ｉを選択する。探索点選択部２１２は、選択した探索点ξ_ｉをデータ取得部２２０に出力する。 The search point selection unit 212 randomly selects one unprocessed search point ξ _i from the search point set Ξ _check configured in the search point set setting unit 210. When the prediction accuracy evaluation function J _g^ is set by the prediction accuracy evaluation function setting unit 232, the search point selection unit 212 selects the search point ξ _i having the maximum evaluation value of the prediction accuracy evaluation function J _g^ from among the unprocessed search points ξ _i . The search point selection unit 212 outputs the selected search point ξ _i to the data acquisition unit 220.

　データ取得部２２０のシステムモデル設定部２２１は、記憶装置２に記憶されたシステムモデル情報、その他の設定情報および探索点集合設定部２１０から入力される探索点ξ_ｉを用いて問題設定情報を構成する。問題設定情報には、システムモデル情報、制約条件、ローレベル制御器、目標時間、解探索関数および探索点ξ_ｉの情報が含まれる。システムモデル設定部２２１は、構成した問題設定情報を問題設定計算部２２２に設定する。 The system model setting unit 221 of the data acquisition unit 220 configures problem setting information using the system model information stored in the storage device 2, other setting information, and the search point ξ _i input from the search point set setting unit 210. The problem setting information includes the system model information, constraint conditions, low-level controller, target time, solution search function, and information on the search point ξ _i . The system model setting unit 221 sets the configured problem setting information in the problem setting calculation unit 222.

　問題設定計算部２２２は、システムモデル設定部２２１により設定された問題設定情報に示されるシステムモデル、制約条件およびローレベル制御器のもとで探索点ξ_ｉに対する解探索問題として最適制御問題を、解探索関数を用いて解く。問題設定計算部２２２は、解探索関数を用い、最小となる評価値を最適値ｇ_ｉ ^＊として定め、最適値ｇ_ｉ ^＊を与える制御パラメータα_ｉ ^＊を算出する。探索点ξ_ｉに対して最適値ｇ_ｉ ^＊および制御パラメータα_ｉ ^＊の組を含む最適制御解情報をデータ更新部２２３に出力する。 The problem setting calculation unit 222 uses a solution search function to solve an optimal control problem as a solution search problem for the search point ξ _i under the system model, constraint conditions, and low-level controller indicated in the problem setting information set by the system model setting unit 221. The problem setting calculation unit 222 uses the solution search function to determine the minimum evaluation value g i * as the optimal value g _i ^* , and calculates the control parameter α _i ^* that gives the optimal value g _i ^* . The problem setting calculation unit 222 outputs optimal control solution information including a set of the optimal value g _i ^* and the control parameter α _i ^* for the search point ξ _i to the data update unit 223.

　データ更新部２２３は、問題設定計算部２２２から入力される最適制御解情報に示される探索点ξ_ｉ、最適値ｇ_ｉ ^＊および制御パラメータα_ｉ ^＊の組を含むように獲得データ集合Ｄ_ｏｐｔを更新する。 The data update unit 223 updates the acquisition data set D _opt so as to include a set of the search point ξ _i , the optimal value g _i ^* , and the control parameter α _i ^* indicated in the optimal control solution information input from the problem setting calculation unit 222 .

　学習設定部２３０のレベルセット関数学習部２３１は、データ更新部２２３により更新された獲得データ集合Ｄ_ｏｐｔを訓練データとして参照し、予め設定されたレベルセット関数情報に示されるレベルセット関数ｇ＾を学習する。レベルセット関数学習部２３１は、学習により得られたレベルセット関数ｇ＾を予測精度評価関数設定部２３２および制御器学習用評価関数設定部２３４に設定する。 The level set function learning unit 231 of the learning setting unit 230 refers to the acquisition data set D _opt updated by the data updating unit 223 as training data, and learns the level set function g^ indicated in the preset level set function information. The level set function learning unit 231 sets the level set function g^ obtained by learning in the prediction accuracy evaluation function setting unit 232 and the controller learning evaluation function setting unit 234.

　予測精度評価関数設定部２３２は、獲得データ集合Ｄ_ｏｐｔを参照し、レベルセット関数学習部２３１により設定されるレベルセット関数ｇ＾に基づいて予測精度評価関数Ｊ_ｇ＾を定める。予測精度評価関数設定部２３２は、定めた予測精度評価関数Ｊ_ｇ＾の評価値を予測精度評価部２３３に設定する。 The prediction accuracy evaluation function setting unit 232 refers to the acquired data set D _opt and determines the prediction accuracy evaluation function J _g^ based on the level set function g^ set by the level set function learning unit 231. The prediction accuracy evaluation function setting unit 232 sets the evaluation value of the determined prediction accuracy evaluation function J _g^ in the prediction accuracy evaluation unit 233.

　予測精度評価部２３３は、予測精度評価関数設定部２３２により設定された予測精度評価関数Ｊｇ＾の評価値を用いて、学習中のレベルセット関数ｇ＾の学習継続の要否を判定する。学習継続要と判定する場合、予測精度評価部２３３は、学習継続要を示す学習継続フラグを探索点集合設定部２１０に出力する。 The prediction accuracy evaluation unit 233 uses the evaluation value of the prediction accuracy evaluation function Jg^ set by the prediction accuracy evaluation function setting unit 232 to determine whether or not it is necessary to continue learning the level set function g^ during learning. If it determines that learning needs to continue, the prediction accuracy evaluation unit 233 outputs a learning continuation flag indicating that learning needs to continue to the search point set setting unit 210.

　制御器学習用評価関数設定部２３４は、レベルセット関数学習部２３１により設定されるレベルセット関数ｇ＾の評価値とハイレベル制御器学習部２４０により設定されるハイレベル制御器に基づいて制御器学習用評価関数Ｊ_ｈを定める。制御器学習用評価関数設定部２３４は、定めた制御器学習用評価関数Ｊ_ｈをハイレベル制御器学習部２４０に設定する。 The controller learning evaluation function setting unit 234 determines a controller learning evaluation function Jh based on the evaluation value of the level set function g^ set by the level set function learning unit 231 and the high-level controller set by the high-level controller learning unit 240. The controller learning evaluation function setting unit 234 sets the determined controller learning evaluation _function _Jh in the high-level controller learning unit 240.

　ハイレベル制御器学習部２４０は、獲得データ集合Ｄ_ｏｐｔを訓練データとして参照し、制御器学習用評価関数Ｊ_ｈを用いてハイレベル制御器π_Ｈの学習を行う。ハイレベル制御器学習部２４０は、学習により得られたハイレベル制御器π_Ｈを問題設定計算部２２２と制御器学習用評価関数設定部２３４に設定する。 The high-level controller learning unit 240 refers to the acquisition data set D _opt as training data and uses the controller learning evaluation function J _h to learn the high-level controller π _H. The high-level controller learning unit 240 sets the high-level controller π _H obtained by learning in the problem setting calculation unit 222 and the controller learning evaluation function setting unit 234.

（７）処理フロー
　次に、本実施形態に係るスキル学習部１５による学習処理の例について説明する。図１２は、本実施形態に係る学習処理を例示するフローチャートである。
（ステップＳ１０１）探索点集合設定部２１０の探索点集合初期化部２１１は、記憶装置２に記憶される目標パラメータ情報を用いて探索点集合Ξ_{ｃｈｅｃｋ}を設定する。その後、ループＬ１１に進み、データ取得／学習の処理を開始する。 (7) Processing Flow Next, an example of the learning process by the skill learning unit 15 according to this embodiment will be described. Fig. 12 is a flowchart illustrating the learning process according to this embodiment.
(Step S101) The search point set initialization unit 211 of the search point set setting unit 210 sets the search point set Ξ _check using the target parameter information stored in the storage device 2. Then, the process proceeds to loop L11, where data acquisition/learning processing is started.

（ステップＳ１０２）探索点選択部２１２は、探索点集合Ξ_{ｃｈｅｃｋ}から次にデータを取得するための探索点ξ_ｉを選択する。
（ステップＳ１０３）システムモデル設定部２２１は、選択された探索点ξ_ｉに基づいてシステムモデル情報を含む問題設定情報を問題設定計算部２２２に設定する。
（ステップＳ１０４）問題設定計算部２２２は、問題設定情報とハイレベル制御器に基づいて最適制御問題を解き、レベルセット関数を評価関数とする評価値の最適値ｇ_ｉ ^＊と最適値ｇ_ｉ ^＊を与える制御パラメータα_ｉ ^＊を算出する。
（ステップＳ１０５）データ更新部２２３は、処理対象とする探索点ξ_ｉ、最適値ｇ_ｉ ^＊、制御パラメータα_ｉ ^＊のセットを追加することにより獲得データ集合Ｄ_ｏｐｔを更新する。 (Step S102) The search point selection unit 212 selects a search point ξ _i for acquiring data next from the search point set Ξ _check .
(Step S103) The system model setting unit 221 sets problem setting information including system model information in the problem setting calculation unit 222 based on the selected search point ξ _i .
(Step S104) The problem setting calculation unit 222 solves the optimal control problem based on the problem setting information and the high-level controller, and calculates the optimal value g _i ^* of the evaluation value with the level set function as the evaluation function, and the control parameter α _i ^* that gives the optimal value g _i ^* .
(Step S105) The data update unit 223 updates the acquisition data set D _opt by adding a set of search points ξ _i , optimal values g _i ^* , and control parameters α _i ^* to be processed.

（ステップＳ１０６）レベルセット関数学習部２３１は、レベルセット関数学習部２３１は、獲得データ集合Ｄ_ｏｐｔを用いてレベルセット関数ｇ＾を学習する。
（ステップＳ１０７）制御器学習用評価関数設定部２３４は、レベルセット関数ｇ＾とハイレベル制御器π_Ｈに基づいて制御器学習用評価関数Ｊ_ｈを定め、ハイレベル制御器学習部２４０に設定する。
（ステップＳ１０８）ハイレベル制御器学習部２４０は、獲得データ集合Ｄ_ｏｐｔを参照し、制御器学習用評価関数Ｊ_ｈを用いてハイレベル制御器π_Ｈを学習する。ハイレベル制御器学習部２４０は、学習済みのハイレベル制御器π_Ｈを問題設定計算部２２２と制御器学習用評価関数設定部２３４に設定する。 (Step S106) The level set function learning unit 231 learns the level set function g^ by using the acquisition data set D _opt .
(Step S107) The controller learning evaluation function setting unit 234 determines a controller learning evaluation function J _h based on the level set function g^ and the high-level controller π _H , and sets it in the high-level controller learning unit 240.
(Step S108) The high-level controller learning unit 240 refers to the acquisition data set D _opt and learns the high-level controller π _H by using the controller learning evaluation function J _h . The high-level controller learning unit 240 sets the learned high-level controller π _H in the problem setting calculation unit 222 and the controller learning evaluation function setting unit 234.

（ステップＳ１０９）予測精度評価関数設定部２３２は、レベルセット関数ｇ＾に基づいてレベルセット関数ｇ＾ならびにハイレベル制御器π_Ｈの予測精度を評価するための予測精度評価関数Ｊ_ｇ＾を定める。予測精度評価関数設定部２３２は、定めた予測精度評価関数Ｊ_ｇ＾の評価値を予測精度評価部２３３と探索点集合設定部２１０に設定する。予測精度評価関数Ｊ_ｇ＾の評価値は、次の探索点ξ_ｉ＋１に係るステップＳ１０２の処理において参照されうる。 (Step S109) The prediction accuracy evaluation function setting unit 232 determines a prediction accuracy evaluation function J _g^ for evaluating the prediction accuracy of the level set function g^ and the high-level controller π _H based on the level set function g^. The prediction accuracy evaluation function setting unit 232 sets the evaluation value of the determined prediction accuracy evaluation function J _g^ in the prediction accuracy evaluation unit 233 and the search point set setting unit 210. The evaluation value of the prediction accuracy evaluation function J _g^ can be referred to in the processing of step S102 related to the next search point ξ _i+1 .

（ステップＳ１１０）予測精度評価部２３３は、予測精度評価関数Ｊ_ｇ＾の評価値と所定の学習条件を用いて、学習中のレベルセット関数ｇ＾の学習継続の要否を判定する。
（ステップＳ１１１）学習継続要と判定されるとき（ステップＳ１１１　ＹＥＳ）、予測精度評価部２３３は、学習継続フラグを探索点集合設定部２１０に出力する。ステップＳ１０２の処理に進めることで、ループＬ１１の処理を繰り返す。
　予測精度評価部２３３は、その時点における探索点ξ_ｉを処理済とし、処理済の探索点数を１加算（インクリメント）する。処理済の探索点数がＮに達していないとき、次の探索点に対しループＬ１１の処理を繰り返す。処理済の探索点数がＮに達したとき、ループＬ１１から離脱し、データ取得／学習の処理を終了する。
　学習継続否と判定されるとき（ステップＳ１１１　ＮＯ）、直ちにループＬ１１から離脱し、データ取得／学習の処理を終了する。その後、図１２の処理を終了する。レベルセット関数学習部２３１は、レベルセット関数ｇ＾をロボットコントローラ３の動作計画部３４に設定する。ハイレベル制御器学習部２４０は、ハイレベル制御器π_Ｈのパラメータを動作計画部３４とハイレベル制御部３５に設定する。 (Step S110) The prediction accuracy evaluation unit 233 uses the evaluation value of the prediction accuracy evaluation function J _^g and predetermined learning conditions to determine whether or not it is necessary to continue learning the level set function g^ that is being learned.
(Step S111) When it is determined that learning needs to be continued (step S111: YES), the prediction accuracy evaluation unit 233 outputs a learning continuation flag to the search point set setting unit 210. By proceeding to the process of step S102, the process of loop L11 is repeated.
The prediction accuracy evaluation unit 233 considers the search point ξ _i at that point to be processed, and increments the number of processed search points by 1. If the number of processed search points has not reached N, the process of loop L11 is repeated for the next search point. If the number of processed search points reaches N, the process exits from loop L11 and ends the data acquisition/learning process.
When it is determined that learning should not be continued (step S111 NO), the process immediately leaves the loop L11 and ends the data acquisition/learning process. After that, the process in FIG. 12 ends. The level set function learning unit 231 sets the level set function g^ in the motion planning unit 34 of the robot controller 3. The high-level controller learning unit 240 sets the parameters of the high-level controller π _H in the motion planning unit 34 and the high-level control unit 35.

（８）動作計画
　本実施形態に係る動作計画部３４は、ロボット５の動作計画において、探索点ξとその探索点ξに基づいて算出される制御パラメータπ_Ｈ（ξ）について設定された学習済のレベルセット関数ｇ＾の評価値ｇ＾（ξ，π_Ｈ（ξ））を評価値ｇ＾^＊として算出する。制御パラメータπ_Ｈ（ξ）は、探索点ξに対する学習済みのハイレベル制御器π_Ｈからの出力として得られる。一般に、新たに設定される動作計画に係る探索点ξは、レベルセット関数ｇ＾等の学習に用いた探索点ξ_ｉとは異なりうるため、目標状態への到達可能性は未知である。動作計画部３４は、算出した評価値に基づいて目標状態への到達可能性を判定する。探索点ξは、上記のように、実行させようとするタスクの初期状態ｘ_ｓと目標状態／既知タスクパラメータ値β_ｇを要素として含むパラメータセットである。動作計画部３４は、評価値ｇ＾^＊が０以下となるか否かにより、目標状態に到達可能か否かを判定することができる。動作計画部３４は、さらに探索点ξに基づいてハイレベル制御器π_Ｈを用いて得られる制御パラメータα（＝π_Ｈ（ξ））が所定の制約条件を満たすか否かを判定し、評価値ｇ＾^＊が０以下、かつ、制約条件を満たすと判定する場合、目標状態に到達可能と判定してもよい。 (8) Motion Plan The motion planner 34 according to this embodiment calculates, as an evaluation value g^*, an evaluation value g^(ξ, π _H (ξ)) of the learned level set function g^ set for the search point ξ and the control parameter π _H (ξ) calculated based on the search point ξ in the motion plan of the robot ^5. The control parameter π _H (ξ) is obtained as an output from the learned high-level controller π _H for the search point ξ. In general, the search point ξ related to the newly set motion plan may differ from the search point ξ _i used in learning the level set function g^, etc., and therefore the possibility of reaching the target state is unknown. The motion planner 34 determines the possibility of reaching the target state based on the calculated evaluation value. As described above, the search point ξ is a parameter set including the initial state x _s of the task to be executed and the target state/known task parameter value β _g as elements. The motion planner 34 can determine whether the target state can be reached based on whether the evaluation value g^ ^* is equal to or less than 0. The motion planning unit 34 may further determine whether or not a control parameter α (=π _H (ξ)) obtained using the high-level controller π _H based on the search point ξ satisfies a predetermined constraint condition, and may determine that the target state can be reached if it is determined that the evaluation value g^ ^* is less than or equal to 0 and satisfies the constraint condition.

　動作計画部３４は、目標状態に到達可能と判定するとき、探索点ξに基づいて得られる制御パラメータπ_Ｈ（ξ）を制御パラメータαとしてローレベル制御部３６に出力する。よって、目標状態に到達可能と判定された探索点ξに基づいて、ロボット５の動作が制御される。 When the motion planning unit 34 determines that the target state can be reached, it outputs the control parameter π _H (ξ) obtained based on the search point ξ as the control parameter α to the low-level control unit 36. Thus, the motion of the robot 5 is controlled based on the search point ξ determined to be capable of reaching the target state.

　目標状態に到達否と判定するとき、動作計画部３４は、制御パラメータπ_Ｈ（ξ）に調整量Δαを加えて得られる和π_Ｈ（ξ）＋Δαを調整済みの制御パラメータα’として定め、探索点ξと調整済みの制御パラメータα’についてのレベルセット関数ｇ＾の評価値ｇ＾（ξ，α’）を評価値ｇ＾^＊として算出する。動作計画部３４は、式（９）に例示されるように、評価値ｇ＾^＊が０以下となる制御パラメータα’を目標状態に到達可能とする制御パラメータとして探索する。 When determining that the target state has not been reached, the motion planning unit 34 defines the sum π _H (ξ)+Δα obtained by adding the adjustment amount Δα to the control parameter π _H (ξ) as the adjusted control parameter α', and calculates the evaluation value g^(ξ,α') of the level set function g^ for the search point ξ and the adjusted control parameter α' as the evaluation value g^ ^* . As exemplified in equation (9), the motion planning unit 34 searches for the control parameter α' for which the evaluation value g^ ^* is 0 or less as the control parameter that enables the target state to be reached.

　動作計画部３４は、さらに制御パラメータα’を用いて所定の制約条件を満たすか否かを判定し、探索により得られた制御パラメータα’の採否を判定し、採用しないと判定した場合、得られた制御パラメータα’を棄却し、評価値ｇ＾^＊を０以下とし、かつ、当該制約条件を満たす新たな制御パラメータα’を新たに探索してもよい。
　動作計画部３４は、得られた制御パラメータα’をローレベル制御部３６に出力する。よって、探索点ξが目標状態に到達できないと判定された場合でも、探索点ξに対する最適解である制御パラメータを調整することで、システム状態を目標状態に到達可能とすることができる。 The action planning unit 34 may further use the control parameter α' to determine whether or not it satisfies a predetermined constraint condition, determine whether or not to adopt the control parameter α' obtained by the search, and if it is determined not to adopt the control parameter α', reject the obtained control parameter α', set the evaluation value g^ ^* to 0 or less, and newly search for a new control parameter α' that satisfies the constraint condition.
The motion planning unit 34 outputs the obtained control parameter α' to the low-level control unit 36. Therefore, even if it is determined that the search point ξ cannot reach the target state, the system state can be made to reach the target state by adjusting the control parameter that is the optimal solution for the search point ξ.

　探索点ξは目標状態を与えるものでも、制御パラメータα’は探索点ξに対する最適解でもないが、目標状態に到達可能とする制御パラメータα’が探索される。例えば、ロボット５の動作環境において、ある探索点ξのもとで最適解となる制御パラメータにより推定される進行予定経路に他の移動物体が進入する場合を仮定する。動作計画部３４は、この仮定のもとで、動作計画部３４は、他の移動物体の予測経路から十分離れた位置を目標状態の要素とする探索点ξに更新する。 The search point ξ does not give the target state, nor is the control parameter α' an optimal solution for the search point ξ, but a control parameter α' that makes it possible to reach the target state is searched for. For example, assume that in the operating environment of the robot 5, another moving object enters the planned path of travel estimated by the control parameter that is the optimal solution at a certain search point ξ. Under this assumption, the motion planning unit 34 updates the search point ξ to a position sufficiently away from the predicted path of the other moving object, which is an element of the target state.

　その後、動作計画部３４は、設定した探索点ξに対して上記の処理を実行する。即ち、動作計画部３４は、設定した探索点ξ、その最適解を与える制御パラメータαに対し、目標状態への到達可否を判定する。動作計画部３４は、到達可と判定するとき、その探索点ξに対して最適解を与える制御パラメータを導出する。動作計画部３４は、到達否と判定するとき、最適解を与える制御パラメータを調整し、目標状態に到達可とする制御パラメータを探索する。そして、移動物体が当初の探索点ξに基づく進行予定経路を通過した後、動作計画部３４は、もとの目標状態を含む探索点ξを再設定し、再設定後の探索点ξに対し、上記の処理を実行してもよい。従って、探索点ξの変更により移動物体との接触または衝突が回避され、目標状態への到達可能性を向上することができる。 Then, the motion planning unit 34 executes the above processing for the set search point ξ. That is, the motion planning unit 34 judges whether the target state can be reached for the set search point ξ and the control parameter α that gives the optimal solution. When the motion planning unit 34 judges that the target state can be reached, it derives control parameters that give the optimal solution for the search point ξ. When the motion planning unit 34 judges that the target state cannot be reached, it adjusts the control parameters that give the optimal solution and searches for control parameters that make it possible to reach the target state. Then, after the moving object has passed the planned progression path based on the initial search point ξ, the motion planning unit 34 may reset the search point ξ including the original target state and execute the above processing for the reset search point ξ. Therefore, by changing the search point ξ, contact or collision with the moving object can be avoided, and the possibility of reaching the target state can be improved.

　次に、本実施形態に係る動作計画の例について説明する。図１３は、本実施形態に係る動作計画を例示するフローチャートである。
（ステップＳ２０１）動作計画部３４は、ロボット５の動作計画を行う際、探索点ξを設定する。上記のように、探索点ξは、制御システムの初期状態、目標状態およびタスクパラメータを示す。
（ステップＳ２０２）動作計画部３４は、探索点ξに基づいて学習済みのハイレベル制御器π_Ｈを用い、ハイレベル制御器π_Ｈからの制御出力π_Ｈ（ξ）をパラメータαとして求める。動作計画部３４は、探索点ξ、制御出力π_Ｈ（ξ）に対するレベルセット関数ｇ＾（ξ，π（ξ））の評価値ｇ^＊を算出する。 Next, an example of an operation plan according to this embodiment will be described below. Fig. 13 is a flowchart illustrating an operation plan according to this embodiment.
(Step S201) The motion planning unit 34 sets a search point ξ when planning the motion of the robot 5. As described above, the search point ξ indicates the initial state, the target state, and the task parameters of the control system.
(Step S202) The motion planning unit 34 uses a trained high-level controller π _H based on the search point ξ to obtain a control output π _H (ξ) from the high-level controller π _H as a parameter α. The motion planning unit 34 calculates an evaluation value g ^* of a level set function g^(ξ, π(ξ)) for the search point ξ and the control output π _H (ξ).

（ステップＳ２０３）動作計画部３４は、評価値ｇ^＊が０以下であり、かつ、探索点ξに係る動作計画が所定の制約条件を満たすか否かを判定する。満たすと判定するとき（ステップＳ２０３　ＹＥＳ）、ステップＳ２０４の処理に進む。満たさないと判定するとき（ステップＳ２０３　ＮＯ）、ステップＳ２０４の処理に進む。 (Step S203) The motion planning unit 34 judges whether or not the evaluation value g ^* is equal to or less than 0 and the motion plan related to the search point ξ satisfies a predetermined constraint condition. If it is judged that the constraint condition is satisfied (YES in step S203), the process proceeds to step S204. If it is judged that the constraint condition is not satisfied (NO in step S203), the process proceeds to step S204.

（ステップＳ２０４）動作計画部３４は、制御パラメータπ_Ｈ（ξ）を制御パラメータαとしてローレベル制御部３６に出力する。よって、動作計画部３４は、ハイレベル制御器から出力された制御パラメータαをそのまま用いてロボット５の制御を実行させることができる。
（ステップＳ２０５）動作計画部３４は、レベルセット関数ｇ＾（ξ，π_Ｈ（ξ）＋Δα）が０以下、かつ、調整後の制御パラメータπ_Ｈ（ξ）＋Δαを用いても制約条件を満たす調整値Δαを探索する。
（ステップＳ２０６）動作計画部３４は、探索により得られた調整後の制御パラメータπ_Ｈ（ξ）＋Δαを制御パラメータα’としてローレベル制御部３６に出力する。これにより、動作計画部３４は、調整後の制御パラメータα’を用いてロボット５の制御を実行させる。その後、図１３の処理を終了する。 (Step S204) The motion planning unit 34 outputs the control parameter π _H (ξ) as the control parameter α to the low-level control unit 36. Therefore, the motion planning unit 34 can control the robot 5 using the control parameter α output from the high-level controller as it is.
(Step S205) The motion planning unit 34 searches for an adjustment value Δα such that the level set function g^(ξ, π _H (ξ)+Δα) is less than or equal to 0 and satisfies the constraint condition even when the adjusted control parameter π _H (ξ)+Δα is used.
(Step S206) The motion planning unit 34 outputs the adjusted control parameter π _H (ξ)+Δα obtained by the search as the control parameter α' to the low-level control unit 36. As a result, the motion planning unit 34 controls the robot 5 using the adjusted control parameter α'. Then, the processing in FIG. 13 ends.

＜第２実施形態＞
　次に、本願の第２実施形態について、第１実施形態との差異点を主として説明する。第１実施形態との共通点については、共通の符号を付し、特に断らない限り、その説明を援用する。
　図１４は、本実施形態に係るスキル学習部１５におけるデータフローを例示するデータフロー図である。
　本実施形態に係るデータ取得部２２０は、データ更新部２２３に代え、第１データ更新部２２３－１と第２データ更新部２２３－２を備える。 Second Embodiment
Next, a second embodiment of the present invention will be described, focusing mainly on the differences from the first embodiment. Common reference numerals are used for common features with the first embodiment, and the description thereof will be used unless otherwise specified.
FIG. 14 is a data flow diagram illustrating a data flow in the skill learning unit 15 according to this embodiment.
The data acquisition section 220 according to this embodiment includes, instead of the data update section 223, a first data update section 223-1 and a second data update section 223-2.

　問題設定計算部２２２は、上記のようにレベルセット関数を評価関数として用い、最適制御問題を解き、最小となる評価値を最適値ｇ_ｉ ^＊として定め、最適値ｇ_ｉ ^＊を与える制御パラメータα_ｉ ^＊を算出する。但し、本実施形態に係る問題設定計算部２２２は、最適制御問題に係る演算処理の終了により最適値ｇ_ｉ ^＊が得られるまでの演算処理過程において算出された１個または複数個の評価値ｇ_ｉ（本願では、非最適解と呼ぶことがある）と、それぞれの評価値ｇ_ｉを与える制御パラメータα_ｉとの組を保存する。問題設定計算部２２２は、探索点ξ_ｉ、最適値ｇ_ｉ ^＊および制御パラメータα_ｉ ^＊の組を含む最適制御解情報を第１データ更新部２２３－１と第２データ更新部２２３－２に出力する。問題設定計算部２２２は、探索点ξ_ｉ、演算処理過程において得られた非最適値ｇ_ｉおよび制御パラメータα_ｉの組を含む非最適制御解情報を第１データ更新部２２３－１に出力する。よって、第１データ更新部２２３－１には、最適制御解情報と非最適制御解情報の両者が提供される。これに対し、第２データ更新部２２３－２には、最適制御解情報が提供され、非最適制御解情報は提供されない。 The problem setting calculation unit 222 solves the optimal control problem using the level set function as the evaluation function as described above, determines the minimum evaluation value as the optimal value g _i ^* , and calculates the control parameter α _i ^{* that gives the optimal value g i *} _. ^However , the problem setting calculation unit 222 according to this embodiment saves a set of one or more evaluation values g _i (sometimes referred to as non-optimal solutions in this application) calculated in the calculation process until the optimal value g _i ^* is obtained by completing the calculation process related to the optimal control problem, and the control parameter α _i that gives each evaluation value g _i . The problem setting calculation unit 222 outputs optimal control solution information including a set of the search point ξ _i , the optimal value g _i ^* , and the control parameter α _i ^* to the first data update unit 223-1 and the second data update unit 223-2. The problem setting calculation unit 222 outputs non-optimum control solution information including the search point ξ _i , the non-optimum value g _i obtained in the calculation process, and a set of the control parameter α _i to the first data update unit 223-1. Therefore, both the optimal control solution information and the non-optimum control solution information are provided to the first data update unit 223-1. In contrast, the second data update unit 223-2 is provided with the optimal control solution information, but not with the non-optimum control solution information.

　第１データ更新部２２３－１は、問題設定計算部２２２から入力される最適制御解情報に示される探索点ξ_ｉ、最適値ｇ_ｉ ^＊および制御パラメータα_ｉ ^＊の組と、非最適制御解情報に示される探索点ξ_ｉ、非最適値ｇ_ｉおよび制御パラメータα_ｉの組とを累積し、第１データセットＤ_ｇを形成する。第１データセットＤ_ｇは、さらに探索点ξ_ｉ、非最適値ｇ_ｉおよび制御パラメータα_ｉの組を含む点で、上記の獲得データ集合Ｄ_ｏｐｔと異なる。第１データセットＤ_ｇは、獲得データ集合Ｄ_ｏｐｔに代え、レベルセット関数学習部２３１におけるレベルセット関数ｇ＾の学習および予測精度評価関数設定部２３２における予測精度評価関数Ｊ_ｇ＾の設定に用いられる。
より多くの探索点ξ_ｉと制御パラメータα_ｉの組と対応づけてレベルセット関数ｇ＾を学習することができるため、説明変数となる制御パラメータαによる依存性をより的確に説明できるレベルセット関数ｇを得ることができる。 The first data update unit 223-1 accumulates a set of search points ξ _i , optimal values g _i ^*, and control parameters α _i ^* indicated in the optimal control solution information input from the problem setting calculation unit 222, and a set of search points ξ _i , non-optimal values g _i , and control parameters α _i indicated in the non-optimal control solution information, to form a first data set D _g . The first data set D _g differs from the above-mentioned acquired data set D _opt in that it further includes a set of search points ξ _i , non-optimal values g _i , and control parameters α _i . The first data set D _g is used in place of the acquired data set D _opt for learning the level set function g^ in the level set function learning unit 231 and for setting the prediction accuracy evaluation function J _g^ in the prediction accuracy evaluation function setting unit 232.
Since the level set function g^ can be learned by associating it with a larger number of sets of search points ξ _i and control parameters α _i , it is possible to obtain a level set function g that can more accurately explain the dependency on the control parameter α, which is the explanatory variable.

　第２データ更新部２２３－２は、問題設定計算部２２２から入力される最適制御解情報に示される探索点ξ_ｉ、最適値ｇ_ｉ ^＊および制御パラメータα_ｉ ^＊の組を累積し、第２データセットＤ_ｈを形成する。第２データセットＤ_ｈは、上記の獲得データ集合Ｄ_ｏｐｔと同様である。第２データセットＤ_ｈは、獲得データ集合Ｄ_ｏｐｔに代え、ハイレベル制御器π_Ｈの学習に用いられる。 The second data update unit 223-2 accumulates a set of the search point ξ _i , the optimal value g _i ^* , and the control parameter α _i ^* indicated in the optimal control solution information input from the problem setting calculation unit 222, and forms a second data set D _h . The second data set D _h is similar to the above-mentioned acquired data set D _opt . The second data set D _h is used for learning the high-level controller π _H in place of the acquired data set D _opt .

　次に、本実施形態に係るスキル学習部１５による学習処理の例について説明する。図１５は、本実施形態に係る学習処理を例示するフローチャートである。図１５の処理は、ステップＳ１０１～Ｓ１０３、Ｓ１１４～Ｓ１１７、Ｓ１０７、Ｓ１１８およびＳ１０９～Ｓ１１１の処理を有する。ステップＳ１０１～Ｓ１０３、Ｓ１０７およびＳ１０９～Ｓ１１１の処理については、図１２の説明を援用する。図１５の処理では、ステップＳ１０３の処理の後、ステップＳ１１４の処理に進む。 Next, an example of the learning process by the skill learning unit 15 according to this embodiment will be described. FIG. 15 is a flowchart illustrating the learning process according to this embodiment. The process in FIG. 15 includes steps S101 to S103, S114 to S117, S107, S118, and S109 to S111. The explanation of FIG. 12 is used for the processes of steps S101 to S103, S107, and S109 to S111. In the process in FIG. 15, after the process of step S103, the process proceeds to step S114.

（ステップＳ１１４）問題設定計算部２２２は、問題設定情報とハイレベル制御器に基づいて最適制御問題を解き、レベルセット関数を評価関数とする評価値の最適値ｇ_ｉ ^＊と最適値ｇ_ｉ ^＊を与える制御パラメータα_ｉ ^＊を算出する。問題設定計算部２２２は、最適値ｇ_ｉ ^＊が得られるまでの演算処理過程（計算途中）において算出した非最適解ｇ_ｉと、非最適解ｇ_ｉを与える制御パラメータα_ｉとの組を保存する。 (Step S114) The problem setting calculation unit 222 solves the optimal control problem based on the problem setting information and the high-level controller, and calculates an optimal value g _i ^* of the evaluation value with the level set function as the evaluation function and a control parameter α _i ^* that gives the optimal value g _i ^* . The problem setting calculation unit 222 saves a pair of a non-optimal solution g _i calculated in the calculation process (during calculation) until the optimal value g _i ^* is obtained, and a control parameter α _i that gives the non-optimal solution g _i .

（ステップＳ１１５）第１データ更新部２２３－１は、探索点ξ_ｉ、最適値ｇ_ｉ ^＊および制御パラメータα_ｉ ^＊の組と、探索点ξ_ｉ、非最適値ｇ_ｉおよび制御パラメータα_ｉの組とを追加することにより、第１データセットＤ_ｇを更新する。
（ステップＳ１１６）第２データ更新部２２３－２は、　探索点ξ_ｉ、最適値ｇ_ｉ ^＊および制御パラメータα_ｉ ^＊の組を追加することにより、第２データセットＤ_ｈを更新する。 (Step S115) The first data update unit 223-1 updates the first data set D g by adding a set of a search point ξ _i , an optimal value g _i ^* , and a control parameter α _i ^* , and a set of a search point ξ _i , a non-optimal value _g _i , and a control parameter α _i .
(Step S116) The second data update unit 223-2 updates the second data set D _h by adding a set of the search point ξ _i , the optimal value g _i ^* , and the control parameter α _i ^* .

（ステップＳ１１７）レベルセット関数学習部２３１は、レベルセット関数学習部２３１は、第１データセットＤ_ｇを用いてレベルセット関数ｇ＾を学習する。
　その後、ステップＳ１０７の処理に進む。ステップＳ１０７の処理の後、ステップＳ１１８の処理に進む。
（ステップＳ１１８）ハイレベル制御器学習部２４０は、第２データセットＤ_ｈを参照し、制御器学習用評価関数Ｊ_ｈを用いてハイレベル制御器π_Ｈを学習する。ハイレベル制御器学習部２４０は、学習済みのハイレベル制御器π_Ｈを問題設定計算部２２２と制御器学習用評価関数設定部２３４に設定する。その後、ステップＳ１０９の処理に進む。 (Step S117) The level set function learning unit 231 learns the level set function g^ by using the first data set D _g .
After that, the process proceeds to step S107. After the process of step S107, the process proceeds to step S118.
(Step S118) The high-level controller learning unit 240 refers to the second data set _Dh and learns the high-level controller _πH using the controller learning evaluation function _Jh . The high-level controller learning unit 240 sets the learned high-level controller _πH in the problem setting calculation unit 222 and the controller learning evaluation function setting unit 234. Then, the process proceeds to step S109.

　なお、第１データ更新部２２３－１への第１データセットＤ_ｇと第２データ更新部２２３－２への第２データセットＤ_ｈの一方または両方の更新ならびにレベルセット関数ｇもしくはハイレベル制御器π_Ｈの学習において、経験再生（ＥＲ：Experience Replay）を適用してもよい。経験再生は、複数組の遷移情報をリプレイバッファ（ＲＢ：Replay Buffer）と呼ばれる記憶領域に保存しておき、保存した遷移情報からランダムに１組ずつ選択し、選択した１組の遷移情報を順次、学習に用いる手法である。１組の遷移情報は、１回の状態遷移における遷移前後の状態と、遷移の条件、遷移による評価関数の評価値の変化量などの情報が含まれる。第１データセットＤ_ｇの生成において、非最適解を用いて事後経験再生（Hindsight Experience Replay）などのデータ拡張技術を適用してもよい。事後経験再生については、例えば、次の文献において記載されている。
　M. Andrychowicz, et al.: “Hindsight Experience Replay”, Proc of NIPS (2017) In addition, experience replay (ER) may be applied in updating one or both of the first data set D _g to the first data update unit 223-1 and the second data set D _h to the second data update unit 223-2, and in learning the level set function g or the high-level controller π _H. Experience replay is a method in which multiple sets of transition information are stored in a storage area called a replay buffer (RB), one set is randomly selected from the stored transition information, and the selected sets of transition information are sequentially used for learning. One set of transition information includes information such as the state before and after a transition in one state transition, the condition of the transition, and the amount of change in the evaluation value of the evaluation function due to the transition. In generating the first data set D _g , a data augmentation technique such as post-event experience replay (Hindsight Experience Replay) may be applied using a non-optimal solution. Post-event experience replay is described in, for example, the following document.
M. Andrychowicz, et al.: “Hindsight Experience Replay”, Proc of NIPS (2017)

　また、上記の例では、第２データセットＤ_ｈにおいて、探索点ξ_ｉ、非最適値ｇ_ｉおよび制御パラメータα_ｉの組が含まれない場合を例にしたが、これには限られない。第２データセットＤ_ｈには、探索点ξ_ｉ、非最適値ｇ_ｉおよび制御パラメータα_ｉの組が含まれてもよい。但し、第２データセットＤ_ｈに含まれる、探索点ξ_ｉ、非最適値ｇ_ｉおよび制御パラメータα_ｉの組は、第１データセットＤ_ｇに含まれる、探索点ξ_ｉ、非最適値ｇ_ｉおよび制御パラメータα_ｉの組の一部とする。 In the above example, the second data set _Dh does not include a set of the search point ξ _i , the non-optimal value g _{i ,} and the control parameter α _i , but this is not limiting. The second data set _Dh may include a set of the search point ξ _i , the non-optimal value g _i , and the control parameter α _i . However, the set of the search point ξ _i , the non-optimal value g _i , and the control parameter α _i included in the second data set _Dh is considered to be a part of the set of the search point ξ _i , the non-optimal value g _{i ,} and the control parameter α _i included in the first data set _Dg .

＜第３実施形態＞
　次に、本願の第３実施形態について、第１実施形態との差異点を主として説明する。第１実施形態または第２実施形態との共通点については、共通の符号を付し、特に断らない限り、その説明を援用する。
　図１６は、本実施形態に係るスキル学習部１５におけるデータフローを例示するデータフロー図である。本実施形態に係るスキル学習部１５は、システムモデル学習部２５０を備える。 Third Embodiment
Next, the third embodiment of the present invention will be described, focusing mainly on the differences from the first embodiment. The same reference numerals are used for the common points with the first and second embodiments, and the description thereof is incorporated herein unless otherwise specified.
16 is a data flow diagram illustrating a data flow in the skill learning unit 15 according to this embodiment. The skill learning unit 15 according to this embodiment includes a system model learning unit 250.

　システムモデル学習部２５０は、学習済み、または、学習中のハイレベル制御器を用いて対象システムにおいてロボット５の動作を制御しているとき、制御結果を示すデータ（本願では、「制御結果データ」と呼ぶ）を取得する。制御において、ハイレベル制御器は、制御対象とするスキルに係る目標パラメータ情報を要素として含む探索点を用いる。制御結果データには、少なくとも各時刻における探索点、ハイレベル制御器から出力される制御パラメータ、および、その制御結果をなすシステム状態を含み、これらを対応付けてなる。制御結果データは、制御対象とするスキルに係る目標パラメータ情報と対応付けられる。システムモデル学習部２５０は、システムモデルの学習において、制御結果データに基づき、各時刻におけるシステム状態を出力とし、制御パラメータおよび探索点に示される目標パラメータを入力とする対象システムのシステムモデルのパラメータを定めることができる。システムモデル学習部２５０は、学習済みのシステムモデルのパラメータを示すシステムモデル情報をシステムモデル設定部２２１に設定する。システムモデル学習部２５０は、取得した制御結果データをデータ更新部２２３に出力する。 When controlling the operation of the robot 5 in the target system using a high-level controller that has been learned or is currently being learned, the system model learning unit 250 acquires data indicating the control result (referred to as "control result data" in this application). In the control, the high-level controller uses a search point that includes, as an element, target parameter information related to the skill to be controlled. The control result data includes at least the search point at each time, the control parameters output from the high-level controller, and the system state that forms the control result, and these are associated with each other. The control result data is associated with the target parameter information related to the skill to be controlled. In learning the system model, the system model learning unit 250 can determine the parameters of the system model of the target system that outputs the system state at each time and inputs the control parameters and the target parameters indicated by the search point based on the control result data. The system model learning unit 250 sets the system model information indicating the parameters of the learned system model in the system model setting unit 221. The system model learning unit 250 outputs the acquired control result data to the data update unit 223.

　システムモデル学習部２５０は、システム５０から各時刻におけるレベルセット関数の評価値を制御結果の一部として取得し、取得した評価値と当該時刻における制御パラメータ、探索点およびシステム状態と対応付けて制御結果データに含めてもよい。レベルセット関数の評価値は、ロボット５の動作制御において最適制御問題を解く過程で算出される。データ更新部２２３は、制御結果データに含まれる探索点と制御パラメータを入力とし、評価値を出力として含む制御解情報を獲得データ集合Ｄ_ｏｐｔに含めてもよい。その場合には、レベルセット関数学習部２３１におけるレベルセット関数およびハイレベル制御器学習部２４０におけるハイレベル制御器の学習に制御結果データから導出される制御解情報が用いられる。 The system model learning unit 250 may acquire an evaluation value of the level set function at each time from the system 50 as a part of the control result, and may include the acquired evaluation value in the control result data in association with the control parameters, search points, and system state at the time. The evaluation value of the level set function is calculated in the process of solving an optimal control problem in the motion control of the robot 5. The data updating unit 223 may include control solution information including the search points and control parameters included in the control result data as input and the evaluation value as output in the acquisition data set D _opt . In this case, the control solution information derived from the control result data is used for learning the level set function in the level set function learning unit 231 and the high-level controller in the high-level controller learning unit 240.

　次に、本実施形態に係るシステムモデル学習部２５０の機能構成例について説明する。図１７は、本実施形態に係るシステムモデル学習部２５０の機能構成例を示す概略ブロック図である。
　システムモデル学習部２５０は、ハイレベル制御器評価部２５０ａ、スキル学習用ローレベル制御器２５０ｂ、データ処理部２５０ｃ、データ収集タスク管理部２５０ｅ、遷移データ格納部２５０ｆおよびシステムモデル学習処理部２５０ｇを含んで構成される。 Next, a description will be given of an example of the functional configuration of the system model learning unit 250 according to this embodiment. Fig. 17 is a schematic block diagram showing an example of the functional configuration of the system model learning unit 250 according to this embodiment.
The system model learning unit 250 includes a high-level controller evaluation unit 250a, a low-level controller for skill learning 250b, a data processing unit 250c, a data collection task management unit 250e, a transition data storage unit 250f, and a system model learning processing unit 250g.

　ハイレベル制御器評価部２５０ａには、ハイレベル制御器学習部２４０により学習されたハイレベル制御器のパラメータが設定される。ハイレベル制御器評価部２５０ａは、学習されたハイレベル制御器を用いて、データ収集タスク管理部２５０ｅにより設定された目標パラメータ情報から導出される探索点に基づいて制御パラメータを算出する。ハイレベル制御器評価部２５０ａは、算出した制御パラメータをスキル学習用ローレベル制御器２５０ｂに出力する。 The high-level controller parameters learned by the high-level controller learning unit 240 are set in the high-level controller evaluation unit 250a. Using the learned high-level controller, the high-level controller evaluation unit 250a calculates control parameters based on search points derived from target parameter information set by the data collection task management unit 250e. The high-level controller evaluation unit 250a outputs the calculated control parameters to the skill learning low-level controller 250b.

　スキル学習用ローレベル制御器２５０ｂは、システム５０から収集されるシステム状態、ハイレベル制御器評価部２５０ａから入力される制御パラメータおよび目標パラメータ情報に示される目標状態／既知タスクパラメータ値に基づいて制御出力を算出し、算出した制御出力を制御対象とするシステム５０に出力する。
　データ処理部２５０ｃは、システム５０の制御に用いられた制御パラメータ、データ収集タスク管理部２５０ｅから取得した制御点、レベルセット関数の評価値、および、システム５０から取得したシステム状態を対応付けて制御結果データを構成する。データ処理部２５０ｃは、構成した制御結果データを制御結果格納部２５０ｄと遷移データ格納部２５０ｆに記憶する。 The low-level controller 250b for skill learning calculates a control output based on the system state collected from the system 50, the control parameters input from the high-level controller evaluation unit 250a, and the target state/known task parameter values indicated in the target parameter information, and outputs the calculated control output to the system 50 to be controlled.
The data processing unit 250c configures control result data by associating the control parameters used in controlling the system 50, the control points acquired from the data collection task management unit 250e, the evaluation value of the level set function, and the system state acquired from the system 50. The data processing unit 250c stores the configured control result data in a control result storage unit 250d and a transition data storage unit 250f.

　制御結果格納部２５０ｄには、タスクごとに制御結果データが集約される。制御結果格納部２５０ｄに集約して記憶された制御結果データは、データ更新部２２３により読み出される。制御結果格納部２５０ｄは、新たな制御結果データを取得するごとに、取得した制御結果データをデータ更新部２２３に出力してもよい。
　データ収集タスク管理部２５０ｅには、時刻ごとに制御結果データが蓄積され、蓄積された制御結果データが累積し、スキルごとに遷移データとして形成される。 The control result storage unit 250d aggregates the control result data for each task. The control result data aggregated and stored in the control result storage unit 250d is read by the data update unit 223. The control result storage unit 250d may output the acquired control result data to the data update unit 223 every time new control result data is acquired.
The data collection task management unit 250e accumulates control result data for each time, and the accumulated control result data is accumulated to form transition data for each skill.

　システムモデル学習処理部２５０ｇは、遷移データ格納部２５０ｆに蓄積された遷移データを用いて、予め設定されたシステムモデル情報で示されるシステムモデルを学習する。システムモデル学習処理部２５０ｇは、遷移データに示される制御パラメータおよび探索点に示される目標パラメータ情報を入力とし、対象システムの各時刻におけるシステム状態を出力として推定するためのシステムモデルのパラメータをスキルごとに定める。システムモデル学習処理部２５０ｇは、学習済みのシステムモデルを示すシステムモデル情報をシステムモデル設定部２２１に設定する。 The system model learning processing unit 250g uses the transition data stored in the transition data storage unit 250f to learn the system model indicated by the preset system model information. The system model learning processing unit 250g receives the control parameters indicated in the transition data and the target parameter information indicated at the search points as input, and determines the system model parameters for each skill to estimate the system state at each time of the target system as output. The system model learning processing unit 250g sets the system model information indicating the learned system model in the system model setting unit 221.

　システムモデル学習処理部２５０ｇは、式（５）に例示されるシステムモデルを用いることができる。当該システムモデルにおいて、微小な単位時間を設定しておくことで、式（５）の左辺に例示されるシステム状態の時間微分は、現時刻のシステム状態から現時刻から単位時刻後の時刻（即ち、次時刻）におけるシステム状態への変動量に近似することができる。従って、システムモデル学習処理部２５０ｇは、システムモデルとして、次時刻のシステム状態を出力とし、現時刻のシステム状態、制御パラメータおよび目標パラメータ情報を入力とする関数を用いることができる。 The system model learning processing unit 250g can use a system model exemplified by equation (5). By setting a very small unit time in this system model, the time derivative of the system state exemplified on the left side of equation (5) can be approximated to the amount of change from the system state at the current time to the system state at the time unit time after the current time (i.e., the next time). Therefore, the system model learning processing unit 250g can use, as the system model, a function that outputs the system state at the next time and inputs the system state at the current time, the control parameters, and the target parameter information.

　また、システムモデル学習部２５０によるシステムモデルの学習は、レベルセット関数ならびにハイレベル制御器の学習とは、別個に実行されてもよいし、それらの学習に係る探索点ごとに実行されてもよい。上記のように探索点は、目標パラメータ情報を要素として含む。システムモデルの学習では、予め設定された探索点集合Ξ_{ｃｈｅｃｋ}に属する探索点に代え、システム５０の稼働に用いられる目標パラメータ情報を与える探索点が用いられてもよい。 Furthermore, the learning of the system model by the system model learning unit 250 may be performed separately from the learning of the level set function and the high-level controller, or may be performed for each search point related to the learning. As described above, the search points include target parameter information as elements. In the learning of the system model, search points that provide target parameter information used in the operation of the system 50 may be used instead of search points that belong to a preset search point set Ξ _check .

　次に、本実施形態に係る学習装置１による学習処理の例について説明する。図１８は、本実施形態に係る学習処理を例示するフローチャートである。ステップＳ３０１、Ｓ３０２の処理は、レベルセット関数ならびにハイレベル制御器の学習とは、独立に実行される。ステップＳ３０４、Ｓ３０５の処理は、探索点ξ_ｉごとにステップＳ３０３に係るレベルセット関数ならびにハイレベル制御器の学習と関連付けて実行される。 Next, an example of the learning process by the learning device 1 according to this embodiment will be described. Fig. 18 is a flowchart illustrating the learning process according to this embodiment. The processes of steps S301 and S302 are executed independently of the learning of the level set function and the high-level controller. The processes of steps S304 and S305 are executed in association with the learning of the level set function and the high-level controller according to step S303 for each search point ξ _i .

（ステップＳ３０１）システムモデル学習部２５０は、実環境で稼働するシステム５０からランダムに制御結果データを取得する。ランダムに取得するとは、無作為に定めた期間において取得すること、各回の稼働ごとに取得の要否を無作為に判定し、取得要と判定したときに取得する、などの意味を含む。各回の稼働ごとにスキルと目標パラメータの一方または両方が異なることがある。
（ステップＳ３０２）システムモデル学習部２５０は、取得した制御結果データに基づいてシステムモデルをスキルごとに学習する。システムモデル学習部２５０は、学習済みのシステムモデルのパラメータを示すシステムモデル情報をシステムモデル設定部２２１に設定する。システムモデル学習部２５０は、システムモデルの学習過程で取得した制御結果データをデータ更新部２２３に出力する。その後、ループＬ３１に進み、データ取得／学習の処理を開始する。但し、図１８の例では、探索点の個数がＮであるが、システム５０の一回の稼働ごとに、ループＬ３１内の処理が実行されてもよい。稼働ごとに探索点をなす初期状態、タスクパラメータおよび目標パラメータ情報の少なくとも１項目が異なりうる。 (Step S301) The system model learning unit 250 randomly acquires control result data from the system 50 operating in a real environment. Randomly acquiring includes acquiring in a randomly determined period, randomly determining whether acquisition is necessary for each operation, and acquiring when it is determined that acquisition is necessary, etc. Either or both of the skill and the target parameter may be different for each operation.
(Step S302) The system model learning unit 250 learns the system model for each skill based on the acquired control result data. The system model learning unit 250 sets system model information indicating the parameters of the learned system model to the system model setting unit 221. The system model learning unit 250 outputs the control result data acquired in the learning process of the system model to the data update unit 223. Thereafter, the process proceeds to loop L31, and data acquisition/learning processing is started. However, although the number of search points is N in the example of FIG. 18, the processing in loop L31 may be executed for each operation of the system 50. At least one item of the initial state, task parameter, and target parameter information that constitute the search point may be different for each operation.

（ステップＳ３０３）スキル学習部１５は、レベルセット関数とハイレベル制御器を学習する。本ステップの処理は、図１２に例示されるループＬ１１のステップＳ１０２～Ｓ１１１の処理と同様であってもよい。但し、ステップＳ１１１において学習継続否と判定されるとき（ステップＳ１１１　ＮＯ）、予測精度評価部２３３は、処理済みの探索点の数を変更せずにステップＳ３０３の処理に進める。 (Step S303) The skill learning unit 15 learns the level set function and the high-level controller. The processing of this step may be the same as the processing of steps S102 to S111 of the loop L11 illustrated in FIG. 12. However, when it is determined in step S111 that learning should not be continued (step S111 NO), the prediction accuracy evaluation unit 233 proceeds to the processing of step S303 without changing the number of search points that have been processed.

（ステップＳ３０４）システムモデル学習部２５０は、学習済みのハイレベル制御器を用いて実環境でロボット５の動作を制御したシステム５０から制御結果データを取得する。
（ステップＳ３０５）システムモデル学習部２５０は、取得した制御結果データに基づいてシステムモデルを学習する。システムモデル学習部２５０は、学習済みのシステムモデルのパラメータを示すシステムモデル情報をシステムモデル設定部２２１に設定する。システムモデル学習部２５０は、システムモデルの学習過程で取得した制御結果データをデータ更新部２２３に出力する。
　予測精度評価部２３３は、その時点における探索点ξ_ｉを処理済とし、処理済の探索点数を１加算する。処理済の探索点数がＮに達していないとき、次の探索点に対しループＬ３１の処理を繰り返す。処理済の探索点数がＮに達したとき、ループＬ３１から離脱し、データ取得／学習の処理を終了する。その後、図１８の処理を終了する。 (Step S304) The system model learning unit 250 acquires control result data from the system 50 that controls the movement of the robot 5 in the real environment using the learned high-level controller.
(Step S305) The system model learning unit 250 learns the system model based on the acquired control result data. The system model learning unit 250 sets system model information indicating parameters of the learned system model in the system model setting unit 221. The system model learning unit 250 outputs the control result data acquired in the process of learning the system model to the data updating unit 223.
The prediction accuracy evaluation unit 233 considers the search point ξ _i at that point to be processed, and increments the number of processed search points by 1. When the number of processed search points has not reached N, the process of loop L31 is repeated for the next search point. When the number of processed search points reaches N, the process exits from loop L31 and ends the data acquisition/learning process. Then, the process of FIG. 18 ends.

　なお、上記の説明は、第１実施形態との差異点を主としたが、第２実施形態に係るスキル学習部１５と同様にデータ更新部２２３に代え、本実施形態に係るスキル学習部１５は、第１データ更新部２２３－１と第２データ更新部２２３－２を備えてもよい。その場合、第１データ更新部２２３－１により生成された第１データセットＤ_ｇがレベルセット関数学習部２３１におけるレベルセット関数ｇ＾の学習および予測精度評価関数設定部２３２における予測精度評価関数Ｊ_ｇ＾の設定に用いられる。第２データ更新部２２３－２により生成された第２データセットＤ_ｈは、ハイレベル制御器π_Ｈの学習に用いられる。図１８に例示したステップＳ３０３の処理は、図１５に例示されるループＬ２１のステップＳ１０２、Ｓ１０３、Ｓ１１４～Ｓ１１７、Ｓ１０７～Ｓ１１１の処理と同様となる。 The above description has focused on the differences from the first embodiment, but instead of the data update unit 223 as in the skill learning unit 15 according to the second embodiment, the skill learning unit 15 according to this embodiment may include a first data update unit 223-1 and a second data update unit 223-2. In this case, the first data set D _g generated by the first data update unit 223-1 is used for learning the level set function g^ in the level set function learning unit 231 and for setting the prediction accuracy evaluation function J _g^ in the prediction accuracy evaluation function setting unit 232. The second data set D _h generated by the second data update unit 223-2 is used for learning the high-level controller π _H. The process of step S303 illustrated in FIG. 18 is similar to the processes of steps S102, S103, S114 to S117, and S107 to S111 of the loop L21 illustrated in FIG. 15.

　なお、システムモデル学習部２５０により生成された制御結果データは、レベルセット関数ｇ＾の学習および予測精度評価関数Ｊ_ｇ＾の設定に用いられうる。その場合、システムモデル学習部２５０により生成された制御結果データは、ハイレベル制御器π_Ｈの学習に用いられなくてもよい。 The control result data generated by the system model learning unit 250 can be used for learning the level set function g^ and for setting the prediction accuracy evaluation function J _g^ . In this case, the control result data generated by the system model learning unit 250 does not need to be used for learning the high-level controller π _H.

＜最小構成例＞
　次に、本願の実施形態の最小構成例について説明する。
　図１９は、本願の実施形態に係る学習装置１の最小構成例を示す概略ブロック図である。学習装置１は、ロボットの動作目標および動作環境を規定する動作パラメータとロボットの制御パラメータを入力とし、動作パラメータと制御パラメータに基づく動作目標の到達可能性に関する評価値を出力とするレベルセット関数を学習するレベルセット関数学習部２３１と、ロボットに動作パラメータに基づく目標動作を実現するための制御パラメータを定めるハイレベル制御器を、レベルセット関数と制御パラメータの予測精度に基づいて学習するハイレベル制御器学習部２４０と、を備える。 <Minimum configuration example>
Next, an example of a minimum configuration of the embodiment of the present application will be described.
19 is a schematic block diagram showing an example of a minimum configuration of the learning device 1 according to the embodiment of the present application. The learning device 1 includes a level set function learning unit 231 that learns a level set function that receives motion parameters that define a motion target and motion environment of the robot and control parameters of the robot as inputs and outputs an evaluation value related to the attainment of the motion target based on the motion parameters and the control parameters, and a high-level controller learning unit 240 that learns a high-level controller that determines control parameters for realizing a target motion based on the motion parameters of the robot based on the level set function and prediction accuracy of the control parameters.

　この構成によれば、ハイレベル制御器学習部２４０によりロボットの制御に用いられる動作パラメータに基づいて目標動作を実現するための制御パラメータを定めるためのハイレベル制御器が学習され、レベルセット関数学習部２３１により動作パラメータと制御パラメータに基づいて動作目標の到達可能性に関する評価値を定めるためのレベルセット関数が学習される。従って、ロボットの制御に係る動作パラメータから学習済みのレベルセット関数を用いて得られる評価値を用いて、当該動作パラメータに基づく動作目標の到達可能性を判定することができる。そのため当該動作パラメータによるロボットの動作計画を効率化することができる。 With this configuration, the high-level controller learning unit 240 learns a high-level controller for determining control parameters for realizing a target motion based on the motion parameters used to control the robot, and the level set function learning unit 231 learns a level set function for determining an evaluation value relating to the achievability of a motion goal based on the motion parameters and the control parameters. Therefore, the achievability of a motion goal based on the motion parameters can be determined using an evaluation value obtained from the motion parameters related to the control of the robot using the learned level set function. This makes it possible to efficiently plan the motion of the robot using the motion parameters.

　図２０は、本願の実施形態に係る制御装置３’の最小構成例を示す概略ブロック図である。制御装置３’は、ロボットの動作目標および動作環境を規定する動作パラメータに基づく目標動作を実現するための制御パラメータを定めるハイレベル制御部３５と、レベルセット関数を用いて動作パラメータと制御パラメータに基づく目標動作の実現性を示す評価値を算出し、評価値に基づいて目標動作を実現可と判定するとき、ロボットの動作制御に制御パラメータを用い、評価値に基づいて目標動作を実現否と判定するとき、評価値に基づき目標動作を実現可とする制御パラメータを探索する動作計画部３４と、を備える。 FIG. 20 is a schematic block diagram showing an example of the minimum configuration of a control device 3' according to an embodiment of the present application. The control device 3' includes a high-level control unit 35 that determines control parameters for realizing a target motion based on motion parameters that define the motion target and motion environment of the robot, and a motion planning unit 34 that calculates an evaluation value indicating the feasibility of the target motion based on the motion parameters and control parameters using a level set function, and uses the control parameters to control the motion of the robot when it is determined that the target motion is realizable based on the evaluation value, and searches for control parameters that make the target motion realizable based on the evaluation value when it is determined that the target motion is not realizable based on the evaluation value.

　この構成によれば、ハイレベル制御部３５により動作パラメータに基づく制御パラメータが得られ、動作計画部３４により動作パラメータと制御パラメータにより得られる評価値に基づいて、動作パラメータによる目標動作の実現可能性が評価される。また、実現可と判定されるとき、ハイレベル制御部３５により得られた制御パラメータが動作制御に用いられ、実現否と判定されるとき、評価値に基づき目標動作を実現可とする制御パラメータを探索する。そのため、動作パラメータによる目標動作が極力実現できるように動作計画を促進することができる。 With this configuration, the high-level control unit 35 obtains control parameters based on the motion parameters, and the motion planning unit 34 evaluates the feasibility of the target motion using the motion parameters based on the motion parameters and the evaluation value obtained from the control parameters. Furthermore, when it is determined that the target motion can be realized, the control parameters obtained by the high-level control unit 35 are used for motion control, and when it is determined that the target motion cannot be realized, a search is made for control parameters that will make the target motion feasible based on the evaluation value. This makes it possible to promote motion planning so that the target motion using the motion parameters can be realized as much as possible.

　上記のように、本願の実施形態は、学習装置１、ロボット５およびハイレベル制御器を備える制御システム１００として実現されてもよい。
　本願の実施形態は、コンピュータに学習装置１として機能させるためのプログラムとして実現されてもよいし、当該プログラムを記憶したコンピュータにより読み取り可能な非一時的記憶媒体として実現されてもよい。当該コンピュータは、プロセッサその他の集積回路と当該非一時的記憶媒体を含んで構成され、当該プログラムを構成する命令で指示される処理を実行可能とし、学習装置１の機能を実現してもよい。 As described above, an embodiment of the present application may be realized as a control system 100 including a learning device 1, a robot 5 and a high-level controller.
The embodiment of the present application may be realized as a program for causing a computer to function as the learning device 1, or as a non-transitory storage medium that stores the program and is readable by the computer. The computer may include a processor or other integrated circuit and the non-transitory storage medium, and may be capable of executing the processes instructed by the instructions constituting the program, thereby realizing the functions of the learning device 1.

　また、学習装置１は、予め定めた複数の探索点を含む探索点集合から、レベルセット関数の予測精度を示す予測精度評価関数に基づいて１個の探索点を選択する探索点選択部２１２を備え、探索点は、動作パラメータと制御パラメータの組である。
　この構成により、既知の動作パラメータと制御パラメータの組に対してレベルセット関数の予測精度が評価され、評価された予測精度に基づいてレベルセット関数の学習に用いられる動作パラメータと制御パラメータが選択される。予測精度の低いために学習過程における収束速度が高くなる動作パラメータと制御パラメータを優先して選択することで、レベルセット関数の学習を効率化することができる。 The learning device 1 also includes a search point selection unit 212 that selects one search point from a search point set including a predetermined number of search points based on a prediction accuracy evaluation function that indicates the prediction accuracy of the level set function, and the search point is a pair of an operating parameter and a control parameter.
With this configuration, the prediction accuracy of the level set function is evaluated for a known set of motion parameters and control parameters, and motion parameters and control parameters used in training the level set function are selected based on the evaluated prediction accuracy. By preferentially selecting motion parameters and control parameters that have low prediction accuracy and therefore have a high convergence speed in the training process, the training of the level set function can be made more efficient.

　また、学習装置１は、予測精度評価関数を用いて、レベルセット関数とハイレベル制御器の学習継続の必要性を判定する予測精度評価部２３３を備えてもよい。
　この構成により、学習過程において評価される予測精度に基づいて学習継続の要否が定量的に判定される。 The learning device 1 may also include a prediction accuracy evaluation unit 233 that uses the prediction accuracy evaluation function to determine the need for continued learning of the level set function and the high-level controller.
With this configuration, the need for continuing learning is quantitatively determined based on the prediction accuracy evaluated during the learning process.

　また、レベルセット関数学習部２３１は、第１学習用データを用いてレベルセット関数を学習し、ハイレベル制御器学習部２４０は、第２学習用データを用いてハイレベル制御器学習部２４０を学習してもよい。第１学習用データは、評価値の最適解を与える動作パラメータと前記制御パラメータを入力とし、最適解を出力として含む最適解セットと、最適解とは異なる非最適解を与える動作パラメータと制御パラメータを入力とし、非最適解を出力として含む非最適解セットを含み、第２学習用データは、最適解セットを含む。
　この構成によれば、レベルセット関数は、さらに非最適解を出力とし、その非最適解を与える動作パラメータと制御パラメータを入力とするデータセットを含む第１学習用データを用いて学習される。最適制御問題の解を回帰するレベルセット関数の学習において、より幅広い制御パラメータが参照される。ひいてはシステム状態が最適な状態から離れた場合でも安定した制御パラメータが出力されるようにハイレベル学習器が制御される。 Furthermore, the level set function learning unit 231 may learn the level set function using the first learning data, and the high-level controller learning unit 240 may learn the high-level controller learning unit 240 using the second learning data. The first learning data includes an optimal solution set that has as input operation parameters and control parameters that provide an optimal solution of an evaluation value and includes an optimal solution as an output, and a non-optimal solution set that has as input operation parameters and control parameters that provide a non-optimal solution different from the optimal solution and includes a non-optimal solution as an output, and the second learning data includes the optimal solution set.
According to this configuration, the level set function is further trained using first training data including a data set having a non-optimal solution as an output and an operation parameter and a control parameter that give the non-optimal solution as an input. A wider range of control parameters are referenced in training the level set function that regresses the solution of the optimal control problem. Thus, the high-level learner is controlled so that stable control parameters are output even when the system state deviates from the optimal state.

　学習装置１は、動作パラメータに基づく制御パラメータを用いてロボットの動作を制御して得られるシステム状態と当該制御パラメータの組を含む制御結果データを用いて、動作パラメータと制御パラメータを入力とし、システム状態を出力とするシステムモデルを学習するシステムモデル学習部２５０を備えてもよい。
　この構成によれば、制御結果データを用いて、動作パラメータと制御パラメータに対して現実に得られたシステム状態が推定されるようにシステムモデルが学習される。学習されたシステムモデルに基づいて制御問題を更新することで、現実のシステム環境に適応するようにハイレベル制御器を学習することができる。 The learning device 1 may include a system model learning unit 250 that uses a system state obtained by controlling the operation of the robot using control parameters based on the operation parameters and control result data including a set of the control parameters to learn a system model that uses operation parameters and control parameters as inputs and outputs a system state.
According to this configuration, the system model is trained using the control result data so that the system state actually obtained for the operation parameters and the control parameters is estimated. By updating the control problem based on the trained system model, the high-level controller can be trained to adapt to the real system environment.

　前記ハイレベル制御器学習部は、前記制御結果データに係る前記動作パラメータを入力とし、前記制御パラメータを出力として前記ハイレベル制御器を学習してもよい。
　この構成によれば、制御により現実に得られた制御パラメータが得られるように、現実に制御に用いた動作パラメータを入力とするハイレベル制御器が学習される。そのため、現実のシステム環境に適応するようにハイレベル制御器が学習される。 The high-level controller learning unit may learn the high-level controller using the operation parameter related to the control result data as an input and the control parameter as an output.
According to this configuration, a high-level controller is trained using the operational parameters actually used for control as input so that the control parameters actually obtained by the control can be obtained. Therefore, the high-level controller is trained to adapt to the actual system environment.

　なお、学習装置１、ロボットコントローラ３、および、ロボット５が行う処理の全部または一部を実行するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に一時的または非一時的に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより各部の処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。
　また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ（Read Only Memory）、ＣＤ－ＲＯＭ（Compact Disc Read Only Memory）等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよい。 Note that the programs for executing all or part of the processing performed by the learning device 1, the robot controller 3, and the robot 5 may be recorded on a computer-readable recording medium, and the programs temporarily or non-temporarily recorded on this recording medium may be read into a computer system and executed to perform the processing of each part. Note that the term "computer system" here includes hardware such as the OS and peripheral devices.
Furthermore, the term "computer-readable recording medium" refers to portable media such as flexible disks, optical magnetic disks, ROMs (Read Only Memory), and CD-ROMs (Compact Disc Read Only Memory), as well as storage devices such as hard disks built into computer systems. The above-mentioned program may be for realizing part of the above-mentioned functions, or may be capable of realizing the above-mentioned functions in combination with a program already recorded in the computer system.

　以上、本願の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The above describes the embodiment of the present application in detail with reference to the drawings, but the specific configuration is not limited to this embodiment, and includes designs within the scope of the gist of the present invention.

　本願の実施形態は、学習装置、制御装置、制御システム、学習方法および記憶媒体として実現することができる。 Embodiments of the present application can be realized as a learning device, a control device, a control system, a learning method, and a storage medium.

　１…学習装置、２…記憶装置、３…ロボットコントローラ、３’…制御装置、４…計測装置、５…ロボット、１１…プロセッサ、１２…メモリ、１３…インタフェース、１４…抽象システムモデル設定部、１５…スキル学習部、１６…スキルタプル生成部、３１…プロセッサ、３２…メモリ、３３…インタフェース、５０…システム、１００…制御システム、２１０…探索点集合設定部、２１１…探索点集合初期化部、２１２…探索点選択部、２２０…データ取得部、２２１…システムモデル設定部、２２２…問題設定計算部、２２３（２２３－１、２２３－２）…データ更新部、２３０…学習設定部、２３１…レベルセット関数学習部、２３２…予測精度評価関数設定部、２３３…予測精度評価部、２３４…制御器学習用評価関数設定部、２４０…ハイレベル制御器学習部、２５０…システムモデル学習部 1...Learning device, 2...Memory device, 3...Robot controller, 3'...Control device, 4...Measuring device, 5...Robot, 11...Processor, 12...Memory, 13...Interface, 14...Abstract system model setting unit, 15...Skill learning unit, 16...Skill tuple generation unit, 31...Processor, 32...Memory, 33...Interface, 50...System, 100...Control system, 210...Search point set setting unit, 211...Search point set initialization unit, 212...Search point selection unit, 220...Data acquisition unit, 221...System model setting unit, 222...Problem setting calculation unit, 223 (223-1, 223-2)...Data update unit, 230...Learning setting unit, 231...Level set function learning unit, 232...Prediction accuracy evaluation function setting unit, 233...Prediction accuracy evaluation unit, 234...Evaluation function setting unit for controller learning, 240...High-level controller learning unit, 250...System model learning unit

Claims

a level set function learning unit that learns a level set function that receives as input motion parameters that define a motion goal and a motion environment of a robot and control parameters of the robot, and outputs an evaluation value regarding the attainment of the motion goal based on the motion parameters and the control parameters;
a high-level controller learning unit that learns a high-level controller that determines control parameters for realizing a target motion based on the motion parameters of the robot, based on the level set function and prediction accuracy of the control parameters.

From a search point set including a plurality of predetermined search points,
a search point selection unit that selects one search point based on a prediction accuracy evaluation function that indicates the prediction accuracy of the level set function;
The learning device according to claim 1 , wherein the search point is a pair of the operation parameter and the control parameter.

The learning device according to claim 2 , further comprising a prediction accuracy evaluation unit that uses the prediction accuracy evaluation function to determine the necessity for continuing learning of the level set function and the high-level controller.

The level set function learning unit learns the level set function using first learning data;
The high-level controller learning unit learns the high-level controller learning unit using second learning data,
the first learning data includes an optimal solution set having the operation parameters and the control parameters which provide an optimal solution for the evaluation value as input and including the optimal solution as output, and a non-optimal solution set having the operation parameters and the control parameters which provide a non-optimal solution different from the optimal solution as input and including the non-optimal solution as output,
The learning device according to claim 2 , wherein the second learning data includes the optimal solution set.

2. The learning device according to claim 1, further comprising a system model learning unit configured to learn a system model having the operation parameters and the control parameters as inputs and the system state as an output, by using control result data including a set of a system state obtained by controlling the operation of the robot using the control parameters based on the operation parameters and the control parameters.

The learning device according to claim 5, wherein the high-level controller learning unit learns the high-level controller using the operation parameters related to the control result data as input and the control parameters as output.

The robot;
the high level controller;
A control system comprising: the learning device according to claim 1 .

On the computer,
A computer-readable storage medium storing a program for causing the learning device according to claim 1 to function.

A learning method for a learning device, comprising:
The learning device,
a level set function learning step of learning a level set function in which operation parameters defining an operation goal and an operation environment of a robot and control parameters of the robot are input, and an evaluation value regarding the possibility of reaching the operation goal based on the operation parameters and the control parameters is output;
a high-level controller learning step of learning a high-level controller that determines control parameters for realizing a target motion based on the motion parameters of the robot, based on the level set function and prediction accuracy of the control parameters.

a high-level control unit that determines control parameters for realizing a target motion based on motion parameters that define a motion target and a motion environment of the robot;
Calculating an evaluation value indicating the feasibility of a target operation based on the operation parameters and the control parameters using a level set function;
When it is determined that the target motion is realizable based on the evaluation value, the control parameter is used for motion control of the robot;
a motion planning unit that searches for control parameters that enable the realization of the target motion based on the evaluation value when it is determined that the target motion is not realizable based on the evaluation value.