WO2020235693A1 - Learning method, learning device, and learning program for ai agent that behaves like human - Google Patents
Learning method, learning device, and learning program for ai agent that behaves like human Download PDFInfo
- Publication number
- WO2020235693A1 WO2020235693A1 PCT/JP2020/020624 JP2020020624W WO2020235693A1 WO 2020235693 A1 WO2020235693 A1 WO 2020235693A1 JP 2020020624 W JP2020020624 W JP 2020020624W WO 2020235693 A1 WO2020235693 A1 WO 2020235693A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- behavior
- learning
- agent
- error
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/092—Reinforcement learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/094—Adversarial learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
Definitions
- the present invention is a method, device, and program that can skillfully take into account the error with the reinforcement learning model and the error with the behavior of a human expert or the imitation learning model when training the agent AI (artificial intelligence). It is about.
- reinforcement learning (hereinafter sometimes referred to as "RL”), especially deep reinforcement learning (DRL), is AI (artificial intelligence) used in electronic games, automatic driving control of vehicles such as automobiles, It is applied to algorithms such as autonomous control of robots.
- Reinforcement learning is a method of repeating learning in units of episodes and adapting to the target environment using the rewards of trial and error in the environment, and learning is performed by optimizing the policy.
- Reinforcement learning uses the information processing power of a convolutional neural network (CNN) to perform reinforcement learning based on high-dimensional input such as image data.
- CNN convolutional neural network
- a problem is set in which an agent in the environment observes the current state and decides an action to be taken, and the agent receives a reward from the environment by selecting an action. While repeating the interaction between the environment and the agent, learning is performed using the three variables of reward, behavior, and state.
- agents act in an environment with a state space and an action space.
- the policy converts the state into a feature vector with the feature extraction learner, and outputs the action probability for the feature vector with the value function calculation learner.
- the environment prints a scalar reward and the next state. The episode ends when you repeat the action until the final state.
- Agent AI constructed by reinforcement learning and deep reinforcement learning can solve various problems by learning through trial and error while interacting with the environment, far exceeding human ability, so game AI, robots It is applied to algorithms such as autonomous control and automatic operation of the above (see, for example, Non-Patent Documents 1 to 5).
- humans cannot predict the behavior, there is a big problem in terms of usability for the practical application of, for example, a robot that works in collaboration with humans and an automatic driving AI on which a human rides.
- agent AI constructed by reinforcement learning and deep reinforcement learning shows high performance because it is trained to maximize profits, but when it is put into practical use, consideration other than such performance indicators is also taken into consideration. There is a need.
- NPC Non Player Character
- the agent AI may be too strong for the player to enjoy the game very much.
- the reinforcement learning agent AI trained for high performance may accelerate or decelerate violently or turn suddenly, causing anxiety to adjacent cars and pedestrians. There is. Therefore, it is necessary to design a human-like agent AI.
- imitation learning (hereinafter sometimes referred to as “IL”) is adopted to build an agent AI that imitates human behavior, mainly for the purpose of reproducing the behavior of human experts. ..
- the policy followed by the expert is the optimal policy, and learning is performed so that the policy of the agent AI approaches the behavior of the expert.
- the optimal policy for a human expert is provided in the form of a series of state-behavior pairs, and then the agent AI policy is inferred from the observed state to the behavior that the expert is likely to take. Being trained, you can expect human-like behavior by imitating human experts.
- the learning policy is limited to the provided data, there is a problem that the performance of the imitation learning agent AI hardly exceeds the performance of a human expert (for example, Non-Patent Documents 6 to 6). See 8.).
- the present invention has achieved both highly efficient optimal behavior that exceeds the human ability acquired by reinforcement learning and human-like behavior acquired by imitating the behavior of a human expert. It is an object of the present invention to provide learning methods, devices and programs for agents.
- the learning method of the agent of the present invention fuses a learning device that realizes a behavior in which an agent judges and takes an optimum action under a predetermined environment and a learning device that realizes a human-like behavior. It is a learning method that optimizes the behavior policy of an agent so as to take the optimum behavior like a human being, and includes the following steps a) to f). a) Human play data expert or a predetermined input step of inputting status data S R and behavioral data A R in at least one of recording of the play data of agents created for the purpose.
- the agent learning method of the present invention the error between the reinforcement learning model and the behavior of a human expert or the imitation learning model is fed back to construct an agent AI that behaves like a human and takes optimal behavior.
- the executor of the reinforcement learning model is set with a policy of observing the current state of the agent in the environment and deciding the action to be taken, as in the case of general reinforcement learning, and the agent has the value of the value function. It is an executor obtained by repeatedly selecting an action from a policy and obtaining a reward so as to increase the size, and learning while feeding back using three variables of reward, action, and state.
- the environment is, for example, a set of image data of an electronic game screen or a time series of image data of a moving image taken by a camera mounted on a robot or a vehicle.
- the state data is data such as individual game images obtained by observing the environment, which is a set of image data, surrounding environment images of the own vehicle, and lane positions.
- the action data is, for example, a game input device operation and a controller input (movement direction, jump, etc.) in the case of an electronic game, and a steering wheel, accelerator, brake operation, etc. in the case of automatic driving of an automobile.
- reinforcement learning functions by a mechanism of learning while feeding back using three variables of reward, behavior, and state, whereas imitation learning functions by a mechanism using a teacher agent and a student agent.
- the behavior data AF of the output of the fusion model is expressed by the probability distribution of the behavior.
- the probability distribution of each input button operation of the controller the probability distribution of the steering wheel, the accelerator, and the brake operation.
- the probability distribution is included as a ⁇ distribution even in the case of only one optimum operation output.
- the loss error means an error between the current output calculated by using a loss function and the expected output, for example.
- first and second loss error The processing of the calculation step is different.
- First loss error calculation step the error between the behavior data AR and the behavior data AF in the play data by a human expert is calculated using a loss function.
- Second loss error calculating step includes a behavior data software targeted behavioral data A RL which are outputted by inputting the state data S R in reinforcement learning model by knowledge distillation, the error between the behavior data A F, loss Calculate using a function.
- the first loss error calculation step is the error between the behavior data AR and the behavior data AF in the play data of the agent of the imitation learning model, or the imitation learning.
- Second loss error calculating step includes a behavior data software targeted behavioral data A RL which are outputted by inputting the state data S R in reinforcement learning model by knowledge distillation, the error between the behavior data A F, loss Calculate using a function.
- Second loss error calculating step includes a behavior data software targeted behavioral data A RL which are outputted by inputting the state data S R in reinforcement learning model by knowledge distillation, the error between the behavior data A F, loss Calculate using a function.
- the learning step is The state data SHE is input to output the action data A F1
- the state data SRL is input to output the action data A F2 .
- the error between the behavior data A HE and the behavior data A F1 in the play data by a human expert is calculated using the loss function.
- the error between the behavior data A RL soft-targeted by knowledge distillation and the behavior data A F2 is lost. Calculate using a function.
- imitation learning model agents play data state data S IL and behavioral data A IL in the recording, and, when the learning step is a state data S RL and behavioral data A RL in the recording play data
- the error between the behavior data AIL soft-targeted by knowledge distillation in the play data of the agent of the imitation learning model and the behavior data A F1 is calculated using the loss function. ..
- the second loss error calculation step the error between the behavior data A RL soft-targeted by knowledge distillation and the behavior data A F2 , which is output by inputting the state data S RL into the reinforcement learning model, is lost. Calculate using a function.
- the output of the agent's action policy has a parameter of heat degree T, and corresponds to hard targeting and soft targeting.
- the heat degree T is "0"
- the output of the action policy becomes a hard target in which only one action has a probability greater than 0, and the action is uniquely determined.
- the heat degree T is not "0"
- the plurality of actions have a probability greater than 0, and the actions are stochastically determined.
- the agent learning device of the present invention fuses a learning device that realizes a behavior in which an agent judges and takes an optimum action under a predetermined environment and a learning device that realizes a human-like behavior, and makes the agent's behavior policy human-like. It is a learning device that optimizes to behave and is equipped with the following A) to F).
- the output of the action policy of the agent has a parameter of heat degree T and is hard-targeted as in the above-described learning method of the agent of the present invention.
- the degree of heat T is "0"
- the output of the action policy becomes a hard target with a probability that only one action has a probability greater than 0, and the action is uniquely determined, while the action is uniquely determined. If the heat degree T is not "0", the plurality of actions have a probability greater than 0, and the actions are determined probabilistically.
- the agent learning program of the present invention is a program for causing a computer to execute all the steps in the above-mentioned agent learning method of the present invention. Further, the learning program of the agent of the present invention includes an input unit, a learning executor, a first loss error calculator, a second loss error calculator, a fusion error calculator and an update unit in the above-mentioned agent learning device of the present invention. It is a program for operating a computer.
- an agent AI that balances highly efficient optimal behavior that exceeds human ability acquired by reinforcement learning and human-like behavior acquired by imitating the behavior of a human expert is constructed. can do.
- Functional block diagram of the learning device of the agent of the present invention Functional block diagram of the learning device of the agent of the first embodiment Schematic flow chart of the learning method of the agent of Example 1 Execution processing flow diagram of the fusion model of the first embodiment Calculation processing flow chart of loss error function LHE with human behavior of Example 1 Calculation processing flow chart of the loss error function LRL with the optimum behavior of the first embodiment Calculation processing flow chart of fusion loss error function L Mix of Example 1 Learning processing flow diagram of the fusion model of Example 1
- Functional block diagram of the learning device of the agent of the second embodiment Calculating process flow diagram of a loss error function L IL with human likely behavioral
- Functional block diagram of the learning device of the agent of the third embodiment Functional block diagram of the learning device of the agent of the fourth embodiment
- Functional block diagram of the learning device of the agent of the fifth embodiment Schematic flow chart of the learning method of the agent of Example 5 Execution processing flow of the fusion model of Example 5 FIG.
- FIG. Calculation processing flow chart of loss error function LHE with human behavior of Example 5
- Functional block diagram of the learning device of the agent of the sixth embodiment Calculation processing flow chart of loss error function LIL of Example 3 Explanatory drawing of soft targeting processor
- FIG. 1 shows a functional block diagram of the learning device of the agent of the present invention.
- the learning device 1 is composed of a fusion model learning executor 2, a loss function calculator 3 with human-like behavior, a loss function calculator 4 with optimal behavior, a fusion model loss function calculator 5, and a database (DB) 6. Will be done.
- the fusion model learning executor 2 acquires state data from the database 6, performs execution processing of the fusion model, and outputs the obtained behavior data to the loss function calculators (3, 4).
- the loss function calculators (3, 4) input state data or behavior data from the database 6, and take in the input data and the behavior data input from the fusion model learning executor 2 into the loss error function calculator, respectively, and the loss error.
- the function L HL and the loss error function L RL are calculated respectively.
- the fusion model loss function calculator 5 inputs the loss error L HL and the loss error L RL calculated by the loss function calculator (3, 4), and the fusion loss error of the loss error with the behavior more efficient and human-like than human beings. Calculate the function L Mix .
- the fusion model learning executor 2 acquires the fusion loss error function L Mix from the fusion model loss function calculator 5 and performs learning processing of the fusion model.
- the functional block diagram of the learning device of the agent of the present invention will be described by dividing into a combination pattern of a fusion model type, a play data type used for learning, and a learning control type.
- the combination patterns are summarized in Table 1.
- FIG. 2 shows a functional block diagram of the learning device of the agent of the first embodiment.
- the learning device 11 includes a fusion model learning executor 2, a loss function calculator 3 with human-like behavior, a loss function calculator 4 with optimal behavior, a fusion model loss function calculator 5, and a database 61. Consists of. Play data by a human expert is stored in the database 61.
- the loss function calculator 3 is provided with a first loss error calculator 30, and the loss function calculator 4 includes a reinforcement learning model executor 8, a soft targeting processor 9, and a second loss error calculator 40. Is provided.
- FIG. 3 shows a schematic flow chart of the learning method of the agent of the first embodiment.
- the fusion model learning executor 2 performs the execution processing of the fusion model (step S11).
- the loss function calculator 3 performs a calculation process of the loss error function LHE with human behavior (step S12).
- the calculation process of the loss error function LRL with the optimum behavior is performed (step S13).
- the fusion model loss function calculator 5 is used to calculate the fusion loss error function L Mix (step S14). ..
- the calculated fusion loss error function L Mix is acquired, and the fusion model learning process is performed (step S15).
- the fusion model learning executor 2 again executes the fusion model execution process (step S11).
- FIG. 4 shows an execution processing flow diagram of the fusion model of the first embodiment.
- a fusion model learning execution unit 2 acquires the status data S R from the database 61 (step S111).
- To fusion model ⁇ type the state data S R to run (step S112).
- the behavior data AF of the fusion model is stored (step S113).
- FIG. 5 shows a calculation processing flow diagram of the loss error function LHE with the human-like behavior of Example 1.
- the first loss error calculator 30 in the loss function calculator 3 and likely human behavior acquires behavior data A R of the human from the database 61 (step S121). Further, the loss function calculator 3 acquires the behavior data AF of the fusion model (step S122). The first loss error calculator 30 calculates the loss error function LHE from the acquired AR and AF (step S123).
- FIG. 6 shows a calculation processing flow diagram of the loss error function LRL with the optimum behavior of the first embodiment.
- the loss function calculator 4 with optimal behavior, reinforcement learning model execution unit 8 acquires the status data S R from the database 61 (step S131). Reinforcement enter the learning model [pi RL to state data S R, Run, and outputs the behavior data A R (step S132).
- the action data AR is input to and executed from the soft targeting processor 9, and the action data A ST2 is output to the second loss error calculator 40 (step S133).
- the second loss error calculator 40 acquires the behavior data AF of the fusion model (step S134).
- the second loss error calculator 40 calculates the loss error function LRL from the acquired behavior data A ST2 and behavior data AF (step S135).
- the soft targeting processor 9 will be described with reference to FIG.
- the output of the agent's action policy has a parameter of heat degree T, and corresponds to the hard targeting and the soft targeting.
- the heat degree T 0
- the output of the action policy becomes a hard target in which only one action has a probability greater than 0, and the action is uniquely determined.
- the heat degree T> 0 when T is not 0
- the plurality of actions have a probability greater than 0, and the actions are stochastically determined.
- the soft targeting processor is used as a loss function calculator with the optimum behavior will be described as an example. As shown in FIG.
- the reinforcement learning model executor inputs the state data, outputs the output data G of the executor to the soft targeting processor, and the soft targeting processor further outputs the output data L. Is output to the loss error LRL calculator.
- the loss error LRL calculator calculates the loss error function by inputting the behavior data of the output of the fusion model and the output data of the soft targeting processor.
- the heat degree T is processed and the loss error function is calculated for the data L where T> 0 (this is referred to as “pattern B”).
- pattern C2 There is a pattern for calculating the loss error function for the data G (this is referred to as "pattern C2").
- the heat degree T is used as a parameter of the exponential function by using the processing function as shown in the graph shown in FIG. 21 (2).
- Table 2 summarizes the above patterns, data names, and data formats.
- FIG. 7 shows a calculation processing flow diagram of the fusion loss error function L Mix of the first embodiment.
- the fusion model loss function calculator 5 acquires the loss error functions L HE and L RL (step S141).
- the fusion model loss function calculator 5 calculates the fusion loss error function L Mix from the trade-off coefficients ⁇ , L HE, and L RL (step S142).
- FIG. 8 shows a learning processing flow diagram of the fusion model of the first embodiment.
- the fusion model learning executor 2 acquires the fusion loss error function L Mix (step S151).
- the fusion model learning executor 2 changes the parameters of the fusion model so that the fusion loss error function L Mix becomes small (step S152).
- FIG. 9 shows a functional block diagram of the learning device of the agent of the second embodiment.
- the learning device 12 includes a fusion model learning executor 2, a loss function calculator 3 with human-like behavior, a loss function calculator 4 with optimal behavior, a fusion model loss function calculator 5, and a database 62. Consists of.
- the database 62 stores play data by the imitation learning model ⁇ IL provided in the imitation learning model executor 7.
- the loss function calculator 3 is provided with a first loss error calculator 30, and the loss function calculator 4 includes a reinforcement learning model executor 8, a soft targeting processor 9, and a second loss error calculator 40. Is provided.
- the agent learning device 12 of the second embodiment uses the database 61 in which the play data by a human expert is stored in that the database used is the database 62 in which the play data by the imitation learning model is stored. It is different from the learning device 11 of 1. Therefore, in the learning method for the agent of Example 2, the process of calculating the loss error function L IL with likely human behavior is different from the first embodiment.
- Figure 10 shows a calculation process flow diagram of a loss error function L IL with human likely behavioral Example 2.
- the first loss error calculator 30 in the loss function calculator 3 and likely human behavior acquires behavior data A IL imitation learning model from the database 62 (step S211). Further, the loss function calculator 3 acquires the behavior data AF of the fusion model (step S212). The first loss error calculator 30 calculates the loss error function L IL from the acquired A IL and AF (step S213).
- FIG. 11 shows a functional block diagram of the learning device of the agent of the third embodiment.
- the learning device 13 includes a fusion model learning executor 2, a loss function calculator 3 with human-like behavior, a loss function calculator 4 with optimal behavior, a fusion model loss function calculator 5, and a database 62. Consists of.
- the database 62 stores play data by the imitation learning model ⁇ IL provided in the imitation learning model executor 7.
- the loss function calculator 4 is provided with a reinforcement learning model executor 8, a soft targeting processor 9, and a second loss error calculator 40.
- the loss function calculator 3 includes not only the first loss error calculator 30, but also the imitation learning model executor 7 and the soft targeting processor. 9 is provided.
- Figure 20 shows a calculation process flow diagram of a loss error function L IL with people likely behavior of Example 3.
- the loss function calculator 3 obtains the status data S R from the database 62 (step S411). Imitation learning model mimics learning model ⁇ input state data S R to IL execution unit 7 executes, and outputs the behavior data A IL (step S412).
- the action data AIL is input to and executed in the soft targeting processor 9, and the action data A ST1 is output (step S413).
- the first loss error calculator 30 acquires the behavior data AF of the fusion model (step S414).
- First loss error calculator 30 calculates a than the loss error function L IL and action data A ST1 acquired behavioral data A F (step S415).
- FIG. 12 shows a functional block diagram of the learning device of the agent of the fourth embodiment.
- the learning device 14 includes a fusion model learning executor 2, a loss function calculator 3 with human-like behavior, a loss function calculator 4 with optimal behavior, a fusion model loss function calculator 5, and a database 63. Consists of.
- the database 62 stores play data by the reinforcement learning model ⁇ RL provided in the reinforcement learning model executor 8.
- the loss function calculator 3 is provided with the imitation learning model executor 7, the soft targeting processor 9, and the first loss error calculator 30, as in the third embodiment.
- the loss function calculator 4 is provided with a reinforcement learning model executor 8, a soft targeting processor 9, and a second loss error calculator 40.
- the database 62 in which the play data by the imitation learning model is stored is used in that the database used is the database 63 in which the play data by the reinforcement learning model is stored. It is different from the learning device 13 of No. 3, but is the same in other respects.
- FIG. 13 shows a functional block diagram of the learning device of the agent of the fifth embodiment.
- the learning device 15 includes a fusion model learning executor 2, a loss function calculator 3 with human-like behavior, a loss function calculator 4 with optimal behavior, a fusion model loss function calculator 5, and a database ( It consists of 61, 63).
- the loss function calculator 3 is provided with a first loss error calculator 30, and the loss function calculator 4 is provided with a second loss error calculator 40.
- the learning device 15 of the fifth embodiment has two types, a database 61 in which play data by a human expert is stored and a database 63 in which play data by the reinforcement learning model ⁇ RL provided in the reinforcement learning model executor 8 is stored. It is different from Examples 1 to 4 in that it uses the database of.
- FIG. 14 shows a schematic flow chart of the learning method of the agent of Example 5.
- the fusion model learning executor 2 performs the execution process 1 of the fusion model (step S31).
- the fusion model learning executor 2 performs the execution process 2 of the fusion model (step S32).
- the loss function calculator 3 performs a calculation process of the loss error function LHE with human behavior (step S33).
- the calculation process of the loss error function LRL with the optimum action is performed (step S34).
- the step S33 may be performed after the step S31, and the step S34 may be performed after the step S32.
- the step S32 may be performed before the step S31.
- the fusion model loss function calculator 5 is used to calculate the fusion loss error function L Mix (step S35). ..
- the fusion model learning executor 2 the calculated fusion loss error function L Mix is acquired, and the fusion model learning process is performed (step S36).
- the fusion model learning executor 2 again performs the fusion model execution processes 1 and 2 (steps S31 and S32).
- FIG. 15 shows an execution processing flow FIG. 1 of the fusion model of the fifth embodiment.
- the fusion model learning executor 2 acquires the state data S R1 from the database 61 (DB1) (step S311).
- the state data S R1 is input to the fusion model ⁇ and executed (step S312).
- the behavior data A F1 of the fusion model is stored (step S313).
- FIG. 16 shows an execution processing flow FIG. 2 of the fusion model of the fifth embodiment.
- the fusion model learning executor 2 acquires the state data S R2 from the database 63 (DB2) (step S321).
- the state data S R2 is input to the fusion model ⁇ and executed (step S322).
- the behavior data A F2 of the fusion model is stored (step S323).
- FIG. 17 shows a calculation processing flow diagram of the loss error function LHE with the human-like behavior of Example 5.
- the first loss error calculator 30 in the loss function calculator 3 and likely human behavior acquires behavior data A R of the human from the database 61 (step S331). Further, the loss function calculator 3 acquires the behavior data A F1 of the fusion model (step S332).
- FIG. 18 shows a calculation processing flow diagram of the loss error function LRL with the optimum behavior of the fifth embodiment.
- a loss function calculator 4 with optimal action acquires behavior data A RL reinforcement learning model from the database 63 (DB2) (step S341). Separately from this, the second loss error calculator 40 acquires the behavior data A F2 of the fusion model (step S342). The second loss error calculator 40 calculates the loss error function L RL from the acquired behavior data A RL and the behavior data A F2 (step S343).
- the calculation process of the fusion loss error function L Mix in step S35 shown in FIG. 14 is the same as the calculation process flow diagram of the fusion loss error function L Mix of Example 1 shown in FIG. 7. Further, the learning process of the fusion model in step S36 shown in FIG. 14 is performed in the same manner as the learning process flow diagram of the fusion model of Example 1 shown in FIG.
- FIG. 19 shows a functional block diagram of the learning device of the agent of the sixth embodiment.
- the learning device 16 includes a fusion model learning executor 2, a loss function calculator 3 with human-like behavior, a loss function calculator 4 with optimal behavior, a fusion model loss function calculator 5, and a database ( It consists of 62,63).
- the loss function calculator 3 is provided with an imitation learning model executor 7, a soft targeting processor 9, and a first loss error calculator 30, and the loss function calculator 4 is provided with a reinforcement learning model executor 8, software.
- a targeting processor 9 and a second loss error calculator 40 are provided.
- the learning device 16 of the sixth embodiment has a database 62 (DB1) in which play data by the imitation learning model ⁇ IL provided in the imitation learning model executor 7 is stored, and a reinforcement learning provided in the reinforcement learning model executor 8. It differs from Example 5 in that the database (61, 63) is used in that it uses two types of databases, the database 63 (DB2) in which the play data by the model ⁇ RL is stored.
- the learning method of learning a human-like agent AI while maintaining the high performance of the reinforcement learning model is a process of learning an agent AI that takes an efficient optimum action that surpasses the human performance. And, it is composed of two processes of learning the agent AI that selects an action like a human being. Each process is tackled as a task of reinforcement learning and imitation learning. Therefore, in the present invention, the fusion model of reinforcement learning and imitation learning is a method based on the distillation of measures in the case of discrete action space and based on hostile imitation learning in the case of continuous action space. If ⁇ * is the optimal policy based on the reinforcement learning model, ⁇ HE is the human (expert) policy, and the parameter that determines the ratio of these two policies is ⁇ ⁇ (0,1), the objective function is as follows. It becomes like an expression.
- the objective function of imitation learning is defined as the following cross entropy loss according to the previous research of imitation learning.
- GAIL Geneative Adversarial Imitation Learning
- the GAIL method requires trajectories ⁇ - ⁇ sampled from the teacher model ⁇ .
- the objective function maximized by the classifier D w in the GAIL method and minimized by the student model are as follows.
- ⁇ is the orbits ⁇ to ⁇ sampled from the student model.
- the trajectories ⁇ HE to ⁇ HE and ⁇ RL to ⁇ RL are sampled from each expert.
- the fusion loss function can be replaced as follows: Intuitively, the discriminator D w is trained to recognize the fusion policy between the person (expert) and the policy of the reinforcement learning model, and the student model ⁇ trained to deceive this classifier becomes the fusion policy. Expected to approach and mimic the strengths of both experts.
- Example 1 Atari 2600 game (Gopher)
- the learning device of Example 1 was first applied to a game of an Atari 2600 system named Gopher in a discrete action space.
- the user of the game acts as a farmer (behavior) and moves left and right or fills a hole so that the mouse (Gopher) coming out from the basement to the ground cannot take the carrot.
- a person (expert) and a trained learning model each provided 55,000 frames, and the training set was 50,000 and the test set was 5,000.
- an Adam optimizer with a learning rate of 10-4 and a Dropout rate of 0.5 were used to train the student model.
- Torcs Torcs (see Wymann, B., “The open racing car simulator”, (2015)) is one of the most commonly used simulators in autonomous driving research. This experiment using Torcs was based on the GymTorcs environment.
- the observation space of the agent AI consists of 65 continuous values in total, such as the distance from the car to the edge, the distance to the enemy car, the current speed and acceleration.
- the action space consists of two elements, "left and right” and "acceleration / deceleration", and the possible values are limited to the range [-1.0, 1.0].
- the reward function was the distance traveled, and the reinforcement learning model was trained based on OpenAI Baselines.
- Example 3 Apple game An Apple game is a game that moves a player's avatar to the position of an apple that appears. Reinforcement learning achieved the highest score, followed by the learning method of the examples, and finally the person (Comparative Example 1) and its imitation learning. As far as civilization is concerned, the human agent (Comparative Example 1) was the best. The learning method of the example showed more human-like behavior than the agent of the reinforcement learning model (Comparative Example 3) while surpassing the agent of the human (Comparative Example 1) and its imitation learning model (Comparative Example 2) in the score, and then. Followinged by a person (Comparative Example 1). From this result, it can be seen that the learning method of the embodiment balances human behavior and high performance in this game.
- the present invention is useful for learning agents in a wide range of fields such as automatic driving of automobiles and automatic control of industrial robot arms.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Business, Economics & Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Economics (AREA)
- Entrepreneurship & Innovation (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Feedback Control In General (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
本発明は、エージェントAI(人工知能)を訓練する際に、強化学習モデルとの誤差と、人のエキスパートの行動や模倣学習モデルとの誤差とを巧く加味して学習できる方法、装置及びプログラムに関するものである。 The present invention is a method, device, and program that can skillfully take into account the error with the reinforcement learning model and the error with the behavior of a human expert or the imitation learning model when training the agent AI (artificial intelligence). It is about.
近年、強化学習(Reinforcement learning;以下、“RL”ということがある)、特に、深層強化学習(DRL)は、電子ゲームにおいて利用されるAI(人工知能)、自動車などの車両の自動運転制御、ロボットの自律制御などのアルゴリズムに応用されている。強化学習は、エピソードという単位で学習を繰り返し、環境の中で試行錯誤した結果の報酬を用いて対象となる環境に適応する手法であり方策を最適化することで学習を行うものであり、深層強化学習は、畳み込みニューラルネットワーク(CNN)の情報処理力を利用して、画像データなどの高次元入力に基づき、強化学習を行うものである。 In recent years, reinforcement learning (hereinafter sometimes referred to as "RL"), especially deep reinforcement learning (DRL), is AI (artificial intelligence) used in electronic games, automatic driving control of vehicles such as automobiles, It is applied to algorithms such as autonomous control of robots. Reinforcement learning is a method of repeating learning in units of episodes and adapting to the target environment using the rewards of trial and error in the environment, and learning is performed by optimizing the policy. Reinforcement learning uses the information processing power of a convolutional neural network (CNN) to perform reinforcement learning based on high-dimensional input such as image data.
強化学習や深層強化学習では、環境内におけるエージェントが、現在の状態を観測し、取るべき行動を決定するといった問題が設定され、エージェントは行動を選択することにより環境から報酬を得る。このような環境とエージェントとのやりとりを繰り返しながら、報酬、行動、状態の3つの変数を用いて学習していく。
強化学習や深層強化学習では、エージェントが状態空間と行動空間を持つある環境で行動を取る。タイムステップ毎に、方策が特徴抽出用学習器で状態を特徴ベクトルに変換し、価値関数計算用学習器で特徴ベクトルに対する行動確率を出力する。エージェントが行動を取った後、環境はスカラーの報酬と次の状態を出力する。最終状態まで行動を取ることを繰り返すと、エピソードが終了する。
In reinforcement learning and deep reinforcement learning, a problem is set in which an agent in the environment observes the current state and decides an action to be taken, and the agent receives a reward from the environment by selecting an action. While repeating the interaction between the environment and the agent, learning is performed using the three variables of reward, behavior, and state.
In reinforcement learning and deep reinforcement learning, agents act in an environment with a state space and an action space. At each time step, the policy converts the state into a feature vector with the feature extraction learner, and outputs the action probability for the feature vector with the value function calculation learner. After the agent takes action, the environment prints a scalar reward and the next state. The episode ends when you repeat the action until the final state.
強化学習や深層強化学習により構築したエージェントAIは、環境と対話しているうちに試行錯誤を重ねて学習して様々な課題が解決でき、人間の能力を遥かに超えることから、ゲームAI、ロボットの自律制御や自動運転などのアルゴリズムに応用されている(例えば、非特許文献1~5を参照。)。しかしながら、その振る舞いを人間が予測できないことから、例えば人間と共同で作業するロボットや人が乗車する自動運転AIの実用化に対し、ユーザビリティの点で課題が大きい。また、強化学習や深層強化学習により構築したエージェントAIは、収益を最大化するように訓練されるため高い性能を示すが、実用化する際には、このような性能指標以外のことも考慮する必要がある。例えば、テレビゲームにおいて、ゲームのプレイヤーが操作できないキャラクタであるNPC(Non Player Character)を、強化学習のエージェントAIにすると、そのエージェントAIが強すぎるため、プレイヤーがゲームをあまり楽しむことができない可能性がある。また、自動運転に応用する際には、高い性能を目指して訓練された強化学習のエージェントAIは、激しく加減速したり急に曲がったりして、隣接する車や歩行者などに不安を与える恐れがある。そこで、人間らしいエージェントAIを設計する必要がある。
Agent AI constructed by reinforcement learning and deep reinforcement learning can solve various problems by learning through trial and error while interacting with the environment, far exceeding human ability, so game AI, robots It is applied to algorithms such as autonomous control and automatic operation of the above (see, for example, Non-Patent
一方、模倣学習(Imitation Learning;以下、“IL”ということがある)は、主に人のエキスパートの振舞いを再現する目的で、人間の行動を模倣するエージェントAIを構築するために採用されている。模倣学習においては、エキスパートが従っている方策を最適な方策であると仮定し、エージェントAIの方策がエキスパートの振舞いに近づくように学習が行われる。模倣学習では、人のエキスパートに最適な方策により、状態と行動のペアの系列という形で提供された上で、エージェントAIの方策を観測される状態からエキスパートに取れそうな行動を推測するように訓練されるので、人のエキスパートを模倣し、人らしい振舞いが期待できる。しかしながら、学習される方策は提供されたデータに制限されることから、模倣学習のエージェントAIの性能は、人のエキスパートの性能を超えることが殆ど無いといった問題がある(例えば、非特許文献6~8を参照。)。
On the other hand, imitation learning (hereinafter sometimes referred to as “IL”) is adopted to build an agent AI that imitates human behavior, mainly for the purpose of reproducing the behavior of human experts. .. In imitation learning, it is assumed that the policy followed by the expert is the optimal policy, and learning is performed so that the policy of the agent AI approaches the behavior of the expert. In imitation learning, the optimal policy for a human expert is provided in the form of a series of state-behavior pairs, and then the agent AI policy is inferred from the observed state to the behavior that the expert is likely to take. Being trained, you can expect human-like behavior by imitating human experts. However, since the learning policy is limited to the provided data, there is a problem that the performance of the imitation learning agent AI hardly exceeds the performance of a human expert (for example, Non-Patent
かかる状況に鑑みて、本発明は、強化学習で獲得される人の能力を超えた効率性の高い最適行動と、人のエキスパートの行動を模倣することで獲得される人らしい振舞いを両立させたエージェントの学習方法、装置及びプログラムを提供することを目的とする。 In view of this situation, the present invention has achieved both highly efficient optimal behavior that exceeds the human ability acquired by reinforcement learning and human-like behavior acquired by imitating the behavior of a human expert. It is an object of the present invention to provide learning methods, devices and programs for agents.
上記課題を解決すべく、本発明のエージェントの学習方法は、所定の環境下でエージェントが判断して最適行動をとる振舞いを実現する学習器と人らしい振舞いを実現する学習器とを融合させ、エージェントの行動方策を、人らしく最適行動をとるように最適化する学習方法であって、下記a)~f)のステップを備える。
a)人のエキスパートによるプレイデータ、又は所定の目的で作成されたエージェントのプレイデータの少なくとも何れかの記録における状態データSRと行動データARを入力する入力ステップ。
b)最適行動をとる振舞いに関わる強化学習モデルと人らしい振舞いに関わる模倣学習モデルとの融合モデルの学習実行器に対して、状態データSRを入力して融合モデルの行動データAFを出力させる学習ステップ。
c)行動データARと行動データAFとの第1の損失誤差を算出する第1の損失誤差算出ステップ。
d)強化学習モデルの実行器又は最適行動アルゴリズムにより状態データSRに基づいて出力される行動データARLと、行動データAFとの第2の損失誤差を算出する第2の損失誤差算出ステップ。
e)第1及び第2の損失誤差の重み比率に基づいて融合誤差を算出する融合誤差算出ステップ。
f)融合誤差に基づいて融合モデルの学習実行器のパラメタを更新する更新ステップ。
In order to solve the above problems, the learning method of the agent of the present invention fuses a learning device that realizes a behavior in which an agent judges and takes an optimum action under a predetermined environment and a learning device that realizes a human-like behavior. It is a learning method that optimizes the behavior policy of an agent so as to take the optimum behavior like a human being, and includes the following steps a) to f).
a) Human play data expert or a predetermined input step of inputting status data S R and behavioral data A R in at least one of recording of the play data of agents created for the purpose.
b) for the learning execution unit of the fusion model of the imitation learning model involved in the reinforcement learning model and people seems to behavior related to the behavior to take the optimal action, enter the state data S R output the action data A F of the fusion model Learning steps to let.
c) The first loss error calculation step for calculating the first loss error between the behavior data AR and the behavior data AF .
and behavior data A RL which is output based on the state data S R by the execution unit or optimal behavior algorithm of d) reinforcement learning model, the second loss error calculating step of calculating a second loss error between behavior data A F ..
e) Fusion error calculation step of calculating the fusion error based on the weight ratio of the first and second loss errors.
f) An update step that updates the parameters of the learning executor of the fusion model based on the fusion error.
本発明のエージェントの学習方法によれば、強化学習モデルとの誤差と、人のエキスパートの行動や模倣学習モデルとの誤差をフィードバッグして、人らしい振舞いでかつ最適行動をとるエージェントAIを構築することができる。
ここで、強化学習モデルの実行器は、一般的な強化学習と同様に、環境内におけるエージェントが現在の状態を観測し、取るべき行動を決定する方策が設定され、エージェントは価値関数の値が大きくなるように、方策から行動を選択して報酬を得ることを繰り返し、報酬と行動と状態の3つの変数を用いてフィードバックしながら学習して得た実行器である。
According to the agent learning method of the present invention, the error between the reinforcement learning model and the behavior of a human expert or the imitation learning model is fed back to construct an agent AI that behaves like a human and takes optimal behavior. can do.
Here, the executor of the reinforcement learning model is set with a policy of observing the current state of the agent in the environment and deciding the action to be taken, as in the case of general reinforcement learning, and the agent has the value of the value function. It is an executor obtained by repeatedly selecting an action from a policy and obtaining a reward so as to increase the size, and learning while feeding back using three variables of reward, action, and state.
環境とは、例えば、電子ゲームの画面の画像データの集合や、ロボットや車両に搭載されたカメラの撮影動画の画像データの時系列などの集合である。状態データとは、画像データの集合である環境を観測して得られる個々のゲーム画像、自車両の周辺環境映像、車線位置などのデータである。また、行動データとは、例えば、電子ゲームであればゲームの入力デバイス操作、コントローラの入力(移動方向、ジャンプなど)であり、自動車の自動運転であればハンドル、アクセル、ブレーキ操作などである。
このように、強化学習は、報酬と行動と状態の3つの変数を用いてフィードバックしながら学習する仕組みによって機能するのに対し、模倣学習は、教師エージェントと生徒エージェントを用いた仕組みで機能する。
The environment is, for example, a set of image data of an electronic game screen or a time series of image data of a moving image taken by a camera mounted on a robot or a vehicle. The state data is data such as individual game images obtained by observing the environment, which is a set of image data, surrounding environment images of the own vehicle, and lane positions. Further, the action data is, for example, a game input device operation and a controller input (movement direction, jump, etc.) in the case of an electronic game, and a steering wheel, accelerator, brake operation, etc. in the case of automatic driving of an automobile.
In this way, reinforcement learning functions by a mechanism of learning while feeding back using three variables of reward, behavior, and state, whereas imitation learning functions by a mechanism using a teacher agent and a student agent.
また、融合モデルの出力の行動データAFは、行動の確率分布で表現される。例えば、コントローラの各入力ボタン操作の確率分布、ハンドル、アクセル、ブレーキ操作の確率分布などである。ここで、確率分布は、最適な一つだけの操作出力の場合も、δ分布として含まれるとしている。より詳しくは、離散行動の場合は確率分布で表現され、連続行動の場合は最適なものが一つだけ出力されることになる。
また、損失誤差とは、例えば損失関数を用いて算出する現時点の出力と期待する出力との誤差をいう。
Moreover, the behavior data AF of the output of the fusion model is expressed by the probability distribution of the behavior. For example, the probability distribution of each input button operation of the controller, the probability distribution of the steering wheel, the accelerator, and the brake operation. Here, it is assumed that the probability distribution is included as a δ distribution even in the case of only one optimum operation output. More specifically, in the case of discrete behavior, it is expressed by a probability distribution, and in the case of continuous behavior, only one optimum one is output.
Further, the loss error means an error between the current output calculated by using a loss function and the expected output, for example.
本発明のエージェントの学習方法の入力ステップにおいて、状態データSRと行動データARは、下記1)~5)の場合が存在し、それらの場合に応じて、第1及び第2の損失誤差算出ステップの処理が異なる。
1)人のエキスパートによるプレイデータの記録である場合
第1の損失誤差算出ステップは、人のエキスパートによるプレイデータにおける行動データARと、行動データAFとの誤差を、損失関数を用いて算出する。
第2の損失誤差算出ステップは、強化学習モデルに状態データSRを入力して出力させた行動データARLを知識蒸留によるソフトターゲット化した行動データと、行動データAFとの誤差を、損失関数を用いて算出する。
In input step of learning how the agent of the present invention, the state data S R and behavioral data A R, the following 1) There are cases 1-5), as the case thereof, first and second loss error The processing of the calculation step is different.
1) When recording play data by a human expert In the first loss error calculation step, the error between the behavior data AR and the behavior data AF in the play data by a human expert is calculated using a loss function. To do.
Second loss error calculating step includes a behavior data software targeted behavioral data A RL which are outputted by inputting the state data S R in reinforcement learning model by knowledge distillation, the error between the behavior data A F, loss Calculate using a function.
2)模倣学習モデルのエージェントのプレイデータの記録である場合
第1の損失誤差算出ステップは、模倣学習モデルのエージェントのプレイデータにおける行動データARと行動データAFとの誤差、又は、模倣学習モデルに前記状態データSRを入力して出力させた行動データAILを知識蒸留によるソフトターゲット化した行動データと、行動データAFとの誤差を、損失関数を用いて算出する。
第2の損失誤差算出ステップは、強化学習モデルに状態データSRを入力して出力させた行動データARLを知識蒸留によるソフトターゲット化した行動データと、行動データAFとの誤差を、損失関数を用いて算出する。
2) When the play data of the agent of the imitation learning model is recorded The first loss error calculation step is the error between the behavior data AR and the behavior data AF in the play data of the agent of the imitation learning model, or the imitation learning. The error between the behavior data A IL soft-targeted by knowledge distillation and the behavior data A F , which is output by inputting the state data S R into the model, is calculated using a loss function.
Second loss error calculating step includes a behavior data software targeted behavioral data A RL which are outputted by inputting the state data S R in reinforcement learning model by knowledge distillation, the error between the behavior data A F, loss Calculate using a function.
3)強化学習モデルのエージェントのプレイデータの記録である場合
第1の損失誤差算出ステップは、模倣学習モデルに状態データSRを入力して出力させた行動データAILを知識蒸留によるソフトターゲット化した行動データと、行動データAFとの誤差を、損失関数を用いて算出する。
第2の損失誤差算出ステップは、強化学習モデルに状態データSRを入力して出力させた行動データARLを知識蒸留によるソフトターゲット化した行動データと、行動データAFとの誤差を、損失関数を用いて算出する。
3) The first loss error calculating step when a record of the play data of the agent of reinforcement learning models, software targeted by Knowledge distillation was enter to output the state data S R to the imitation learning model behavior data A IL The error between the behavior data and the behavior data AF is calculated using the loss function.
Second loss error calculating step includes a behavior data software targeted behavioral data A RL which are outputted by inputting the state data S R in reinforcement learning model by knowledge distillation, the error between the behavior data A F, loss Calculate using a function.
4)人のエキスパートによるプレイデータの記録における状態データSHEと行動データAHE、及び、強化学習モデルのエージェントのプレイデータの記録における状態データSRLと行動データARLである場合
学習ステップは、状態データSHEを入力して行動データAF1を出力させ、及び、状態データSRLを入力して行動データAF2を出力させる。
第1の損失誤差算出ステップは、人のエキスパートによるプレイデータにおける行動データAHEと、行動データAF1との誤差を、損失関数を用いて算出する。
第2の損失誤差算出ステップは、強化学習モデルに状態データSRLを入力して出力させた行動データARLを知識蒸留によるソフトターゲット化した行動データと、行動データAF2との誤差を、損失関数を用いて算出する。
4) When the state data S HE and the behavior data A HE in the recording of the play data by the human expert and the state data S RL and the behavior data A RL in the recording of the play data of the agent of the reinforcement learning model, the learning step is The state data SHE is input to output the action data A F1 , and the state data SRL is input to output the action data A F2 .
In the first loss error calculation step, the error between the behavior data A HE and the behavior data A F1 in the play data by a human expert is calculated using the loss function.
In the second loss error calculation step, the error between the behavior data A RL soft-targeted by knowledge distillation and the behavior data A F2 , which is output by inputting the state data S RL into the reinforcement learning model, is lost. Calculate using a function.
5)模倣学習モデルのエージェントのプレイデータの記録における状態データSILと行動データAIL、及び、強化学習モデルのエージェントのプレイデータの記録における状態データSRLと行動データARLである場合
学習ステップは、状態データSILを入力して行動データAF1を出力させ、及び、状態データSRLを入力して行動データAF2を出力させる。
第1の損失誤差算出ステップは、模倣学習モデルのエージェントのプレイデータにおける行動データAILを知識蒸留によるソフトターゲット化した行動データと、行動データAF1との誤差を、損失関数を用いて算出する。
第2の損失誤差算出ステップは、強化学習モデルに状態データSRLを入力して出力させた行動データARLを知識蒸留によるソフトターゲット化した行動データと、行動データAF2との誤差を、損失関数を用いて算出する。
5) imitation learning model agents play data state data S IL and behavioral data A IL in the recording, and, when the learning step is a state data S RL and behavioral data A RL in the recording play data Agent Reinforcement Learning Inputs the state data SIL to output the action data A F1 , and inputs the state data SRL to output the action data A F2 .
In the first loss error calculation step, the error between the behavior data AIL soft-targeted by knowledge distillation in the play data of the agent of the imitation learning model and the behavior data A F1 is calculated using the loss function. ..
In the second loss error calculation step, the error between the behavior data A RL soft-targeted by knowledge distillation and the behavior data A F2 , which is output by inputting the state data S RL into the reinforcement learning model, is lost. Calculate using a function.
本発明のエージェントの学習方法における知識蒸留によるソフトターゲット化は、エージェントの行動方策の出力が、熱度Tのパラメタを有し、ハードターゲット化とソフトターゲット化に対応するものである。熱度Tが“0”の場合は、行動方策の出力は一つの行動のみが0より大きな確率を持つハードターゲットとなり、行動は一意に決定される。一方、熱度Tが“0”でない場合は、複数の行動が0より大きな確率をもち、行動は確率的に決定される。 In the soft targeting by knowledge distillation in the agent learning method of the present invention, the output of the agent's action policy has a parameter of heat degree T, and corresponds to hard targeting and soft targeting. When the heat degree T is "0", the output of the action policy becomes a hard target in which only one action has a probability greater than 0, and the action is uniquely determined. On the other hand, when the heat degree T is not "0", the plurality of actions have a probability greater than 0, and the actions are stochastically determined.
次に、本発明のエージェントの学習装置について説明する。
本発明のエージェントの学習装置は、所定の環境下でエージェントが判断して最適行動をとる振舞いを実現する学習器と人らしい振舞いを実現する学習器とを融合させ、エージェントの行動方策を人らしく振舞うように最適化する学習装置であって、下記A)~F)を備える。
A)人のエキスパートによるプレイデータ、又は所定の目的で作成されたエージェントのプレイデータの少なくとも何れかの記録における状態データSRと行動データARを入力する入力部。
B)最適行動をとる振舞いに関わる強化学習モデルと人らしい振舞いに関わる模倣学習モデルとの融合モデルに対して、状態データSRを入力して融合モデルの行動データAFを出力する学習実行器。
C)行動データARと行動データAFとの第1の損失誤差を算出する第1の損失誤差算出器。
D)強化学習モデルの実行器又は最適行動アルゴリズムにより状態データSRに基づいて出力される行動データARLと、行動データAFとの第2の損失誤差を算出する第2の損失誤差算出器。
E)第1及び第2の損失誤差の重み比率に基づいて融合誤差を算出する融合誤差算出器。
F)融合誤差に基づいて前記融合モデルの学習実行器のパラメタを更新する更新部。
Next, the learning device of the agent of the present invention will be described.
The agent learning device of the present invention fuses a learning device that realizes a behavior in which an agent judges and takes an optimum action under a predetermined environment and a learning device that realizes a human-like behavior, and makes the agent's behavior policy human-like. It is a learning device that optimizes to behave and is equipped with the following A) to F).
An input unit for inputting status data S R and behavioral data A R in at least one of recording of the play data of agents created A) human experts play data, or a predetermined purpose.
Against fusion model and imitation learning model related to reinforcement learning model and human seems behaviors related to behavior taking B) optimal action, learning execution outputting a behavior data A F of the fused model to input state data S R ..
C) A first loss error calculator for calculating a first loss error between the behavior data AR and the behavior data AF .
And behavior data A RL which is output based on the state data S R by the execution unit or optimal behavior algorithm D) reinforcement learning model, the second loss error calculator for calculating a second loss error between behavior data A F ..
E) A fusion error calculator that calculates the fusion error based on the weight ratio of the first and second loss errors.
F) An update unit that updates the parameters of the learning executor of the fusion model based on the fusion error.
本発明のエージェントの学習装置の入力部において、上述した本発明のエージェントの学習方法の入力ステップと同様に、上述の1)~5)の場合の状態データSRと行動データARに応じて、第1及び第2の損失誤差算出器の算出処理が異なる。
また、本発明のエージェントの学習装置における知識蒸留によるソフトターゲット化は、上述した本発明のエージェントの学習方法と同様に、エージェントの行動方策の出力が、熱度Tのパラメタを有し、ハードターゲット化とソフトターゲット化に対応し得るものであり、熱度Tが“0”の場合は、行動方策の出力は一つの行動のみが0より大きな確率を持つハードターゲットとなり、行動は一意に決定され、一方、熱度Tが“0”でない場合は、複数の行動が0より大きな確率をもち、行動は確率的に決定される。
At the input of the agent of the learning device of the present invention, similarly to the input step of learning how the agent of the present invention described above, according to the state data S R and behavioral data A R in the case of the above 1) to 5) , The calculation process of the first and second loss error calculators is different.
Further, in the soft targeting by knowledge distillation in the learning device of the agent of the present invention, the output of the action policy of the agent has a parameter of heat degree T and is hard-targeted as in the above-described learning method of the agent of the present invention. When the degree of heat T is "0", the output of the action policy becomes a hard target with a probability that only one action has a probability greater than 0, and the action is uniquely determined, while the action is uniquely determined. If the heat degree T is not "0", the plurality of actions have a probability greater than 0, and the actions are determined probabilistically.
本発明のエージェントの学習プログラムは、上述の本発明のエージェントの学習方法における全てのステップを、コンピュータに実行させるためのプログラムである。
また本発明のエージェントの学習プログラムは、上述の本発明のエージェントの学習装置における入力部、学習実行器、第1の損失誤差算出器、第2の損失誤差算出器、融合誤差算出器及び更新部として、コンピュータを機能させるためのプログラムである。
The agent learning program of the present invention is a program for causing a computer to execute all the steps in the above-mentioned agent learning method of the present invention.
Further, the learning program of the agent of the present invention includes an input unit, a learning executor, a first loss error calculator, a second loss error calculator, a fusion error calculator and an update unit in the above-mentioned agent learning device of the present invention. It is a program for operating a computer.
本発明によれば、強化学習で獲得される人間の能力を超えた効率性の高い最適行動と、人のエキスパートの行動を模倣することで獲得される人らしい振舞いを両立させたエージェントAIを構築することができる。 According to the present invention, an agent AI that balances highly efficient optimal behavior that exceeds human ability acquired by reinforcement learning and human-like behavior acquired by imitating the behavior of a human expert is constructed. can do.
以下、本発明の実施形態の一例を、図面を参照しながら詳細に説明していく。なお、本発明の範囲は、以下の実施例や図示例に限定されるものではなく、幾多の変更及び変形が可能である。 Hereinafter, an example of the embodiment of the present invention will be described in detail with reference to the drawings. The scope of the present invention is not limited to the following examples and illustrated examples, and many modifications and modifications can be made.
図1は、本発明のエージェントの学習装置の機能ブロック図を示している。学習装置1は、融合モデル学習実行器2、人らしい行動との損失関数算出器3、最適行動との損失関数算出器4、融合モデル損失関数算出器5、及び、データベース(DB)6から構成される。融合モデル学習実行器2は、データベース6から状態データを取得し、融合モデルの実行処理を行い、得られた行動データを、損失関数算出器(3,4)へ出力する。損失関数算出器(3,4)は、データベース6から状態データ又は行動データを入力し、入力したデータと融合モデル学習実行器2から入力した行動データをそれぞれ損失誤差関数算出器に取込み、損失誤差関数LHL及び損失誤差関数LRLをそれぞれ算出する。融合モデル損失関数算出器5は、損失関数算出器(3,4)が算出した損失誤差LHL及び損失誤差LRLを入力し、人より効率的かつ人らしい振舞いとの損失誤差の融合損失誤差関数LMixを算出する。融合モデル学習実行器2は、融合モデル損失関数算出器5から融合損失誤差関数LMixを取得し、融合モデルの学習処理を行う。
以下の実施例では、融合モデルのタイプ、学習に用いるプレイデータの種別、学習のコントロールの種別の組合せパターンに分けて、本発明のエージェントの学習装置の機能ブロック図を説明する。組合せパターンを表1に纏める。
FIG. 1 shows a functional block diagram of the learning device of the agent of the present invention. The
In the following examples, the functional block diagram of the learning device of the agent of the present invention will be described by dividing into a combination pattern of a fusion model type, a play data type used for learning, and a learning control type. The combination patterns are summarized in Table 1.
図2は、実施例1のエージェントの学習装置の機能ブロック図を示している。図2に示すように、学習装置11は、融合モデル学習実行器2、人らしい行動との損失関数算出器3、最適行動との損失関数算出器4、融合モデル損失関数算出器5及びデータベース61から成る。データベース61には、人のエキスパートによるプレイデータが記憶されている。損失関数算出器3には、第1の損失誤差算出器30が設けられ、損失関数算出器4には、強化学習モデル実行器8、ソフトターゲット化処理器9及び第2の損失誤差算出器40が設けられている。
FIG. 2 shows a functional block diagram of the learning device of the agent of the first embodiment. As shown in FIG. 2, the
学習装置11を用いたエージェントの学習方法について、図3~8を参照しながら説明する。図3は、実施例1のエージェントの学習方法の概略フロー図を示している。図3に示すように、まず、融合モデル学習実行器2において、融合モデルの実行処理を行う(ステップS11)。次に、損失関数算出器3において、人らしい行動との損失誤差関数LHEの算出処理を行う(ステップS12)。また、これとは別に、最適行動との損失誤差関数LRLの算出処理を行う(ステップS13)。人らしい行動との損失誤差関数LHE及び最適行動との損失誤差関数LRLを基に、融合モデル損失関数算出器5を用いて、融合損失誤差関数LMixの算出処理を行う(ステップS14)。融合モデル学習実行器2において、算出された融合損失誤差関数LMixを取得し、融合モデルの学習処理を行う(ステップS15)。融合損失誤差関数LMixが所定値以上の場合には(ステップS16)、再度、融合モデル学習実行器2において、融合モデルの実行処理を行う(ステップS11)。
An agent learning method using the
次に、図3で示した各処理について説明する。図4は、実施例1の融合モデルの実行処理フロー図を示している。図4に示すように、融合モデル学習実行器2は、データベース61より状態データSRを取得する(ステップS111)。融合モデルπへ状態データSRを入力し実行する(ステップS112)。融合モデルの行動データAFを記憶する(ステップS113)。
Next, each process shown in FIG. 3 will be described. FIG. 4 shows an execution processing flow diagram of the fusion model of the first embodiment. As shown in FIG. 4, a fusion model
図5は、実施例1の人らしい行動との損失誤差関数LHEの算出処理フロー図を示している。図5に示すように、人らしい行動との損失関数算出器3における第1の損失誤差算出器30は、データベース61より人の行動データARを取得する(ステップS121)。また、損失関数算出器3は、融合モデルの行動データAFを取得する(ステップS122)。第1の損失誤差算出器30は、取得したARとAFより、損失誤差関数LHEを算出する(ステップS123)。
FIG. 5 shows a calculation processing flow diagram of the loss error function LHE with the human-like behavior of Example 1. As shown in FIG. 5, the first
図6は、実施例1の最適行動との損失誤差関数LRLの算出処理フロー図を示している。図6に示すように、最適行動との損失関数算出器4において、強化学習モデル実行器8は、データベース61より状態データSRを取得する(ステップS131)。強化学習モデルπRLへ状態データSRを入力、実行し、行動データARを出力する(ステップS132)。ソフトターゲット化処理器9へ行動データARを入力、実行し、第2の損失誤差算出器40へ行動データAST2を出力する(ステップS133)。また、これとは別に、第2の損失誤差算出器40は、融合モデルの行動データAFを取得する(ステップS134)。第2の損失誤差算出器40は、取得した行動データAST2と行動データAFより損失誤差関数LRLを算出する(ステップS135)。
FIG. 6 shows a calculation processing flow diagram of the loss error function LRL with the optimum behavior of the first embodiment. As shown in FIG. 6, the
ここで、ソフトターゲット化処理器9について図21を参照して説明する。ソフトターゲット化処理器9における知識蒸留によるソフトターゲット化は、前述のとおり、エージェントの行動方策の出力が、熱度Tのパラメタを有し、ハードターゲット化とソフトターゲット化に対応するものである。熱度T=0の場合は、行動方策の出力は一つの行動のみが0より大きな確率を持つハードターゲットとなり、行動は一意に決定される。一方、熱度T>0の場合(Tが0でない場合)は、複数の行動が0より大きな確率をもち、行動は確率的に決定される。
ソフトターゲット化処理器について、最適行動との損失関数算出器に用いられる場合を例にして説明する。図21(1)に示すように、強化学習モデル実行器が、状態データを入力し、実行器の出力データGをソフトターゲット化処理器に出力し、さらにソフトターゲット化処理器が、出力データLを損失誤差LRL算出器に出力する。損失誤差LRL算出器は、融合モデルの出力の行動データと、ソフトターゲット化処理器の出力データとを入力し損失誤差関数を算出する。
Here, the
The case where the soft targeting processor is used as a loss function calculator with the optimum behavior will be described as an example. As shown in FIG. 21 (1), the reinforcement learning model executor inputs the state data, outputs the output data G of the executor to the soft targeting processor, and the soft targeting processor further outputs the output data L. Is output to the loss error LRL calculator. The loss error LRL calculator calculates the loss error function by inputting the behavior data of the output of the fusion model and the output data of the soft targeting processor.
この場合、図21(3)に示すように、ソフトターゲット化処理器のパターンは大きくわけて3通り存在する。先ず、ソフトターゲット化処理器内の処理で、熱度Tを処理し、T=0,T>0の場合に分けてそれぞれデータmとデータLにつき損失誤差関数を算出するパターン(これを“パターンA”とする)、次に、ソフトターゲット化処理器内の処理で、熱度Tを処理し、T>0とするデータLにつき損失誤差関数を算出するパターン(これを“パターンB”とする)、そして、ソフトターゲット化処理器内の処理で、熱度Tを処理し、T=1とするデータGにつき損失誤差関数を算出するパターン(これを“パターンC1”とする)と、T=0とするデータGにつき損失誤差関数を算出するパターン(これを“パターンC2”とする)が存在する。熱度Tによるソフトターゲット化処理は、具体的には、図21(2)に示すグラフのような処理関数を用いて、指数関数のパラメタとして熱度Tを使用している。上記のパターンとデータ名とデータ形式を表2に纏める。 In this case, as shown in FIG. 21 (3), there are roughly three patterns of the soft targeting processor. First, in the processing in the soft targeting processor, the heat degree T is processed, and the loss error function is calculated for each of the data m and the data L separately for the cases of T = 0 and T> 0 (this is referred to as "pattern A"). Then, in the processing in the soft targeting processor, the heat degree T is processed and the loss error function is calculated for the data L where T> 0 (this is referred to as “pattern B”). Then, in the processing in the soft targeting processor, the heat degree T is processed, and the loss error function is calculated for the data G in which T = 1 (this is referred to as “pattern C1”), and T = 0. There is a pattern for calculating the loss error function for the data G (this is referred to as "pattern C2"). Specifically, in the soft targeting process by the heat degree T, the heat degree T is used as a parameter of the exponential function by using the processing function as shown in the graph shown in FIG. 21 (2). Table 2 summarizes the above patterns, data names, and data formats.
図7は、実施例1の融合損失誤差関数LMixの算出処理フロー図を示している。図7に示すように、融合モデル損失関数算出器5は、損失誤差関数LHE及びLRLを取得する(ステップS141)。融合モデル損失関数算出器5は、トレードオフ係数α、LHE及びLRLより融合損失誤差関数LMixを算出する(ステップS142)。
FIG. 7 shows a calculation processing flow diagram of the fusion loss error function L Mix of the first embodiment. As shown in FIG. 7, the fusion model
図8は、実施例1の融合モデルの学習処理フロー図を示している。図8に示すように、融合モデル学習実行器2は、融合損失誤差関数LMixを取得する(ステップS151)。融合モデル学習実行器2は、融合損失誤差関数LMixが小さくなる様に融合モデルのパラメタを変更する(ステップS152)。
FIG. 8 shows a learning processing flow diagram of the fusion model of the first embodiment. As shown in FIG. 8, the fusion
図9は、実施例2のエージェントの学習装置の機能ブロック図を示している。図9に示すように、学習装置12は、融合モデル学習実行器2、人らしい行動との損失関数算出器3、最適行動との損失関数算出器4、融合モデル損失関数算出器5及びデータベース62から成る。データベース62には、模倣学習モデル実行器7に設けられた模倣学習モデルπILによるプレイデータが記憶されている。損失関数算出器3には、第1の損失誤差算出器30が設けられ、損失関数算出器4には、強化学習モデル実行器8、ソフトターゲット化処理器9及び第2の損失誤差算出器40が設けられている。
実施例2のエージェントの学習装置12は、使用するデータベースが、模倣学習モデルによるプレイデータが記憶されたデータベース62である点で、人のエキスパートによるプレイデータが記憶されたデータベース61を使用する実施例1の学習装置11とは異なる。そのため、実施例2のエージェントの学習方法では、人らしい行動との損失誤差関数LILの算出処理について、実施例1とは異なっている。
FIG. 9 shows a functional block diagram of the learning device of the agent of the second embodiment. As shown in FIG. 9, the
The
図10は、実施例2の人らしい行動との損失誤差関数LILの算出処理フロー図を示している。図10に示すように、人らしい行動との損失関数算出器3における第1の損失誤差算出器30は、データベース62より模倣学習モデルの行動データAILを取得する(ステップS211)。また、損失関数算出器3は、融合モデルの行動データAFを取得する(ステップS212)。第1の損失誤差算出器30は、取得したAILとAFより、損失誤差関数LILを算出する(ステップS213)。
Figure 10 shows a calculation process flow diagram of a loss error function L IL with human likely behavioral Example 2. As shown in FIG. 10, the first
図11は、実施例3のエージェントの学習装置の機能ブロック図を示している。図11に示すように、学習装置13は、融合モデル学習実行器2、人らしい行動との損失関数算出器3、最適行動との損失関数算出器4、融合モデル損失関数算出器5及びデータベース62から成る。データベース62には、模倣学習モデル実行器7に設けられた模倣学習モデルπILによるプレイデータが記憶されている。損失関数算出器4には、強化学習モデル実行器8、ソフトターゲット化処理器9及び第2の損失誤差算出器40が設けられている。
実施例3のエージェントの学習装置13では、実施例2とは異なり、損失関数算出器3には、第1の損失誤差算出器30だけではなく、模倣学習モデル実行器7及びソフトターゲット化処理器9が設けられている。
FIG. 11 shows a functional block diagram of the learning device of the agent of the third embodiment. As shown in FIG. 11, the
In the
図20は、実施例3の人らしい行動との損失誤差関数LILの算出処理フロー図を示している。図20に示すように、まず、損失関数算出器3は、データベース62から状態データSRを取得する(ステップS411)。模倣学習モデル実行器7の模倣学習モデルπILへ状態データSRを入力、実行し、行動データAILを出力する(ステップS412)。ソフトターゲット化処理器9へ行動データAILを入力、実行し、行動データAST1を出力する(ステップS413)。また、これとは別に、第1の損失誤差算出器30は、融合モデルの行動データAFを取得する(ステップS414)。第1の損失誤差算出器30は、取得した行動データAST1と行動データAFより損失誤差関数LILを算出する(ステップS415)。
Figure 20 shows a calculation process flow diagram of a loss error function L IL with people likely behavior of Example 3. As shown in FIG. 20, first, the
図12は、実施例4のエージェントの学習装置の機能ブロック図を示している。図12に示すように、学習装置14は、融合モデル学習実行器2、人らしい行動との損失関数算出器3、最適行動との損失関数算出器4、融合モデル損失関数算出器5及びデータベース63から成る。データベース62には、強化学習モデル実行器8に設けられた強化学習モデルπRLによるプレイデータが記憶されている。損失関数算出器3に模倣学習モデル実行器7、ソフトターゲット化処理器9及び第1の損失誤差算出器30が設けられている点は、実施例3と同様である。また、損失関数算出器4には、強化学習モデル実行器8、ソフトターゲット化処理器9及び第2の損失誤差算出器40が設けられている。
実施例4のエージェントの学習装置14では、使用するデータベースが、強化学習モデルによるプレイデータが記憶されたデータベース63である点で、模倣学習モデルによるプレイデータが記憶されたデータベース62を使用する実施例3の学習装置13と異なるが、その他の点は同様である。
FIG. 12 shows a functional block diagram of the learning device of the agent of the fourth embodiment. As shown in FIG. 12, the
In the
図13は、実施例5のエージェントの学習装置の機能ブロック図を示している。図13に示すように、学習装置15は、融合モデル学習実行器2、人らしい行動との損失関数算出器3、最適行動との損失関数算出器4、融合モデル損失関数算出器5及びデータベース(61,63)から成る。損失関数算出器3には、第1の損失誤差算出器30が設けられ、損失関数算出器4には、第2の損失誤差算出器40が設けられている。
実施例5の学習装置15は、人のエキスパートによるプレイデータが記憶されたデータベース61と、強化学習モデル実行器8に設けられた強化学習モデルπRLによるプレイデータが記憶されたデータベース63という2種類のデータベースを使用する点で、実施例1~4とは異なっている。
FIG. 13 shows a functional block diagram of the learning device of the agent of the fifth embodiment. As shown in FIG. 13, the
The
そこで、学習装置15を用いたエージェントの学習方法について、図14~18を参照しながら説明する。図14は、実施例5のエージェントの学習方法の概略フロー図を示している。図14に示すように、まず、融合モデル学習実行器2は、融合モデルの実行処理1を行う(ステップS31)。またこれとは別に、融合モデル学習実行器2は、融合モデルの実行処理2を行う(ステップS32)。ステップS31の融合モデルの実行処理1の後に、損失関数算出器3において、人らしい行動との損失誤差関数LHEの算出処理を行う(ステップS33)。また、ステップS32の融合モデルの実行処理2の後に、最適行動との損失誤差関数LRLの算出処理を行う(ステップS34)。なお、ステップS33はステップS31の後に行われ、ステップS34はステップS32の後に行われればよく、例えば、ステップS31の前にステップS32を行ってもよい。
人らしい行動との損失誤差関数LHE及び最適行動との損失誤差関数LRLを基に、融合モデル損失関数算出器5を用いて、融合損失誤差関数LMixの算出処理を行う(ステップS35)。融合モデル学習実行器2において、算出された融合損失誤差関数LMixを取得し、融合モデルの学習処理を行う(ステップS36)。融合損失誤差関数LMixが所定値以上の場合には(ステップS37)、再度、融合モデル学習実行器2において、融合モデルの実行処理1及び2を行う(ステップS31,ステップS32)。
Therefore, the learning method of the agent using the
Based on the loss error function L HE with human-like behavior and the loss error function L RL with optimal behavior, the fusion model
次に、図14で示した各処理について説明する。図15は、実施例5の融合モデルの実行処理フロー図1を示している。図15に示すように、融合モデル学習実行器2は、データベース61(DB1)より状態データSR1を取得する(ステップS311)。融合モデルπへ状態データSR1を入力し実行する(ステップS312)。融合モデルの行動データAF1を記憶する(ステップS313)。
Next, each process shown in FIG. 14 will be described. FIG. 15 shows an execution processing flow FIG. 1 of the fusion model of the fifth embodiment. As shown in FIG. 15, the fusion
図16は、実施例5の融合モデルの実行処理フロー図2を示している。図16に示すように、融合モデル学習実行器2は、データベース63(DB2)より状態データSR2を取得する(ステップS321)。融合モデルπへ状態データSR2を入力し実行する(ステップS322)。融合モデルの行動データAF2を記憶する(ステップS323)。
FIG. 16 shows an execution processing flow FIG. 2 of the fusion model of the fifth embodiment. As shown in FIG. 16, the fusion
図17は、実施例5の人らしい行動との損失誤差関数LHEの算出処理フロー図を示している。図17に示すように、人らしい行動との損失関数算出器3における第1の損失誤差算出器30は、データベース61より人の行動データARを取得する(ステップS331)。また、損失関数算出器3は、融合モデルの行動データAF1を取得する(ステップS332)。第1の損失誤差算出器30は、取得したARとAF1より、損失誤差関数LHEを算出する(ステップS333)。
FIG. 17 shows a calculation processing flow diagram of the loss error function LHE with the human-like behavior of Example 5. As shown in FIG. 17, the first
図18は、実施例5の最適行動との損失誤差関数LRLの算出処理フロー図を示している。図18に示すように、最適行動との損失関数算出器4は、データベース63(DB2)より強化学習モデルの行動データARLを取得する(ステップS341)。また、これとは別に、第2の損失誤差算出器40は、融合モデルの行動データAF2を取得する(ステップS342)。第2の損失誤差算出器40は、取得した行動データARLと行動データAF2より損失誤差関数LRLを算出する(ステップS343)。
FIG. 18 shows a calculation processing flow diagram of the loss error function LRL with the optimum behavior of the fifth embodiment. As shown in FIG. 18, a
なお、図14に示すステップS35の融合損失誤差関数LMixの算出処理については、図7に示す実施例1の融合損失誤差関数LMixの算出処理フロー図と同様の処理を行う。また、図14に示すステップS36の融合モデルの学習処理については、図8に示す実施例1の融合モデルの学習処理フロー図と同様の処理を行う。 The calculation process of the fusion loss error function L Mix in step S35 shown in FIG. 14 is the same as the calculation process flow diagram of the fusion loss error function L Mix of Example 1 shown in FIG. 7. Further, the learning process of the fusion model in step S36 shown in FIG. 14 is performed in the same manner as the learning process flow diagram of the fusion model of Example 1 shown in FIG.
図19は、実施例6のエージェントの学習装置の機能ブロック図を示している。図19に示すように、学習装置16は、融合モデル学習実行器2、人らしい行動との損失関数算出器3、最適行動との損失関数算出器4、融合モデル損失関数算出器5及びデータベース(62,63)から成る。損失関数算出器3には、模倣学習モデル実行器7、ソフトターゲット化処理器9及び第1の損失誤差算出器30が設けられ、損失関数算出器4には、強化学習モデル実行器8、ソフトターゲット化処理器9及び第2の損失誤差算出器40が設けられている。
実施例6の学習装置16は、模倣学習モデル実行器7に設けられた模倣学習モデルπILによるプレイデータが記憶されたデータベース62(DB1)と、強化学習モデル実行器8に設けられた強化学習モデルπRLによるプレイデータが記憶されたデータベース63(DB2)という2種類のデータベースを使用する点で、データベース(61,63)を使用する実施例5とは異なっている。
FIG. 19 shows a functional block diagram of the learning device of the agent of the sixth embodiment. As shown in FIG. 19, the
The
(融合モデルの性能評価について)
上述の実施例で説明したとおり、強化学習モデルの高い性能を保ったまま人らしいエージェントAIを学習するという学習方法は、人の性能を凌駕する効率的な最適行動をとるエージェントAIを学習する処理と、人らしく行動を選択するエージェントAIを学習する処理の2つから構成されている。
各処理は、それぞれの強化学習と模倣学習の課題として取り組まれている。そこで、本発明では、強化学習と模倣学習の融合モデルについて、離散行動空間の場合の方策の蒸留に基づき、連続行動空間の場合は敵対模倣学習に基づく方法となっている。
また、π*を強化学習モデルによる最適な方策、πHEを人(エキスパート)の方策とし、これらの2つの方策の比率を決めるパラメタをα∈(0,1)とすると、目的関数は下記の式のようになる。
(About performance evaluation of fusion model)
As explained in the above embodiment, the learning method of learning a human-like agent AI while maintaining the high performance of the reinforcement learning model is a process of learning an agent AI that takes an efficient optimum action that surpasses the human performance. And, it is composed of two processes of learning the agent AI that selects an action like a human being.
Each process is tackled as a task of reinforcement learning and imitation learning. Therefore, in the present invention, the fusion model of reinforcement learning and imitation learning is a method based on the distillation of measures in the case of discrete action space and based on hostile imitation learning in the case of continuous action space.
If π * is the optimal policy based on the reinforcement learning model, π HE is the human (expert) policy, and the parameter that determines the ratio of these two policies is α ∈ (0,1), the objective function is as follows. It becomes like an expression.
離散行動空間の場合、模倣学習の目的関数を、模倣学習の先行する研究に従って、以下の交差エントロピー損失として定義する。 In the case of discrete behavioral space, the objective function of imitation learning is defined as the following cross entropy loss according to the previous research of imitation learning.
人(エキスパート)の方策πHEは、数理モデルとして定義するのが難しく、実験的にサンプリングされたデータの上で学習を行うことにする。πHEからソフトターゲットが得られないにも関わらず、方策の蒸留には、ハードターゲットとソフトターゲットの重付き平均を計算することでより良い性能が得られる。したがって、人(エキスパート)に提供されるデータをハードターゲットとし、学習済みモデルの方策π(T) RLの出力を、熱度Tで調整したものをソフトターゲットとする。最終的に、損失関数は以下のようになる。 Human (expert) policy π HE is difficult to define as a mathematical model, so we will train on experimentally sampled data. Despite the fact that soft targets cannot be obtained from π HE , better performance can be obtained for policy distillation by calculating the weighted average of hard and soft targets. Therefore, the data provided to the person (expert) is set as the hard target, and the output of the policy π (T) RL of the trained model adjusted by the heat degree T is set as the soft target. Finally, the loss function looks like this:
連続行動空間の場合には、模倣学習方法として、既知の方法のGAIL(Generative Adversarial Imitation Learning)法を用いる(GAIL法については、非特許文献8を参照。)。GAIL法は教師モデルπからサンプリングされた軌道τ~πを必要とする。GAIL法における識別器Dwに最大化される目的関数と生徒モデルに最小化される目的関数は以下のようになる。 In the case of a continuous action space, a known GAIL (Generative Adversarial Imitation Learning) method is used as an imitation learning method (for the GAIL method, see Non-Patent Document 8). The GAIL method requires trajectories τ-π sampled from the teacher model π. The objective function maximized by the classifier D w in the GAIL method and minimized by the student model are as follows.
ここで、τは生徒モデルからサンプリングされた軌道τ~πである。融合化モデルとする際には、教師モデルを人(エキスパート)と強化学習モデルにするため、それぞれのエキスパートから軌道τHE~πHEと、τRL~πRLをサンプリングする。さらに、融合の損失関数を以下のように置き換えることができる。直感的には、識別器Dwは、人(エキスパート)と強化学習モデルの方策間の融合方策を認めるように学習され、この識別器を騙せるように訓練される生徒モデルπが融合方策に近づき、両方のエキスパートの長所を模倣すると期待される。 Here, τ is the orbits τ to π sampled from the student model. In order to make the teacher model a human (expert) and a reinforcement learning model, the trajectories τ HE to π HE and τ RL to π RL are sampled from each expert. In addition, the fusion loss function can be replaced as follows: Intuitively, the discriminator D w is trained to recognize the fusion policy between the person (expert) and the policy of the reinforcement learning model, and the student model π trained to deceive this classifier becomes the fusion policy. Expected to approach and mimic the strengths of both experts.
実施した実験について説明する。
(実験1)Atari 2600 ゲーム(Gopher)
実施例1の学習装置を、まず離散行動空間のGopherという名称のAtari 2600システムのゲームに適用した。このゲームは、地下から地上に出てくるねずみ(Gopher)が人参を取れないように、ゲームのユーザが農夫として行動し(振舞い)、左右に動いたり穴を埋めたりすることである。人(エキスパート)と訓練済みの学習モデルがそれぞれ55000のフレームを提供した上、訓練セットを50000、テストセットを5000として学習を行った。特に、生徒モデルを訓練するために学習率10-4のAdam optimizerとDropout率0.5を利用した。また、知識蒸留の熱度をT=0.1、トレードオフ係数をα=0.93として、融合モデルの学習を行った。
The experiments carried out will be described.
(Experiment 1) Atari 2600 game (Gopher)
The learning device of Example 1 was first applied to a game of an Atari 2600 system named Gopher in a discrete action space. In this game, the user of the game acts as a farmer (behavior) and moves left and right or fills a hole so that the mouse (Gopher) coming out from the basement to the ground cannot take the carrot. A person (expert) and a trained learning model each provided 55,000 frames, and the training set was 50,000 and the test set was 5,000. In particular, an Adam optimizer with a learning rate of 10-4 and a Dropout rate of 0.5 were used to train the student model. In addition, the fusion model was trained with the heat of knowledge distillation set to T = 0.1 and the trade-off coefficient set to α = 0.93.
(実験2)Torcs
Torcs(Wymann, B.,“The open racing car simulator”,(2015)を参照)は、自動運転の研究で最もよく利用されるシミュレータの1つである。このTorcsを用いた実験は、GymTorcs環境をベースにした。エージェントAIの観測空間は、車から端までの距離、敵の車までの距離、現在の速度や加速度など、全体で65の連続値から成っている。行動空間は、2つの要素「左右」と「加減速」から成っており、取りうる値は[-1.0,1.0]の範囲に限られる。
報酬関数は、走った距離にし、強化学習モデルをOpenAI Baselinesを基に訓練した。さらに、人らしさが見分けられる状況が現れるように、Torcsのシミュレータに停車ボットをした上、人(エキスパート)に60秒の220エピソードをプレイさせ、そのデータを収集した。単なる模倣学習エージェントを、人(エキスパート)のデータ上で、強化学習モデルの訓練と同じく、OpenAIのGAIL法を用いて訓練を行った。最後に、GAIL法における学習器の更新を実施し、両方のエキスパートの影響を等しくするためにトレードオフ係数α=0.5にして、融合モデルの学習を行った。
(実験3)Appleゲーム
Appleゲームは、ランダムに1つ画面上に現われたリンゴをプレイヤーが収集し、収集されたリンゴの数でスコアを出すものである。
(Experiment 2) Torcs
Torcs (see Wymann, B., “The open racing car simulator”, (2015)) is one of the most commonly used simulators in autonomous driving research. This experiment using Torcs was based on the GymTorcs environment. The observation space of the agent AI consists of 65 continuous values in total, such as the distance from the car to the edge, the distance to the enemy car, the current speed and acceleration. The action space consists of two elements, "left and right" and "acceleration / deceleration", and the possible values are limited to the range [-1.0, 1.0].
The reward function was the distance traveled, and the reinforcement learning model was trained based on OpenAI Baselines. Furthermore, in order to show the situation where humanity can be discerned, a stop bot was made to the Torcs simulator, and a person (expert) was made to play 220 episodes of 60 seconds, and the data was collected. A mere imitation learning agent was trained on human (expert) data using the GAIL method of OpenAI in the same way as the training of the reinforcement learning model. Finally, the learner was updated in the GAIL method, and the fusion model was trained with a trade-off coefficient α = 0.5 in order to equalize the influences of both experts.
(Experiment 3) Apple game In the Apple game, the player collects apples that randomly appear on the screen and gives a score based on the number of collected apples.
(3)人間らしさの感性試験
各モデルの評価以外は、モデルの人らしさを評価するためにダブル・ブラインドで感性試験を実施した。その試験は男性23人、女性3人の計26人の審査員を対象にした。年齢は27から59歳、平均年齢は44歳である。感性試験の審査員全員、本調査以前に、本発明内容に関する資料には接触がなかった。初めに、審査員に各ゲームのルールを説明し、人らしい行動(振舞い)を理解してもらえるように各ゲームの体験会を実施した。調査の内容は、各審査員、ゲーム毎に2本の動画(Gopherの場合15秒、Torcsの場合30秒)を提供し、人間かAIかの判断とその理由を依頼した。
(3) Kansei test of humanity Except for the evaluation of each model, a double-blind sensitivity test was conducted to evaluate the humanity of the model. The test targeted a total of 26 judges, 23 males and 3 females. The age range is 27 to 59 years and the average age is 44 years. All the judges of the Kansei test had no contact with the materials related to the content of the present invention before this investigation. First, we explained the rules of each game to the judges and held a hands-on session for each game so that they could understand the human behavior (behavior). As for the content of the survey, each judge provided two videos (15 seconds for Gopher and 30 seconds for Torcs) for each game, and asked them to judge whether they were humans or AI and why.
実験結果について説明する。
(1)Atari 2600 ゲーム(Gopher)
性能に関して、点数の高い順に、強化学習モデル(比較例3)、次に実施例の融合モデル、最後に人(比較例1)と模倣学習モデル(比較例2)の順番になった。実施例の融合モデルは、強化学習モデル(比較例3)に提供されたターゲットをα=0.8で優先したにも関わらず、スコアの向上が3点しかなく、単体の強化学習モデルの点数との差が大きい。感性試験に関して、強化学習モデル(比較例3)は、あまり人間らしくないと判断されたが、実施例の融合モデルは点数だけではなく、人らしさでも、人(比較例1)とその模倣(比較例2)より高いスコアを示した。
融合モデルは、強化学習の目標に向かった学習傾向と人(エキスパート)の振舞いを学習できたことがわかる。なお、意外にも、融合モデルは人(エキスパート)よりも、人らしいと判断された。その理由を解明するために、審査員のコメント分析によってよく現れた感想は、“無駄な動きが少ない”、“動きが細かい、プログラム感が動きにある”、“穴を順番に埋めようとする”であった。従って、人(エキスパート)は、特にゲームをあまりしない審査員に高い性能が期待されていなかったと考えられる。点数の結果を表3に纏める。
The experimental results will be described.
(1) Atari 2600 game (Gopher)
Regarding the performance, the reinforcement learning model (Comparative Example 3) was followed by the fusion model of the example, and finally the person (Comparative Example 1) and the imitation learning model (Comparative Example 2) in descending order of the score. In the fusion model of the example, although the target provided in the reinforcement learning model (Comparative Example 3) was prioritized at α = 0.8, the score was improved by only 3 points, and the score of the single reinforcement learning model was obtained. There is a big difference with. Regarding the Kansei test, it was judged that the reinforcement learning model (Comparative Example 3) was not very human-like, but the fusion model of the Example was not only in terms of score but also in humanity as well as human (Comparative Example 1) and its imitation (Comparative Example). 2) Showed a higher score.
It can be seen that the fusion model was able to learn the learning tendency and the behavior of the person (expert) toward the goal of reinforcement learning. Surprisingly, the fusion model was judged to be more human than human (expert). In order to clarify the reason, the impressions often expressed by the judge's comment analysis are "less useless movement", "fine movement, program feeling in movement", "trying to fill the hole in order". "Met. Therefore, it is considered that people (experts) were not expected to have high performance, especially for judges who do not play many games. The results of the scores are summarized in Table 3.
(2)Torcs
性能の評価に関して、まず、それぞれの人(エキスパート)、GAILによる人の模倣学習、強化学習のDDPGと融合モデルの点数を比較した。点数の結果を表4に纏める。実験より、GAILは訓練済みの強化学習モデルや決定論的なボットの模倣に優れているが、人(エキスパート)の模倣の効率は意外に低いことが観測された。それは、人(エキスパート)の方策は複雑で、基本的なニューラルネットワークでの扱いが困難だと推定される。一方で、実施例の融合モデルは、例えば、強化学習モデル(比較例3)の高速度や、人(比較例1)の曲がり方という特徴の模倣に成功した。さらに、人(比較例1)と強化学習モデル(比較例3)のように全体のトラックを走れるようになった。
人らしいと判断された割合が低い順に、まず強化学習モデル(比較例3)は“走るのが速すぎる”や“高速で角を曲がる”という理由で、あまり人間らしくないと判断された。意外にも、人(比較例1)は人らしくないと判断された。同じ動画に対して、審査員のコメントが多様であるが、より低い性能を示した人の模倣エージェントAIが、人(エキスパート)より人らしいと判断されたため、Gopherと同じく、人(比較例1)に示された性能は高いといえる。最後に、融合モデルは最も人らしいと判断された上、強化学習モデル(比較例3)に近い性能を示した。
(2) Torcs
Regarding the evaluation of performance, first, the scores of the DDPG and the fusion model of each person (expert), imitation learning of the person by GAIL, and reinforcement learning were compared. The results of the scores are summarized in Table 4. From experiments, it was observed that GAIL excels in imitating trained reinforcement learning models and deterministic bots, but the efficiency of imitation of humans (experts) is surprisingly low. It is presumed that the human (expert) policy is complicated and difficult to handle with a basic neural network. On the other hand, the fusion model of the example succeeded in imitating the characteristics of, for example, the high speed of the reinforcement learning model (Comparative Example 3) and the bending method of a person (Comparative Example 1). Furthermore, it became possible to run the entire track like a person (Comparative Example 1) and a reinforcement learning model (Comparative Example 3).
In ascending order of the percentage of people judged to be human, the reinforcement learning model (Comparative Example 3) was judged to be not very human because it "runs too fast" or "turns a corner at high speed". Surprisingly, it was judged that the person (Comparative Example 1) was not human. For the same video, the judges' comments are diverse, but the imitation agent AI of the person who showed lower performance was judged to be more human than the person (expert), so like Gopher, the person (Comparative Example 1) It can be said that the performance shown in) is high. Finally, the fusion model was judged to be the most human, and showed performance close to that of the reinforcement learning model (Comparative Example 3).
(実験3)Appleゲーム
Appleゲームは単純な原理、すなわち、プレイヤーのアバターを出現したリンゴの位置に移動させるゲームである。そして、強化学習が最も高いスコアを達成し、続いて、実施例の学習方法、そして最後に、人(比較例1)とその模倣学習が続いた。
人らしさに関する限り、人のエージェント(比較例1)が最もよかった。実施例の学習方法は、スコアにおいて人(比較例1)及びその模倣学習モデルのエージェント(比較例2)を上回りながら、強化学習モデルのエージェント(比較例3)よりも人らしい行動を示し、その後に人(比較例1)が続いた。この結果から、実施例の学習方法は、このゲームにおける人らしい行動と高性能のバランスをとることがわかる。
(Experiment 3) Apple game An Apple game is a game that moves a player's avatar to the position of an apple that appears. Reinforcement learning achieved the highest score, followed by the learning method of the examples, and finally the person (Comparative Example 1) and its imitation learning.
As far as humanity is concerned, the human agent (Comparative Example 1) was the best. The learning method of the example showed more human-like behavior than the agent of the reinforcement learning model (Comparative Example 3) while surpassing the agent of the human (Comparative Example 1) and its imitation learning model (Comparative Example 2) in the score, and then. Followed by a person (Comparative Example 1). From this result, it can be seen that the learning method of the embodiment balances human behavior and high performance in this game.
本発明は、自動車の自動運転、工業用ロボットアームの自動制御など幅広い分野におけるエージェントの学習に有用である。 The present invention is useful for learning agents in a wide range of fields such as automatic driving of automobiles and automatic control of industrial robot arms.
1,11~16 学習装置
2 融合モデル学習実行器
3,4 損失関数算出器
5 融合モデル損失関数算出器
6,61~63 データベース
7 模倣学習モデル実行器
8 強化学習モデル実行器
9 ソフトターゲット化処理器
30 第1の損失誤差算出器
40 第2の損失誤差算出器
1,11-16
Claims (12)
人のエキスパートによるプレイデータ、又は所定の目的で作成されたエージェントのプレイデータの少なくとも何れかの記録における状態データSRと行動データARを入力する入力ステップと、
前記最適行動をとる振舞いに関わる強化学習モデルと前記人らしい振舞いに関わる模倣学習モデルとの融合モデルの学習実行器に対して、前記状態データSRを入力して融合モデルの行動データAFを出力させる学習ステップと、
前記行動データARと前記行動データAFとの第1の損失誤差を算出する第1の損失誤差算出ステップと、
前記強化学習モデルの実行器又は最適行動アルゴリズムにより前記状態データSRに基づいて出力される行動データARLと、前記行動データAFとの第2の損失誤差を算出する第2の損失誤差算出ステップと、
第1及び第2の損失誤差の重み比率に基づいて融合誤差を算出する融合誤差算出ステップと、
前記融合誤差に基づいて前記融合モデルの学習実行器のパラメタを更新する更新ステップ、を備えることを特徴とするエージェントの学習方法。 The learning device that realizes the behavior that the agent judges and takes the optimum action under the predetermined environment and the learning device that realizes the human-like behavior are fused, and the action policy of the agent is optimized to take the optimum action like a person. It ’s a learning method
An input step for inputting status data S R and behavioral data A R in the play data, or at least one of recording of the play data of the agent created with given purpose by a human expert,
Against learning executor fusion model and imitation learning model related to the human seems behavior as reinforcement learning model related to behavior taking the optimal action, the action data A F of the fused model to input the state data S R Learning steps to output and
A first loss error calculating step of calculating a first loss error between the behavior data A F and the action data A R,
And behavior data A RL which is output based on the state data S R by the execution unit or optimal behavior algorithm of the reinforcement learning model, the second loss error calculation for calculating a second loss error between the behavior data A F Steps and
A fusion error calculation step that calculates the fusion error based on the weight ratio of the first and second loss errors, and
An agent learning method comprising: an update step of updating a parameter of a learning executor of the fusion model based on the fusion error.
上記の第1の損失誤差算出ステップは、
人のエキスパートによるプレイデータにおける前記行動データARと、前記行動データAFとの誤差を、損失関数を用いて算出し、
上記の第2の損失誤差算出ステップは、
前記強化学習モデルに前記状態データSRを入力して出力させた行動データARLを知識蒸留によるソフトターゲット化した行動データと、前記行動データAFとの誤差を、損失関数を用いて算出することを特徴とする請求項1に記載のエージェントの学習方法。 In the input step, the status data S R and behavioral data A R is, in the case of recording the play data by human experts,
The first loss error calculation step described above is
And said action data A R in the play data by human experts, the error between the behavior data A F, is calculated by using the loss function,
The second loss error calculation step described above is
Soft targeting the behavioral data with the status data S R has been inputted to output the behavior data A RL in the RL model by Knowledge distillation, the error between the behavior data A F, is calculated by using the loss function The agent learning method according to claim 1, wherein the agent learning method.
上記の第1の損失誤差算出ステップは、
前記模倣学習モデルのエージェントのプレイデータにおける前記行動データARと前記行動データAFとの誤差、又は、前記模倣学習モデルに前記状態データSRを入力して出力させた行動データAILを知識蒸留によるソフトターゲット化した行動データと、前記行動データAFとの誤差を、損失関数を用いて算出し、
上記の第2の損失誤差算出ステップは、
前記強化学習モデルに前記状態データSRを入力して出力させた行動データARLを知識蒸留によるソフトターゲット化した行動データと、前記行動データAFとの誤差を、損失関数を用いて算出することを特徴とする請求項1に記載のエージェントの学習方法。 In the input step, when the state data S R and behavioral data A R is a record of the play data of the agent of imitation learning model,
The first loss error calculation step described above is
The imitation learning model Agent error between the behavior data A R in the play data and the action data A F, or the imitation learning model in the state data S R knowledge behavior data A IL that enter to output The error between the behavior data softly targeted by distillation and the behavior data AF was calculated using a loss function.
The second loss error calculation step described above is
The error between the behavior data A RL soft-targeted by knowledge distillation and the behavior data A F , which is output by inputting the state data S R into the reinforcement learning model, is calculated using a loss function. The agent learning method according to claim 1, wherein the agent learning method.
上記の第1の損失誤差算出ステップは、
前記模倣学習モデルに前記状態データSRを入力して出力させた行動データAILを知識蒸留によるソフトターゲット化した行動データと、前記行動データAFとの誤差を、損失関数を用いて算出し、
上記の第2の損失誤差算出ステップは、
前記強化学習モデルに前記状態データSRを入力して出力させた行動データARLを知識蒸留によるソフトターゲット化した行動データと、前記行動データAFとの誤差を、損失関数を用いて算出することを特徴とする請求項1に記載のエージェントの学習方法。 In the input step, the status data S R and behavioral data A R is, in the case of recording the play data of the agent of reinforcement learning model,
The first loss error calculation step described above is
The soft targeting the behavioral data imitation learning model in the state data S R behavioral data A IL that enter to output by knowledge distillation, the error between the behavior data A F, is calculated by using the loss function ,
The second loss error calculation step described above is
The error between the behavior data A RL soft-targeted by knowledge distillation and the behavior data A F , which is output by inputting the state data S R into the reinforcement learning model, is calculated using a loss function. The agent learning method according to claim 1, wherein the agent learning method.
前記学習ステップは、前記状態データSHEを入力して行動データAF1を出力させ、及び、状態データSRLを入力して行動データAF2を出力させ、
上記の第1の損失誤差算出ステップは、
人のエキスパートによるプレイデータにおける前記行動データAHEと、前記行動データAF1との誤差を、損失関数を用いて算出し、
上記の第2の損失誤差算出ステップは、
前記強化学習モデルに前記状態データSRLを入力して出力させた行動データARLを知識蒸留によるソフトターゲット化した行動データと、前記行動データAF2との誤差を、損失関数を用いて算出することを特徴とする請求項1に記載のエージェントの学習方法。 In the input step, the status data S R and behavioral data A R is, status data S HE and behavioral data A HE at recording play data by human experts, and, in the recording of the play data of the agent of the reinforcement learning model In the case of state data S RL and behavior data A RL ,
In the learning step, the state data SHE is input to output the action data A F1 , and the state data SRL is input to output the action data A F2 .
The first loss error calculation step described above is
The error between the behavior data A HE and the behavior data A F1 in the play data by a human expert is calculated by using the loss function.
The second loss error calculation step described above is
The error between the behavior data A RL soft-targeted by knowledge distillation and the behavior data A F2 output by inputting the state data S RL into the reinforcement learning model is calculated by using a loss function. The agent learning method according to claim 1, wherein the agent learning method.
前記学習ステップは、前記状態データSILを入力して行動データAF1を出力させ、及び、状態データSRLを入力して行動データAF2を出力させ、
上記の第1の損失誤差算出ステップは、
前記模倣学習モデルのエージェントのプレイデータにおける前記行動データAILを知識蒸留によるソフトターゲット化した行動データと、前記行動データAF1との誤差を、損失関数を用いて算出し、
上記の第2の損失誤差算出ステップは、
前記強化学習モデルに前記状態データSRLを入力して出力させた行動データARLを知識蒸留によるソフトターゲット化した行動データと、前記行動データAF2との誤差を、損失関数を用いて算出することを特徴とする請求項1に記載のエージェントの学習方法。 In the input step, the state data S R and the behavior data AR are the state data SIL and the behavior data A IL in the recording of the play data of the agent of the imitation learning model, and the play data of the agent of the reinforcement learning model. In the case of state data S RL and behavior data A RL in the recording of
In the learning step, the state data SIL is input to output the action data A F1 , and the state data SRL is input to output the action data A F2 .
The first loss error calculation step described above is
The error between the behavior data A IL soft-targeted by knowledge distillation in the play data of the agent of the imitation learning model and the behavior data A F1 is calculated by using a loss function.
The second loss error calculation step described above is
The error between the behavior data A RL soft-targeted by knowledge distillation and the behavior data A F2 output by inputting the state data S RL into the reinforcement learning model is calculated by using a loss function. The agent learning method according to claim 1, wherein the agent learning method.
人のエキスパートによるプレイデータ、又は所定の目的で作成されたエージェントのプレイデータの少なくとも何れかの記録における状態データSRと行動データARを入力する入力部と、
前記最適行動をとる振舞いに関わる強化学習モデルと前記人らしい振舞いに関わる模倣学習モデルとの融合モデルに対して、前記状態データSRを入力して融合モデルの行動データAFを出力する学習実行器と、
前記行動データARと前記行動データAFとの第1の損失誤差を算出する第1の損失誤差算出器と、
前記強化学習モデルの実行器又は最適行動アルゴリズムにより前記状態データSRに基づいて出力される行動データARLと、前記行動データAFとの第2の損失誤差を算出する第2の損失誤差算出器と、
第1及び第2の損失誤差の重み比率に基づいて融合誤差を算出する融合誤差算出器と、
前記融合誤差に基づいて前記融合モデルの学習実行器のパラメタを更新する更新部、を備えることを特徴とするエージェントの学習装置。 A learning device that optimizes the behavior policy of the agent so that it behaves in a human-like manner by fusing a learning device that realizes the behavior that the agent judges and takes the optimum behavior under a predetermined environment and a learning device that realizes the human-like behavior. And
An input unit for inputting the state data S R and behavioral data A R in the play data, or at least one of recording of the play data of the agent created with given purpose by a human expert,
Against fusion model and imitation learning model related to the human seems behavior as reinforcement learning model related to behavior taking the optimal action, learning execution for outputting behavior data A F of the fused model to input the state data S R With a vessel
A first loss error calculator for calculating a first loss error between the behavior data A F and the action data A R,
And behavior data A RL which is output based on the state data S R by the execution unit or optimal behavior algorithm of the reinforcement learning model, the second loss error calculation for calculating a second loss error between the behavior data A F With a vessel
A fusion error calculator that calculates the fusion error based on the weight ratio of the first and second loss errors, and
An agent learning device including an update unit that updates parameters of the learning executor of the fusion model based on the fusion error.
1)前記状態データSRと行動データARが、人のエキスパートによるプレイデータの記録である場合には、
上記の第1の損失誤差算出器は、
人のエキスパートによるプレイデータにおける前記行動データARと、前記行動データAFとの誤差を、損失関数を用いて算出し、
上記の第2の損失誤差算出器は、
前記強化学習モデルに前記状態データSRを入力して出力させた行動データARLを知識蒸留によるソフトターゲット化した行動データと、前記行動データAFとの誤差を、損失関数を用いて算出し、
2)前記状態データSRと行動データARが、模倣学習モデルのエージェントのプレイデータの記録である場合には、
上記の第1の損失誤差算出器は、
前記模倣学習モデルのエージェントのプレイデータにおける前記行動データARと前記行動データAFとの誤差、又は、前記模倣学習モデルに前記状態データSRを入力して出力させた行動データAILを知識蒸留によるソフトターゲット化した行動データと、前記行動データAFとの誤差を、損失関数を用いて算出し、
上記の第2の損失誤差算出器は、
前記強化学習モデルに前記状態データSRを入力して出力させた行動データARLを知識蒸留によるソフトターゲット化した行動データと、前記行動データAFとの誤差を、損失関数を用いて算出し、
3)前記状態データSRと行動データARが、強化学習モデルのエージェントのプレイデータの記録である場合には、
上記の第1の損失誤差算出器は、
前記模倣学習モデルに前記状態データSRを入力して出力させた行動データAILを知識蒸留によるソフトターゲット化した行動データと、前記行動データAFとの誤差を、損失関数を用いて算出し、
上記の第2の損失誤差算出器は、
前記強化学習モデルに前記状態データSRを入力して出力させた行動データARLを知識蒸留によるソフトターゲット化した行動データと、前記行動データAFとの誤差を、損失関数を用いて算出し、
4)前記状態データSRと行動データARが、人のエキスパートによるプレイデータの記録における状態データSHEと行動データAHE、及び、前記強化学習モデルのエージェントのプレイデータの記録における状態データSRLと行動データARLである場合には、
前記学習実行器は、前記状態データSHEを入力して行動データAF1を出力させ、及び、状態データSRLを入力して行動データAF2を出力し、
上記の第1の損失誤差算出器は、
人のエキスパートによるプレイデータにおける前記行動データAHEと、前記行動データAF1との誤差を、損失関数を用いて算出し、
上記の第2の損失誤差算出器は、
前記強化学習モデルに前記状態データSRLを入力して出力させた行動データARLを知識蒸留によるソフトターゲット化した行動データと、前記行動データAF2との誤差を、損失関数を用いて算出し、
5)前記状態データSRと行動データARが、前記模倣学習モデルのエージェントのプレイデータの記録における状態データSILと行動データAIL、及び、前記強化学習モデルのエージェントのプレイデータの記録における状態データSRLと行動データARLである場合には、
前記学習実行器は、前記状態データSILを入力して行動データAF1を出力させ、及び、状態データSRLを入力して行動データAF2を出力し、
上記の第1の損失誤差算出器は、
前記模倣学習モデルのエージェントのプレイデータにおける前記行動データAILを知識蒸留によるソフトターゲット化した行動データと、前記行動データAF1との誤差を、損失関数を用いて算出し、
上記の第2の損失誤差算出器は、
前記強化学習モデルに前記状態データSRLを入力して出力させた行動データARLを知識蒸留によるソフトターゲット化した行動データと、前記行動データAF2との誤差を、損失関数を用いて算出する。 Depending on the state data S R and behavioral data A R in the input section, the following 1) to 5) the agent of the learning apparatus according to claim 8, characterized in that it comprises a configuration in which any one of the calculation process of :
1) the state data S R and behavioral data A R is, in the case of recording the play data by human experts,
The first loss error calculator described above is
And said action data A R in the play data by human experts, the error between the behavior data A F, is calculated by using the loss function,
The second loss error calculator described above is
Soft targeting the behavioral data with the status data S R has been inputted to output the behavior data A RL in the RL model by Knowledge distillation, the error between the behavior data A F, is calculated by using the loss function ,
2) If the state data S R and behavioral data A R is a record of the play data of the agent of imitation learning model,
The first loss error calculator described above is
The imitation learning model Agent error between the behavior data A R in the play data and the action data A F, or the imitation learning model in the state data S R knowledge behavior data A IL that enter to output The error between the behavior data softly targeted by distillation and the behavior data AF was calculated using a loss function.
The second loss error calculator described above is
Soft targeting the behavioral data with the status data S R has been inputted to output the behavior data A RL in the RL model by Knowledge distillation, the error between the behavior data A F, is calculated by using the loss function ,
3) the state data S R and behavioral data A R is, in the case of recording the play data of the agent of reinforcement learning model,
The first loss error calculator described above is
The soft targeting the behavioral data imitation learning model in the state data S R behavioral data A IL that enter to output by knowledge distillation, the error between the behavior data A F, is calculated by using the loss function ,
The second loss error calculator described above is
Soft targeting the behavioral data with the status data S R has been inputted to output the behavior data A RL in the RL model by Knowledge distillation, the error between the behavior data A F, is calculated by using the loss function ,
4) the state data S R and behavioral data A R is, status data S HE and behavioral data A HE at recording play data by human experts, and state data S in the recording of the play data of the agent of the reinforcement learning model RL and behavior data A If RL ,
The learning executor inputs the state data SHE to output the action data A F1 , and inputs the state data SRL to output the action data A F2 .
The first loss error calculator described above is
The error between the behavior data A HE and the behavior data A F1 in the play data by a human expert is calculated by using the loss function.
The second loss error calculator described above is
The error between the behavior data A RL soft-targeted by knowledge distillation and the behavior data A F2 , which was output by inputting the state data S RL into the reinforcement learning model, was calculated using a loss function. ,
5) The state data S R and the behavior data AR are used in the recording of the state data SIL and the behavior data A IL in the recording of the play data of the agent of the imitation learning model, and the play data of the agent of the reinforcement learning model. In the case of state data S RL and behavior data A RL ,
The learning executor inputs the state data SIL to output the action data A F1 , and inputs the state data SRL to output the action data A F2 .
The first loss error calculator described above is
The error between the behavior data A IL soft-targeted by knowledge distillation in the play data of the agent of the imitation learning model and the behavior data A F1 is calculated by using a loss function.
The second loss error calculator described above is
The error between the behavior data A RL soft-targeted by knowledge distillation and the behavior data A F2 output by inputting the state data S RL into the reinforcement learning model is calculated using a loss function. ..
A computer is used as an input unit, a learning executor, a first loss error calculator, a second loss error calculator, a fusion error calculator, and an update unit in the learning device of any of the agents of claims 8 to 10. An agent learning program to make it work.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2019-097222 | 2019-05-23 | ||
| JP2019097222A JP2020191022A (en) | 2019-05-23 | 2019-05-23 | AI agent learning methods, learning devices and learning programs that behave like humans |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2020235693A1 true WO2020235693A1 (en) | 2020-11-26 |
Family
ID=73454686
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2020/020624 Ceased WO2020235693A1 (en) | 2019-05-23 | 2020-05-25 | Learning method, learning device, and learning program for ai agent that behaves like human |
Country Status (2)
| Country | Link |
|---|---|
| JP (1) | JP2020191022A (en) |
| WO (1) | WO2020235693A1 (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20210374612A1 (en) * | 2020-05-26 | 2021-12-02 | Nec Laboratories America, Inc. | Interpretable imitation learning via prototypical option discovery |
| CN115269565A (en) * | 2022-05-16 | 2022-11-01 | 南京大学 | Abnormal recommendation data detection method and system based on reinforcement learning |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| DE102020215324A1 (en) * | 2020-12-03 | 2022-06-09 | Robert Bosch Gesellschaft mit beschränkter Haftung | Selection of driving maneuvers for at least partially automated vehicles |
| JP7659260B2 (en) * | 2021-02-12 | 2025-04-09 | 古河電気工業株式会社 | EVALUATION APPARATUS, SYSTEM, EVALUATION METHOD, AND PROGRAM |
| CN113299084B (en) * | 2021-05-31 | 2022-04-12 | 大连理工大学 | Regional signal lamp cooperative control method based on multi-view coding migration reinforcement learning |
| WO2024116387A1 (en) * | 2022-12-01 | 2024-06-06 | 日本電信電話株式会社 | Information processing device, information processing method, and information processing program |
| CN119249366B (en) * | 2024-12-05 | 2025-02-11 | 中国海洋大学 | A multi-model fusion method for ocean waves based on reinforcement learning |
-
2019
- 2019-05-23 JP JP2019097222A patent/JP2020191022A/en active Pending
-
2020
- 2020-05-25 WO PCT/JP2020/020624 patent/WO2020235693A1/en not_active Ceased
Non-Patent Citations (2)
| Title |
|---|
| LIAN, XINYU ET AL.: "A Human-Like Agent Based on a Hybrid of Reinforcement and Imitation Learning", IEICE TECHNICAL REPORT, vol. 118, no. 316, 15 November 2018 (2018-11-15), pages 45 - 50, ISSN: 0913-5685 * |
| MIYASHITA, SHOHEI ET AL.: "Developing Game AI Agent Behaving Like Human by Mixing Reinforcement Learning and Supervised Learning", PROCEEDINGS OF THE 18TH IEEE /ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, 26 June 2017 (2017-06-26), pages 489 - 494, XP033146395, ISBN: 978-1-5090-5504-3, DOI: 10.1109/SNPD.2017.8022767 * |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20210374612A1 (en) * | 2020-05-26 | 2021-12-02 | Nec Laboratories America, Inc. | Interpretable imitation learning via prototypical option discovery |
| US12380360B2 (en) * | 2020-05-26 | 2025-08-05 | Nec Corporation | Interpretable imitation learning via prototypical option discovery for decision making |
| CN115269565A (en) * | 2022-05-16 | 2022-11-01 | 南京大学 | Abnormal recommendation data detection method and system based on reinforcement learning |
Also Published As
| Publication number | Publication date |
|---|---|
| JP2020191022A (en) | 2020-11-26 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2020235693A1 (en) | Learning method, learning device, and learning program for ai agent that behaves like human | |
| Wurman et al. | Outracing champion Gran Turismo drivers with deep reinforcement learning | |
| Wu et al. | Prioritized experience-based reinforcement learning with human guidance for autonomous driving | |
| Justesen et al. | Deep learning for video game playing | |
| Wymann et al. | Torcs, the open racing car simulator | |
| Togelius et al. | Towards automatic personalised content creation for racing games | |
| Hämäläinen et al. | Online motion synthesis using sequential monte carlo | |
| WO2004032045A1 (en) | Idea model device, spontaneous feeling model device, method thereof, and program | |
| Perez et al. | Evolving a fuzzy controller for a car racing competition | |
| Togelius et al. | Computational intelligence in racing games | |
| Chaperot et al. | Improving artificial intelligence in a motocross game | |
| Butt et al. | The Development of Intelligent Agents: A Case-Based Reasoning Approach to Achieve Human-Like Peculiarities via Playback of Human Traces | |
| Muñoz et al. | Towards imitation of human driving style in car racing games | |
| Berthling-Hansen et al. | Automating behaviour tree generation for simulating troop movements (poster) | |
| Babadi et al. | Learning task-agnostic action spaces for movement optimization | |
| Kanervisto | Advances in deep learning for playing video games | |
| Ribeiro | Deep reinforcement learning for robot navigation systems | |
| Babadi et al. | Intelligent middle-level game control | |
| Cavadas et al. | Using provenance data and imitation learning to train human-like bots | |
| Perez et al. | Evolving a rule system controller for automatic driving in a car racing competition | |
| Gao et al. | Comparison of control methods based on imitation learning for autonomous driving | |
| Dwarakanath et al. | Learning to play collaborative-competitive games | |
| Law et al. | Hammers for Robots: Designing Tools for Reinforcement Learning Agents | |
| García et al. | Ensemble Approach to Adaptable Behavior Cloning for a Fighting Game AI | |
| Iqbal et al. | A goal-based movement model for continuous multi-agent tasks |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20809465 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 20809465 Country of ref document: EP Kind code of ref document: A1 |