WO2024050712A1 - Method and apparatus for guided offline reinforcement learning - Google Patents
Method and apparatus for guided offline reinforcement learning Download PDFInfo
- Publication number
- WO2024050712A1 WO2024050712A1 PCT/CN2022/117516 CN2022117516W WO2024050712A1 WO 2024050712 A1 WO2024050712 A1 WO 2024050712A1 CN 2022117516 W CN2022117516 W CN 2022117516W WO 2024050712 A1 WO2024050712 A1 WO 2024050712A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- policy
- offline
- reinforcement learning
- guiding
- computer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/092—Reinforcement learning
Definitions
- the present disclosure relates generally to artificial intelligence technical field, and more particularly, to offline reinforcement learning (RL) with expert guidance.
- RL offline reinforcement learning
- Reinforcement learning is an important area of machine learning, which aims to solve problems of how agents ought to take actions in different states of an environment so as to maximize some kinds of cumulative reward.
- reinforcement learning comprises online RL and offline RL.
- Online RL learns a policy by interacting with the environment
- offline RL learns a policy by optimizing the policy based only an offline dataset without any interaction with the environment. Since the offline RL learns a policy on a previously collected dataset, it usually suffers from a distributional shift problem, due to the gap between state-action distributions of the offline dataset and the current policy’s interactions with the test environment. Specifically, after optimized on the offline dataset, the agent might encounter unvisited states or misestimate state-action values during the interactions with the online environment, leading to a poor performance.
- prior solutions adopt a single trade-off between two conflicting objectives for offline RL, i.e., a policy improvement objective, which aims to optimize the policy according to current value functions, and a policy constraint objective, which keeps the policy’s behavior around the offline dataset to avoid the policy being too aggressive.
- a policy improvement objective which aims to optimize the policy according to current value functions
- a policy constraint objective which keeps the policy’s behavior around the offline dataset to avoid the policy being too aggressive.
- prior solutions either add an explicit policy constraint term to the policy improvement equation, or confine the policy implicitly by revising update rules of value functions.
- these solutions generally concentrate only on the global characteristics of the dataset, but ignore the individual feature of each sample. Typically, they make only a single trade-off for all the data in a mini-batch or even in the whole offline dataset. Such “one-size-fits-all” trade-offs might not be able to achieve a perfect balance for each sample, and thus probably limit the potential of performance.
- a method for offline reinforcement learning comprises: obtaining an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment; generating a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and updating policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.
- an apparatus for offline reinforcement learning may comprise a memory and at least one processor coupled to the memory.
- the at least one processor may be configured to obtain an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment; generate a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and update policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.
- a computer readable medium storing computer code for offline reinforcement learning.
- the computer code when executed by a processor, may cause the processor to obtain an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment; generate a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and update policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.
- a computer program product for offline reinforcement learning may comprise processor executable computer code for obtaining an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment; generating a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and updating policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.
- FIG. 1 illustrates a basic structure of a reinforcement learning model in accordance with one aspect of the present disclosure.
- FIG. 2 illustrates a block diagram of a method for guided offline reinforcement learning in accordance with one aspect of the present disclosure.
- FIG. 3 illustrates a flow chart of a method for guided offline reinforcement learning in accordance with one aspect of the present disclosure.
- FIG. 4 illustrates a block diagram of an apparatus for guided offline reinforcement learning in accordance with one aspect of the present disclosure.
- FIGs. 5A and 5B illustrate performance comparisons on different numbers of expert samples in accordance with one aspect of the present disclosure.
- FIG. 1 illustrates a basic structure of a reinforcement learning model in accordance with one aspect of the present disclosure.
- Reinforcement learning mainly comprises agents, environments, states, actions, and rewards.
- the agent 110 may perform an action a to the environment 120 based on a policy. Then, the environment 120 may transition to a new state s t+1 , and a reward r is given to the agent 110 as a feedback from the environment 120. Subsequently, the agent 110 may perform new actions according to the rewards and the new states of environment 120 circularly.
- the objective of the policy of the agent 110 is to maximize the accumulated rewards. This process shows how the agent and the environment interact through states, actions, and rewards.
- agents can determine what actions they should take when they are in different states of the environment to get the maximum reward.
- the agent may be a robot or particularly a brain of the robot, and the environment may be the environment of the place where the robot works.
- the environment may have various states, such as, obstacles, slopes, hollows, etc.
- the robot may take different actions in different states based on the learned policy in its brain, such as in order to walk on the road. If the robot takes correct actions and bypasses the obstacles, it may get higher total rewards.
- Reinforcement learning is usually expressed as a Markov decision process (MDP) denoted as a tuple where is the state space, is the action space, P (s t+1 ⁇ s t , a) stands for the environment's state transition probability, s t and s t+1 belong to the state space, a belongs to the action space, the policy of the agent may be based on such a probability, d 0 (s 0 ) denotes a distribution of the initial state s 0 , R (s t , a, s t+1 ) defines a reward function, and ⁇ (0, 1] is a discount factor.
- MDP Markov decision process
- offline RL aims to optimize the policy by only an offline dataset where a k is the action taken in state s k of the environment, s′ k is the transitioned state due to a k , and r k is the corresponding reward.
- This characteristic of offline RL brings convenience to applications in many fields where online interactions are expensive or dangerous, such as, robot control, autonomous driving, and health care.
- the goal of most of them comprises two conflicting objectives, either explicitly or implicitly: (1) policy improvement, which is aimed to optimize the policy according to current value functions; and (2) policy constraint, which keeps the policy around the behavior policy or offline dataset's distribution.
- Offline RL has to make a trade-off between these two objectives: if concentrating on the policy improvement term too much, the policy probably steps into an unfamiliar area and generates bad actions due to distributional shift; otherwise, focusing excessively on the policy constraint term might lead to the policy only imitating behaviors in the offline dataset and possibly lacking generalization ability towards out-of-distribution data.
- the prior offline RL solutions may include: TD3+BC (Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. Advances in Neural Information Processing Systems, 34, 2021) ; its variant SAC+BC, which applying SAC (Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018) to TD3+BC framework; CQL (Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning.
- ⁇ is a state-action value function estimating the expected sum of discounted rewards after taking action a at state s.
- L pi ( ⁇ ) and L pc ( ⁇ ) stand for a policy improvement and policy constraint terms
- F ( ⁇ ) is a trade-off function between L pi ( ⁇ ) and L pc ( ⁇ ) .
- An ideal improvement-constraint balance for offline RL is to concentrate more on policy constraint for the samples resembling expert behaviors, but stress more on policy improvement for the data similar to random behaviors. Furthermore, it is proved by many online RL methods that expert demonstrations, even in a small quantity, will be beneficial to the policy performance, but current offline RL methods do not take full advantage of the expert data.
- an offline RL method which may determine an adaptive trade-off between policy improvement and policy constraint for different samples with the guidance of only a few expert data is provided.
- FIG. 2 illustrates a block diagram of a method for guided offline reinforcement learning (GORL) in accordance with one aspect of the present disclosure.
- An offline dataset 210 may contain an enormous amount of data, and may be used for training an RL network 220.
- the RL network 220 may provide a policy for an agent.
- a guiding dataset 230 may consist of a few expert data resembled from expert demonstrations or behaviors, and may be used for training a guiding network 240.
- the guiding network 240 may be used to guide the policy’s training of the RL network 220.
- the GORL method may alternate between updating the guiding network on the guiding dataset in a MAML (Model-Agnostic Meta-Learning) -like way and training the RL agent on the offline dataset with the guidance of the updated guiding network.
- MAML Model-Agnostic Meta-Learning
- the GORL method is a plug-in approach, and may evaluate the relative importance of policy improvement and policy constraint for each datum adaptively and end-to-end.
- the GORL method points out a theoretically guaranteed optimization direction for the agent, and may be easy to implement on most of offline RL solutions.
- the GORL method may achieve significant performance improvement on a number of state-of-the-art offline RL solutions with the guidance of only a few expert data.
- the offline dataset may be denoted as and the guiding dataset may be denoted as where M ⁇ N , and
- M ⁇ N is a large offline dataset containing sub-optimal or even random policies' trajectories, and is a guiding dataset with a small quantity of optimal data such as collected by expert or nearly expert policies.
- a training objective of the GORL method may be formulated as:
- L pc ( ⁇ ) stands for a policy constraint term, e.g., ( ⁇ ⁇ (s k ) -a k ) 2 in TD3+BC.
- the guiding network with parameters w takes a policy constraint term L pc as input, and outputs a constraint degree.
- constraint degrees generated by the guiding network may vary with different state-action pairs (s k , a k ) .
- the guiding network may take the policy constraint objective as input, and outputs a constraint degree for the policy constraint objective.
- the constraint degree varies for different samples in the offline dataset and the guiding dataset.
- the guiding network may output a higher relative importance of the policy constraint objective as compared to the policy improvement objective for samples corresponding (i.e., similar) to expert behaviors in the offline dataset, and may output a higher relative importance of the policy improvement objective as compared to the policy constraint objective for samples corresponding (i.e., similar) to random behaviors in the offline dataset.
- FIG. 3 illustrates a flow chart of a method 300 for training a guided offline reinforcement learning network in accordance with one aspect of the present disclosure.
- the method 300 may be implemented by a computer.
- the computer may be any computing devices, such as, cloud server, distributed computing entities, etc.
- the method 300 may comprise obtaining an offline reinforcement learning network.
- the offline reinforcement learning network may provide a policy for an agent to take an action at a state of an environment.
- the offline reinforcement learning network is used for robot control, autonomous driving, or health care.
- an agent such as, the robot or the brain of the robot
- may take corresponding actions such as, walking at different states of the environment (such as, road conditions) based on the policy in the brain of the robot provided by a learned RL network.
- the central control unit of a car may drive the car, such as, turn left or break (i. e, actions) at different traffic conditions (i.e., environment) based on a policy learned through RL.
- an agent such as, an automated robotic arm for surgery
- can make decisions about such as the type of treatment, drug dose, or review time (i.e., actions) at a certain point in time according to current health status and previous treatment history of a patient (i.e., environment) based on a policy learned through RL.
- the offline reinforcement learning network may be obtained by initializing an offline reinforcement learning network with random policy parameters.
- an offline reinforcement learning network trained based on the prior or even the present offline RL methods may be obtained and used as the initial offline reinforcement learning network for further optimization.
- the offline reinforcement learning network may be trained based on TD3+BC, SAC+BC, IQL, CQL, etc.
- method 300 may also comprise any other well-known steps for training an offline RL network.
- method 300 may also comprise updating the state-action value function (such as, the Q function in equations 1 or 2) of the offline RL network with offline dataset.
- method 300 may comprise generating a guiding network on a guiding dataset.
- the guiding network may evaluate a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network.
- the generating a guiding network in block 320 may comprise initializing a guiding network with random guiding parameters, or updating the guiding parameters of the guiding network on the guiding dataset based on updated policy parameters.
- method 300 may comprise updating policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.
- the updating step may comprise updating the policy parameters toward a direction of maximizing the policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance evaluated with the updated guiding parameters.
- steps in block 320 and block 330 as well as other one or more steps of method 300 may be performed alternatively and circularly for a number of times to better balance the policy improvement and policy constraint terms and then optimize the policy of the offline RL network.
- n ⁇ and ⁇ ⁇ are the mini-batch size and learning rate respectively for policy ⁇ ⁇ .
- method 300 may comprise updating the guiding parameters of the guiding network on the guiding dataset i.e., based on the updated as follows:
- n w is the mini-batch size and ⁇ w is the step size for guiding network
- the policy's parameters are the updated parameters in Equation 3 related to w.
- method 300 may comprise updating the policy ⁇ ⁇ of the offline reinforcement learning network on an offline dataset by moving the policy's parameters ⁇ toward the direction of maximizing the policy objective in Equation 2 as follows:
- Equation 5 the guiding network controls the relative update steps of policy improvement and policy constraint gradient for each data pair (s k , a k ) in the mini-batches.
- the present disclosed GORL plug-in approach may be applied to various prior or future offline RL methods, including TD3+BC, SAC+BC, IQL, CQL, etc.
- offline RL methods including TD3+BC, SAC+BC, IQL, CQL, etc.
- a constraint term may be explicit in some methods (e.g., TD3+BC and SAC+BC) , while much more implicit in other methods (e.g., CQL and IQL) .
- Algorithm 1 the procedures in Algorithm 1 may be followed with L pc (a k , ⁇ ⁇ ) substituted with ( ⁇ ⁇ (s k ) -a k ) 2 .
- the pseudo-code of TD3+BC with GORL is presented in Algorithm 2 as below.
- Table 1 shows the hyperparameters of TD3+BC with GORL on Gym locomotion and Adroit robotic manipulation tasks in the D4RL benchmark dataset.
- the Table 2 below shows the hyperparameters of SAC+BC with GORL on locomotion/adroit dataset.
- the policy update objective in IQL may be:
- Equation 7 may be reformulated into:
- Equation 8 assigns a different scalar ⁇ k for each data pair (s k , a k ) .
- ⁇ k is the exponent of (exp (Q (s k , a k ) -V (s k ) ) ) .
- Equation 8 may be further changed into:
- GORL may be implemented on IQL based on Algorithm 1 changed with the new objective (Equation 9) . is used to generate ⁇ k by taking as input.
- the pseudo-code of IQL with GORL is presented in Algorithm 4 as below.
- the Table 3 below shows the hyperparameters of SAC+BC with GORL on locomotion/adroit dataset.
- the policy update objective in CQL may be:
- Q (s, a) is a conservative approximation of the state-action value.
- the policy constraint objective is implicitly contained during the conservative Q-learning. The more conservative Q-value represents the stronger policy constraint.
- GORL can be implemented following Algorithm 1 with the new policy update objective below:
- the Table 4 below shows the hyperparameters of CQL with GORL on locomotion/adroit dataset.
- FIG. 4 illustrates a block diagram of an apparatus 400 in accordance with one aspect of the present disclosure.
- the apparatus 400 may comprise a memory 410 and at least one processor 420.
- the apparatus 400 may be used for training an offline reinforcement learning network.
- the processor 420 may be coupled to the memory 410 and configured to perform the method 300 described above with reference to FIG. 3.
- the apparatus 400 may be used for a trained offline reinforcement learning network.
- the processor 420 may be coupled to the memory 410 and configured to implement an offline reinforcement learning network trained by performing the method 300.
- the processor 420 may be a general-purpose processor, an artificial intelligence processor, or may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
- the memory 410 may store the input data, output data, data generated by processor 420, and/or instructions executed by processor 420.
- the Guided Offline Reinforcement Learning (GORL) in this disclosure is a general training framework compatible with most offline RL methods.
- the GORL method may learn a sample-adaptive intensity of policy constraint under the guidance of only a few high-quality data (i.e., expert data) .
- GORL may exert a weak constraint to “random-like” samples in the offline dataset, and may exert a strong constraint to “expert-like” samples in the offline dataset.
- each sample may be assigned a different weight and the weights vary through training.
- the agent when fed with relatively high-quality samples (i.e., the samples similar to expert behaviors) , the agent may be inclined to imitation learning; otherwise, when encountering low-quality samples (i.e., the samples similar to random behaviors) , it may choose to slightly diverge from these samples’ distribution.
- Such adaptive weights seek to achieve the full potential of every sample, leading to higher performance compared with the fixed weight.
- Equation 4 can be reformulated as:
- Equation 12 larger would encourage the guiding network to output a larger constraint degree for the corresponding policy's loss
- Equation 13 is an inner product between the guiding gradient average and policy's gradient Therefore, would assign larger weights for those whose gradients are close to the guiding gradient average.
- the benefits are two-fold: (1) the policy would align its update directions closer to the guiding gradient average whose reliance is guaranteed theoretically; (2) the policy could also enjoy plenty information about the environment provided by a large amount of data in the offline dataset which is scarce in due to its small data quantity.
- the guiding gradient obtained on n expert guiding data may be denoted as Formally,
- n ⁇ is the uniform distribution on ⁇ 1, 2, ..., n ⁇ . It can be proved that when n increases, the guiding gradient on n expert guiding data will converge to the optimal guiding gradient in probability at a rate
- the guiding gradient average in Equation 13 will be approximate to the optimal gradient, and therefore provides reliable guidance for the offline RL algorithms.
- FIG. 5A illustrates performance comparisons between the mixed scheme (such as, vanilla) and the guided scheme on different numbers of expert samples.
- the horizontal axis is the number of expert samples
- the vertical axis is percent difference between these two schemes
- the grey bars corresponds to the guided scheme (denoted as D (e) ⁇ D)
- the black bars corresponds to the mixed scheme (denoted as D (e) +D) .
- a quite small quantity of expert data e.g., a hundred or several hundred (the offline dataset’s size is typically 1 million)
- the guided scheme constantly outperforms the mixed scheme.
- FIG. 5B shows a result of a policy trained on expert-only dataset (denoted as “D (e) ” ) with different dataset scales.
- the horizontal axis is the number of expert samples, and the vertical axis is the normalized score. It’s obvious that the policy’s scores remain quite low until the expert sample number reaches 10 4 , which demonstrates that a large amount of training data is necessary for offline RL.
- a computer program product for training an offline reinforcement learning network may comprise processor executable computer code for performing the method 300 described above with reference to FIG. 3.
- a computer program product for an offline reinforcement learning network may comprise processor executable computer code when executed by a processor causing the processor to implement the offline reinforcement learning network trained by performing the method 300.
- a computer readable medium may store computer code for training an offline reinforcement learning network, the computer code when executed by a processor may cause the processor to perform the method 300 described above with reference to FIG. 3.
- a computer readable medium may store computer code for an offline reinforcement learning network, the computer code when executed by a processor may cause the processor to implement the offline reinforcement learning network trained by performing the method 300.
- Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. Any connection may be properly termed as a computer-readable medium. Other embodiments and implementations are within the scope of the disclosure.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Feedback Control In General (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims (17)
- A computer-implemented method for training an offline reinforcement learning network, comprising:obtaining an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment;generating a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; andupdating policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.
- The computer-implemented method of claim 1, wherein the offline reinforcement learning network is used for robot control, autonomous driving, or health care.
- The computer-implemented method of claim 1, wherein the guiding network takes the policy constraint objective as input, and outputs a constraint degree for the policy constraint objective, the constraint degree varies for different samples in the offline dataset and the guiding dataset.
- The computer-implemented method of claim 1, wherein the guiding network outputs a higher relative importance of the policy constraint objective as compared to the policy improvement objective for high-quality samples in the offline dataset, and outputs a higher relative importance of the policy improvement objective as compared to the policy constraint objective for low-quality samples in the offline dataset.
- The computer-implemented method of claim 1, wherein the guiding dataset includes hundreds of high-quality samples collected from expert behaviors.
- The computer-implemented method of claim 1, wherein the obtaining an offline reinforcement learning network comprises:initializing an offline reinforcement learning network with random policy parameters.
- The computer-implemented method of claim 6, further comprising:updating a value function of the offline reinforcement learning network on a mini-batch of offline data sampled from the offline dataset.
- The computer-implemented method of claim 7, wherein the generating a guiding network comprises:initializing a guiding network with random guiding parameters.
- The computer-implemented method of claim 8, wherein the updating policy parameters of the offline reinforcement learning network comprises:updating the policy parameters with a gradient descent step on the mini-batch of offline data based on the relative importance output by the guiding network with the random guiding parameters.
- The computer-implemented method of claim 9, wherein the generating a guiding network further comprises:updating guiding parameters of the guiding network on a mini-batch of guiding data sampled from the guiding dataset based on the updated policy parameters.
- The computer-implemented method of claim 10, wherein the updating policy parameters of the offline reinforcement learning network further comprises:updating the policy parameters toward a direction of maximizing the policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance output by the guiding network with the updated guiding parameters.
- An apparatus for training an offline reinforcement learning network, comprising:a memory; andat least one processor coupled to the memory and configured to perform the computer-implemented method of one of claims 1-11.
- A computer readable medium, storing computer code for training an offline reinforcement learning network, the computer code when executed by a processor, causing the processor to perform the computer-implemented method of one of claims 1-11.
- A computer program product for training an offline reinforcement learning network, comprising: processor executable computer code for performing the computer-implemented method of one of claims 1-11.
- An apparatus for an offline reinforcement learning network, comprising:a memory; andat least one processor coupled to the memory and configured to implement the offline reinforcement learning network trained by performing the computer-implemented method of one of claims 1-11.
- A computer readable medium, storing computer code for an offline reinforcement learning network, the computer code when executed by a processor, causing the processor to implement the offline reinforcement learning network trained by performing the computer-implemented method of one of claims 1-11.
- A computer program product for an offline reinforcement learning network, comprising: processor executable computer code for implementing the offline reinforcement learning network trained by performing the computer-implemented method of one of claims 1-11.
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2022/117516 WO2024050712A1 (en) | 2022-09-07 | 2022-09-07 | Method and apparatus for guided offline reinforcement learning |
| CN202280099600.0A CN119895440A (en) | 2022-09-07 | 2022-09-07 | Method and apparatus for directed offline reinforcement learning |
| DE112022007008.0T DE112022007008T5 (en) | 2022-09-07 | 2022-09-07 | METHOD AND DEVICE FOR GUIDED OFFLINE REINFORCEMENT LEARNING |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2022/117516 WO2024050712A1 (en) | 2022-09-07 | 2022-09-07 | Method and apparatus for guided offline reinforcement learning |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024050712A1 true WO2024050712A1 (en) | 2024-03-14 |
Family
ID=90192675
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2022/117516 Ceased WO2024050712A1 (en) | 2022-09-07 | 2022-09-07 | Method and apparatus for guided offline reinforcement learning |
Country Status (3)
| Country | Link |
|---|---|
| CN (1) | CN119895440A (en) |
| DE (1) | DE112022007008T5 (en) |
| WO (1) | WO2024050712A1 (en) |
Citations (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190228309A1 (en) * | 2018-01-25 | 2019-07-25 | The Research Foundation For The State University Of New York | Framework and methods of diverse exploration for fast and safe policy improvement |
| US20190354859A1 (en) * | 2018-05-18 | 2019-11-21 | Deepmind Technologies Limited | Meta-gradient updates for training return functions for reinforcement learning systems |
| US20200119556A1 (en) * | 2018-10-11 | 2020-04-16 | Di Shi | Autonomous Voltage Control for Power System Using Deep Reinforcement Learning Considering N-1 Contingency |
| US20200151562A1 (en) * | 2017-06-28 | 2020-05-14 | Deepmind Technologies Limited | Training action selection neural networks using apprenticeship |
| US20200302323A1 (en) * | 2019-03-20 | 2020-09-24 | Sony Corporation | Reinforcement learning through a double actor critic algorithm |
| US20210034970A1 (en) * | 2018-02-05 | 2021-02-04 | Deepmind Technologies Limited | Distributed training using actor-critic reinforcement learning with off-policy correction factors |
| US20210367424A1 (en) * | 2020-05-19 | 2021-11-25 | Ruisheng Diao | Multi-Objective Real-time Power Flow Control Method Using Soft Actor-Critic |
| WO2022023386A1 (en) * | 2020-07-28 | 2022-02-03 | Deepmind Technologies Limited | Off-line learning for robot control using a reward prediction model |
| WO2022028926A1 (en) * | 2020-08-07 | 2022-02-10 | Telefonaktiebolaget Lm Ericsson (Publ) | Offline simulation-to-reality transfer for reinforcement learning |
| WO2022045425A1 (en) * | 2020-08-26 | 2022-03-03 | 주식회사 우아한형제들 | Inverse reinforcement learning-based delivery means detection apparatus and method |
| WO2022167079A1 (en) * | 2021-02-04 | 2022-08-11 | Huawei Technologies Co., Ltd. | An apparatus and method for training a parametric policy |
-
2022
- 2022-09-07 WO PCT/CN2022/117516 patent/WO2024050712A1/en not_active Ceased
- 2022-09-07 DE DE112022007008.0T patent/DE112022007008T5/en active Pending
- 2022-09-07 CN CN202280099600.0A patent/CN119895440A/en active Pending
Patent Citations (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20200151562A1 (en) * | 2017-06-28 | 2020-05-14 | Deepmind Technologies Limited | Training action selection neural networks using apprenticeship |
| US20190228309A1 (en) * | 2018-01-25 | 2019-07-25 | The Research Foundation For The State University Of New York | Framework and methods of diverse exploration for fast and safe policy improvement |
| US20210034970A1 (en) * | 2018-02-05 | 2021-02-04 | Deepmind Technologies Limited | Distributed training using actor-critic reinforcement learning with off-policy correction factors |
| US20190354859A1 (en) * | 2018-05-18 | 2019-11-21 | Deepmind Technologies Limited | Meta-gradient updates for training return functions for reinforcement learning systems |
| US20200119556A1 (en) * | 2018-10-11 | 2020-04-16 | Di Shi | Autonomous Voltage Control for Power System Using Deep Reinforcement Learning Considering N-1 Contingency |
| US20200302323A1 (en) * | 2019-03-20 | 2020-09-24 | Sony Corporation | Reinforcement learning through a double actor critic algorithm |
| US20210367424A1 (en) * | 2020-05-19 | 2021-11-25 | Ruisheng Diao | Multi-Objective Real-time Power Flow Control Method Using Soft Actor-Critic |
| WO2022023386A1 (en) * | 2020-07-28 | 2022-02-03 | Deepmind Technologies Limited | Off-line learning for robot control using a reward prediction model |
| WO2022028926A1 (en) * | 2020-08-07 | 2022-02-10 | Telefonaktiebolaget Lm Ericsson (Publ) | Offline simulation-to-reality transfer for reinforcement learning |
| WO2022045425A1 (en) * | 2020-08-26 | 2022-03-03 | 주식회사 우아한형제들 | Inverse reinforcement learning-based delivery means detection apparatus and method |
| WO2022167079A1 (en) * | 2021-02-04 | 2022-08-11 | Huawei Technologies Co., Ltd. | An apparatus and method for training a parametric policy |
Also Published As
| Publication number | Publication date |
|---|---|
| CN119895440A (en) | 2025-04-25 |
| DE112022007008T5 (en) | 2025-01-30 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Ross et al. | Reinforcement and imitation learning via interactive no-regret learning | |
| Pecka et al. | Safe exploration techniques for reinforcement learning–an overview | |
| CN113780576B (en) | Collaborative multi-agent reinforcement learning method based on reward self-adaptive distribution | |
| CN113191500A (en) | Decentralized off-line multi-agent reinforcement learning method and execution system | |
| He et al. | Rediffuser: Reliable decision-making using a diffuser with confidence estimation | |
| Bowen et al. | Finite-time theory for momentum Q-learning | |
| Huang et al. | Svqn: Sequential variational soft q-learning networks | |
| WO2024050712A1 (en) | Method and apparatus for guided offline reinforcement learning | |
| Sun et al. | Deterministic and discriminative imitation (d2-imitation): revisiting adversarial imitation for sample efficiency | |
| Zhang et al. | Balancing exploration and exploitation in hierarchical reinforcement learning via latent landmark graphs | |
| Kumar et al. | Neural/fuzzy self learning Lyapunov control for non linear systems | |
| Wang et al. | Are Expressive Models Truly Necessary for Offline RL? | |
| Xiao et al. | Potential-based advice for stochastic policy learning | |
| Liu et al. | Judgmentally adjusted Q-values based on Q-ensemble for offline reinforcement learning | |
| Huang et al. | Parameter adaptation within co-adaptive learning classifier systems | |
| Li et al. | Robust Reinforcement Learning via Progressive Task Sequence. | |
| Valensi et al. | Tree search-based policy optimization under stochastic execution delay | |
| Bi et al. | A Comparative Study of Deterministic and Stochastic Policies for Q-learning | |
| Cao et al. | Hierarchical reinforcement learning for kinematic control tasks with parameterized action spaces | |
| Alaa et al. | Curriculum learning for deep reinforcement learning in swarm robotic navigation task | |
| Lu et al. | Demonstration Guided Multi-Objective Reinforcement Learning | |
| CN111950691A (en) | A Reinforcement Learning Policy Learning Method Based on Latent Action Representation Space | |
| Hlavatý et al. | Development of Advanced Control Strategy Based on Soft Actor-Critic Algorithm | |
| CN113065693B (en) | A Traffic Flow Prediction Method Based on Radial Basis Neural Network | |
| Ma et al. | Cultural algorithm based on particle swarm optimization for function optimization |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22957685 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 112022007008 Country of ref document: DE |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202280099600.0 Country of ref document: CN |
|
| WWP | Wipo information: published in national office |
Ref document number: 202280099600.0 Country of ref document: CN |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 22957685 Country of ref document: EP Kind code of ref document: A1 |