[go: up one dir, main page]

WO2024050712A1 - Method and apparatus for guided offline reinforcement learning - Google Patents

Method and apparatus for guided offline reinforcement learning Download PDF

Info

Publication number
WO2024050712A1
WO2024050712A1 PCT/CN2022/117516 CN2022117516W WO2024050712A1 WO 2024050712 A1 WO2024050712 A1 WO 2024050712A1 CN 2022117516 W CN2022117516 W CN 2022117516W WO 2024050712 A1 WO2024050712 A1 WO 2024050712A1
Authority
WO
WIPO (PCT)
Prior art keywords
policy
offline
reinforcement learning
guiding
computer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2022/117516
Other languages
French (fr)
Inventor
Gao HUANG
Qisen YANG
Shenzhi WANG
Qihang ZHANG
Wenjie Shi
Haigang ZHOU
Shiji Song
Xiaonan LU
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Robert Bosch GmbH
Original Assignee
Tsinghua University
Robert Bosch GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, Robert Bosch GmbH filed Critical Tsinghua University
Priority to PCT/CN2022/117516 priority Critical patent/WO2024050712A1/en
Priority to CN202280099600.0A priority patent/CN119895440A/en
Priority to DE112022007008.0T priority patent/DE112022007008T5/en
Publication of WO2024050712A1 publication Critical patent/WO2024050712A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning

Definitions

  • the present disclosure relates generally to artificial intelligence technical field, and more particularly, to offline reinforcement learning (RL) with expert guidance.
  • RL offline reinforcement learning
  • Reinforcement learning is an important area of machine learning, which aims to solve problems of how agents ought to take actions in different states of an environment so as to maximize some kinds of cumulative reward.
  • reinforcement learning comprises online RL and offline RL.
  • Online RL learns a policy by interacting with the environment
  • offline RL learns a policy by optimizing the policy based only an offline dataset without any interaction with the environment. Since the offline RL learns a policy on a previously collected dataset, it usually suffers from a distributional shift problem, due to the gap between state-action distributions of the offline dataset and the current policy’s interactions with the test environment. Specifically, after optimized on the offline dataset, the agent might encounter unvisited states or misestimate state-action values during the interactions with the online environment, leading to a poor performance.
  • prior solutions adopt a single trade-off between two conflicting objectives for offline RL, i.e., a policy improvement objective, which aims to optimize the policy according to current value functions, and a policy constraint objective, which keeps the policy’s behavior around the offline dataset to avoid the policy being too aggressive.
  • a policy improvement objective which aims to optimize the policy according to current value functions
  • a policy constraint objective which keeps the policy’s behavior around the offline dataset to avoid the policy being too aggressive.
  • prior solutions either add an explicit policy constraint term to the policy improvement equation, or confine the policy implicitly by revising update rules of value functions.
  • these solutions generally concentrate only on the global characteristics of the dataset, but ignore the individual feature of each sample. Typically, they make only a single trade-off for all the data in a mini-batch or even in the whole offline dataset. Such “one-size-fits-all” trade-offs might not be able to achieve a perfect balance for each sample, and thus probably limit the potential of performance.
  • a method for offline reinforcement learning comprises: obtaining an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment; generating a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and updating policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.
  • an apparatus for offline reinforcement learning may comprise a memory and at least one processor coupled to the memory.
  • the at least one processor may be configured to obtain an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment; generate a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and update policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.
  • a computer readable medium storing computer code for offline reinforcement learning.
  • the computer code when executed by a processor, may cause the processor to obtain an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment; generate a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and update policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.
  • a computer program product for offline reinforcement learning may comprise processor executable computer code for obtaining an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment; generating a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and updating policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.
  • FIG. 1 illustrates a basic structure of a reinforcement learning model in accordance with one aspect of the present disclosure.
  • FIG. 2 illustrates a block diagram of a method for guided offline reinforcement learning in accordance with one aspect of the present disclosure.
  • FIG. 3 illustrates a flow chart of a method for guided offline reinforcement learning in accordance with one aspect of the present disclosure.
  • FIG. 4 illustrates a block diagram of an apparatus for guided offline reinforcement learning in accordance with one aspect of the present disclosure.
  • FIGs. 5A and 5B illustrate performance comparisons on different numbers of expert samples in accordance with one aspect of the present disclosure.
  • FIG. 1 illustrates a basic structure of a reinforcement learning model in accordance with one aspect of the present disclosure.
  • Reinforcement learning mainly comprises agents, environments, states, actions, and rewards.
  • the agent 110 may perform an action a to the environment 120 based on a policy. Then, the environment 120 may transition to a new state s t+1 , and a reward r is given to the agent 110 as a feedback from the environment 120. Subsequently, the agent 110 may perform new actions according to the rewards and the new states of environment 120 circularly.
  • the objective of the policy of the agent 110 is to maximize the accumulated rewards. This process shows how the agent and the environment interact through states, actions, and rewards.
  • agents can determine what actions they should take when they are in different states of the environment to get the maximum reward.
  • the agent may be a robot or particularly a brain of the robot, and the environment may be the environment of the place where the robot works.
  • the environment may have various states, such as, obstacles, slopes, hollows, etc.
  • the robot may take different actions in different states based on the learned policy in its brain, such as in order to walk on the road. If the robot takes correct actions and bypasses the obstacles, it may get higher total rewards.
  • Reinforcement learning is usually expressed as a Markov decision process (MDP) denoted as a tuple where is the state space, is the action space, P (s t+1 ⁇ s t , a) stands for the environment's state transition probability, s t and s t+1 belong to the state space, a belongs to the action space, the policy of the agent may be based on such a probability, d 0 (s 0 ) denotes a distribution of the initial state s 0 , R (s t , a, s t+1 ) defines a reward function, and ⁇ (0, 1] is a discount factor.
  • MDP Markov decision process
  • offline RL aims to optimize the policy by only an offline dataset where a k is the action taken in state s k of the environment, s′ k is the transitioned state due to a k , and r k is the corresponding reward.
  • This characteristic of offline RL brings convenience to applications in many fields where online interactions are expensive or dangerous, such as, robot control, autonomous driving, and health care.
  • the goal of most of them comprises two conflicting objectives, either explicitly or implicitly: (1) policy improvement, which is aimed to optimize the policy according to current value functions; and (2) policy constraint, which keeps the policy around the behavior policy or offline dataset's distribution.
  • Offline RL has to make a trade-off between these two objectives: if concentrating on the policy improvement term too much, the policy probably steps into an unfamiliar area and generates bad actions due to distributional shift; otherwise, focusing excessively on the policy constraint term might lead to the policy only imitating behaviors in the offline dataset and possibly lacking generalization ability towards out-of-distribution data.
  • the prior offline RL solutions may include: TD3+BC (Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. Advances in Neural Information Processing Systems, 34, 2021) ; its variant SAC+BC, which applying SAC (Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018) to TD3+BC framework; CQL (Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning.
  • is a state-action value function estimating the expected sum of discounted rewards after taking action a at state s.
  • L pi ( ⁇ ) and L pc ( ⁇ ) stand for a policy improvement and policy constraint terms
  • F ( ⁇ ) is a trade-off function between L pi ( ⁇ ) and L pc ( ⁇ ) .
  • An ideal improvement-constraint balance for offline RL is to concentrate more on policy constraint for the samples resembling expert behaviors, but stress more on policy improvement for the data similar to random behaviors. Furthermore, it is proved by many online RL methods that expert demonstrations, even in a small quantity, will be beneficial to the policy performance, but current offline RL methods do not take full advantage of the expert data.
  • an offline RL method which may determine an adaptive trade-off between policy improvement and policy constraint for different samples with the guidance of only a few expert data is provided.
  • FIG. 2 illustrates a block diagram of a method for guided offline reinforcement learning (GORL) in accordance with one aspect of the present disclosure.
  • An offline dataset 210 may contain an enormous amount of data, and may be used for training an RL network 220.
  • the RL network 220 may provide a policy for an agent.
  • a guiding dataset 230 may consist of a few expert data resembled from expert demonstrations or behaviors, and may be used for training a guiding network 240.
  • the guiding network 240 may be used to guide the policy’s training of the RL network 220.
  • the GORL method may alternate between updating the guiding network on the guiding dataset in a MAML (Model-Agnostic Meta-Learning) -like way and training the RL agent on the offline dataset with the guidance of the updated guiding network.
  • MAML Model-Agnostic Meta-Learning
  • the GORL method is a plug-in approach, and may evaluate the relative importance of policy improvement and policy constraint for each datum adaptively and end-to-end.
  • the GORL method points out a theoretically guaranteed optimization direction for the agent, and may be easy to implement on most of offline RL solutions.
  • the GORL method may achieve significant performance improvement on a number of state-of-the-art offline RL solutions with the guidance of only a few expert data.
  • the offline dataset may be denoted as and the guiding dataset may be denoted as where M ⁇ N , and
  • M ⁇ N is a large offline dataset containing sub-optimal or even random policies' trajectories, and is a guiding dataset with a small quantity of optimal data such as collected by expert or nearly expert policies.
  • a training objective of the GORL method may be formulated as:
  • L pc ( ⁇ ) stands for a policy constraint term, e.g., ( ⁇ ⁇ (s k ) -a k ) 2 in TD3+BC.
  • the guiding network with parameters w takes a policy constraint term L pc as input, and outputs a constraint degree.
  • constraint degrees generated by the guiding network may vary with different state-action pairs (s k , a k ) .
  • the guiding network may take the policy constraint objective as input, and outputs a constraint degree for the policy constraint objective.
  • the constraint degree varies for different samples in the offline dataset and the guiding dataset.
  • the guiding network may output a higher relative importance of the policy constraint objective as compared to the policy improvement objective for samples corresponding (i.e., similar) to expert behaviors in the offline dataset, and may output a higher relative importance of the policy improvement objective as compared to the policy constraint objective for samples corresponding (i.e., similar) to random behaviors in the offline dataset.
  • FIG. 3 illustrates a flow chart of a method 300 for training a guided offline reinforcement learning network in accordance with one aspect of the present disclosure.
  • the method 300 may be implemented by a computer.
  • the computer may be any computing devices, such as, cloud server, distributed computing entities, etc.
  • the method 300 may comprise obtaining an offline reinforcement learning network.
  • the offline reinforcement learning network may provide a policy for an agent to take an action at a state of an environment.
  • the offline reinforcement learning network is used for robot control, autonomous driving, or health care.
  • an agent such as, the robot or the brain of the robot
  • may take corresponding actions such as, walking at different states of the environment (such as, road conditions) based on the policy in the brain of the robot provided by a learned RL network.
  • the central control unit of a car may drive the car, such as, turn left or break (i. e, actions) at different traffic conditions (i.e., environment) based on a policy learned through RL.
  • an agent such as, an automated robotic arm for surgery
  • can make decisions about such as the type of treatment, drug dose, or review time (i.e., actions) at a certain point in time according to current health status and previous treatment history of a patient (i.e., environment) based on a policy learned through RL.
  • the offline reinforcement learning network may be obtained by initializing an offline reinforcement learning network with random policy parameters.
  • an offline reinforcement learning network trained based on the prior or even the present offline RL methods may be obtained and used as the initial offline reinforcement learning network for further optimization.
  • the offline reinforcement learning network may be trained based on TD3+BC, SAC+BC, IQL, CQL, etc.
  • method 300 may also comprise any other well-known steps for training an offline RL network.
  • method 300 may also comprise updating the state-action value function (such as, the Q function in equations 1 or 2) of the offline RL network with offline dataset.
  • method 300 may comprise generating a guiding network on a guiding dataset.
  • the guiding network may evaluate a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network.
  • the generating a guiding network in block 320 may comprise initializing a guiding network with random guiding parameters, or updating the guiding parameters of the guiding network on the guiding dataset based on updated policy parameters.
  • method 300 may comprise updating policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.
  • the updating step may comprise updating the policy parameters toward a direction of maximizing the policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance evaluated with the updated guiding parameters.
  • steps in block 320 and block 330 as well as other one or more steps of method 300 may be performed alternatively and circularly for a number of times to better balance the policy improvement and policy constraint terms and then optimize the policy of the offline RL network.
  • n ⁇ and ⁇ ⁇ are the mini-batch size and learning rate respectively for policy ⁇ ⁇ .
  • method 300 may comprise updating the guiding parameters of the guiding network on the guiding dataset i.e., based on the updated as follows:
  • n w is the mini-batch size and ⁇ w is the step size for guiding network
  • the policy's parameters are the updated parameters in Equation 3 related to w.
  • method 300 may comprise updating the policy ⁇ ⁇ of the offline reinforcement learning network on an offline dataset by moving the policy's parameters ⁇ toward the direction of maximizing the policy objective in Equation 2 as follows:
  • Equation 5 the guiding network controls the relative update steps of policy improvement and policy constraint gradient for each data pair (s k , a k ) in the mini-batches.
  • the present disclosed GORL plug-in approach may be applied to various prior or future offline RL methods, including TD3+BC, SAC+BC, IQL, CQL, etc.
  • offline RL methods including TD3+BC, SAC+BC, IQL, CQL, etc.
  • a constraint term may be explicit in some methods (e.g., TD3+BC and SAC+BC) , while much more implicit in other methods (e.g., CQL and IQL) .
  • Algorithm 1 the procedures in Algorithm 1 may be followed with L pc (a k , ⁇ ⁇ ) substituted with ( ⁇ ⁇ (s k ) -a k ) 2 .
  • the pseudo-code of TD3+BC with GORL is presented in Algorithm 2 as below.
  • Table 1 shows the hyperparameters of TD3+BC with GORL on Gym locomotion and Adroit robotic manipulation tasks in the D4RL benchmark dataset.
  • the Table 2 below shows the hyperparameters of SAC+BC with GORL on locomotion/adroit dataset.
  • the policy update objective in IQL may be:
  • Equation 7 may be reformulated into:
  • Equation 8 assigns a different scalar ⁇ k for each data pair (s k , a k ) .
  • ⁇ k is the exponent of (exp (Q (s k , a k ) -V (s k ) ) ) .
  • Equation 8 may be further changed into:
  • GORL may be implemented on IQL based on Algorithm 1 changed with the new objective (Equation 9) . is used to generate ⁇ k by taking as input.
  • the pseudo-code of IQL with GORL is presented in Algorithm 4 as below.
  • the Table 3 below shows the hyperparameters of SAC+BC with GORL on locomotion/adroit dataset.
  • the policy update objective in CQL may be:
  • Q (s, a) is a conservative approximation of the state-action value.
  • the policy constraint objective is implicitly contained during the conservative Q-learning. The more conservative Q-value represents the stronger policy constraint.
  • GORL can be implemented following Algorithm 1 with the new policy update objective below:
  • the Table 4 below shows the hyperparameters of CQL with GORL on locomotion/adroit dataset.
  • FIG. 4 illustrates a block diagram of an apparatus 400 in accordance with one aspect of the present disclosure.
  • the apparatus 400 may comprise a memory 410 and at least one processor 420.
  • the apparatus 400 may be used for training an offline reinforcement learning network.
  • the processor 420 may be coupled to the memory 410 and configured to perform the method 300 described above with reference to FIG. 3.
  • the apparatus 400 may be used for a trained offline reinforcement learning network.
  • the processor 420 may be coupled to the memory 410 and configured to implement an offline reinforcement learning network trained by performing the method 300.
  • the processor 420 may be a general-purpose processor, an artificial intelligence processor, or may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
  • the memory 410 may store the input data, output data, data generated by processor 420, and/or instructions executed by processor 420.
  • the Guided Offline Reinforcement Learning (GORL) in this disclosure is a general training framework compatible with most offline RL methods.
  • the GORL method may learn a sample-adaptive intensity of policy constraint under the guidance of only a few high-quality data (i.e., expert data) .
  • GORL may exert a weak constraint to “random-like” samples in the offline dataset, and may exert a strong constraint to “expert-like” samples in the offline dataset.
  • each sample may be assigned a different weight and the weights vary through training.
  • the agent when fed with relatively high-quality samples (i.e., the samples similar to expert behaviors) , the agent may be inclined to imitation learning; otherwise, when encountering low-quality samples (i.e., the samples similar to random behaviors) , it may choose to slightly diverge from these samples’ distribution.
  • Such adaptive weights seek to achieve the full potential of every sample, leading to higher performance compared with the fixed weight.
  • Equation 4 can be reformulated as:
  • Equation 12 larger would encourage the guiding network to output a larger constraint degree for the corresponding policy's loss
  • Equation 13 is an inner product between the guiding gradient average and policy's gradient Therefore, would assign larger weights for those whose gradients are close to the guiding gradient average.
  • the benefits are two-fold: (1) the policy would align its update directions closer to the guiding gradient average whose reliance is guaranteed theoretically; (2) the policy could also enjoy plenty information about the environment provided by a large amount of data in the offline dataset which is scarce in due to its small data quantity.
  • the guiding gradient obtained on n expert guiding data may be denoted as Formally,
  • n ⁇ is the uniform distribution on ⁇ 1, 2, ..., n ⁇ . It can be proved that when n increases, the guiding gradient on n expert guiding data will converge to the optimal guiding gradient in probability at a rate
  • the guiding gradient average in Equation 13 will be approximate to the optimal gradient, and therefore provides reliable guidance for the offline RL algorithms.
  • FIG. 5A illustrates performance comparisons between the mixed scheme (such as, vanilla) and the guided scheme on different numbers of expert samples.
  • the horizontal axis is the number of expert samples
  • the vertical axis is percent difference between these two schemes
  • the grey bars corresponds to the guided scheme (denoted as D (e) ⁇ D)
  • the black bars corresponds to the mixed scheme (denoted as D (e) +D) .
  • a quite small quantity of expert data e.g., a hundred or several hundred (the offline dataset’s size is typically 1 million)
  • the guided scheme constantly outperforms the mixed scheme.
  • FIG. 5B shows a result of a policy trained on expert-only dataset (denoted as “D (e) ” ) with different dataset scales.
  • the horizontal axis is the number of expert samples, and the vertical axis is the normalized score. It’s obvious that the policy’s scores remain quite low until the expert sample number reaches 10 4 , which demonstrates that a large amount of training data is necessary for offline RL.
  • a computer program product for training an offline reinforcement learning network may comprise processor executable computer code for performing the method 300 described above with reference to FIG. 3.
  • a computer program product for an offline reinforcement learning network may comprise processor executable computer code when executed by a processor causing the processor to implement the offline reinforcement learning network trained by performing the method 300.
  • a computer readable medium may store computer code for training an offline reinforcement learning network, the computer code when executed by a processor may cause the processor to perform the method 300 described above with reference to FIG. 3.
  • a computer readable medium may store computer code for an offline reinforcement learning network, the computer code when executed by a processor may cause the processor to implement the offline reinforcement learning network trained by performing the method 300.
  • Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. Any connection may be properly termed as a computer-readable medium. Other embodiments and implementations are within the scope of the disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Feedback Control In General (AREA)
  • Machine Translation (AREA)

Abstract

A method for training an offline reinforcement learning network is disclosed. The method comprises: obtaining an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment; generating a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and updating policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.

Description

METHOD AND APPARATUS FOR GUIDED OFFLINE REINFORCEMENT LEARNING FIELD
The present disclosure relates generally to artificial intelligence technical field, and more particularly, to offline reinforcement learning (RL) with expert guidance.
BACKGROUND
Reinforcement learning (RL) is an important area of machine learning, which aims to solve problems of how agents ought to take actions in different states of an environment so as to maximize some kinds of cumulative reward. Generally, reinforcement learning comprises online RL and offline RL. Online RL learns a policy by interacting with the environment, while offline RL learns a policy by optimizing the policy based only an offline dataset without any interaction with the environment. Since the offline RL learns a policy on a previously collected dataset, it usually suffers from a distributional shift problem, due to the gap between state-action distributions of the offline dataset and the current policy’s interactions with the test environment. Specifically, after optimized on the offline dataset, the agent might encounter unvisited states or misestimate state-action values during the interactions with the online environment, leading to a poor performance.
To mitigate this problem, prior solutions adopt a single trade-off between two conflicting objectives for offline RL, i.e., a policy improvement objective, which aims to optimize the policy according to current value functions, and a policy constraint objective, which keeps the policy’s behavior around the offline dataset to avoid the policy being too aggressive. Thus, prior solutions either add an explicit policy constraint term to the policy improvement equation, or confine the policy implicitly by revising update rules of value functions. However, these solutions generally concentrate only on the global characteristics of the dataset, but ignore the individual feature of each sample. Typically, they make only a single trade-off for all the data in a mini-batch or even in the whole offline dataset. Such “one-size-fits-all” trade-offs might not be able to achieve a perfect balance for each sample, and thus probably limit the potential of performance.
Therefore, there exists a need to provide an improved solution for offline reinforcement learning.
SUMMARY
The following presents a simplified summary of one or more aspects according to the present disclosure in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.
In an aspect of the disclosure, a method for offline reinforcement learning is disclosed. The method comprises: obtaining an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment; generating a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and updating policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.
In another aspect of the disclosure, an apparatus for offline reinforcement learning is disclosed. The apparatus may comprise a memory and at least one processor coupled to the memory. The at least one processor may be configured to obtain an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment; generate a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and update policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.
In another aspect of the disclosure, a computer readable medium storing computer code for offline reinforcement learning is disclosed. The computer code, when executed by a processor, may cause the processor to obtain an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment; generate a guiding network on a guiding dataset, wherein the guiding network outputs a relative  importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and update policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.
In another aspect of the disclosure, a computer program product for offline reinforcement learning is disclosed. The computer program product may comprise processor executable computer code for obtaining an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment; generating a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and updating policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.
Other aspects or variations of the disclosure will become apparent by consideration of the following detailed description and accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
The following figures depict various embodiments of the present disclosure for purposes of illustration only. One skilled in the art will readily recognize from the following description that alternative embodiments of the methods and structures disclosed herein may be implemented without departing from the spirit and principles of the disclosure described herein.
FIG. 1 illustrates a basic structure of a reinforcement learning model in accordance with one aspect of the present disclosure.
FIG. 2 illustrates a block diagram of a method for guided offline reinforcement learning in accordance with one aspect of the present disclosure.
FIG. 3 illustrates a flow chart of a method for guided offline reinforcement learning in accordance with one aspect of the present disclosure.
FIG. 4 illustrates a block diagram of an apparatus for guided offline reinforcement learning in accordance with one aspect of the present disclosure.
FIGs. 5A and 5B illustrate performance comparisons on different numbers of expert samples in accordance with one aspect of the present disclosure.
DETAILED DESCRIPTION
Before any embodiments of the present disclosure are explained in detail, it is to be understood that the disclosure is not limited in its application to the details of construction and the arrangement of features set forth in the following description. The disclosure is capable of other embodiments and of being practiced or of being carried out in various ways.
FIG. 1 illustrates a basic structure of a reinforcement learning model in accordance with one aspect of the present disclosure. Reinforcement learning mainly comprises agents, environments, states, actions, and rewards. As shown in FIG. 1, for the current state s t of the environment 120, the agent 110 may perform an action a to the environment 120 based on a policy. Then, the environment 120 may transition to a new state s t+1, and a reward r is given to the agent 110 as a feedback from the environment 120. Subsequently, the agent 110 may perform new actions according to the rewards and the new states of environment 120 circularly. The objective of the policy of the agent 110 is to maximize the accumulated rewards. This process shows how the agent and the environment interact through states, actions, and rewards.
Through reinforcement learning, agents can determine what actions they should take when they are in different states of the environment to get the maximum reward. For example, in robot control applications, the agent may be a robot or particularly a brain of the robot, and the environment may be the environment of the place where the robot works. The environment may have various states, such as, obstacles, slopes, hollows, etc. The robot may take different actions in different states based on the learned policy in its brain, such as in order to walk on the road. If the robot takes correct actions and bypasses the obstacles, it may get higher total rewards.
Reinforcement learning is usually expressed as a Markov decision process (MDP) denoted as a tuple
Figure PCTCN2022117516-appb-000001
where
Figure PCTCN2022117516-appb-000002
is the state space, 
Figure PCTCN2022117516-appb-000003
is the action space, P (s t+1∣s t, a) stands for the environment's state transition probability, s t and s t+1 belong to the state space, a belongs to the action space, the policy of the agent may be based on such a probability, d 0 (s 0) denotes a distribution of the initial state s 0, R (s t, a, s t+1) defines a reward function, and γ∈ (0, 1] is a discount factor.
Unlike online RL which learns a policy by interacting with the environment, offline RL aims to optimize the policy by only an offline dataset
Figure PCTCN2022117516-appb-000004
Figure PCTCN2022117516-appb-000005
where a k is the action taken in state s k of the environment, s′ k is the transitioned state due to a k, and r k is the corresponding reward. This  characteristic of offline RL brings convenience to applications in many fields where online interactions are expensive or dangerous, such as, robot control, autonomous driving, and health care. Although there exist various offline RL solutions with different training losses, the goal of most of them comprises two conflicting objectives, either explicitly or implicitly: (1) policy improvement, which is aimed to optimize the policy according to current value functions; and (2) policy constraint, which keeps the policy around the behavior policy or offline dataset's distribution. Offline RL has to make a trade-off between these two objectives: if concentrating on the policy improvement term too much, the policy probably steps into an unfamiliar area and generates bad actions due to distributional shift; otherwise, focusing excessively on the policy constraint term might lead to the policy only imitating behaviors in the offline dataset and possibly lacking generalization ability towards out-of-distribution data.
For example, the prior offline RL solutions may include: TD3+BC (Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. Advances in Neural Information Processing Systems, 34, 2021) ; its variant SAC+BC, which applying SAC (Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018) to TD3+BC framework; CQL (Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33: 1179–1191, 2020) ; and IQL (Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv: 2110.06169, 2021) . The policy optimization objectives of these solutions can be unified as:
Figure PCTCN2022117516-appb-000006
where
Figure PCTCN2022117516-appb-000007
is a policy with trainable parameters θ, 
Figure PCTCN2022117516-appb-000008
is a state-action value function estimating the expected sum of discounted rewards after taking action a at state s. Furthermore, L pi (·) and L pc (·) stand for a policy improvement and policy constraint terms, and F (·) is a trade-off function between L pi(·) and L pc (·) . 
Figure PCTCN2022117516-appb-000009
is a constraint degree: larger d c would encourage stronger policy constraint, and therefore the policy becomes more conservative; otherwise the policy would stress more on the policy improvement term, and thus tends to be more aggressive.
An ideal improvement-constraint balance for offline RL is to concentrate more  on policy constraint for the samples resembling expert behaviors, but stress more on policy improvement for the data similar to random behaviors. Furthermore, it is proved by many online RL methods that expert demonstrations, even in a small quantity, will be beneficial to the policy performance, but current offline RL methods do not take full advantage of the expert data.
In the present disclosure, an offline RL method which may determine an adaptive trade-off between policy improvement and policy constraint for different samples with the guidance of only a few expert data is provided.
FIG. 2 illustrates a block diagram of a method for guided offline reinforcement learning (GORL) in accordance with one aspect of the present disclosure. An offline dataset 210 may contain an enormous amount of data, and may be used for training an RL network 220. The RL network 220 may provide a policy for an agent. A guiding dataset 230 may consist of a few expert data resembled from expert demonstrations or behaviors, and may be used for training a guiding network 240. The guiding network 240 may be used to guide the policy’s training of the RL network 220. The GORL method may alternate between updating the guiding network on the guiding dataset in a MAML (Model-Agnostic Meta-Learning) -like way and training the RL agent on the offline dataset with the guidance of the updated guiding network.
The GORL method is a plug-in approach, and may evaluate the relative importance of policy improvement and policy constraint for each datum adaptively and end-to-end. The GORL method points out a theoretically guaranteed optimization direction for the agent, and may be easy to implement on most of offline RL solutions. The GORL method may achieve significant performance improvement on a number of state-of-the-art offline RL solutions with the guidance of only a few expert data.
In the GORL method, the offline dataset may be denoted as
Figure PCTCN2022117516-appb-000010
Figure PCTCN2022117516-appb-000011
and the guiding dataset may be denoted as
Figure PCTCN2022117516-appb-000012
Figure PCTCN2022117516-appb-000013
where M<<N , and
Figure PCTCN2022117516-appb-000014
Figure PCTCN2022117516-appb-000015
For instance, 
Figure PCTCN2022117516-appb-000016
is a large offline dataset containing sub-optimal or even random policies' trajectories, and
Figure PCTCN2022117516-appb-000017
is a guiding dataset with a small quantity of optimal data such as collected by expert or nearly expert policies.
For simplicity, similarly to TD3+BC, a training objective of the GORL method may be formulated as:
Figure PCTCN2022117516-appb-000018
Figure PCTCN2022117516-appb-000019
where unif {1, N} denotes a uniform distribution on {1, 2, …, N} , and L pc (·) stands for a policy constraint term, e.g., (π θ (s k) -a k2 in TD3+BC. The guiding network 
Figure PCTCN2022117516-appb-000020
with parameters w takes a policy constraint term L pc as input, and outputs a constraint degree. Please note that unlike the typical framework of existing offline solutions (shown in Equation 1) which assigns a fixed constraint degree (e.g., d c) for all the data, in the GORL method, constraint degrees generated by the guiding network
Figure PCTCN2022117516-appb-000021
may vary with different state-action pairs (s k, a k) . In Equation (2) , the guiding network may take the policy constraint objective as input, and outputs a constraint degree for the policy constraint objective. The constraint degree varies for different samples in the offline dataset and the guiding dataset. For example, the guiding network may output a higher relative importance of the policy constraint objective as compared to the policy improvement objective for samples corresponding (i.e., similar) to expert behaviors in the offline dataset, and may output a higher relative importance of the policy improvement objective as compared to the policy constraint objective for samples corresponding (i.e., similar) to random behaviors in the offline dataset.
FIG. 3 illustrates a flow chart of a method 300 for training a guided offline reinforcement learning network in accordance with one aspect of the present disclosure. The method 300 may be implemented by a computer. The computer may be any computing devices, such as, cloud server, distributed computing entities, etc. In block 310, the method 300 may comprise obtaining an offline reinforcement learning network. The offline reinforcement learning network may provide a policy for an agent to take an action at a state of an environment. The offline reinforcement learning network is used for robot control, autonomous driving, or health care. In robot control applications, an agent (such as, the robot or the brain of the robot) may take corresponding actions (such as, walking) at different states of the environment (such as, road conditions) based on the policy in the brain of the robot provided by a learned RL network. Similarly, in autonomous driving applications, the central control unit of a car (i.e., an agent) may drive the car, such as, turn left or break (i. e, actions) at different traffic conditions (i.e., environment) based on a policy learned through RL. In health care applications, an agent (such as, an automated robotic arm for surgery) can make decisions about such as the type of treatment, drug dose, or review time (i.e., actions) at a certain point in time according to current health status and previous treatment history of a patient (i.e., environment) based on a policy learned through RL.
In one embodiment, in block 310, the offline reinforcement learning network  may be obtained by initializing an offline reinforcement learning network with random policy parameters. In another embodiment, an offline reinforcement learning network trained based on the prior or even the present offline RL methods may be obtained and used as the initial offline reinforcement learning network for further optimization. For example, the offline reinforcement learning network may be trained based on TD3+BC, SAC+BC, IQL, CQL, etc. Although not shown in FIG. 3, method 300 may also comprise any other well-known steps for training an offline RL network. For example, method 300 may also comprise updating the state-action value function (such as, the Q function in equations 1 or 2) of the offline RL network with offline dataset.
In block 320, method 300 may comprise generating a guiding network on a guiding dataset. The guiding network may evaluate a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network. The generating a guiding network in block 320 may comprise initializing a guiding network with random guiding parameters, or updating the guiding parameters of the guiding network on the guiding dataset based on updated policy parameters. In block 330, method 300 may comprise updating policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance. The updating step may comprise updating the policy parameters toward a direction of maximizing the policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance evaluated with the updated guiding parameters.
The steps in block 320 and block 330 as well as other one or more steps of method 300 (if needed) may be performed alternatively and circularly for a number of times to better balance the policy improvement and policy constraint terms and then optimize the policy of the offline RL network.
In one embodiment, during the initial circular steps, in block 320, method 300 may comprise initializing a guiding network with random guiding parameters, and in block 330, method 300 may comprise updating the policy parameters θ with a gradient descent step on the offline dataset
Figure PCTCN2022117516-appb-000022
as follows based on the initial 
Figure PCTCN2022117516-appb-000023
 (t=1) :
Figure PCTCN2022117516-appb-000024
where n θ and α θ are the mini-batch size and learning rate respectively for policy π θ.
Then, in the next circular steps, in block 320, method 300 may comprise updating the guiding parameters of the guiding network
Figure PCTCN2022117516-appb-000025
on the guiding dataset
Figure PCTCN2022117516-appb-000026
i.e., 
Figure PCTCN2022117516-appb-000027
based on the updated
Figure PCTCN2022117516-appb-000028
as follows:
Figure PCTCN2022117516-appb-000029
where n w is the mini-batch size and α w is the step size for guiding network
Figure PCTCN2022117516-appb-000030
In particular, the policy's parameters
Figure PCTCN2022117516-appb-000031
are the updated parameters in Equation 3 related to w.
In block 330, method 300 may comprise updating the policy π θ of the offline reinforcement learning network on an offline dataset by moving the policy's parameters θ toward the direction of maximizing the policy objective in Equation 2 as follows:
Figure PCTCN2022117516-appb-000032
where the w  (t+1) in Equation 5 may be different from the w  (t) in Equation 3. It can be clearly observed in Equation 5 that the guiding network 
Figure PCTCN2022117516-appb-000033
controls the relative update steps of policy improvement and policy constraint gradient for each data pair (s k, a k) in the mini-batches.
The pseudo-code of the disclosed plug-in framework, i.e., guided offline reinforcement learning (GORL) , is presented in Algorithm 1 below.
Figure PCTCN2022117516-appb-000034
The present disclosed GORL plug-in approach may be applied to various prior  or future offline RL methods, including TD3+BC, SAC+BC, IQL, CQL, etc. To implement GORL on offline RL methods, one of the most important works is to find out the corresponding policy constraint term. Such a constraint term may be explicit in some methods (e.g., TD3+BC and SAC+BC) , while much more implicit in other methods (e.g., CQL and IQL) .
In one embodiment of implementing GORL on TD3+BC, the procedures in Algorithm 1 may be followed with L pc (a k, π θ) substituted with (π θ (s k) -a k2. The pseudo-code of TD3+BC with GORL is presented in Algorithm 2 as below.
Figure PCTCN2022117516-appb-000035
The Table 1 below shows the hyperparameters of TD3+BC with GORL on Gym locomotion and Adroit robotic manipulation tasks in the D4RL benchmark dataset.
Figure PCTCN2022117516-appb-000036
Table-1
In another embodiment of implementing GORL on SAC+BC which is a natural extension of TD3+BC, replacing TD3 with SAC as policy optimization objective is as below:
Figure PCTCN2022117516-appb-000037
where
Figure PCTCN2022117516-appb-000038
is a constant, 
Figure PCTCN2022117516-appb-000039
is an action sampled from π θ (·∣s k) by reparameterization trick, and
Figure PCTCN2022117516-appb-000040
denotes the probability of π θ choosing
Figure PCTCN2022117516-appb-000041
at state s k. The Q-function optimization of SAC+BC is the same as SAC. By adding the maximizing entropy term to Equations 3, 4, 5, GORL can be applied to SAC+BC with the procedures in Algorithm 1. The pseudo-code of SAC+BC with GORL is presented in Algorithm 3 as below.
Figure PCTCN2022117516-appb-000042
The Table 2 below shows the hyperparameters of SAC+BC with GORL on locomotion/adroit dataset.
Figure PCTCN2022117516-appb-000043
Table-2
In another embodiment of implementing GORL on IQL, the policy update objective in IQL may be:
Figure PCTCN2022117516-appb-000044
where V (s) is an approximator of
Figure PCTCN2022117516-appb-000045
The reason behind Equation 7 is that, if some action a k is in advantage, i.e., 
Figure PCTCN2022117516-appb-000046
the term exp (β (Q (s, a k) -V (s) ) ) will be larger than its expectation
Figure PCTCN2022117516-appb-000047
Figure PCTCN2022117516-appb-000048
Therefore, after updating with Equation 7, π θ is more likely to choose a k rather than other actions. It can be seen that scalar β, together with Q (s, a) -V (s) , controls to what extent π θ accepts action a at state s. Based on the analysis above, Equation 7 may be reformulated into:
Figure PCTCN2022117516-appb-000049
Compared with Equation 7, Equation 8 assigns a different scalar β k for each data pair (s k, a k) . However, note that
Figure PCTCN2022117516-appb-000050
Figure PCTCN2022117516-appb-000051
It's difficult to find an optimal β k end-to-end because  β k is the exponent of (exp (Q (s k, a k) -V (s k) ) ) .
To make the optimization of β k possible, Equation 8 may be further changed into:
Figure PCTCN2022117516-appb-000052
where β k is multiplied by exp (Q (s k, a k) -V (s k) ) , which is much easier to optimize.
GORL may be implemented on IQL based on Algorithm 1 changed with the new objective (Equation 9) . 
Figure PCTCN2022117516-appb-000053
is used to generate β k by taking
Figure PCTCN2022117516-appb-000054
as input. The pseudo-code of IQL with GORL is presented in Algorithm 4 as below.
Figure PCTCN2022117516-appb-000055
The Table 3 below shows the hyperparameters of SAC+BC with GORL on locomotion/adroit dataset.
Figure PCTCN2022117516-appb-000056
Table-3
In another embodiment of implementing GORL on CQL, the policy update objective in CQL may be:
Figure PCTCN2022117516-appb-000057
where Q (s, a) is a conservative approximation of the state-action value. The policy constraint objective is implicitly contained during the conservative Q-learning. The more conservative Q-value represents the stronger policy constraint. In this case, GORL can be implemented following Algorithm 1 with the new policy update objective below:
Figure PCTCN2022117516-appb-000058
The pseudo-code of CQL with GORL is presented in Algorithm 5 as below.
Figure PCTCN2022117516-appb-000059
The Table 4 below shows the hyperparameters of CQL with GORL on locomotion/adroit dataset.
Figure PCTCN2022117516-appb-000060
Table-4
FIG. 4 illustrates a block diagram of an apparatus 400 in accordance with one  aspect of the present disclosure. The apparatus 400 may comprise a memory 410 and at least one processor 420. In one embodiment, the apparatus 400 may be used for training an offline reinforcement learning network. The processor 420 may be coupled to the memory 410 and configured to perform the method 300 described above with reference to FIG. 3. In another embodiment, the apparatus 400 may be used for a trained offline reinforcement learning network. The processor 420 may be coupled to the memory 410 and configured to implement an offline reinforcement learning network trained by performing the method 300. The processor 420 may be a general-purpose processor, an artificial intelligence processor, or may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. The memory 410 may store the input data, output data, data generated by processor 420, and/or instructions executed by processor 420.
The Guided Offline Reinforcement Learning (GORL) in this disclosure is a general training framework compatible with most offline RL methods. The GORL method may learn a sample-adaptive intensity of policy constraint under the guidance of only a few high-quality data (i.e., expert data) . Specifically, GORL may exert a weak constraint to “random-like” samples in the offline dataset, and may exert a strong constraint to “expert-like” samples in the offline dataset. During the guided learning, each sample may be assigned a different weight and the weights vary through training. Specifically, when fed with relatively high-quality samples (i.e., the samples similar to expert behaviors) , the agent may be inclined to imitation learning; otherwise, when encountering low-quality samples (i.e., the samples similar to random behaviors) , it may choose to slightly diverge from these samples’ distribution. Such adaptive weights seek to achieve the full potential of every sample, leading to higher performance compared with the fixed weight.
The rationality of GORL’s update mechanism and the near-optimality of the guidance from GORL will be described below.
By the chain rule, Equation 4 can be reformulated as:
Figure PCTCN2022117516-appb-000061
where
Figure PCTCN2022117516-appb-000062
Here, the policy's loss
Figure PCTCN2022117516-appb-000063
with
Figure PCTCN2022117516-appb-000064
and the guiding loss
Figure PCTCN2022117516-appb-000065
with
Figure PCTCN2022117516-appb-000066
It can be observed that in Equation 12, larger
Figure PCTCN2022117516-appb-000067
would encourage the guiding network
Figure PCTCN2022117516-appb-000068
to output a larger constraint degree for the corresponding policy's loss 
Figure PCTCN2022117516-appb-000069
Further note that in Equation 13, 
Figure PCTCN2022117516-appb-000070
is an inner product between the guiding gradient average and policy's gradient
Figure PCTCN2022117516-appb-000071
Therefore, 
Figure PCTCN2022117516-appb-000072
would assign larger weights for those
Figure PCTCN2022117516-appb-000073
whose gradients are close to the guiding gradient average. The benefits are two-fold: (1) the policy would align its update directions closer to the guiding gradient average whose reliance is guaranteed theoretically; (2) the policy could also enjoy plenty information about the environment provided by a large amount of data in the offline dataset
Figure PCTCN2022117516-appb-000074
which is scarce in
Figure PCTCN2022117516-appb-000075
due to its small data quantity.
To demonstrate that the guiding gradient average
Figure PCTCN2022117516-appb-000076
Figure PCTCN2022117516-appb-000077
in Equation 13 is qualified for guiding the offline training processes, the guiding gradient obtained on n expert guiding data may be denoted as 
Figure PCTCN2022117516-appb-000078
Formally,
Figure PCTCN2022117516-appb-000079
If the number of guiding data tends to infinity, the guiding gradient average will reach its optimal form:
Figure PCTCN2022117516-appb-000080
Figure PCTCN2022117516-appb-000081
where unif {1, n} is the uniform distribution on {1, 2, …, n} . It can be proved that when n increases, the guiding gradient on n expert guiding data
Figure PCTCN2022117516-appb-000082
will converge to the optimal guiding gradient
Figure PCTCN2022117516-appb-000083
in probability at a rate
Figure PCTCN2022117516-appb-000084
In other words, when the guiding dataset
Figure PCTCN2022117516-appb-000085
has sufficient expert data, the guiding gradient average in Equation 13 will be approximate to the optimal gradient, and therefore provides reliable guidance for the offline RL algorithms.
Compared with a vanilla scheme which simply mixes expert demonstrations with the offline dataset, the guided training better utilizes the limited high-quality data. FIG. 5A illustrates performance comparisons between the mixed scheme (such as, vanilla) and the guided scheme on different numbers of expert samples. In FIG. 5A, the horizontal axis is the number of expert samples, the vertical axis is percent difference between these two schemes, the grey bars corresponds to the guided scheme (denoted as D (e) →D) , and the black bars corresponds to the mixed scheme (denoted as D (e) +D) . As shown in FIG. 5A, a quite small quantity of expert data, e.g., a hundred or several hundred (the offline dataset’s size is typically 1 million) , may be sufficient for the guiding dataset 
Figure PCTCN2022117516-appb-000086
to generate a good enough guiding gradient average in Equation 13. When the amount of expert data is small, the guided scheme constantly outperforms the mixed scheme.
FIG. 5B shows a result of a policy trained on expert-only dataset (denoted as “D (e) ” ) with different dataset scales. In FIG. 5B, the horizontal axis is the number of expert samples, and the vertical axis is the normalized score. It’s obvious that the policy’s scores remain quite low until the expert sample number reaches 10 4, which demonstrates that a large amount of training data is necessary for offline RL.
From FIGs. 5A and 5B, it can be seen that the limited expert data itself cannot produce a satisfying agent, due to insufficiency of training samples; and the GORL method may generate reliable guidance for offline RL with only a few expert samples.
The various operations, modules, and networks described in connection with the disclosure herein may be implemented in hardware, software executed by a processor, firmware, or any combination thereof. According an embodiment of the disclosure, a computer program product for training an offline reinforcement learning network may comprise processor executable computer code for performing the method 300 described above with reference to FIG. 3. According an embodiment of the disclosure, a computer program product for an offline reinforcement learning network  may comprise processor executable computer code when executed by a processor causing the processor to implement the offline reinforcement learning network trained by performing the method 300. According to another embodiment of the disclosure, a computer readable medium may store computer code for training an offline reinforcement learning network, the computer code when executed by a processor may cause the processor to perform the method 300 described above with reference to FIG. 3. According to another embodiment of the disclosure, a computer readable medium may store computer code for an offline reinforcement learning network, the computer code when executed by a processor may cause the processor to implement the offline reinforcement learning network trained by performing the method 300. Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. Any connection may be properly termed as a computer-readable medium. Other embodiments and implementations are within the scope of the disclosure.
The preceding description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the various embodiments. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the scope of the various embodiments. Thus, the claims are not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the following claims and the principles and novel features disclosed herein.

Claims (17)

  1. A computer-implemented method for training an offline reinforcement learning network, comprising:
    obtaining an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment;
    generating a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and
    updating policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.
  2. The computer-implemented method of claim 1, wherein the offline reinforcement learning network is used for robot control, autonomous driving, or health care.
  3. The computer-implemented method of claim 1, wherein the guiding network takes the policy constraint objective as input, and outputs a constraint degree for the policy constraint objective, the constraint degree varies for different samples in the offline dataset and the guiding dataset.
  4. The computer-implemented method of claim 1, wherein the guiding network outputs a higher relative importance of the policy constraint objective as compared to the policy improvement objective for high-quality samples in the offline dataset, and outputs a higher relative importance of the policy improvement objective as compared to the policy constraint objective for low-quality samples in the offline dataset.
  5. The computer-implemented method of claim 1, wherein the guiding dataset includes hundreds of high-quality samples collected from expert behaviors.
  6. The computer-implemented method of claim 1, wherein the obtaining an offline reinforcement learning network comprises:
    initializing an offline reinforcement learning network with random policy parameters.
  7. The computer-implemented method of claim 6, further comprising:
    updating a value function of the offline reinforcement learning network on a mini-batch of offline data sampled from the offline dataset.
  8. The computer-implemented method of claim 7, wherein the generating a guiding network comprises:
    initializing a guiding network with random guiding parameters.
  9. The computer-implemented method of claim 8, wherein the updating policy parameters of the offline reinforcement learning network comprises:
    updating the policy parameters with a gradient descent step on the mini-batch of offline data based on the relative importance output by the guiding network with the random guiding parameters.
  10. The computer-implemented method of claim 9, wherein the generating a guiding network further comprises:
    updating guiding parameters of the guiding network on a mini-batch of guiding data sampled from the guiding dataset based on the updated policy parameters.
  11. The computer-implemented method of claim 10, wherein the updating policy parameters of the offline reinforcement learning network further comprises:
    updating the policy parameters toward a direction of maximizing the policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance output by the guiding network with the updated guiding parameters.
  12. An apparatus for training an offline reinforcement learning network, comprising:
    a memory; and
    at least one processor coupled to the memory and configured to perform the computer-implemented method of one of claims 1-11.
  13. A computer readable medium, storing computer code for training an offline reinforcement learning network, the computer code when executed by a processor, causing the processor to perform the computer-implemented method of one of claims 1-11.
  14. A computer program product for training an offline reinforcement learning network, comprising: processor executable computer code for performing the computer-implemented method of one of claims 1-11.
  15. An apparatus for an offline reinforcement learning network, comprising:
    a memory; and
    at least one processor coupled to the memory and configured to implement the offline reinforcement learning network trained by performing the computer-implemented method of one of claims 1-11.
  16. A computer readable medium, storing computer code for an offline reinforcement learning network, the computer code when executed by a processor, causing the processor to implement the offline reinforcement learning network trained by performing the computer-implemented method of one of claims 1-11.
  17. A computer program product for an offline reinforcement learning network, comprising: processor executable computer code for implementing the offline reinforcement learning network trained by performing the computer-implemented method of one of claims 1-11.
PCT/CN2022/117516 2022-09-07 2022-09-07 Method and apparatus for guided offline reinforcement learning Ceased WO2024050712A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2022/117516 WO2024050712A1 (en) 2022-09-07 2022-09-07 Method and apparatus for guided offline reinforcement learning
CN202280099600.0A CN119895440A (en) 2022-09-07 2022-09-07 Method and apparatus for directed offline reinforcement learning
DE112022007008.0T DE112022007008T5 (en) 2022-09-07 2022-09-07 METHOD AND DEVICE FOR GUIDED OFFLINE REINFORCEMENT LEARNING

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/117516 WO2024050712A1 (en) 2022-09-07 2022-09-07 Method and apparatus for guided offline reinforcement learning

Publications (1)

Publication Number Publication Date
WO2024050712A1 true WO2024050712A1 (en) 2024-03-14

Family

ID=90192675

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/117516 Ceased WO2024050712A1 (en) 2022-09-07 2022-09-07 Method and apparatus for guided offline reinforcement learning

Country Status (3)

Country Link
CN (1) CN119895440A (en)
DE (1) DE112022007008T5 (en)
WO (1) WO2024050712A1 (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190228309A1 (en) * 2018-01-25 2019-07-25 The Research Foundation For The State University Of New York Framework and methods of diverse exploration for fast and safe policy improvement
US20190354859A1 (en) * 2018-05-18 2019-11-21 Deepmind Technologies Limited Meta-gradient updates for training return functions for reinforcement learning systems
US20200119556A1 (en) * 2018-10-11 2020-04-16 Di Shi Autonomous Voltage Control for Power System Using Deep Reinforcement Learning Considering N-1 Contingency
US20200151562A1 (en) * 2017-06-28 2020-05-14 Deepmind Technologies Limited Training action selection neural networks using apprenticeship
US20200302323A1 (en) * 2019-03-20 2020-09-24 Sony Corporation Reinforcement learning through a double actor critic algorithm
US20210034970A1 (en) * 2018-02-05 2021-02-04 Deepmind Technologies Limited Distributed training using actor-critic reinforcement learning with off-policy correction factors
US20210367424A1 (en) * 2020-05-19 2021-11-25 Ruisheng Diao Multi-Objective Real-time Power Flow Control Method Using Soft Actor-Critic
WO2022023386A1 (en) * 2020-07-28 2022-02-03 Deepmind Technologies Limited Off-line learning for robot control using a reward prediction model
WO2022028926A1 (en) * 2020-08-07 2022-02-10 Telefonaktiebolaget Lm Ericsson (Publ) Offline simulation-to-reality transfer for reinforcement learning
WO2022045425A1 (en) * 2020-08-26 2022-03-03 주식회사 우아한형제들 Inverse reinforcement learning-based delivery means detection apparatus and method
WO2022167079A1 (en) * 2021-02-04 2022-08-11 Huawei Technologies Co., Ltd. An apparatus and method for training a parametric policy

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200151562A1 (en) * 2017-06-28 2020-05-14 Deepmind Technologies Limited Training action selection neural networks using apprenticeship
US20190228309A1 (en) * 2018-01-25 2019-07-25 The Research Foundation For The State University Of New York Framework and methods of diverse exploration for fast and safe policy improvement
US20210034970A1 (en) * 2018-02-05 2021-02-04 Deepmind Technologies Limited Distributed training using actor-critic reinforcement learning with off-policy correction factors
US20190354859A1 (en) * 2018-05-18 2019-11-21 Deepmind Technologies Limited Meta-gradient updates for training return functions for reinforcement learning systems
US20200119556A1 (en) * 2018-10-11 2020-04-16 Di Shi Autonomous Voltage Control for Power System Using Deep Reinforcement Learning Considering N-1 Contingency
US20200302323A1 (en) * 2019-03-20 2020-09-24 Sony Corporation Reinforcement learning through a double actor critic algorithm
US20210367424A1 (en) * 2020-05-19 2021-11-25 Ruisheng Diao Multi-Objective Real-time Power Flow Control Method Using Soft Actor-Critic
WO2022023386A1 (en) * 2020-07-28 2022-02-03 Deepmind Technologies Limited Off-line learning for robot control using a reward prediction model
WO2022028926A1 (en) * 2020-08-07 2022-02-10 Telefonaktiebolaget Lm Ericsson (Publ) Offline simulation-to-reality transfer for reinforcement learning
WO2022045425A1 (en) * 2020-08-26 2022-03-03 주식회사 우아한형제들 Inverse reinforcement learning-based delivery means detection apparatus and method
WO2022167079A1 (en) * 2021-02-04 2022-08-11 Huawei Technologies Co., Ltd. An apparatus and method for training a parametric policy

Also Published As

Publication number Publication date
CN119895440A (en) 2025-04-25
DE112022007008T5 (en) 2025-01-30

Similar Documents

Publication Publication Date Title
Ross et al. Reinforcement and imitation learning via interactive no-regret learning
Pecka et al. Safe exploration techniques for reinforcement learning–an overview
CN113780576B (en) Collaborative multi-agent reinforcement learning method based on reward self-adaptive distribution
CN113191500A (en) Decentralized off-line multi-agent reinforcement learning method and execution system
He et al. Rediffuser: Reliable decision-making using a diffuser with confidence estimation
Bowen et al. Finite-time theory for momentum Q-learning
Huang et al. Svqn: Sequential variational soft q-learning networks
WO2024050712A1 (en) Method and apparatus for guided offline reinforcement learning
Sun et al. Deterministic and discriminative imitation (d2-imitation): revisiting adversarial imitation for sample efficiency
Zhang et al. Balancing exploration and exploitation in hierarchical reinforcement learning via latent landmark graphs
Kumar et al. Neural/fuzzy self learning Lyapunov control for non linear systems
Wang et al. Are Expressive Models Truly Necessary for Offline RL?
Xiao et al. Potential-based advice for stochastic policy learning
Liu et al. Judgmentally adjusted Q-values based on Q-ensemble for offline reinforcement learning
Huang et al. Parameter adaptation within co-adaptive learning classifier systems
Li et al. Robust Reinforcement Learning via Progressive Task Sequence.
Valensi et al. Tree search-based policy optimization under stochastic execution delay
Bi et al. A Comparative Study of Deterministic and Stochastic Policies for Q-learning
Cao et al. Hierarchical reinforcement learning for kinematic control tasks with parameterized action spaces
Alaa et al. Curriculum learning for deep reinforcement learning in swarm robotic navigation task
Lu et al. Demonstration Guided Multi-Objective Reinforcement Learning
CN111950691A (en) A Reinforcement Learning Policy Learning Method Based on Latent Action Representation Space
Hlavatý et al. Development of Advanced Control Strategy Based on Soft Actor-Critic Algorithm
CN113065693B (en) A Traffic Flow Prediction Method Based on Radial Basis Neural Network
Ma et al. Cultural algorithm based on particle swarm optimization for function optimization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22957685

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 112022007008

Country of ref document: DE

WWE Wipo information: entry into national phase

Ref document number: 202280099600.0

Country of ref document: CN

WWP Wipo information: published in national office

Ref document number: 202280099600.0

Country of ref document: CN

122 Ep: pct application non-entry in european phase

Ref document number: 22957685

Country of ref document: EP

Kind code of ref document: A1