WO2024050712A1

WO2024050712A1 - Method and apparatus for guided offline reinforcement learning

Info

Publication number: WO2024050712A1
Application number: PCT/CN2022/117516
Authority: WO
Inventors: Gao HUANG; Qisen YANG; Shenzhi WANG; Qihang ZHANG; Wenjie Shi; Haigang ZHOU; Shiji Song; Xiaonan LU
Original assignee: Tsinghua University; Robert Bosch GmbH
Current assignee: Tsinghua University; Robert Bosch GmbH
Priority date: 2022-09-07
Filing date: 2022-09-07
Publication date: 2024-03-14
Anticipated expiration: 2025-03-07
Also published as: CN119895440A; DE112022007008T5

Abstract

A method for training an offline reinforcement learning network is disclosed. The method comprises: obtaining an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment; generating a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and updating policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.

Description

METHOD AND APPARATUS FOR GUIDED OFFLINE REINFORCEMENT LEARNING

FIELD

The present disclosure relates generally to artificial intelligence technical field, and more particularly, to offline reinforcement learning (RL) with expert guidance.

BACKGROUND

Reinforcement learning (RL) is an important area of machine learning, which aims to solve problems of how agents ought to take actions in different states of an environment so as to maximize some kinds of cumulative reward. Generally, reinforcement learning comprises online RL and offline RL. Online RL learns a policy by interacting with the environment, while offline RL learns a policy by optimizing the policy based only an offline dataset without any interaction with the environment. Since the offline RL learns a policy on a previously collected dataset, it usually suffers from a distributional shift problem, due to the gap between state-action distributions of the offline dataset and the current policy’s interactions with the test environment. Specifically, after optimized on the offline dataset, the agent might encounter unvisited states or misestimate state-action values during the interactions with the online environment, leading to a poor performance.

To mitigate this problem, prior solutions adopt a single trade-off between two conflicting objectives for offline RL, i.e., a policy improvement objective, which aims to optimize the policy according to current value functions, and a policy constraint objective, which keeps the policy’s behavior around the offline dataset to avoid the policy being too aggressive. Thus, prior solutions either add an explicit policy constraint term to the policy improvement equation, or confine the policy implicitly by revising update rules of value functions. However, these solutions generally concentrate only on the global characteristics of the dataset, but ignore the individual feature of each sample. Typically, they make only a single trade-off for all the data in a mini-batch or even in the whole offline dataset. Such “one-size-fits-all” trade-offs might not be able to achieve a perfect balance for each sample, and thus probably limit the potential of performance.

Therefore, there exists a need to provide an improved solution for offline reinforcement learning.

SUMMARY

The following presents a simplified summary of one or more aspects according to the present disclosure in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.

In an aspect of the disclosure, a method for offline reinforcement learning is disclosed. The method comprises: obtaining an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment; generating a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and updating policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.

In another aspect of the disclosure, an apparatus for offline reinforcement learning is disclosed. The apparatus may comprise a memory and at least one processor coupled to the memory. The at least one processor may be configured to obtain an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment; generate a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and update policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.

In another aspect of the disclosure, a computer readable medium storing computer code for offline reinforcement learning is disclosed. The computer code, when executed by a processor, may cause the processor to obtain an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment; generate a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and update policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.

In another aspect of the disclosure, a computer program product for offline reinforcement learning is disclosed. The computer program product may comprise processor executable computer code for obtaining an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment; generating a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and updating policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.

Other aspects or variations of the disclosure will become apparent by consideration of the following detailed description and accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The following figures depict various embodiments of the present disclosure for purposes of illustration only. One skilled in the art will readily recognize from the following description that alternative embodiments of the methods and structures disclosed herein may be implemented without departing from the spirit and principles of the disclosure described herein.

FIG. 1 illustrates a basic structure of a reinforcement learning model in accordance with one aspect of the present disclosure.

FIG. 2 illustrates a block diagram of a method for guided offline reinforcement learning in accordance with one aspect of the present disclosure.

FIG. 3 illustrates a flow chart of a method for guided offline reinforcement learning in accordance with one aspect of the present disclosure.

FIG. 4 illustrates a block diagram of an apparatus for guided offline reinforcement learning in accordance with one aspect of the present disclosure.

FIGs. 5A and 5B illustrate performance comparisons on different numbers of expert samples in accordance with one aspect of the present disclosure.

DETAILED DESCRIPTION

Before any embodiments of the present disclosure are explained in detail, it is to be understood that the disclosure is not limited in its application to the details of construction and the arrangement of features set forth in the following description. The disclosure is capable of other embodiments and of being practiced or of being carried out in various ways.

FIG. 1 illustrates a basic structure of a reinforcement learning model in accordance with one aspect of the present disclosure. Reinforcement learning mainly comprises agents, environments, states, actions, and rewards. As shown in FIG. 1, for the current state s _t of the environment 120, the agent 110 may perform an action a to the environment 120 based on a policy. Then, the environment 120 may transition to a new state s _t+1, and a reward r is given to the agent 110 as a feedback from the environment 120. Subsequently, the agent 110 may perform new actions according to the rewards and the new states of environment 120 circularly. The objective of the policy of the agent 110 is to maximize the accumulated rewards. This process shows how the agent and the environment interact through states, actions, and rewards.

Through reinforcement learning, agents can determine what actions they should take when they are in different states of the environment to get the maximum reward. For example, in robot control applications, the agent may be a robot or particularly a brain of the robot, and the environment may be the environment of the place where the robot works. The environment may have various states, such as, obstacles, slopes, hollows, etc. The robot may take different actions in different states based on the learned policy in its brain, such as in order to walk on the road. If the robot takes correct actions and bypasses the obstacles, it may get higher total rewards.

Reinforcement learning is usually expressed as a Markov decision process (MDP) denoted as a tuple

where

is the state space,

is the action space, P (s _t+1∣s _t, a) stands for the environment's state transition probability, s _t and s _t+1 belong to the state space, a belongs to the action space, the policy of the agent may be based on such a probability, d ₀ (s ₀) denotes a distribution of the initial state s ₀, R (s _t, a, s _t+1) defines a reward function, and γ∈ (0, 1] is a discount factor.

Unlike online RL which learns a policy by interacting with the environment, offline RL aims to optimize the policy by only an offline dataset

where a _k is the action taken in state s _k of the environment, s′ _k is the transitioned state due to a _k, and r _k is the corresponding reward. This characteristic of offline RL brings convenience to applications in many fields where online interactions are expensive or dangerous, such as, robot control, autonomous driving, and health care. Although there exist various offline RL solutions with different training losses, the goal of most of them comprises two conflicting objectives, either explicitly or implicitly: (1) policy improvement, which is aimed to optimize the policy according to current value functions; and (2) policy constraint, which keeps the policy around the behavior policy or offline dataset's distribution. Offline RL has to make a trade-off between these two objectives: if concentrating on the policy improvement term too much, the policy probably steps into an unfamiliar area and generates bad actions due to distributional shift; otherwise, focusing excessively on the policy constraint term might lead to the policy only imitating behaviors in the offline dataset and possibly lacking generalization ability towards out-of-distribution data.

For example, the prior offline RL solutions may include: TD3+BC (Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. Advances in Neural Information Processing Systems, 34, 2021) ; its variant SAC+BC, which applying SAC (Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018) to TD3+BC framework; CQL (Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33: 1179–1191, 2020) ; and IQL (Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv: 2110.06169, 2021) . The policy optimization objectives of these solutions can be unified as:

where

is a policy with trainable parameters θ,

is a state-action value function estimating the expected sum of discounted rewards after taking action a at state s. Furthermore, L _pi (·) and L _pc (·) stand for a policy improvement and policy constraint terms, and F (·) is a trade-off function between L _pi(·) and L _pc (·) .

is a constraint degree: larger d _c would encourage stronger policy constraint, and therefore the policy becomes more conservative; otherwise the policy would stress more on the policy improvement term, and thus tends to be more aggressive.

An ideal improvement-constraint balance for offline RL is to concentrate more on policy constraint for the samples resembling expert behaviors, but stress more on policy improvement for the data similar to random behaviors. Furthermore, it is proved by many online RL methods that expert demonstrations, even in a small quantity, will be beneficial to the policy performance, but current offline RL methods do not take full advantage of the expert data.

In the present disclosure, an offline RL method which may determine an adaptive trade-off between policy improvement and policy constraint for different samples with the guidance of only a few expert data is provided.

FIG. 2 illustrates a block diagram of a method for guided offline reinforcement learning (GORL) in accordance with one aspect of the present disclosure. An offline dataset 210 may contain an enormous amount of data, and may be used for training an RL network 220. The RL network 220 may provide a policy for an agent. A guiding dataset 230 may consist of a few expert data resembled from expert demonstrations or behaviors, and may be used for training a guiding network 240. The guiding network 240 may be used to guide the policy’s training of the RL network 220. The GORL method may alternate between updating the guiding network on the guiding dataset in a MAML (Model-Agnostic Meta-Learning) -like way and training the RL agent on the offline dataset with the guidance of the updated guiding network.

The GORL method is a plug-in approach, and may evaluate the relative importance of policy improvement and policy constraint for each datum adaptively and end-to-end. The GORL method points out a theoretically guaranteed optimization direction for the agent, and may be easy to implement on most of offline RL solutions. The GORL method may achieve significant performance improvement on a number of state-of-the-art offline RL solutions with the guidance of only a few expert data.

In the GORL method, the offline dataset may be denoted as

and the guiding dataset may be denoted as

where M＜＜N , and

For instance,

is a large offline dataset containing sub-optimal or even random policies' trajectories, and

is a guiding dataset with a small quantity of optimal data such as collected by expert or nearly expert policies.

For simplicity, similarly to TD3+BC, a training objective of the GORL method may be formulated as:

where unif {1, N} denotes a uniform distribution on {1, 2, …, N} , and L _pc (·) stands for a policy constraint term, e.g., (π _θ (s _k) -a _k) ² in TD3+BC. The guiding network

with parameters w takes a policy constraint term L _pc as input, and outputs a constraint degree. Please note that unlike the typical framework of existing offline solutions (shown in Equation 1) which assigns a fixed constraint degree (e.g., d _c) for all the data, in the GORL method, constraint degrees generated by the guiding network

may vary with different state-action pairs (s _k, a _k) . In Equation (2) , the guiding network may take the policy constraint objective as input, and outputs a constraint degree for the policy constraint objective. The constraint degree varies for different samples in the offline dataset and the guiding dataset. For example, the guiding network may output a higher relative importance of the policy constraint objective as compared to the policy improvement objective for samples corresponding (i.e., similar) to expert behaviors in the offline dataset, and may output a higher relative importance of the policy improvement objective as compared to the policy constraint objective for samples corresponding (i.e., similar) to random behaviors in the offline dataset.

FIG. 3 illustrates a flow chart of a method 300 for training a guided offline reinforcement learning network in accordance with one aspect of the present disclosure. The method 300 may be implemented by a computer. The computer may be any computing devices, such as, cloud server, distributed computing entities, etc. In block 310, the method 300 may comprise obtaining an offline reinforcement learning network. The offline reinforcement learning network may provide a policy for an agent to take an action at a state of an environment. The offline reinforcement learning network is used for robot control, autonomous driving, or health care. In robot control applications, an agent (such as, the robot or the brain of the robot) may take corresponding actions (such as, walking) at different states of the environment (such as, road conditions) based on the policy in the brain of the robot provided by a learned RL network. Similarly, in autonomous driving applications, the central control unit of a car (i.e., an agent) may drive the car, such as, turn left or break (i. e, actions) at different traffic conditions (i.e., environment) based on a policy learned through RL. In health care applications, an agent (such as, an automated robotic arm for surgery) can make decisions about such as the type of treatment, drug dose, or review time (i.e., actions) at a certain point in time according to current health status and previous treatment history of a patient (i.e., environment) based on a policy learned through RL.

In one embodiment, in block 310, the offline reinforcement learning network may be obtained by initializing an offline reinforcement learning network with random policy parameters. In another embodiment, an offline reinforcement learning network trained based on the prior or even the present offline RL methods may be obtained and used as the initial offline reinforcement learning network for further optimization. For example, the offline reinforcement learning network may be trained based on TD3+BC, SAC+BC, IQL, CQL, etc. Although not shown in FIG. 3, method 300 may also comprise any other well-known steps for training an offline RL network. For example, method 300 may also comprise updating the state-action value function (such as, the Q function in equations 1 or 2) of the offline RL network with offline dataset.

In block 320, method 300 may comprise generating a guiding network on a guiding dataset. The guiding network may evaluate a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network. The generating a guiding network in block 320 may comprise initializing a guiding network with random guiding parameters, or updating the guiding parameters of the guiding network on the guiding dataset based on updated policy parameters. In block 330, method 300 may comprise updating policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance. The updating step may comprise updating the policy parameters toward a direction of maximizing the policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance evaluated with the updated guiding parameters.

The steps in block 320 and block 330 as well as other one or more steps of method 300 (if needed) may be performed alternatively and circularly for a number of times to better balance the policy improvement and policy constraint terms and then optimize the policy of the offline RL network.

In one embodiment, during the initial circular steps, in block 320, method 300 may comprise initializing a guiding network with random guiding parameters, and in block 330, method 300 may comprise updating the policy parameters θ with a gradient descent step on the offline dataset

as follows based on the initial

(t=1) :

where n _θ and α _θ are the mini-batch size and learning rate respectively for policy π _θ.

Then, in the next circular steps, in block 320, method 300 may comprise updating the guiding parameters of the guiding network

on the guiding dataset

i.e.,

based on the updated

as follows:

where n _w is the mini-batch size and α _w is the step size for guiding network

In particular, the policy's parameters

are the updated parameters in Equation 3 related to w.

In block 330, method 300 may comprise updating the policy π _θ of the offline reinforcement learning network on an offline dataset by moving the policy's parameters θ toward the direction of maximizing the policy objective in Equation 2 as follows:

where the w ^(t+1) in Equation 5 may be different from the w ^(t) in Equation 3. It can be clearly observed in Equation 5 that the guiding network

controls the relative update steps of policy improvement and policy constraint gradient for each data pair (s _k, a _k) in the mini-batches.

The pseudo-code of the disclosed plug-in framework, i.e., guided offline reinforcement learning (GORL) , is presented in Algorithm 1 below.

The present disclosed GORL plug-in approach may be applied to various prior or future offline RL methods, including TD3+BC, SAC+BC, IQL, CQL, etc. To implement GORL on offline RL methods, one of the most important works is to find out the corresponding policy constraint term. Such a constraint term may be explicit in some methods (e.g., TD3+BC and SAC+BC) , while much more implicit in other methods (e.g., CQL and IQL) .

In one embodiment of implementing GORL on TD3+BC, the procedures in Algorithm 1 may be followed with L _pc (a _k, π _θ) substituted with (π _θ (s _k) -a _k) ². The pseudo-code of TD3+BC with GORL is presented in Algorithm 2 as below.

The Table 1 below shows the hyperparameters of TD3+BC with GORL on Gym locomotion and Adroit robotic manipulation tasks in the D4RL benchmark dataset.

Table-1

In another embodiment of implementing GORL on SAC+BC which is a natural extension of TD3+BC, replacing TD3 with SAC as policy optimization objective is as below:

where

is a constant,

is an action sampled from π _θ (·∣s _k) by reparameterization trick, and

denotes the probability of π _θ choosing

at state s _k. The Q-function optimization of SAC+BC is the same as SAC. By adding the maximizing entropy term to Equations 3, 4, 5, GORL can be applied to SAC+BC with the procedures in Algorithm 1. The pseudo-code of SAC+BC with GORL is presented in Algorithm 3 as below.

The Table 2 below shows the hyperparameters of SAC+BC with GORL on locomotion/adroit dataset.

Table-2

In another embodiment of implementing GORL on IQL, the policy update objective in IQL may be:

where V (s) is an approximator of

The reason behind Equation 7 is that, if some action a _k is in advantage, i.e.,

the term exp (β (Q (s, a _k) -V (s) ) ) will be larger than its expectation

Therefore, after updating with Equation 7, π _θ is more likely to choose a _k rather than other actions. It can be seen that scalar β, together with Q (s, a) -V (s) , controls to what extent π _θ accepts action a at state s. Based on the analysis above, Equation 7 may be reformulated into:

Compared with Equation 7, Equation 8 assigns a different scalar β _k for each data pair (s _k, a _k) . However, note that

It's difficult to find an optimal β _k end-to-end because β _k is the exponent of (exp (Q (s _k, a _k) -V (s _k) ) ) .

To make the optimization of β _k possible, Equation 8 may be further changed into:

where β _k is multiplied by exp (Q (s _k, a _k) -V (s _k) ) , which is much easier to optimize.

GORL may be implemented on IQL based on Algorithm 1 changed with the new objective (Equation 9) .

is used to generate β _k by taking

as input. The pseudo-code of IQL with GORL is presented in Algorithm 4 as below.

The Table 3 below shows the hyperparameters of SAC+BC with GORL on locomotion/adroit dataset.

Table-3

In another embodiment of implementing GORL on CQL, the policy update objective in CQL may be:

where Q (s, a) is a conservative approximation of the state-action value. The policy constraint objective is implicitly contained during the conservative Q-learning. The more conservative Q-value represents the stronger policy constraint. In this case, GORL can be implemented following Algorithm 1 with the new policy update objective below:

The pseudo-code of CQL with GORL is presented in Algorithm 5 as below.

The Table 4 below shows the hyperparameters of CQL with GORL on locomotion/adroit dataset.

Table-4

FIG. 4 illustrates a block diagram of an apparatus 400 in accordance with one aspect of the present disclosure. The apparatus 400 may comprise a memory 410 and at least one processor 420. In one embodiment, the apparatus 400 may be used for training an offline reinforcement learning network. The processor 420 may be coupled to the memory 410 and configured to perform the method 300 described above with reference to FIG. 3. In another embodiment, the apparatus 400 may be used for a trained offline reinforcement learning network. The processor 420 may be coupled to the memory 410 and configured to implement an offline reinforcement learning network trained by performing the method 300. The processor 420 may be a general-purpose processor, an artificial intelligence processor, or may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. The memory 410 may store the input data, output data, data generated by processor 420, and/or instructions executed by processor 420.

The Guided Offline Reinforcement Learning (GORL) in this disclosure is a general training framework compatible with most offline RL methods. The GORL method may learn a sample-adaptive intensity of policy constraint under the guidance of only a few high-quality data (i.e., expert data) . Specifically, GORL may exert a weak constraint to “random-like” samples in the offline dataset, and may exert a strong constraint to “expert-like” samples in the offline dataset. During the guided learning, each sample may be assigned a different weight and the weights vary through training. Specifically, when fed with relatively high-quality samples (i.e., the samples similar to expert behaviors) , the agent may be inclined to imitation learning; otherwise, when encountering low-quality samples (i.e., the samples similar to random behaviors) , it may choose to slightly diverge from these samples’ distribution. Such adaptive weights seek to achieve the full potential of every sample, leading to higher performance compared with the fixed weight.

The rationality of GORL’s update mechanism and the near-optimality of the guidance from GORL will be described below.

By the chain rule, Equation 4 can be reformulated as:

where

Here, the policy's loss

with

and the guiding loss

with

It can be observed that in Equation 12, larger

would encourage the guiding network

to output a larger constraint degree for the corresponding policy's loss

Further note that in Equation 13,

is an inner product between the guiding gradient average and policy's gradient

Therefore,

would assign larger weights for those

whose gradients are close to the guiding gradient average. The benefits are two-fold: (1) the policy would align its update directions closer to the guiding gradient average whose reliance is guaranteed theoretically; (2) the policy could also enjoy plenty information about the environment provided by a large amount of data in the offline dataset

which is scarce in

due to its small data quantity.

To demonstrate that the guiding gradient average

in Equation 13 is qualified for guiding the offline training processes, the guiding gradient obtained on n expert guiding data may be denoted as

Formally,

If the number of guiding data tends to infinity, the guiding gradient average will reach its optimal form:

where unif {1, n} is the uniform distribution on {1, 2, …, n} . It can be proved that when n increases, the guiding gradient on n expert guiding data

will converge to the optimal guiding gradient

in probability at a rate

In other words, when the guiding dataset

has sufficient expert data, the guiding gradient average in Equation 13 will be approximate to the optimal gradient, and therefore provides reliable guidance for the offline RL algorithms.

Compared with a vanilla scheme which simply mixes expert demonstrations with the offline dataset, the guided training better utilizes the limited high-quality data. FIG. 5A illustrates performance comparisons between the mixed scheme (such as, vanilla) and the guided scheme on different numbers of expert samples. In FIG. 5A, the horizontal axis is the number of expert samples, the vertical axis is percent difference between these two schemes, the grey bars corresponds to the guided scheme (denoted as D (e) →D) , and the black bars corresponds to the mixed scheme (denoted as D (e) +D) . As shown in FIG. 5A, a quite small quantity of expert data, e.g., a hundred or several hundred (the offline dataset’s size is typically 1 million) , may be sufficient for the guiding dataset

to generate a good enough guiding gradient average in Equation 13. When the amount of expert data is small, the guided scheme constantly outperforms the mixed scheme.

FIG. 5B shows a result of a policy trained on expert-only dataset (denoted as “D (e) ” ) with different dataset scales. In FIG. 5B, the horizontal axis is the number of expert samples, and the vertical axis is the normalized score. It’s obvious that the policy’s scores remain quite low until the expert sample number reaches 10 ⁴, which demonstrates that a large amount of training data is necessary for offline RL.

From FIGs. 5A and 5B, it can be seen that the limited expert data itself cannot produce a satisfying agent, due to insufficiency of training samples; and the GORL method may generate reliable guidance for offline RL with only a few expert samples.

The various operations, modules, and networks described in connection with the disclosure herein may be implemented in hardware, software executed by a processor, firmware, or any combination thereof. According an embodiment of the disclosure, a computer program product for training an offline reinforcement learning network may comprise processor executable computer code for performing the method 300 described above with reference to FIG. 3. According an embodiment of the disclosure, a computer program product for an offline reinforcement learning network may comprise processor executable computer code when executed by a processor causing the processor to implement the offline reinforcement learning network trained by performing the method 300. According to another embodiment of the disclosure, a computer readable medium may store computer code for training an offline reinforcement learning network, the computer code when executed by a processor may cause the processor to perform the method 300 described above with reference to FIG. 3. According to another embodiment of the disclosure, a computer readable medium may store computer code for an offline reinforcement learning network, the computer code when executed by a processor may cause the processor to implement the offline reinforcement learning network trained by performing the method 300. Computer-readable media includes both non-transitory computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. Any connection may be properly termed as a computer-readable medium. Other embodiments and implementations are within the scope of the disclosure.

The preceding description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the various embodiments. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the scope of the various embodiments. Thus, the claims are not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the following claims and the principles and novel features disclosed herein.

Claims

A computer-implemented method for training an offline reinforcement learning network, comprising:

obtaining an offline reinforcement learning network, wherein the offline reinforcement learning network provides a policy for an agent to take an action at a state of an environment;

generating a guiding network on a guiding dataset, wherein the guiding network outputs a relative importance of a policy improvement objective and a policy constraint objective for optimizing the offline reinforcement learning network; and

updating policy parameters of the offline reinforcement learning network on an offline dataset by a policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance.
The computer-implemented method of claim 1, wherein the offline reinforcement learning network is used for robot control, autonomous driving, or health care.
The computer-implemented method of claim 1, wherein the guiding network takes the policy constraint objective as input, and outputs a constraint degree for the policy constraint objective, the constraint degree varies for different samples in the offline dataset and the guiding dataset.
The computer-implemented method of claim 1, wherein the guiding network outputs a higher relative importance of the policy constraint objective as compared to the policy improvement objective for high-quality samples in the offline dataset, and outputs a higher relative importance of the policy improvement objective as compared to the policy constraint objective for low-quality samples in the offline dataset.
The computer-implemented method of claim 1, wherein the guiding dataset includes hundreds of high-quality samples collected from expert behaviors.
The computer-implemented method of claim 1, wherein the obtaining an offline reinforcement learning network comprises:

initializing an offline reinforcement learning network with random policy parameters.
The computer-implemented method of claim 6, further comprising:

updating a value function of the offline reinforcement learning network on a mini-batch of offline data sampled from the offline dataset.
The computer-implemented method of claim 7, wherein the generating a guiding network comprises:

initializing a guiding network with random guiding parameters.
The computer-implemented method of claim 8, wherein the updating policy parameters of the offline reinforcement learning network comprises:

updating the policy parameters with a gradient descent step on the mini-batch of offline data based on the relative importance output by the guiding network with the random guiding parameters.
The computer-implemented method of claim 9, wherein the generating a guiding network further comprises:

updating guiding parameters of the guiding network on a mini-batch of guiding data sampled from the guiding dataset based on the updated policy parameters.
The computer-implemented method of claim 10, wherein the updating policy parameters of the offline reinforcement learning network further comprises:

updating the policy parameters toward a direction of maximizing the policy objective as a function of the policy improvement objective and the policy constraint objective based on the relative importance output by the guiding network with the updated guiding parameters.
An apparatus for training an offline reinforcement learning network, comprising:

a memory; and

at least one processor coupled to the memory and configured to perform the computer-implemented method of one of claims 1-11.
A computer readable medium, storing computer code for training an offline reinforcement learning network, the computer code when executed by a processor, causing the processor to perform the computer-implemented method of one of claims 1-11.
A computer program product for training an offline reinforcement learning network, comprising: processor executable computer code for performing the computer-implemented method of one of claims 1-11.
An apparatus for an offline reinforcement learning network, comprising:

a memory; and

at least one processor coupled to the memory and configured to implement the offline reinforcement learning network trained by performing the computer-implemented method of one of claims 1-11.
A computer readable medium, storing computer code for an offline reinforcement learning network, the computer code when executed by a processor, causing the processor to implement the offline reinforcement learning network trained by performing the computer-implemented method of one of claims 1-11.
A computer program product for an offline reinforcement learning network, comprising: processor executable computer code for implementing the offline reinforcement learning network trained by performing the computer-implemented method of one of claims 1-11.