WO2022028926A1

WO2022028926A1 - Offline simulation-to-reality transfer for reinforcement learning

Info

Publication number: WO2022028926A1
Application number: PCT/EP2021/070717
Authority: WO
Inventors: Filippo VANNELLA; Ezeddin AL HAKIM; Mayank GULATI
Original assignee: Telefonaktiebolaget LM Ericsson AB
Current assignee: Telefonaktiebolaget LM Ericsson AB
Priority date: 2020-08-07
Filing date: 2021-07-23
Publication date: 2022-02-10
Anticipated expiration: 2023-02-07

Abstract

A computer-implemented model for offline simulation-to-reality transfer for reinforcement learning is provided. The method includes training a simulation model portion of a hybrid policy model. The hybrid policy model includes a parametric artificial intelligence model. The parametric model includes the simulation model portion including a set of simulation model parameters trained in a simulated environment and a real-world model portion including a set of real-world parameters trained on a dataset. The method further includes sharing a shared subset of the set of simulation model parameters with the real-world model portion. The method further includes training the real-world model portion on the dataset from a real-world environment based on merging an output of the hybrid policy model from the shared subset of the simulation model parameters with an output of the hybrid policy model from the real-world model portion including real-world parameters at a training step.

Description

OFFLINE SIMULATION-TO-REALITY TRANSFER FOR REINFORCEMENT LEARNING

[0001] The present disclosure relates generally to computer-implemented methods for offline simulation-to-reality (Sim-to-Real) transfer for reinforcement learning, and related methods and apparatuses.

BACKGROUND

[0002] In the context of Reinforcement Learning (RL), an agent interacts with a stochastic environment and observes reward feedback as a result of taking an action in a given state. An agent (e.g., a model, including one or more processors, or a model communicatively connected to one or more processors) takes actions in the environment. Reward feedback for the action can be fedback to the agent as a measure of success or failure of the agent’s action in a given state. Through this interaction, the agent learns a policy π , that is a control law mapping states to probability over actions. At any time step t, the agent finds itself in state St, takes action at by following the policy π , transitions to the next state st+i, and in turn receives reward sample rt as feedback from such interaction with the environment. Reward feedback can be a numerical value received by the agent from the environment as a response to the agent’s action. Reward feedback can be, for example, a positive numerical value for a positive action or a negative numerical value for a negative action. The numerical value can be determined by a reward function, e.g., reward feedback for a first state can be specified to have a numerical value of 1 , reward feedback for a second (or more) states can be specified to have a numerical value of 0, etc. This interaction happens in a cycle where the agent tries to evaluate and improve its policy π until it eventually comes up with an optimal policy π *. A RL problem can be defined based on Markov Decision Process (MDP) which is described by a five elements tuple M = (S, A , p, r, γ), where: s is the set of possible states; is the set of possible actions; p: S × A × S → [0,1] is the transition probability for going from state s to s' when selecting a; r(s, a) is is the reward function, received for being in a state s and taking action a; and γ ∈ [0,1] is the discount factor, accounting for delayed rewards.

[0003] A main objective of an RL agent is to learn an optimal policy π * with a goal of maximizing the expected cumulative rewards. As described below, some RL algorithm approaches used to try to achieve this goal are now discussed.

[0004] A Q-Learning approach is a value function based algorithm where the state-action value function (Q-function), denoted by Q^π (s,a), quantifies the desirability to take an action a in state s, and following policy π . Q-learning aims at learning an optimal policy by maximizing the Q-function: π ^* = argmax_π Q ^π (s, a).

[0005] A Policy Gradient (PG) approach is a policy-based algorithm that aims to optimize the policy directly through optimization methods such as gradient-based methods. The policy π_θ (a | s)is generally parameterized with respect to w ∈ R^d. The value of the objective function depends on the policy and various algorithms can be applied to optimize 0 on the maximization objective:

[0006] An Actor-Critic (AC) is a class of RL algorithms that combines policy-based and value-based methods. The critic evaluates the policy using value function which in turn can provide a better estimate for the actor to update in the gradient in the direction governed by the critic.

SUMMARY

[0007] In various embodiments, operations of a method for offline simulation-to-reality transfer for reinforcement learning is provided. The method includes training a simulation model portion of a hybrid policy model. The simulation model portion includes a set of simulation model parameters θ_s trained in a simulated environment. The hybrid policy model includes a parametric artificial intelligence model having a parameter θ. The parameter θ includes the simulation model portion including the set of simulation model parameters θ_s trained in the simulated environment and a real-world model portion including a set of real-world parameters θ_r trained on a dataset. The method further includes sharing a shared subset of the set of simulation model parameters, θ_shared, with the real-world model portion of the hybrid policy model. The method further includes training the real- world model portion of the hybrid policy model on the dataset from a real-world environment based on merging an output of the hybrid policy model from the shared subset of the simulation model parameters θ_shared with an output of the hybrid policy model from the real-world model portion including real-world parameters θ_r at a training step.

[0008] In some embodiments, further operations include accessing the dataset. The dataset includes recorded observations from the real-world environment. The recorded observations include a plurality of sets of a state, an action, and a reward feedback by following the action governed by a logging policy. The accessing includes collecting the dataset by one or more of the logging policy or policies and a pre-collected real-world dataset.

[0009] In some embodiments, further operations include processing the dataset with a conversion of the dataset to a format for use in the training. The conversion includes one or more of a data preparation technique, a data cleaning, an instances partitioning, a feature tuning, a feature extraction, a feature construction, a computation of inverse propensity scores through logistic regression, and a split of the dataset into a training set and a test set.

[0010] In some embodiments, further operations include, subsequent to training the set of simulation model portion and the set of real-world model portion, evaluating the hybrid policy model on the test set by an offline off-policy evaluation technique.

[0011] Corresponding embodiments of inventive concepts for a simulation-to-reality system, computer program products, and computer programs are also provided.

[0012] Potential advantages of disclosed embodiments include knowledge sharing between the two environments’ representations that can result in a hybrid policy trained in simulation that is closer to reality. For example, the outcome of such training may create a policy having increased reliability and robustness towards discrepancies between the real -world and the simulation environment. Furthermore, a balance may be struck between exploration and exploitation during training by taking advantage of simulation to promote exploration while enhancing exploitation during real-world interaction. Additionally, such training can include offline off-policy samples to have a policy closer to reality through the injection of real-world samples in the training; and training of a policy more safely without interaction with the real environment.

BRIEF DESCRIPTION OF DRAWINGS

[0013] The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this application, illustrate certain non- limiting embodiments of inventive concepts. In the drawings:

[0014] Figure 1 is a diagram illustrating a Remote Electrical Tilt angle between a beam of an antenna pattern and a horizontal plane of an antenna in a telecommunication network in accordance with various embodiments of the present disclosure;

[0015] Figure 2 is a block diagram of an architecture for offline Sim-to-Real transfer for reinforcement learning in accordance with some embodiments of the present disclosure;

[0016] Figure 3 is a block diagram and data flow diagram for training an offline Sim-to-Real transfer for reinforcement learning in accordance with some embodiments;

[0017] Figure 4 is a block diagram and data flow diagram of hybrid policy model comprising an artificial neural network in accordance with some embodiments; [0018] Figure 5 is a block diagram of operational modules and related circuits and controllers of an offline Sim-to-Real transfer system for reinforcement learning in accordance with some embodiments; and

[0019] Figure 6 is a flow chart of operations that may be performed by the offline Sim-to- Real transfer system for reinforcement learning in accordance with some embodiments.

DETAILED DESCRIPTION

[0020] Inventive concepts will now be described more fully hereinafter with reference to the accompanying drawings, in which examples of embodiments of inventive concepts are shown. Inventive concepts may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of present inventive concepts to those skilled in the art. It should also be noted that these embodiments are not mutually exclusive. Components from one embodiment may be tacitly assumed to be present/used in another embodiment.

[0021] The following description presents various embodiments of the disclosed subject matter. These embodiments are presented as teaching examples and are not to be construed as limiting the scope of the disclosed subject matter. For example, certain details of the described embodiments may be modified, omitted, or expanded upon without departing from the scope of the described subject matter.

[0022] A limitation with training a RL model in a real-world environment includes concerns regarding performance disruption in the real-world environment caused by uncontrolled exploration. Challenges exist, however, in building a reliable simulation model due to, e.g., modeling errors, uncontrollable stochastic effects of the real-world environment, etc. The discrepancy between a simulation model and the real-world environment is referred to as a simulator-to-reality (Sim-to-Real) gap. Sim-to-Real gap hinders real-world deployment of RL policies trained in simulated environments. [0023] Disclosed is a computer-implemented method for offline Sim-to-Real transfer for RL. The method includes a simulation model portion for training of a set of simulation model parameters trained in a simulated environment and a real-world portion for training of a set of real-world parameters on a dataset from the real-world environment. The method includes sharing of a shared subset of the simulation model parameters with the real-world model portion trained on the dataset. As a consequence, potential advantages of disclosed embodiments include knowledge sharing between the two environments’ representations that can result in a hybrid policy trained in simulation that is closer to reality.

[0024] A dichotomy of RL methods is the distinction between on-policy and off-policy methods. The former aims at learning the same policy in which interaction with the environment happens. While the latter aims at learning a policy (target policy π ) that is different from the policy that is used to interact with the environment (behavior policy π₀) and thus collect trajectory samples. When, in addition to learning from samples from the behavior policy, the agent (e.g., a parametric artificial intelligence model) requires to base its learning strategy solely on a fixed batch of data which cannot be expanded further, the learning can be referred to as Batch RL or Offline RL.

[0025] Importance Sampling (IS) is a technique known to those of skill in the art that can be used in offline off-policy learning. IS includes reweighting returns using an Importance Sampling Ratio (ISR), defined for a trajectory starting from state St and ending in state s_T, as ρ_t:T = Such a ratio can be used to compensate for a distribution mismatch between a logging

policy π₀ and a learning policy π and the policy estimator based on the IS is an unbiased estimator of the value of a policy.

[0026] Sim-to-Real gap will now be discussed further. Simulators can provide an effective infrastructure to train RL policies without concerns about performance disruption caused by uncontrolled exploration, which represents one of the main limitations in training RL algorithms in real- world environments. However, it can be challenging to build reliable simulation models due to modeling errors (e.g., inherent modeling errors) or uncontrollable stochastic effects of the real-world environment.

[0027] Some approaches have been explored to bridge the Sim-to-Real gap. See e.g., X. B. Peng, M. Andrychowicz, W. Zaremba and P. Abbeel, "Sim-to-Real Transfer of Robotic Control with Dynamics Randomization," IEEE International Conference on Robotics and Automation ICRA, pp. 3803-3810, 2018 (“Peng”); L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba and P. Abbeel, "Asymmetric Actor Critic for Image-Based Robot Learning," Proceedings of Robotics: Science and Systems, 2018 (“Pinto”); A. A. Rusu, N. Heess, M. Vec’erik, T. Rothörl, R. Pascanu and R. Hadsell, "Sim-to-Real Robot Learning from Pixels with Progressive Nets," Proceedings of the 1st Annual Conference on Robot Learning, vol. 78, pp. 262-270, 2017 (“Rusu”).

[0028] Two approaches are: (1) Domain Adaptation (DA): starting from a model trained on a task in a given domain (source domain) to adapt to another domain (target domain). In the context of Sim-to-Real, DA includes updating the simulation parameters to make them closer the real-world observation; and (2) Domain Randomization (DR): an ample number of random variations are generated in the simulation environments and the model is trained across these randomized settings. This may make the trained model robust to change of conditions in the real-world environment.

[0029] For example, Peng describes using DR by randomizing the dynamics of a simulator to try to develop robust policies capable to adapt real-world dynamics. A multi-goal formulation is described where different goals g ∈G are presented for every episode which results in policy function π (a | s, g) and reward function r(s_t, a_t, g) with an extra input.

[0030] In another approach, Pinto describes leveraging learning from a simulator by using DR to try to achieve robustness in performance of policies trained in simulation and deployed to the real world. In particular, DR to visual aspects during rendering from the simulator to try to accomplish sim-to-real transfer without training with real-world data using an actor-critic setup.

[0031] Although Peng and Pinto describe using off-policy learning techniques where they utilize segments of a history of experiences in the form of an experience replay, neither approach includes a complete offline learning approach in which there is no longer interaction with the environment. Peng and Pinto each involve online learning both in simulation as well as in the real world. As a consequence, concerns exist about performance disruption caused by uncontrolled exploration in the real-world environment.

[0032] In another approach, N. C. Rabinowitz, G. Desjardins, A.-A. Rusu, K. Kavukcuoglu, R. T. Hadsell, J. Kirkpatrick and H. J. Soyer, "Progressive Neural Networks", U.S. Published Patent Application, International Publication Number WO2017/200597 A1 , 23 November 2017 (“Rabinowitz), describes a technique called “Progressive Networks” (progressive net) for transferring knowledge to a sequence of machine learning (ML) tasks. Rabinowitz includes the use of Artificial Neural Networks (ANNs) where the ANNs used to train on subsequent tasks utilize outputs coming from preceding layers that were trained on prior tasks. Rabinowitz, however, does not address applicability to Sim-to- Real RL tasks. As a consequence, Sim-to-Real transfer learning is not described in Rabinowitz.

[0033] Potential applicability of Rabinowitz to sim-to-real is explored in A. A. Rusu, N. Heess, M. Vecerik, T. Rothörl, R. Pascanu and R. Hadsell, "Sim-to-Real Robot Learning from Pixels with Progressive Nets," Proceedings of the 1st Annual Conference on Robot Learning, vol. 78, pp. 262-270, 2017 (“Rusu”). Rusu describes employing an ANN architecture (progressive net) to transfer learning through shared connections from simulation to real-world experiments. First, the model is trained in a simulation environment. Subsequently, another model is deployed to train from the real- world with the help of learnable joint connections to enhance knowledge sharing. However, the approach used in Rusu includes active interaction with the real-world environments so, its application is confined to only online learning use cases. As a consequence, concerns exist about performance disruption caused by uncontrolled exploration in the real-world environment for the online learning use cases.

[0034] Potential problems for RL for remote electrical tilt (RET) optimization of an antenna at/near a base station for a telecommunication network will now be discussed. Figure 1 is a diagram illustrating a Remote Electrical Tilt (RET) angle between a beam of an antenna pattern and a horizontal plane of an antenna in a telecommunication network (such as a cellular network) in accordance with various embodiments of the present disclosure. RET refers to controlling an electrical antenna tilt angle remotely, where the RET angle is defined as an angle between a beam (e.g., a main beam) of an antenna pattern and a horizontal plane (see θ in Figure 1). Self-Organizing Networks (SON) is a telecommunication network management framework introduced by the 3rd Generation Partnership Program (3GPP) aimed at optimizing, configuring, troubleshooting, and healing cellular networks autonomously. A SON framework may provide an approach for RET optimization.

[0035] Traditionally, in some approaches, rule-based or control-based heuristic algorithms have been employed to control the tilt angle of each antenna at/near a base station of a telecommunication network (see e.g., V. Buenestado, M. Toril, S. Luna-Ramirez, J. M. Ruiz-Avilés and A. Mendo, "Self-tuning of Remote Electrical Tilts Based on Call Traces for Coverage and Capacity Optimization in LTE," IEEE Transactions on Vehicular Technology, vol. 66, no. 5, pp. 4315- 4326, 2017 (“Buenestado”) and A. Engels, M. Reyer, X. Xu, R. Mathar, J. Zhang and H. Zhuang, "Autonomous Self-Optimization of Coverage and Capacity in LTE Cellular Networks," IEEE Transactions on Vehicular Technology, vol. 62, no. 5, pp. 1989-2004, 2013 (“Engels”).

[0036] Additional approaches include RL techniques to try to solve the RET optimization problem. See e.g., S. Fan, H. Tian and C. Sengul, "Self-optimization of coverage and capacity based on a fuzzy neural network with cooperative reinforcement learning," EURASIP Journal on Wireless Communications and Networking, no. 1 , p. 57, 2014 (“Fan”); E. Balevi and J. Andrews, "Learning, Online Antenna Tuning in Heterogeneous Cellular Networks With Deep Reinforcement," IEEE Transactions on Cognitive Communications and Networking, pp. 1 -1 , 2019 (“Balevi”); F. Vannella, J. Jeong and A. Proutiere, "Off-policy Learning for Remote Electrical Tilt Optimization," 2020 (“Vannella”). RL algorithm approaches may have shown improved performance compared to traditional rule-based or optimization-based methods, including for fast adaptation to abrupt changes in the environment. However, in Fan and Balevi, the use of RL is limited to simulation environments. Moreover, the approach of Vannella, as well as approaches described in non-published internal reference implementation, only consider off-policy learning from offline data and do not consider a combination of simulated and real-world environment.

[0037] The following explanation of potential problems with some approaches is a present realization as part of the present disclosure and is not to be construed as previously known by others. Approaches that have been explored to bridge the Sim-to-Real gap may suffer from the following disadvantages:

[0038] Potential problems with approaches to bridge the Sim-to-Real gap in simulation- based RL: Due to uncontrolled RL exploration which makes real-world training risky and thus infeasible, RL algorithms usually train policies in a simulated environment. However, simulators are subject to modeling errors or random environment effects not captured by the simulator. This may make the policy trained in simulation non-robust to discrepancies between real environment and simulator and thus unreliable when deployed.

[0039] For example, some RL applications for telecommunication industrial use-cases consider training a RL agent in the simulation environment (e.g. RET optimization described in Fan and Balevi), which may result in serious limitations to the deployability of RL algorithms in the real network.

[0040] Potential problems with approaches using infeasible online experiments for Sim-to- Real techniques: Many Sim-to-Real RL techniques envision a possibility of executing real-world experiments (see e.g., Pinto, Rusu, and Rabinowitz). In real-world use-cases, executing real-world experiments may not be possible since experiments may not be possible in many real-world applications (e.g., RET optimization) due to a high risk of uncontrolled online interaction.

[0041] Potential problems with approaches using offline off-policy learning: An offline RL agent learns from an offline dataset, gathered by a behavior policy without active interactions with the real-world. As a consequence, offline techniques are strongly limited by the amount and quality of offline data from which the policy is learned. As a consequence of this limitation, it may become difficult to derive an optimal or improved policy by solely using the offline data.

[0042] Various embodiments of the present disclosure may provide solutions to these and other potential problems. In various embodiments of the present disclosure, a method is provided for transferring a RL policy from a simulated environment to a real environment (Sim-to-Real transfer) through offline learning. The method includes a simulated environment and offline data. As a consequence, the method may compensate for the shortcomings of both simulation learning and offline off-policy learning. [0043] In various embodiments, simulation may enable achieving better generalization with respect to conventional offline learning as it complements offline learning with learning through unseen simulated trajectories; while the offline learning may enable closing the Sim-to-Real gap by exposing the agent to real-world samples.

[0044] The method of various embodiments of the present disclosure includes training a hybrid policy model. The hybrid policy model includes a portion trained in simulation and a portion trained on offline real-world data. As a consequence, samples from the real-world data and the simulated environment can be merged by modifying conventional RL training procedures through knowledge sharing between representations of the two environments.

[0045] Potential advantages provided by training the hybrid policy model that combines learning from a simulation and an offline dataset from a real-world environment may include that the outcome of such training may create a policy having increased reliability and robustness towards discrepancies between the real-world and the simulation environment.

[0046] Additional potential advantages may include striking a balance between exploration and exploitation during training by taking advantage of simulation to promote exploration while enhancing exploitation during real-world interaction. This aspect of the method may be important in real-world applications in which the performance of a system under control cannot be degraded arbitrarily.

[0047] Further potential advantages may include improvement of real-world sample efficiency, e.g., reducing the number of real-world data samples required to learn an optimal policy with the help of simulation used to generalize the offline learning to unseen trajectories. The method of various embodiments can include offline off-policy samples to have a policy closer to the reality through the injection of real-world samples in the training.

[0048] Additional potential advantage may include allowing training of a policy more safely without interaction with the real environment. This may be particularly important in RL applications where the interaction with the real-world can disrupt the performance of the real system because of uncontrolled exploration. Furthermore, the method may enhance both simulation training and offline training by integrating the two.

[0049] An architecture for RL Sim-to-Real transfer for RL learning will now be described. Figure 2 is a block diagram of an architecture 200 for offline Sim-to-Real transfer for reinforcement learning in accordance with some embodiments of the present disclosure.

[0050] Figure 2 includes the following components: a simulation environment 202, an offline real-world dataset 204, a hybrid policy model 206, an online simulation training 208, sim and real knowledge sharing 210, and offline real-world dataset training 212. While the components of Figure 2 are depicted as single boxes, in practice, the components may be combined in a single component or may comprise different physical components that make up single illustrated component or a combination of one or more components.

[0051] Simulation environment 202 includes an “imperfect” simulation model of a true real- world environment. It is noted that if a simulation model is perfect, there is no need for a Sim-to-Real method. Thus, an imperfect simulation model refers to a simulation model having a Sim-to-Real gap. Simulation environment 202 can be related to the particular use-case in which the architecture is deployed (e.g. network simulator, robotic simulator, etc.). In some embodiments, simulation environment 202 enables DR by changing stimulation parameters to simulate randomized environment conditions.

[0052] Offline real-world dataset 204 (also referred to herein as a dataset, an offline dataset, and/or a real-world dataset) includes a real-world dataset of recorded observations from the real-world environment collected according to a logging policy. As referred to herein, a logging policy is a policy that was previously operating in a real-world system and has collected data of its interaction, or an ad- hoc logging policy optimized to explore the state-action space. The dataset 204 collected according to the logging policy includes different accumulated observations in the form of state, action, and rewards by following the actions governed by the logging policy.

[0053] Hybrid policy model 206 includes a parametric artificial intelligence model (e.g., an artificial neural network (ANN), a Regression Model, a Decision Tree, etc.) with parameter θ. The model parameter vector includes a portion that is trained in the real world θ_r and a portion trained in the simulation θ_s, in other words θ = θ_r ∪ θ_s. Hybrid policy model 206 can enable integration of training elements from the simulation environment and real-world data.

[0054] Embodiments discussed herein are explained in the non-limiting context of parameter 0. Parameter 0 includes any parameter of a parametric artificial intelligence model, including without limitation, a learning rate, an outcome value, a reward value, an exploration parameter, a discount factor, etc.

[0055] Online simulation training 208 trains the simulation parameter θ _S of the hybrid policy model in the simulation environment. In some embodiments, DR can be used to train the hybrid policy across randomized parameter settings or adjusted by taking feedback from real-world data/training through sim and real knowledge sharing 210 and offline real-world dataset training 212 (DA technique). Online simulation training 208 outputs trained simulation weights, e.g. denoted by θ^* _s. [0056] Sim to real knowledge sharing 210 includes sharing based on freezing and sharing a subset of the simulation model parameters. As referred to herein, sharing refers to sharing a full or a partial subset of the simulation model parameters, i.e. θ_shared θ^* _s> with the real-world model

portion, e.g. θ_r . Freezing refers to the subset of the set of simulation model parameters (i.e., θ_shared) are kept fixed and gradient updates are not executed on these weights during training. In some embodiments, the real-world training parameters, e.g. θ_r and θ_shared , are concatenated in a way that their outputs (from sim and real portions) at each layer can be combined by summing them after hybrid policy model 206 is provided with input from the real-world dataset 204. In additional or alternative embodiments, the sharing is based on fine-tuning the subset of the set of simulation model parameters (i.e., θ_shared), e.g., to increase the performance.

[0057] Offline real-world dataset training 212 can train the real-world parameters, e.g. 0_r of the hybrid policy with off-policy offline learning with the help of IS techniques during training. Training 212 can use learning from the simulator using the shared parameters, e.g. θ_shared , by merging outputs from the shared parameters and the real-world parameters, e.g. θ_shared and θ_r > at each training step. Merging outputs is discussed further herein with reference to Sim to real knowledge sharing and concatenation with respect to Figure 4. The output of the real-world training 212 includes a trained real-world parameter, e.g. denoted by θ^* _r. In some embodiments, an inference process is executed by the portion of the hybrid policy model 206 containing both the shared subset of simulation model parameters and the shared real world parameters, e.g. θ_shared and θ_r > of hybrid policy model 206, e.g. denoted by θ^* = θ_shared ∪ θ *_r·

[0058] Figure 3 is a block diagram and data flow diagram for training an offline Sim-to-Real transfer for reinforcement learning in accordance with some embodiments.

[0059] Referring to Figure 3, the method 300 includes collecting 302 a real-world offline dataset. The real-world dataset can be collected by following one or more logging policies. In additional or alternative embodiments, method 300 can start from a pre-collected real-world offline dataset.

[0060] Method 300 can further include preprocessing 304 the real-world offline dataset. Preprocessing 304 can include, but is not limited to, preparing the real-world dataset for processing such as, for example, by converting the real-world dataset to a format that is suitable for training. In some embodiments, preprocessing 304 includes data preparation techniques, data cleaning, instances partitioning, feature tuning, feature extraction, feature construction, etc. In some embodiments, to execute offline off-policy learning on the real-world dataset though an IS technique, inverse propensity scores are computed through logistic regression as described, for example, in non- published internal reference implementation. In some embodiments, the dataset is split into training and test sets.

[0061] Still referring to Figure 3, method 300 can further include creating 306 a hybrid policy model (e.g., hybrid policy model 206). As discussed above, a hybrid policy model includes a parametric artificial intelligence model (e.g., an ANN, Regression Model, Decision Tree, etc.) with parameter θ. The model parameter vector includes a portion that is trained in the real world θ_r and a portion trained in the simulation θ_s, in other words θ = θ_r u θ_s. In some embodiments, the two portions (i.e., the portion trained in the real-world θ_r (also referred to herein as the “real-world model portion”) and the portion trained in the simulation θ_s (also referred to herein as the “simulation model portion”) can be trained using two separate loss functions using the same RL framework formulation and algorithm (e.g. Deep Q-Network (DQN), Actor-Critic, etc.). In some embodiments, the structure of the two portions may differ from one another. In some embodiments, the simulation model portion has more trainable parameters as compared to the real-world model portion, e.g. |θ_s |>| θ_r |, as online simulation training is exposed to a variety of environmental settings and the online simulation training tries to optimize in the various settings.

[0062] Still referring to Figure 3, method 300 can further include training 308 the hybrid policy model in the simulated environment with DR. In some embodiments, the set of simulation parameters θ_s of the hybrid policy model is trained in the simulated environment with different randomized settings by using the DR approach. DR can also be performed by exacting information from the offline real-world dataset, e.g. measuring stochasticity from the offline real-world dataset and injecting the measurement in the form of noise in the simulator.

[0063] Method 300 can further include training 310 the hybrid policy model with the real- world dataset using propensities. In some embodiments, the set of real-world parameters θ_r of the hybrid policy model are trained on the real-world offline dataset utilizing the shared pre-trained parameters θ_shared from simulation training. In some embodiments of offline off-policy learning, gradient updates of the training parameters θ_r uses ISR ρ_t in the form of computed propensities (e.g., from an operation of preprocessing 304) for each data sample using an update rule. In some embodiments, the update rule can be computed as: where a is the learning rate,

is the temporal difference value error with discount factor

is the gradient of

value function with respect to θ_r ^t and ρ_t corresponds to the importance sampling ratio at time step t. [0064] Still referring to Figure 3, method 300 can further include evaluating 312 the hybrid policy model using propensities. In some embodiments, after training the hybrid policy model in both simulation and the real-world environment, the hybrid policy model is then evaluated on the test dataset by offline off-policy evaluation techniques (e.g. IS).

[0065] Exemplary embodiments of the method of the present disclosure for RET optimization and power optimization will now be described. Such exemplary embodiments are non- limiting use cases of the method of the present disclosure, and other use cases of the method may be used.

[0066] An exemplary embodiment of an offline Sim-to-Real method for RET optimization is now discussed.

[0067] In some embodiments, a RL framework for an offline Sim-to-Real method for RET optimization includes, without limitation, states, actions, and rewards.

[0068] A state(s) includes a set of measured key performance indicators (KPIs) collected at a cell, cluster, or network level depending on the implementation. In some embodiments, the state contains recorded KPIs at the cell level for every cell and the hybrid policy model is trained across different cells independently. In some embodiments, the state at time t may include the following vector s_t = [c_DOF(t), q_DOF(t), ψ (t)] ⊂ [0,1] × [0,1] × [Ψ _min, ψ _min] where ψ (t) is the current vertical tilt angle, and c_D0F(t),q_D0F(t) are a measure of the coverage and capacity in the cell.

[0069] An action(s) includes a discrete tilt variation from a current vertical antenna tilt angle. In some embodiments, the action includes up-tilting, down-tilting or no-change action a_t e {-α, 0, α}. In some embodiments, the tilt at a next step is deterministically computed as: ψ (t + 1) = ψ (t) + a_t.

[0070] A reward(s) includes a function of the observed KPIs that describe a quality of service (QoS) perceived by user equipment(s) (UE(s)). The term "user equipment" or “UE” is used in a non-limiting manner, can refer to any type of communication device, and may be interchangeable and replaced with the term "communication device". In some embodiments, a function of c_DOF(t) and q_DOF(t) is considered which addresses coverage and capacity (CCO) optimization trade-off: r_t = log (1 + c_DOF(t)² + q_DOF(t)²).

[0071] Still referring to an exemplary embodiment of an offline Sim-to-Real method for RET optimization, simulation environment 202 includes a Fourth Generation (4G) cellular network simulator that includes a model of the real network environment. In some embodiments, the simulation state, action, and reward are consistent with the above discussion. While various embodiments herein are discussed in the non-limiting context of a 4G cellular network, the embodiments are not so limited. Instead, other cellular networks may be used, including without limitation, a new radio (NR) network.

[0072] A real-world dataset for an exemplary embodiment of an offline Sim-to-Real method for RET optimization includes, without limitation, an offline dataset, e.g. D_π0, generated according to an expert policy π₀(a_i | s_i) coming from a 4G network. In other words, an agent performs different actions, a,, in an environment having a set of states s, based on its expert policy π₀. In some embodiments, for the real-world interface, an offline learning approach is used (e.g. non-published internal reference implementation), in which a policy is learned completely offline.

[0073] In some embodiments, the hybrid policy model for offline Sim-to-Real transfer learning for RET optimization is a parametric ANN that is trained on data samples coming from both the real-world and the simulated network environment. Training of the hybrid policy model may use DQN, PG, Actor-Critic, etc. algorithms, etc.

[0074] While embodiments discussed herein are explained in the non-limiting context of a hybrid policy model comprising an ANN, the embodiments are not so limited. Instead, other parametric artificial intelligence models may be used, e.g. a Regression Model, a Decision Tree, etc. [0075] In some embodiments, training in the simulator using a simulated 4G cellular network includes training the hybrid policy model containing a set of simulation parameters θ_S in the simulator using randomized environmental settings following a DR approach until convergence.

[0076] In some embodiments, after training in the simulator, a subset of trained simulation parameters, θ_shared , are shared with the hybrid policy model for ensemble (e.g., merged) learning with real-world training.

[0077] In some embodiments, subsequently, the real-world portion of the hybrid policy model is trained by combining the outputs from θ_shared and θ_r at each training step.

[0078] In some embodiments, the hybrid policy π _θ* is evaluated by testing it on a test dataset and comparing its performance with a baseline policyπ₀ based on a metric defined in terms of reward function.

[0079] An exemplary embodiment of an offline Sim-to-Real method for power optimization is now discussed.

[0080] Downlink power control is a problem known and studied by those of skill in the art. An aim of downlink power control is to control radio downlink transmission power to try to maximize a set of desired key performance indicators (KPI)s, such as coverage, capacity and signal quality. The power control can be formulated as an RL agent at a cell, cluster, or network level, depending on the implementation. [0081] In some embodiments, a RL framework for an offline Sim-to-Real method for power optimization in a telecommunication network includes, without limitation, states, actions, and rewards. [0082] In some embodiments, the RL framework at a cell level includes states, an action(s), rewards, and a simulation environment.

[0083] A state(s) includes a set of measured KPIs collected at a cell level, such as an amount of cell power, etc. e.g., an average Reference Signal Received Power (RSRP) to provide implicit user locations and average interference.

[0084] An action(s) includes a discrete power adjustment from a current downlink power at cell level, etc., e.g., power-up, power-down and no change.

[0085] A reward(s) includes a function of different KPIs, e.g., coverage, capacity and signal quality, etc.

[0086] In some embodiments, a simulation environment includes a 4G cellular network simulator that simulates downlink power control. In some embodiments, the simulation takes cell power as input and outputs a set of cells measured KPIs.

[0087] Referring to Figure 2, in some embodiments, a RL framework for an offline Sim-to- Real method for power optimization in a telecommunication network includes, without limitation, a simulation environment 202, an offline real-world dataset 204, a hybrid policy model 206, an online simulation training 208, sim and real knowledge sharing 210, and offline real-world dataset training 212.

[0088] Still referring to an exemplary embodiment of an offline Sim-to-Real method for power optimization, simulation environment 202 includes a 4G cellular network simulator that includes a model of the real network environment. In some embodiments, the simulation state, action, and reward are consistent with the above discussion.

[0089] In some embodiments, the hybrid policy model for offline Sim-to-Real transfer learning for power optimization is a parametric ANN that is trained on data samples coming from both the real-world and the simulated network environment. Training of the hybrid policy model may use DQN, PG, Actor-Critic, etc. algorithms, etc.

[0090] In some embodiments, training in the simulator using a simulated 4G cellular network includes training the hybrid policy model containing a set of simulation parameters θ_S in the simulator using randomized environmental settings following a DR approach until convergence.

[0091] Figure 4 is a block diagram and data flow diagram of an exemplary embodiment of a hybrid policy model 400 (e.g., hybrid policy model 206) comprising a parametric artificial neural network having a parameter θ in accordance with some embodiments. [0092] Referring to Figure 4, hybrid policy model 400 includes a simulation model portion 402 including a set of simulation model parameters θ_s and a real-world model portion 404 including a set of real-world parameters θ_r. A subset of the set of simulation model parameters θ_s, θ_shared 406, is shared with real-world model portion 404 of hybrid policy model 400.

[0093] In one embodiment, simulation model portion 402 includes layers and sizes as follows and as illustrated in Figure 4:

[0094] In this embodiment, real-world model portion 404 includes layers and sizes as follows and as illustrated in Figure 4:

[0095] Simulation model portion 402 includes input layer 408 having a plurality of input nodes, a sequence of neural network hidden layers 410, 412, 414 each include a plurality of weight nodes (e.g., 100, 50, and 20 nodes, respectively), and the output layer 416 includes one or more output nodes (e.g., 3 output nodes). In the particular non-limiting example of Figure 4, the input layer 408 includes 4 nodes. The input data is provided to different ones of the input nodes. A first one of the sequence of neural network hidden layers 410, 412, 414 includes weight nodes dense, dense_1 , and dense_2, respectively. The output layer 416 includes three output nodes dense_3.

[0096] Real-world model portion 404 includes input layer 408 having a plurality of input nodes, a sequence of neural network hidden layers 420, 428, 422, and 430 each including a plurality of weight nodes (e.g., 10, 5, 10, and 5 nodes, respectively), and the output layer 434 includes one or more output nodes (e.g., 3 output nodes). In the particular non-limiting example of Figure 4, the input layer 408 includes 4 nodes. The input data is provided to different ones of the input nodes. A first one of the sequence of neural network hidden layers 410, 412, 414 includes weight nodes dense, dense_1, and dense_2, respectively. The output layer 416 includes three output nodes dense_8.

[0097] During operation, simulated model portion 402 of hybrid policy model 400 is trained. Input features (e.g., KPIs) are provided to input nodes of the simulated model portion 402. The simulated model portion 402 processes the inputs to the input nodes through neural network hidden layers 412, 414 which combine the inputs, as will be described below, to provide outputs for combining at output node 416. The output node provides an output value responsive to processing through the input nodes of the neural network a stream of input features.

[0098] Still referring to Figure 4, simulated model portion 402 generates an action from output node 416 and performs feedback training of the node weights of the input layer 408, and the hidden neural network layers 412, 414.

[0099] Input features (e.g., KPIs) are provided to different ones of the input nodes of input layer 408. A first one of the sequence of neural network hidden layers 412 includes weight nodes N1 L1 (where "1 L1 " refers to a first weight node on layer one) to NXL1 (where X is any plural integer). A last one ("Z") of the sequence of neural network hidden layers 414 includes weight nodes N1 LZ (where Z is any plural integer) to NYLZ (where Y is any plural integer). The output layer 416 includes an output node.

[OO1OO] The simulated model portion 402 operates the input nodes of the input layer 408 to each receive different input features. Each of the input nodes multiply an input feature metric (e.g., a KPI) by a weight that is assigned to the input node to generate a weighted metric value. The input node then provides the weighted metric value to combine nodes of the first one of the sequence of the hidden layers 412.

[OO1O1] During operation, the interconnected structure between the input nodes 408, the weight nodes of the neural network hidden layers 412, 414, and the output nodes 416 may cause the characteristics of each inputted feature to influence the output (e.g., an action) generated for all of the other inputted features that are simultaneously processed. [00102] Still referring to Figure 4, real-word model portion 404 generates an action from output node 434 and performs feedback training of the node weights of the input layer 408, and the hidden neural network layers 420, 422, 428, and 430.

[00103] Input features (e.g., KPIs) are provided to different ones of the input nodes of input layer 408. The first one of the sequence of neural network hidden layers 422 includes weight nodes N1 L1 (where "1 L1 " refers to a first weight node on layer one) to NXL1 (where X is any plural integer). A last one ("Z") of the sequence of neural network hidden layers 414 includes weight nodes N1 LZ (where Z is any plural integer) to NYLZ (where Y is any plural integer). The output layer 434 includes an output node.

[00104] The real-world model portion 404 operates the input nodes of the input layer 408 to each receive different input features. Each of the input nodes multiply an input feature metric (e.g., a KPI) by a weight that is assigned to the input node to generate a weighted metric value. The input node then provides the weighted metric value to combine nodes of the first one of the sequence of the hidden layers 412.

[00105] Still referring to Figure 4, Sim to real knowledge sharing includes sharing a subset 406 (θ_shared) of the trained simulation model parameters θ^* _s with real-world model portion 404 (e.g., based on freezing and sharing subset 406 of the trained simulation model portion 402). A full or a partial subset of the simulation model parameters can be shared with the real-world model portion 404, i.e. θ_shared

θ^* _s, as shown with lambda 418 and lambda_1 426. Freezing refers to the subset of the set of simulation model parameters (i.e., θ_shared) are kept fixed and gradient updates are not executed on these weights during training.

[00106] Still referring to Figure 4, in some embodiments, the real-world training parameters, e.g. θ_r and 0_shared, are concatenated in a way that their outputs (from sim and real portions) at each layer can be combined by summing them (e.g., add 424 and add_1 432) when hybrid policy model 400 is provided with input from the real-world dataset. In additional or alternative embodiments, the sharing is based on fine-tuning the subset of the set of simulation model parameters (i.e., θ_shared), e.g., to increase the performance.

[00107] Offline real-world dataset training can train the real-world parameters, e.g. θ_r of the real-world model portion of hybrid policy model 400 with off-policy offline learning during training. As shown in Figure 4, training can use learning from the simulator using the shared parameters 406, e.g. θ_shared , by merging outputs from the shared parameters and the real-world parameters, e.g. θ_shared and θ_r , at each training step (e.g., add 424 and add_1 432). The output of the real-world training from output node 434 includes a trained real-world parameter, e.g. denoted by θ^* _r. [00108] In some embodiments, an inference process is executed by the portion of the hybrid policy model 400 containing both the shared subset of simulation model parameters 406 and the real world parameters of real-world portion 404 of hybrid policy model 400, e.g. denoted by θ^* = θ_shared ∪ θ^* _r.

[00109] The parametric ANN of Figure 4 is an example that has been provided for ease of illustration and explanation of one embodiment. Other embodiments may include other parametric artificial intelligence models, or other structures of a parametric ANN including other predictions and any non-zero number of input layers having any non-zero number of input nodes, any non-zero number of neural network layers having a plural number of weight nodes, and any non-zero number of output layers having any non-zero number of output nodes. The number of input nodes can be selected based on the number of measured key performance metrics (e.g., KPIs) that are to be simultaneously processed, and the number of output nodes can be similarly selected based on the number of prediction values that are to be simultaneously generated therefrom.

[00110] In various embodiments, a Sim-to-Real architecture can be implemented in a distributed implementation. In some embodiments, n agents (e.g., parametric artificial intelligence models) are distributed across different workers (e.g. in a RET optimization, there can be one worker node per base station). In some embodiments, hybrid policy model training can be employed in an accelerated fashion by combining local gradient updates calculated on separate workers (e.g., worker nodes) to perform global model parameter updates synchronously or asynchronously. In some embodiments, various model parallelism and data parallelism techniques can be harnessed to expedite the training process.

[00111] In some embodiments, after a deployment stage, the pre-trained hybrid policy model can be re-tuned/configured based on a revised dataset from a local telecommunications network base station or cluster of neighboring base stations. For example, in some embodiments, the process can be carried out using federated learning, e.g. an approach of distributed learning through multiple decentralized agents that does not require exchange of a dataset itself among the agents and, as a consequence, preserving privacy in e.g. multi-vendor scenarios.

[00112] Various embodiments of the present disclosure include a complete architecture that may tackle the Sim-to-Real gap problem in RL concerning a mismatch between simulation and real- world environment models which can hinder the real-world deployment of RL policies trained in simulated environments.

[00113] In various embodiments, unlike conventional RL training methods, an RL agent (e.g., a parametric artificial intelligence model) is trained in a simulation by including parameters for learning real-world features and embedding them in the simulation. As a consequence, in various embodiments, a combined training is performed in the simulation environment with parameters coming directly from the real-world environment in the simulation altering the training process to include real-world distributed samples in the simulation. Figure 5 illustrates an offline Sim-to Real transfer system 500 in accordance with various embodiments. Offline Sim-to-Real transfer system 500 includes a hybrid policy model 206, a computer 510, and a data repository 530. Implementation of the Sim-to-Real RL system includes, without limitation, implementation in a cloud-based network (e.g., in a server), a node of a network (e.g., a base station of a cellular network or a core node for a cellular network), a device communicatively connected to a network (e.g., the Internet or a cellular network), etc. While the components of Sim-to-Real transfer system 500 are depicted as single boxes located within a larger box, or nested within multiple boxes, in practice, a Sim-to-Real transfer system may include multiple different physical components that make up a single illustrated component (e.g., memory 516 may comprise multiple separate hard drives as well as multiple RAM modules), and the multiple components can be implemented in a distributed implementation across multiple locations or networks.

[OO114] The computer 510 includes at least one memory 516 ("memory") storing program code 518, a network interface 514, and at least one processor 512 ("processor") that executes the program code 518 to perform operations described herein. The computer 510 is coupled to the data repository 530 and the hybrid policy model 206. In some embodiments, the offline Sim-to-Real transfer system 500 can be connected to a cellular network and can acquire KPIs for controlling a RET angle of an antenna in the cellular network, for controlling a downlink power control of a radio downlink transmission power in the cellular network, etc. In some embodiments, processor 512 can be connected via the network interface 514 to communicate with the cellular network and the data repository 530.

[OO115] The processor 512 may include one or more data processing circuits, such as a general purpose and/or special purpose processor (e.g., microprocessor and/or digital signal processor) that may be collocated or distributed across one or more networks. The processor 512 may include one or more instruction processor cores. The processor 512 is configured to execute computer program code 518 in the memory 516, described below as a non-transitory computer readable medium, to perform at least some of the operations described herein as being performed by any one or more elements of the offline Sim-to-Real transfer system 500.

[OO116] Now that the operations that the various components have been described, operations specific to the computer 510 of offline Sim-to-Real transfer system 500 (implemented using the structure of the block diagram of Figure 5) for performing offline Sim-to-Real transfer for RL will now be discussed with reference to the flow chart of Figure 6 according to various embodiments of the present disclosure. For example, modules may be stored in memory 516 of Figure 5, and these modules may provide instructions so that when the instructions of a module are executed by respective computer processing circuitry 512, processing circuitry 512 performs respective operations of the flow chart. Each of the operations described in Figure 6 can be combined and/or omitted in any combination with each other, and it is contemplated that all such combinations fall within the spirit and scope of this disclosure.

[00117] Referring to Figure 6, a method is provided for offline simulation-to-reality transfer for reinforcement learning. The method includes training 605 a simulation model portion of a hybrid policy model. The simulation model portion includes a set of simulation model parameters θ_s trained in a simulated environment. The hybrid policy model includes a parametric artificial intelligence model having a parameter θ. The parameter θ includes the simulation model portion including the set of simulation model parameters θ_s trained in the simulated environment and a real-world model portion including a set of real-world parameters θ_r trained on a dataset. The method further includes sharing a shared subset of the set of simulation model parameters, θ_shared, with the real-world model portion of the hybrid policy model. The model further includes training 609 the real-world model portion of the hybrid policy model on the dataset from a real-world environment based on merging an output of the hybrid policy model from the shared subset of the simulation model parameters θ_shared with an output of the hybrid policy model from the real-world model portion including real-world parameters θ_r at a training step.

[00118] In some embodiments, the training 605 the simulation model portion includes using the simulated environment to train the set of simulation model parameters θ_s through online simulation training.

[00119] In some embodiments, the simulated environment enables one or more of a plurality of domain randomization techniques by changing one or more simulation conditions of the simulated environment to simulate randomized environment conditions and a domain adaptation adjustment by providing feedback from the dataset.

[00120] In some embodiments, the method further includes accessing 601 the dataset. The dataset includes recorded observations from the real-world environment, the recorded observations include a plurality of sets of a state, an action, and a reward feedback by following the action governed by a logging policy, and the accessing includes collecting the dataset by one or more of the logging policy and a pre-collected real-world dataset. [00121] In some embodiments, the method further includes preprocessing 603 the dataset with a conversion of the dataset to a format for use in the training, and the conversion includes one or more of a data preparation technique, a data cleaning, an instances partitioning, a feature tuning, a feature extraction, a feature construction, a computation of inverse propensity scores through logistic regression, and a split of the dataset into a training set and a test set.

[00122] In some embodiments, the sharing includes freezing at least some of the simulation model parameters θ_s to obtain the shared subset of the simulation model parameters θ_shared.

[00123] In some embodiments, the sharing comprises adjusting at least some of the simulation model parameters θ_s to obtain the shared subset of the simulation model parameters θ_shared-

[00124] In some embodiments, the set of real-world parameters θ_r and the shared subset of the set of simulation model parameters θ_shared are concatenated. The sharing 607 includes summing an output from the simulation model portion and an output from the real-world model portion based on the concatenation. The summing is performed when the hybrid policy model receives input from the dataset.

[00125] In some embodiments, the training 609 of the real-world model portion trains the set of real-world parameters θ_r of the hybrid policy model with an importance sampling technique.

[00126] In some embodiments, the method further includes subsequent to training the set of simulation model portion and the set of real-world model portion, evaluating 611 the hybrid policy model on the test set by an offline off-policy evaluation technique.

[00127] In some embodiments, the environment includes a real-world cellular network, and the hybrid policy model includes a parametric artificial intelligence model for controlling a remote electrical tilt, RET, angle of an antenna in the real-world cellular network.

[00128] In some embodiments, the dataset includes a dataset generated according to a logging policy from the real-world cellular network including recorded observations from the real-world cellular network and the recorded observations include a plurality of sets of a state, an action, and a reward feedback by following the action governed by the logging policy. The state includes a plurality of key performance indicators, KPIs, collected from one of the cells, a cluster, or a network level. The action includes a discrete tilt variation from a current RET angle of the antenna, and the reward feedback includes a function of the plurality of key performance indicators that describe a quality of service of a communication device. [00129] In some embodiments, the environment includes a real-world cellular network, and the hybrid policy model includes a parametric artificial intelligence model for controlling a downlink power control of a radio downlink transmission power in the real-world cellular network.

[00130] In some embodiments, the dataset includes a dataset generated according to a logging policy from the real-world cellular network including recorded observations from the real-world cellular network and the recorded observations include a plurality of sets of a state, an action, and a reward feedback by following the action governed by the logging policy. The state includes a plurality of key performance indicators, KPIs, collected from one of a cell, a cluster, or a network level. The action includes a discrete power adjustment from a current downlink power, and the reward feedback includes a function of one or more the plurality of key performance indicators.

[00131] Various operations from the flow chart of Figure 6 may be optional with respect to some embodiments of an offline simulation-to-reality transfer system and related methods. For example, operations of blocks 601 , 603 and 611 of Figure 6 may be optional.

[00132] In the above description of various embodiments of the present disclosure, it is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of present inventive concepts. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which present inventive concepts belong. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of this specification and the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

[00133] When an element is referred to as being "connected", "coupled", "responsive", or variants thereof to another element, it can be directly connected, coupled, or responsive to the other element or intervening elements may be present. In contrast, when an element is referred to as being "directly connected", "directly coupled", "directly responsive", or variants thereof to another element, there are no intervening elements present. Like numbers refer to like elements throughout.

Furthermore, "coupled", "connected", "responsive", or variants thereof as used herein may include wirelessly coupled, connected, or responsive. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Well- known functions or constructions may not be described in detail for brevity and/or clarity. The term "and/or" includes any and all combinations of one or more of the associated listed items. [00134] It will be understood that although the terms first, second, third, etc. may be used herein to describe various elements/operations, these elements/operations should not be limited by these terms. These terms are only used to distinguish one element/operation from another element/operation. Thus, a first element/operation in some embodiments could be termed a second element/operation in other embodiments without departing from the teachings of present inventive concepts. The same reference numerals or the same reference designators denote the same or similar elements throughout the specification.

[00135] As used herein, the terms "comprise", "comprising", "comprises", "include", "including", "includes", "have", "has", "having", or variants thereof are open-ended, and include one or more stated features, integers, elements, steps, components or functions but does not preclude the presence or addition of one or more other features, integers, elements, steps, components, functions or groups thereof. Furthermore, as used herein, the common abbreviation "e.g.", which derives from the Latin phrase "exempli gratia," may be used to introduce or specify a general example or examples of a previously mentioned item, and is not intended to be limiting of such item. The common abbreviation "i.e.", which derives from the Latin phrase "id est," may be used to specify a particular item from a more general recitation.

[00136] Example embodiments are described herein with reference to block diagrams and/or flowchart illustrations of computer-implemented methods, apparatus (systems and/or devices) and/or computer program products. It is understood that a block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions that are performed by one or more computer circuits. These computer program instructions may be provided to a processor circuit of a general purpose computer circuit, special purpose computer circuit, and/or other programmable data processing circuit to produce a machine, such that the instructions, which execute via the processor of the computer and/or other programmable data processing apparatus, transform and control transistors, values stored in memory locations, and other hardware components within such circuitry to implement the functions/acts specified in the block diagrams and/or flowchart block or blocks, and thereby create means (functionality) and/or structure for implementing the functions/acts specified in the block diagrams and/or flowchart block(s).

[00137] These computer program instructions may also be stored in a tangible computer- readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instructions which implement the functions/acts specified in the block diagrams and/or flowchart block or blocks. Accordingly, embodiments of present inventive concepts may be embodied in hardware and/or in software (including firmware, resident software, micro-code, etc.) that runs on a processor such as a digital signal processor, which may collectively be referred to as "circuitry," "a module" or variants thereof.

[00138] It should also be noted that in some alternate implementations, the functions/acts noted in the blocks may occur out of the order noted in the flowcharts. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Moreover, the functionality of a given block of the flowcharts and/or block diagrams may be separated into multiple blocks and/or the functionality of two or more blocks of the flowcharts and/or block diagrams may be at least partially integrated. Finally, other blocks may be added/inserted between the blocks that are illustrated, and/or blocks/operations may be omitted without departing from the scope of inventive concepts. Moreover, although some of the diagrams include arrows on communication paths to show a primary direction of communication, it is to be understood that communication may occur in the opposite direction to the depicted arrows.

[00139] Many variations and modifications can be made to the embodiments without substantially departing from the principles of the present inventive concepts. All such variations and modifications are intended to be included herein within the scope of present inventive concepts. Accordingly, the above disclosed subject matter is to be considered illustrative, and not restrictive, and the examples of embodiments are intended to cover all such modifications, enhancements, and other embodiments, which fall within the spirit and scope of present inventive concepts. Thus, to the maximum extent allowed by law, the scope of present inventive concepts is to be determined by the broadest permissible interpretation of the present disclosure including the examples of embodiments and their equivalents, and shall not be restricted or limited by the foregoing detailed description.

[00140] Example embodiments are discussed below. Reference numbers/letters are provided in parenthesis by way of example/illustration without limiting example embodiments to particular elements indicated by reference numbers/letters.

[00141] Embodiment 1 . A computer-implemented method for offline simulation-to-reality transfer for reinforcement learning. The method comprises training (605) a simulation model portion of a hybrid policy model. The simulation model portion comprises a set of simulation model parameters θ_s trained in a simulated environment. The hybrid policy model comprises a parametric artificial intelligence model having a parameter θ. The parameter θ comprises the simulation model portion including the set of simulation model parameters θ_S trained in the simulated environment and a real- world model portion comprising a set of real-world parameters θ_r trained on a dataset. The method further includes sharing (607) a shared subset of the set of simulation model parameters, θ_shared, with the real-world model portion of the hybrid policy model. The method further includes training (609) the real-world model portion of the hybrid policy model on the dataset from a real-world environment based on merging an output of the hybrid policy model from the shared subset of the simulation model parameters θ_shared with an output of the hybrid policy model from the real-world model portion comprising real-world parameters θ_r at a training step.

[00142] Embodiment 2. The method of Embodiment 1 , wherein the training (605) the simulation model portion comprises using the simulated environment to train the set of simulation model parameters θ_s through online simulation training.

[00143] Embodiment 3. The method of Embodiment 2, wherein the online simulation training comprises an output of trained simulation weights θ^* _s.

[00144] Embodiment 4. The method of any of Embodiments 2 to 3, wherein the simulated environment enables one or more of a plurality of domain randomization techniques by changing one or more simulation conditions of the simulated environment to simulate randomized environment conditions and a domain adaptation adjustment by providing feedback from the dataset.

[00145] Embodiment 5. The method of any of Embodiments 1 to 4, further comprising accessing (601) the dataset. The dataset comprises recorded observations from the real-world environment, and the recorded observations comprise a plurality of sets of a state, an action, and a reward feedback by following the action governed by a logging policy, and the accessing comprises collecting the dataset by one or more of the logging policy and a pre-collected real-world dataset. [00146] Embodiment 6. The method of any of Embodiments 1 to 5, further comprising preprocessing (603) the dataset with a conversion of the dataset to a format for use in the training.

The conversion comprises one or more of a data preparation technique, a data cleaning, an instances partitioning, a feature tuning, a feature extraction, a feature construction, a computation of inverse propensity scores through logistic regression, and a split of the dataset into a training set and a test set.

[00147] Embodiment 7. The method of any of Embodiments 2 to 6, wherein the sharing comprises freezing at least some of the simulation model parameters θ_s to obtain the shared subset of the simulation model parameters θ_shared.

[00148] Embodiment 8. The method of any of Embodiments 2 to 6, wherein the sharing comprises adjusting at least some of the simulation model parameters θ_s to obtain the shared subset of the simulation model parameters θ_shared. [00149] Embodiment 9. The method of Embodiment 1 to 8, wherein the set of real-world parameters θ_r and the shared subset of the set of simulation model parameters θ_shared are concatenated, and wherein the sharing (607) comprises summing an output from the simulation model portion and an output from the real-world model portion based on the concatenation, wherein the summing is performed when the hybrid policy model receives input from the dataset.

[00150] Embodiment 10. The method of any of Embodiments 1 to 9, wherein the training (609) of the real-world model portion trains the set of real-world parameters θ_r of the hybrid policy model with an importance sampling technique.

[00151] Embodiment 11. The method of any of Embodiments 6 to 10, further comprising subsequent to training the set of simulation model portion and the set of real-world model portion, evaluating (611) the hybrid policy model on the test set by an offline off-policy evaluation technique. [00152] Embodiment 12. The method of any of Embodiments 1 to 11 , wherein the environment comprises a real-world cellular network, wherein the hybrid policy model comprises a parametric artificial intelligence model for controlling a remote electrical tilt, RET, angle of an antenna in the real-world cellular network.

[00153] Embodiment 13. The method of Embodiment 12, wherein the dataset comprises a dataset generated according to a logging policy from the real-world cellular network comprising recorded observations from the real-world cellular network and the recorded observations comprise a plurality of sets of a state, an action, and a reward feedback by following the action governed by the logging policy, wherein the state comprises a plurality of key performance indicators, KPIs, collected from one of a cell, a cluster, or a network level, wherein the action comprises a discrete tilt variation from a current RET angle of the antenna, and wherein the reward feedback comprises a function of the plurality of key performance indicators that describe a quality of service of a communication device.

[00154] Embodiment 14. The method of any of Embodiments 1 to 11 , wherein the environment comprises a real-world cellular network, wherein the hybrid policy model comprises a parametric artificial intelligence model for controlling a downlink power control of a radio downlink transmission power in the real-world cellular network.

[00155] Embodiment 15. The method of Embodiment 13, wherein the dataset comprises a dataset generated according to a logging policy from the real-world cellular network comprising recorded observations from the real-world cellular network and the recorded observations comprise a plurality of sets of a state, an action, and a reward feedback by following the action governed by the logging policy, wherein the state comprises a plurality of key performance indicators, KPIs, collected from one of a cell, a cluster, or a network level, wherein the action comprises a discrete power adjustment from a current downlink power, and wherein the reward feedback comprises a function of one or more the plurality of key performance indicators.

[00156] Embodiment 16. An offline sim u lati on -to-reality transfer system (500) for reinforcement learning, the offline simulation-to-reality transfer system comprising: at least one processor (512); at least one memory (516) connected to the at least one processor (512) and storing program code that is executed by the at least one processor to perform operations comprising: training a simulation model portion of a hybrid policy model, wherein the simulation model portion comprises a set of simulation model parameters θ_s trained in a simulated environment. The hybrid policy model comprises a parametric artificial intelligence model having a parameter θ. The parameter θ comprises the simulation model portion including the set of simulation model parameters θ_s trained in the simulated environment and a real-world model portion comprising a set of real-world parameters θ_r trained on a dataset. The operations further include sharing a shared subset of the set of simulation model parameters, θ_shared, with the real-world model portion of the hybrid policy model. The operations further include training the real-world model portion of the hybrid policy model on the dataset from a real-world environment based on merging an output of the hybrid policy model from the shared subset of the simulation model parameters θ_shared with an output of the hybrid policy model from the real-world model portion comprising real-world parameters θ_r at a training step.

[00157] Embodiment 17. The offline simulation-to-reality transfer system (500) of Embodiment 16, wherein the at least one memory (516) connected to the at least one processor (512) and storing program code that is executed by the at least one processor to perform operations according to Embodiments 2 to 15.

[00158] Embodiment 18. An offline simulation-to-reality transfer system (500) for reinforcement learning, the offline simulation-to-reality transfer system adapted to perform operations comprising: training a simulation model portion of a hybrid policy model, wherein the simulation model portion comprises a set of simulation model parameters θ_s trained in a simulated environment. The hybrid policy model comprises a parametric artificial intelligence model having a parameter θ. The parameter θ comprises the simulation model portion and a real-world model portion comprising a set of real-world parameters θ_r trained on a dataset. The operations further include sharing a shared subset of the set of simulation model parameters, θ_shared, with the real-world model portion of the hybrid policy model. The operations further include training the real-world model portion of the hybrid policy model on the dataset from a real-world environment based on merging an output of the hybrid policy model from the shared subset of the simulation model parameters θ_shared with an output of the hybrid policy model from the real-world model portion comprising real-world parameters Or at a training step.

[00159] Embodiment 19. The offline si mul ation-to-reality transfer system (500) of

Embodiment 18 adapted to perform operations according to any of Embodiments 2 to 15.

[00160] Embodiment 20. A computer program comprising program code to be executed by processing circuitry (512) of an offline simulation-to-reality transfer system (500), whereby execution of the program code causes the offline simulation-to-reality transfer system to perform operations comprising: training a simulation model portion of a hybrid policy model, wherein the simulation model portion comprises a set of simulation model parameters θ_s trained in a simulated environment. The hybrid policy model comprises a parametric artificial intelligence model having a parameter θ. The parameter θ comprises the simulation model portion including the set of simulation model parameters θ_s trained in the simulated environment and a real-world model portion comprising a set of real-world parameters θ_r trained on a dataset. The operations further include sharing a shared subset of the set of simulation model parameters, θ_shared, with the real-world model portion of the hybrid policy model. The operations further include training the real-world model portion of the hybrid policy model on the dataset from a real-world environment based on merging an output of the hybrid policy model from the shared subset of the simulation model parameters θ_shared with an output of the hybrid policy model from the real-world model portion comprising real-world parameters θ_r at a training step.

[00161] Embodiment 22. The computer program of Embodiment 20, whereby execution of the program code causes the offline simulation-to-reality transfer system (500) to perform operations according to any of Embodiments 2 to 15.

[00162] Embodiment 22. A computer program product comprising a non -transitory storage medium including program code to be executed by processing circuitry (512) of an offline simulation- to-reality transfer system (500), whereby execution of the program code causes the offline simulation- to-reality transfer system to perform operations comprising: training a simulation model portion of a hybrid policy model, wherein the simulation model portion comprises a set of simulation model parameters θ_s trained in a simulated environment. The hybrid policy model comprises a parametric artificial intelligence model having a parameter θ. The parameter θ comprises the simulation model portion and a real-world model portion comprising a set of real-world parameters θ_r trained on a dataset. The operations further include sharing a shared subset of the set of simulation model parameters, θ_shared, with the real-world model portion of the hybrid policy model. The operations further include training the real-world model portion of the hybrid policy model on the dataset from a real-world environment based on merging an output of the hybrid policy model from the shared subset of the simulation model parameters θ_shared with an output of the hybrid policy model from the real- world model portion comprising real-world parameters θ_r at a training step.

[00163] Embodiment 23. The computer program product of Embodiment 22, whereby execution of the program code causes the offline simulation-to-reality transfer system (500) to perform operations according to any of Embodiments 2 to 15.

[00164] Explanations are provided below for various abbreviations/acronyms used in the present disclosure.

Abbreviation Explanation

RL: Reinforcement Learning

PG: Policy Gradient

AC: Actor-Critic

3GPP: Third Generation Partnership Project

ANN: Artificial Neural Network

SON: Self-Organizing Network

DA: Domain Adaptation

DR: Domain Randomization

CCO: Coverage Capacity Optimization

KPI: Key Performance Indicator

RET: Remote Electrical Tilt

DOF: Degree of Fire

LTE: Long Term Evolution

BS: Base Stations

UE: User Equipment

QoS: Quality of Service

4G: Fourth Generation

Claims

CLAIMS:

1 . A computer-implemented method for offline simulation-to-reality transfer for reinforcement learning, the method comprising: training (605) a simulation model portion of a hybrid policy model, wherein the simulation model portion comprises a set of simulation model parameters θ_s trained in a simulated environment; wherein the hybrid policy model comprises a parametric artificial intelligence model having a parameter θ, wherein the parameter θ comprises the simulation model portion including the set of simulation model parameters θ_s trained in the simulated environment and a real-world model portion comprising a set of real-world parameters θ_r trained on a dataset; sharing (607) a shared subset of the set of simulation model parameters, θ_shared, with the real- world model portion of the hybrid policy model; and training (609) the real-world model portion of the hybrid policy model on the dataset from a real-world environment based on merging an output of the hybrid policy model from the shared subset of the simulation model parameters θ_shared with an output of the hybrid policy model from the real- world model portion comprising real-world parameters θ_r at a training step.

2. The method of Claim 1 , wherein the training (605) the simulation model portion comprises using the simulated environment to train the set of simulation model parameters θ_s through online simulation training.

3. The method of Claim 2, wherein the online simulation training comprises an output of trained simulation weights θ*_s.

4. The method of any of Claims 1 to 3, wherein the simulated environment enables one or more of a plurality of domain randomization techniques by changing one or more simulation conditions of the simulated environment to simulate randomized environment conditions and a domain adaptation adjustment by providing feedback from the dataset.

5. The method of any of Claims 1 to 4, further comprising: accessing (601) the dataset, wherein the dataset comprises recorded observations from the real-world environment, and wherein the recorded observations comprise a plurality of sets of a state, an action, and a reward feedback by following the action governed by a logging policy, and wherein the accessing comprises collecting the dataset by one or more of the logging policy and a pre- collected real-world dataset.

6. The method of any of Claims 1 to 5, further comprising: processing (603) the dataset with a conversion of the dataset to a format for use in the training (605), wherein the conversion comprises one or more of a data preparation technique, a data cleaning, an instances partitioning, a feature tuning, a feature extraction, a feature construction, a computation of inverse propensity scores through logistic regression, and a split of the dataset into a training set and a test set.

7. The method of any of Claims 2 to 6, wherein the sharing (607) comprises freezing at least some of the simulation model parameters θ_s to obtain the shared subset of the simulation model parameters θ_shared.

8. The method of any of Claims 2 to 6, wherein the sharing (607) comprises adjusting at least some of the simulation model parameters θ_s to obtain the shared subset of the simulation model parameters θ_shared.

9. The method of any of Claims 1 to 8, wherein the set of real-world parameters θ_r and the shared subset of the set of simulation model parameters θ_shared are concatenated, and wherein the sharing (607) comprises summing an output from the simulation model portion and an output from the real-world model portion based on the concatenation, wherein the summing is performed when the hybrid policy model receives input from the dataset.

10. The method of any of Claims 1 to 9, wherein the training (609) of the real-world model portion trains the set of real-world parameters θ_r of the hybrid policy model with an importance sampling technique.

11 . The method of any of Claims 6 to 10, further comprising: subsequent to training the set of simulation model portion and the set of real-world model portion, evaluating (611) the hybrid policy model on the test set by an offline off-policy evaluation technique.

12. The method of any of Claims 1 to 11 , wherein the real-world environment comprises a real-world cellular network, wherein the hybrid policy model comprises a parametric artificial intelligence model for controlling a remote electrical tilt, RET, angle of an antenna in the real-world cellular network.

13. The method of Claim 12, wherein the dataset comprises a dataset generated according to a logging policy from the real-world cellular network comprising recorded observations from the real-world cellular network and the recorded observations comprise a plurality of sets of a state, an action, and a reward feedback by following the action governed by the logging policy, wherein the state comprises a plurality of key performance indicators, KPIs, collected from one of a cell, a cluster, or a network level, wherein the action comprises a discrete tilt variation from a current RET angle of the antenna, and wherein the reward feedback comprises a function of the plurality of key performance indicators that describe a quality of service of a communication device.

14. The method of any of Claims 1 to 11 , wherein the real-world environment comprises a real-world cellular network, wherein the hybrid policy model comprises a parametric artificial intelligence model for controlling a downlink power control of a radio downlink transmission power in the real-world cellular network.

15. The method of Claim 13, wherein the dataset comprises a dataset generated according to a logging policy from the real-world cellular network comprising recorded observations from the real-world cellular network and the recorded observations comprise a plurality of sets of a state, an action, and a reward feedback by following the action governed by the logging policy, wherein the state comprises a plurality of key performance indicators, KPIs, collected from one of a cell, a cluster, or a network level, wherein the action comprises a discrete power adjustment from a current downlink power, and wherein the reward feedback comprises a function of one or more the plurality of key performance indicators.

16. The method of any of Claims 1 to 15, wherein the computer-implemented method for offline simulation-to-reality transfer for reinforcement learning is performed by one of a base station, a network node, and a cloud-based node.

17. An offline simulation-to-reality transfer system (500) for reinforcement learning, the offline simulation-to-reality transfer system adapted to perform operations comprising: training a simulation model portion of a hybrid policy model, wherein the simulation model portion comprises a set of simulation model parameters θ_s trained in a simulated environment; wherein the hybrid policy model comprises a parametric artificial intelligence model having a parameter θ, wherein the parameter θ comprises the simulation model portion and a real-world model portion comprising a set of real-world parameters θ_r trained on a dataset; sharing a shared subset of the set of simulation model parameters, θ_shared, with the real-world model portion of the hybrid policy model; and training the real-world model portion of the hybrid policy model on the dataset from a real- world environment based on merging an output of the hybrid policy model from the shared subset of the simulation model parameters θ_shared with an output of the hybrid policy model from the real-world model portion comprising real-world parameters Or at a training step.

18. The offline simulation-to-reality transfer system (500) of Claim 17 adapted to perform operations according to any of Claims 2 to 16.

19. A computer program comprising program code to be executed by processing circuitry (512) of an offline simulation-to-reality transfer system (500), whereby execution of the program code causes the offline simulation-to-reality transfer system to perform operations comprising: training a simulation model portion of a hybrid policy model, wherein the simulation model portion comprises a set of simulation model parameters θ_s trained in a simulated environment; wherein the hybrid policy model comprises a parametric artificial intelligence model having a parameter θ, wherein the parameter θ comprises the simulation model portion including the set of simulation model parameters θ_s trained in the simulated environment and a real-world model portion comprising a set of real-world parameters θ_r trained on a dataset; sharing a shared subset of the set of simulation model parameters, θ_shared, with the real-world model portion of the hybrid policy model; and training the real-world model portion of the hybrid policy model on the dataset from a real- world environment based on merging an output of the hybrid policy model from the shared subset of the simulation model parameters θ_shared with an output of the hybrid policy model from the real-world model portion comprising real-world parameters θ_r at a training step.

20. A computer program product comprising a non-transitory storage medium including program code to be executed by processing circuitry (512) of an offline simulation-to-reality transfer system (500), whereby execution of the program code causes the offline simulation-to-reality transfer system to perform operations comprising: training a simulation model portion of a hybrid policy model, wherein the simulation model portion comprises a set of simulation model parameters θ_s trained in a simulated environment; wherein the hybrid policy model comprises a parametric artificial intelligence model having a parameter θ, wherein the parameter θ comprises the simulation model portion and a real-world model portion comprising a set of real-world parameters θ_r trained on a dataset; sharing a shared subset of the set of simulation model parameters, θ_shared, with the real-world model portion of the hybrid policy model; and training the real-world model portion of the hybrid policy model on the dataset from a real- world environment based on merging an output of the hybrid policy model from the shared subset of the simulation model parameters θ_shared with an output of the hybrid policy model from the real-world model portion comprising real-world parameters θ_r at a training step.

21 . The offline simulation-to-reality transfer system (500) of any of Claims 17 to 20, wherein the offline simulation-to-reality transfer system is one of a base station, a network node, and a cloud-based node.