WO2020244735A1

WO2020244735A1 - Method and apparatus for setting a configurable system

Info

Publication number: WO2020244735A1
Application number: PCT/EP2019/064380
Authority: WO
Inventors: Lorenzo MAGGI
Original assignee: Nokia Solutions and Networks Oy
Current assignee: Nokia Solutions and Networks Oy
Priority date: 2019-06-03
Filing date: 2019-06-03
Publication date: 2020-12-10
Anticipated expiration: 2021-12-03

Abstract

A method of reinforcement learning for setting a configurable system, e.g. an antenna of the 5G type. The method may comprise in each of a plurality of computing nodes repeatedly doing the steps of selecting (304) an action suitable for causing said configurable system to transition from a current configuration state to a next configuration state, wherein the current configuration state and the next configuration state belong to a set of configuration states which can be visited by the configurable system, wherein the current configuration state belongs to a subset of the set of configuration states assigned to the computing node, wherein the subsets of the set of configuration states assigned to each of the computing nodes are non-overlapping subsets which compose a partition of the set of configuration states.

Description

METHOD AND APPARATUS FOR SETTING A CONFIGURABLE SYSTEM

Field

The disclosure relates to methods and apparatuses for setting a configurable system, and in particular to methods and apparatuses that employ a reinforcement learning technique for setting a configurable system.

Background

Growing complexity in the hyper-connected society is a major challenge for configuring wireless communication devices. Indeed, such wireless communication devices present an exponential number of possible configurations dependent on various factors such as the number of users and the type of their equipment. Traditional approaches using machine learning to setting wireless communication devices will be increasingly defied due to the evolving wireless communication devices and are insufficient to provide optimal parametrization of such devices to ensure optimal network coverage and quality of transmission for a growing number of users and more complex users' equipment.

Summary

In some example embodiments the disclosure provides a method of reinforcement learning for setting a configurable system, e.g. an antenna suitable for wireless communication according to standard protocols of the fifth generation (5G).

The method may comprise in each of a plurality of computing nodes repeatedly doing the steps of determining and storing a record comprising a current configuration state, an action and a next configuration state, wherein the action is suitable for causing said configurable system to transition from a current configuration state to a next configuration state. The current configuration state and the next configuration state may belong to a set of configuration states which can be visited by the configurable system and the current configuration state may belong to a subset of the set of configuration states assigned to the computing node. The set of configuration states assigned to the plurality of the computing nodes may be non- overlapping subsets which compose a partition of the set of configuration states.

In an embodiment, the step of determining the record may comprise:

- selecting (304) the action suitable for causing the configurable system to transition from a current configuration state to the next configuration state,

- emulating the next configuration state corresponding to a response of the configurable system to the selected action in the current configuration state.

In an embodiment, each computing node comprises a fast-learning neural network. The method may further comprise in each computing node, the steps of:

- selecting a records sample from the stored records ,

- improving a target value function for one or more actions and one or more next configuration states contained in the records sample. The target value function may represent a desirability of the configurable system as a function of a next configuration state and an action.

The method may further comprise in response to determining that a next configuration state contained in the records sample is outside the subset assigned to the computing node, retrieving a target value function for said next configuration state from a neighboring computing node to which a subset of the set of configuration states comprising said next configuration state is assigned.

The method may further comprise the step of updating the fast-learning neural network based on the computed target value function and determining an optimal value function of the fast-learning neural network, the optimal value function being defined for the configuration states belonging to the subset assigned to the computing node.

The method may further comprise in response to detecting that the target value function converges to the optimal value function according to a predefined convergence criterion, returning the optimal value function as a globally optimal value function to a configuration agent suitable for interacting with the configurable system, otherwise iterating the method using the optimal value function as the current value function.

In some example embodiments, such method may comprise one or more of the features below.

The set of configuration states may be partitioned into subsets of the set of configuration states using various method. In an example embodiment, the set of configuration states may be partitioned into box clusters using box clustering, each box cluster being a subset of the set of configuration states. In an alternative embodiment, the set of configuration states may be partitioned into subsets of configuration states using spectral clustering techniques or random clustering.

The action may be randomly selected, e.g. when the computing node decides to take an exploration action. The action may be selected to maximize the target value function in a current configuration state, e.g. when the computing node decides to take an exploitation action.

In an example embodiment, the records may be stored in a buffer of the computing node and the records sample may be randomly selected from the records stored in the buffer of the computing node.

In an example embodiment, in each computing node, the retrieving of the current value function from a neighboring computing node may comprise communicating with said neighboring computing node and copying the current value function of said neighboring computing node locally in the computing node.

In another example embodiment, in each computing node, the retrieving of the current value function from a neighboring computing node may comprise periodically communicating with said neighboring computing node and storing at each period the current value function of said neighboring computing node with a timestamp in a local memory of the computing node. The current value function of said neighboring computing node may be retrieved from the local memory in response to determining that a next configuration state of the records sample is outside the subset assigned to said computing node. In an example embodiment, in each computing node, the target value function may be approximated by a target neural network having target parameters.

The configurable system may be any wireless communication system. In some embodiment, the configurable system may be an antenna suitable for wireless communication according to standard protocols of the fifth generation (5G), for example implementing a Multiple Input Multiple Output protocol for wireless communication with a user equipment. The antenna may be configured to emit a plurality of beams, each beam having a direction and an intensity.

In an example embodiment, the set of configuration states may comprise possible beams having different directions and intensities and each configuration states comprises a predefined number of beams. The action may comprise selecting a beam of the possible beams and adding it to the current configuration state.

In some embodiment, the configurable system may be a resource-controlled system e.g. data processing resources in a data processing system, and the set of configuration state may comprise a resource variation parameter such as RAM capacity, CPU processing power, CPU bandwidth, number of processor-cores, number of virtual machines, disk I/O bandwidth and network I/O bandwidth.

In example embodiments, the disclosure also provides a computer program comprising executable code that causes a computer to perform the steps of such method when executed.

In some example embodiments, the disclosure also provides a computing node configured for setting a configurable system, the computing node comprising a fast-learning neural network and means for repeatedly performing:

determining and storing a record comprising a current configuration state, an action and a next configuration state, wherein the action is an action suitable for causing said configurable system to transition from a current configuration state to a next configuration state, wherein the current configuration state and the next configuration state belong to a set of configuration states which can be visited by the configurable system, wherein the current configuration state belongs to a subset of the set of configuration states assigned to the computing node , wherein the computing node belongs to a plurality of computing nodes and wherein a plurality of non-overlapping subsets of the set of configuration states are assigned to the plurality of the computing nodes and compose partitions of the set of configuration states,

and wherein the computing node further comprises means for performing: selecting a records sample from the stored records,

improving a target value function for one or more actions and one or more next configuration states contained in the records sample, wherein the target value function represents a desirability of the configurable system as a function of a next configuration states and an action,

in response to determining that a next configuration state contained in the records sample is outside the subset assigned to said computing node, retrieving a target value function for said next configuration state from a neighboring computing node to which a subset of the set of configuration states comprising said next configuration state is assigned,

updating the fast-learning neural network based on the computed target value function and determining an optimal value function of the fast-learning neural network, the optimal value function being defined for the configuration states belonging to the subset assigned to the computing node,

in response to detecting that the target value function converges to the optimal value function according to a predefined convergence criterion, returning the optimal value function as a globally optimal value function to a configuration agent suitable for interacting with the configurable system, otherwise iterating the method using the optimal value function as the target value function.

Brief description of the drawings

These and other aspects of the invention will be apparent from and elucidated with reference to example embodiments described hereinafter, by way of example, with reference to the drawings. Figure 1 is a functional representation of a configurable system interacting with a configuration agent that may execute methods according to embodiment of the invention in order to optimize the settings of the configurable system.

Figure 2 is a functional representation of a plurality of computing nodes that may emulate the configuration agent of Fig. 1, in accordance with an embodiment.

Figure 3 is an iterative process that may be performed by a computing node to fill a buffer of the computing node.

Figure 4 is an iterative process that may be performed by a computing node for updating a target function value of the computing node of Fig. 3.

Figure 5 is a functional representation of an example of a configurable system.

Figure 6 is a functional representation of partitioning a set of configuration states of the configurable system of figure 5.

Figure 7 is a functional representation of computing nodes implementing the process of figures 3 and 4 for setting the configurable system of figure 5.

Detailed description of the embodiments

One of the challenges reinforcement learning approaches for determining the best parametrization of a configurable system faces concerns the exponential number of possible parameters. The present disclosure relates to methods that distribute the computational burden across several computing nodes and avoid failure events.

An embodiment provides a method of reinforcement learning and a computing node for setting a configurable system.

As depicted in Fig. 1, a configurable system 100 is being interacted with by a configuration agent 101 at discrete time steps t. At each time step t, the configuration agent 101 monitors a current configuration state S_t of the configurable system 100 that represents at a time step t, a current value of a set of metrics m_q of the configurable system 100; q being and integer between 1 and the number of possible metrics. The configuration agent 101 also takes, at each time step t, an action A_t that causes the configurable system 100 to transition from the current configuration state S_t to a next configuration state S_t+1. Upon taking the action A_t, the configuration agent receives a reward r_t from the configurable system 101 that is a variable depending on both the action A_t and the current configuration state S_t.

At each time step t, the configurable system 100 can exhibit any current configuration state S_t among a set S of configuration states.

In order to setting the configurable setting 100, a globally optimal strategy p ^*(s) of behaviour of the configuration agent 101 is determined by the method according to an embodiment. The configuration agent 101 using the globally optimal strategy determines an action to cause the configurable system 100 to transition to an optimal configuration state S_* ensuring the best parametrization possible of the configurable system 100.

As depicted in Fig. 2, the method comprises similar steps to be performed by each of a plurality of computing nodes 102,, where / is an integer comprised between 1 and n, such as n is the number of the plurality of computing nodes 102,. Each computing node 102, is configured to determine the globally optimal strategy p ^* (s) for the configuration agent 101 defined to maximize a globally optimal value function of said computing node 102,. The globally optimal value function

represents a desirability of the configurable system as a function of one or more configuration states s and one or more actions a. In particular, the globally optima l strategy p ^*(s) satisfies:

Each computing node 102,, returns the globally optimal strategy p ^*(s) to the configuration agent 101 so the configuration agent 101 can select an action suitable for transitioning the configurable system 100 to the globally optimal configuration state S_*.

Each computing node 102, comprises local computing resources such as a processor and a local memory. The computing resources 110 are configured for implementing an emulator 110, a fast-learning neural network 106 and a target neural network 104.

The computing node 102, is configured to learn a current value function Q_q-_(i) (s,a) of said computing node 102, by updating a fast-learning neural network 106. In particular, the target value function Q_q-_(i)(s, a) is approximated using the target neural network 104 having parameters q-(i) that are determined by updating the fast-learning neural network 106, the parameters q-(i) being weights of the target neural network 104.

The emulator 110 is configured to emulate the configurable system 100 that the configuration agent 101 interacts with by determining a succession of configuration states S₁, S₂, etc. In each configuration state S_t, the emulator 110 accepts as input an action A_t that simulates the behavior of the configuration agent 101 and it produces the corresponding reward r_t received from the configurable system 100. The emulator 110 also determines the next configuration state S_t+1 to which the configurable system 100 transitions upon being interacted with by the configuration agent 101 performing the action A_t.

Each computing node 102, is also configured to communicate with at least a neighboring computing node 102y through a communication channel 108.

Each computing node 102, is being assigned with a subset 5(0 of the set 5 of configuration states. The assembly of the subsets 5(0 assigned to each of the computing nodes compose the set 5 of configuration states and the subsets 5(0 are non-overlapping. In other word, none of the plurality of computing nodes shares a current configuration state S_t with another computing node. The subsets 5(0 satisfy the following equations:

where i,j refer to two distinct computing nodes 102,, 102y.

Each computing nodes 102, comprises, in its local memory, a table mapping each current configuration state S_t to the subset S(k) it belongs to, where 102_k refers to the computing node 102, or the neighboring computing node 102j. The table may be a hash function h such as:

Each computing nodes 102, is configured to perform the steps depicted in Fig. 3, in accordance with an embodiment of the method.

In step 302, the method comprises for each computing nodes 102, from the plurality of computing nodes, selecting an initial current configuration state S₀ comprised in the subset S(i) being assigned to the computing node 102,. The initial current configuration state S₀ may be randomly selected from the subset S(i) of the set S of configuration states.

The method also comprises in each computing nodes 102_/ m times iterations of the following steps:

Step 304: selecting an action A_t,

Step 306: emulating a next configuration state S_t+1 and a reward r_t for a current configuration state S_t and the action A_t, and

Step 308: storing a record comprising (S_t, A_t, r_t, S_t+1) in a buffer B(i) of the computing node 102,.

The above iteration may be terminated in response to encountering a current configuration state S_t exiting the subset S(i). In other words, a current configuration state S_t belonging to a subset S(j) of a neighboring computing node 102 j; i.e. S_t S(i).

In an example embodiment of the method, the action A_t in step 304 may be selected randomly or to maximize the target value function Q_q-_(i) (s,a) of the computing node 102, in a current configuration state S_t.

In an example embodiment of the method, the action A_t is selected as follows:

where Î is a predefined probability.

In an example embodiment, the step 306 is performed by the emulator 110 of computing node 102,. In an example embodiment, after performing in the computing node 102, m times the steps 304 to 308, the buffer B(i) comprises the records (S_t, A_t, r_t, S_t+1) for t comprised between 0 and m.

As depicted in Fig. 4; the method further comprises in each computing nodes 102, after the steps described above, a step 402 of sampling a records sample B'(i) from the buffer B(i). The method further comprises a step 404 of determining a computing node 102^ where S_t+1 is comprised in the subset S(/c) assigned to the corresponding computing node k, for all S_t+1 comprised in the records sample B'(i).

In an example embodiment, the computing node 102^ is determined using the table, in particular the hash function h, mapping each current configuration state S_t to the subset S(k) it belongs to.

In response to determining at step 404 that a configuration state S_t+1 is comprised in a subset S(k) belonging to a neighboring computing node 102_j, i.e. k=j, the method comprises a step 406 of retrieving the current value function Q_q-_(j) (s,a) of neighboring computing node 102_j for (A_t, S_t+1) of the records sample B'(i).

In an example embodiment, the step 406 of retrieving the target value function Q_q-_(j) (s,a) is performed by:

communicating with the neighboring computing node 102_j,

copying the value of the target neural network 104 parameters _q-_(j) of the neighboring computing node 102_j locally in the computing node 102 and

computing the target value function Q_q-_(j) (s,a) for (A_t, S_t+1) using the target neural network 104 of the computing node 102, and the retrieved value of parameters _q-_(j) .

Alternatively, the step 406 of retrieving the target value function Q_q-_(j) (s,a) is performed by:

communicating with the neighboring computing node 102_j, and

copying in the computing node 102 , the target value function Q_q-_(j) (s,a) for

(A_t, S_t+1) computed by the neighboring computing node 102_j. In another example embodiment, the method comprises in each computing node 102 periodically communicating with neighboring computing nodes 102_j a nd storing at each period the value of the target neural network 104 parameters q^_(j) of each neighboring computing node 102_j with a timestamp in the local memory of the computing node 102 . In this case, the step 406 of retrieving the target value function Q_q-_(j) (s,a) performed by retrieving the value of the parameters q^_(j) from the local memory in response to determining that the timestamp of said value of the parameters q^_(j) satisfies a recency criterion. Otherwise, the value of the target neural network 104 parameters q^_(j) is directly retrieved from the neighboring computing node 102_j. Then, the step 406 comprises computing the target value function Qe-_(j) (s, a) using the target neural network 104 of the computing node 102, and the value of the parameters q^_(j) . This embodiment allows to sparsify the communication among neighboring computing nodes and to reduce communication costs. Moreover, the limitation of the retrieval frequency, may in some case, improve the convergence time of learning the optimal function value Q_q-_(i) (s,a) and reduce the amount of traffic exchanged among processors, thus decreasing the risk of congestion.

In response to determining at step 404 that a configuration state S_t+1 is comprised in a subset S(k) belonging to the computing node 102 , i.e. k=i, the method further comprises a step 408 of computing the current value function Q

_q-_(i) (s,a) of the computing node 102 for (A_t, S_t+1) of the records sample

The method further comprises, in each computing node 102 , a step 410 of updating the target value function Q_q-_(i) (s,a). The step 410 comprises a step of updating the fast-learning neural network 106 based on the computed target value function Q_q-_(i) (s,a) of step 408 and/or the computed target value function Q_q-_(i) (s,a) o The step 410 further comprises determining an optimal

Q_q-_(i) (s,a) for the fast-learning neural network 106, where q^* (i) are parameters of the fast-learning neural network 106. In particular, the step 410 comprises resolving the equation:

where the computing node 102_k is either to the computing node 102,, i.e. k = i, or the neighbouring computing node 102_j,. i.e. k = j;

Q_q-_(i) (s,a) is a variable value function approximated by neural network parameters 0(i).

s' is a next configuration state S_t+1 comprised in the sample records

The step 410 further comprises updating the target value function Q_q-_(i) (s,a) as the optimal value function Q_{q *(i)} (s,a), in particular updating the target neural network 104 parameters q^_(i) as the optimal parameters q ^*(i) and reiterating the method from step 302.

The step 410 also comprises a step of terminating the method and returning to the configuration agent 101 the optimal value function Q_{q *(i)} (s,a), in particular the optimal parameters q ^*(i), in response to detecting that the current value function Q_q-(i) (s,a) converges to the optimal value function Q_{q *(i)} (s,a) according to a predefined convergence criterion. In particular, the computing node 102, returns the optimal value function Q_{q *(i)} (s,a) as the globally optimal value function the configuration agent 101 for the configuration states belonging to the subset the subset S(i) assigned to the computing node 102 . Thus, the configuration agent 101 upon receiving the globally optimal value function Q_{q *(i)} (s,a) computes the globally optimal strategy p ^*(s) for the configuration states belonging to the subset 5(0 assigned to the computing node 102 .

In an example embodiment, terminating the method is performed in response to satisfying the equation:

where d is predefined small parameter.

In an example embodiment, the method comprises a step of randomly initializing the target value function Q_q-(i) (s,a), in particular a step of randomly initializing the parameters q^_(i) of the target neural network 104. The method of Fig.3 and Fig.4 provides the advantages of:

- distributing the reinforcement learning across several computing node enabling to converge faster to the globally optimal value function,

- partitioning the configuration states set into non-overlapping subsets to reduce the number of possible configurations states to visit by each computing node and reducing the computational time,

- avoiding a central node for communicating with the plurality of computing nodes thus enhancing the robustness to failure.

In an example embodiment of the method, the set S of configuration states is partitioned into subsets S(i) using spectral clustering such that the configuration states s within a subset S(i) are highly connected and the configuration states s between different subset S(i) are sparsely connected. This allows to reduce the need of communication among different computing nodes 102, since it creates a sparse computing nodes' topology.

In an example embodiment of the method, the set S of configuration states is partitioned into subsets S(i) using random clustering by assigning configuration states s uniformly at random to the subsets S(i). In this case, the computing nodes' topology is more connected compared to that obtained by spectral clustering, but its diameter is reduced. Thanks to this, the updated value of the target value function updated by each computing node 102 is propagated quickly enough to all the other computing nodes.

In a preferred example embodiment of the method, the set S of configuration states is partitioned into box clusters using box clustering, each box cluster being a subset S(i) of configuration states assigned to a computing node 102,. In this example, for the configurable system 100, each configuration state S_t describes the value of the set of metrics m_q that are of interest for the configuration agent 101 defined as follow:

where Q is the number of possible metrics. Each metric m_j can take on values within the interval / such as:

The interval is divided in h sub-intervals as

A box cluster B is the Cartesian product of sub-intervals across all metrics such as:

In this example, the set S of configuration states is partitioned by assigning all states within a box cluster B to a single subset S(i), for all boxes

{B(h,

Fig. 5 depicts a functional representation of an example of a configurable system 500.

The configurable system 500 is a tandem of processors having two queues 502₁ et 502₂ in which a service rate m₁, m₂ of the queues 502₁ et 502₂ respectively is configurable. Each queue 502i, 502₂ has a number of jobs waiting respectively b₁ , b₂ representing a buffer 506, respectively a buffer 508 of queue 502₁, respectively queue 502₂. A current configuration state S_t of the configurable system 500 is defined as S_t = b₁, b₂ )_. Thus, a set S of configuration states depends on the possible lengths b₁, b₂ .

A globally optimal configuration state S_* of the configurable system 500 minimizes a linear combination of an increasing function of the queue length and of the service rate. This allows to reduce the response time of the configurable system 500, while keeping its running cost low. In order to set up the configurable system 500, the method of Fig. 3 and 4 can be implemented by determining an action to cause the configurable system 500 to transition to the globally optimal configuration state S_*, by determining the globally optimal value for the plurality of computing nodes 102. The action may be performed by a configuration agent interacting with the configurable system 500 and receiving the globally optimal value function from the plurality of computing nodes 102. The configuration agent takes, at each new job arrival, an action A_t consisting of the tuning of the queue service rates m₁, m₂ respectively of queues 502₁ and 502₂. A reward r_t for the configuration agent in a current configuration state S_t = (b₁, b₂ ) can be defined as follows:

where C is a positive pre-defined parameter.

An action A_t of the configuration agent causes the configurable system 500 to transition to three possible next configuration states S_t+1 as follows:

In other words, depending on whether a new job arrives in the configurable system 500, a job is served by queue 502i and routed to queue 502₂, or else a job is served by queue 502₂ and leaves the system. The configurable system 500 can transition from a current configuration state S_t to a next configuration state S_t+1 following direction 602.

The set S of configuration state may be partitioned into non-overlapping subsets S(i) and assigned to each computing node 102, as follows. As depicted in Fig. 6, is the maximum buffer length of buffers b₁ and b₂ . Thus, the configuration state of the configurable system 500 is defined by two metrics m ₁ = b₁ and m₂ = b₂ , each metric can take on a value between 0 and .

The set S of configuration states is partitioned using box clustering as follows. The interval of all buffer occupancy levels for queues 502_/ as l = 1,2 is

divided into h sub-intervals Then, a

box is the Cartesian product between two sub-intervals X

, which also defines a subsets S(i).

Each subset S(i) is assigned to a computing node 102, to perform the steps of Fig. 3. In particular, the emulator 110 of each computing node 102, computes, in step 306, the reward r_t according to equation (1) and the next configuration state S_t+1 according to equation (2), given the current configuration state S_t belonging to the subset S(i) and the action A_t selected in step 304.

The target value function of each computing node 102, can be computed according to steps of Fig. 4. According to the above partitioning, the step 406 of Fig. 4 is performed due to selecting an action of step 304 of Fig. 3 causing the configurable system 500 to transition to a next configuration state following direction 604. Therefore, the communication pattern between pairs of computing nodes 102, and 102j is determined by the partitioning of the set S of the configuration states. In fact, after assigning a subset S(i ) to each computing node 102 , two neighboring computing nodes 102 and 102_j needs to communicate whenever there exists a state transition following direction 604 between two computing nodes.

Figure 7 depicts the topology of communication of the computing nodes 102, resulting from the partitioning explained above, i.e. a 2-dimensional grid with unidirectional communication over the diagonal 702. Each computing node 102, performs step 406 of Fig. 4 for retrieving the target value function from five neighboring computing nodes (N,S,E,W,NW).

In another example embodiment, the configurable system is an antenna suitable for wireless communication according to standard protocols of the fifth generation (5G), for example implementing a Multiple Input Multiple Output protocol for wireless communication with a user equipment. The antenna is configured to emit a plurality of beams, each beam having a direction and an intensity. The method of Fig. 3 and Fig. 4 may be implemented for setting the antenna by determining an action to cause the antenna to transition to an optimal configuration state.

A current configuration state of the antenna is defined by the number of beams and the action comprises selecting a beam of possible beams and adding it to the current configuration state. For example, each beam is characterized by a direction and an intensity of emission of said antenna.

The method improves setting the antennas of the suitable for wireless communication according to standard protocols of the fifth generation (5G) by: - improving the quality of the computed grids in terms of achieved throughput, given a fixed time budget for computations

- providing a solution of similar quality to the state of the art within a smaller amount of time.

While this disclosure includes references to illustrative embodiments, this specification is not intended to be construed in a limiting sense. Various modifications of the described embodiments, as well as other embodiments within the scope of the disclosure, which are apparent to persons skilled in the art to which the disclosure pertains are deemed to lie within the principle and scope of the disclosure, e.g., as expressed in the following claims.

The use of figure numbers and/or figure reference labels in the claims is intended to identify one or more possible embodiments of the claimed subject matter in order to facilitate the interpretation of the claims. Such use is not to be construed as necessarily limiting the scope of those claims to the embodiments shown in the corresponding figures.

Although the elements in the following method claims, if any, are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily intended to be limited to being implemented in that particular sequence.

Reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the disclosure. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments necessarily mutually exclusive of other embodiments. The same applies to the term "implementation."

The described embodiments are to be considered in all respects as only illustrative and not restrictive. In particular, the scope of the disclosure is indicated by the appended claims rather than by the description and figures herein. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.

A person of ordinary skill in the art would readily recognize that steps of various above-described methods can be performed by programmed computers. Herein, some embodiments are intended to cover program storage devices, e.g., digital data storage media, which are machine or computer readable and encode machine-executable or computer-executable programs of instructions where said instructions perform some or all of the steps of methods described herein. The program storage devices may be, e.g., digital memories, magnetic storage media such as a magnetic disks or tapes, hard drives, or optically readable digital data storage media. The embodiments are also intended to cover computers programmed to perform said steps of methods described herein.

The functions of the various elements shown in the figures, including any functional blocks labeled as "processors" and/or "controllers," may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term "processor" or "controller" should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage. Other hardware, conventional and/or custom, may also be included.

Claims

1. A method of reinforcement learning for setting a configurable system (100), wherein the method comprises in each of a plurality of computing nodes (102 ) repeatedly doing the steps of:

determining and storing a record comprising a current configuration state, an action and a next configuration state, wherein the action is suitable for causing said configurable system to transition from a current configuration state to a next configuration state, wherein the current configuration state and the next configuration state belong to a set of configuration states which can be visited by the configurable system (100), wherein the current configuration state belongs to a subset of the set of configuration states assigned to the computing node (102 ), wherein the subsets of the set of configuration states assigned to the plurality of the computing nodes are non-overlapping subsets which compose a partition of the set of configuration states,

and wherein each computing node (102 ) comprises a fast-learning neural network (106) and the method further comprises, in each computing node (102 ), the steps of: selecting (402) a records sample from the stored records ,

improving (408) a target value function for one or more actions and one or more next configuration states contained in the records sample, wherein the target value function represents a desirability of the configurable system as a function of a next configuration state and an action,

in response to determining that a next configuration state contained in the records sample is outside the subset assigned to said computing node, retrieving (406) a target value function for said next configuration state from a neighboring computing node (102j) to which a subset of the set of configuration states comprising said next configuration state is assigned,

updating the fast-learning neural network based on the computed target value function and determining an optimal value function of the fast-learning neural network, the optimal value function being defined for the configuration states belonging to the subset assigned to the computing node, in response to detecting that the target value function converges to the optimal value function according to a predefined convergence criterion, returning the optimal value function as a globally optimal value function to a configuration agent (101) suitable for interacting with the configurable system, otherwise iterating the method using the optimal value function as the current value function.

2. Method according to claim 1, wherein the set of configuration states is partitioned into box clusters using box clustering, each box cluster being a subset of the set of configuration states.

3. Method according to claim 1 or 2, wherein the step of determining the record comprises:

selecting (304) the action suitable for causing the configurable system to transition from a current configuration state to the next configuration state, and

emulating (306) the next configuration state corresponding to a response of the configurable system to the selected action in the current configuration state.

4. Method according to claim 3, wherein the action is randomly selected.

5. Method according to claim 3, wherein the action is selected to maximize the target value function in a current configuration state.

6. Method according to anyone of claims 1 to 5, wherein the records are stored in a buffer of the computing node and wherein the records sample is randomly selected from the records stored in the buffer of the computing node (102 ).

7. Method according to anyone of claims 1 to 6, wherein, in each computing node, the retrieving (406) of the target value function from a neighboring computing node comprises communicating with said neighboring computing node (102,) and copying the target value function of said neighboring computing node (102y) locally in the computing node (102 ).

8. Method according to anyone of claims 1 to 6, wherein, in each computing node, the retrieving of the target value function from a neighboring computing node comprises periodically communicating with said neighboring computing node and storing at each period the target value function of said neighboring computing node with a timestamp in a local memory of the computing node,

wherein the target value function is retrieved from said local memory in response to determining that a next configuration state of the records sample is outside the subset assigned to said computing node.

9. Method according to anyone of claims 1 to 8, wherein, in each computing node, the target value function is approximated by a target neural network (104) having target parameters.

10. Method according to anyone of claims 1 to 9, wherein the configurable system is an antenna suitable for wireless communication according to standard protocols of the fifth generation (5G), for example implementing a Multiple Input Multiple Output protocol for wireless communication with a user equipment, said antenna being configured to emit a plurality of beams, each beam having a direction and an intensity.

11. Method according to claim 10, wherein the set of configuration states comprises possible beams having different directions and intensities and each configuration states comprises a predefined number of beams and the action comprises selecting a beam of the possible beams and adding it to the current configuration state.

12. A computing node (102 ) configured for setting a configurable system, the computing node (102 ) comprising a fast-learning neural network (106) and means for repeatedly performing:

determining and storing a record comprising a current configuration state, an action and a next configuration state, wherein the action is suitable for causing said configurable system to transition from a current configuration state to a next configuration state, wherein the current configuration state and the next configuration state belong to a set of configuration states which can be visited by the configurable system, wherein the current configuration state belongs to a subset of the set of configuration states assigned to the computing node (102 ), wherein the computing node belongs to a plurality of computing nodes and wherein a plurality of non-overlapping subsets of the set of configuration states are assigned to the plurality of the computing nodes and compose partitions of the set of configuration states,

and wherein the computing node (102 ) further comprises means for performing: selecting (402) a records sample from the stored records , improving (408) a target value function for one or more actions and one or more next configuration states contained in the records sample, wherein the target value function represents a desirability of the configurable system as a function of a next configuration states and an action,

in response to determining that a next configuration state contained in the records sample is outside the subset assigned to said computing node, retrieving (406) a target value function for said next configuration state from a neighboring computing node to which a subset of the set of configuration states comprising said next configuration state is assigned,

in response to detecting that the target value function converges to the optimal value function as a globally optimal value function according to a predefined convergence criterion, returning the optimal value function to a configuration agent suitable for interacting with the configurable system, otherwise iterating the method using the optimal value function as the current value function.

13. A computer program comprising executable code for causing a computing node repeatedly performing:

determining and storing a record comprising a current configuration state, an action and a next configuration state, wherein the action is suitable for causing said configurable system to transition from a current configuration state to a next configuration state, wherein the current configuration state and the next configuration state belong to a set of configuration states which can be visited by the configurable system, wherein the current configuration state belongs to a subset of the set of configuration states assigned to the computing node , wherein the subsets of the set of configuration states assigned to of the plurality of the computing nodes are non-overlapping subsets which compose a partition of the set of configuration states,

wherein each computing node comprises a fast-learning neural network and the computer program further comprises executable code for causing said computing node performing:

selecting a records sample from the stored records ,

improving a target value function for one or more actions and one or more next configuration states contained in the records sample, wherein the target value function represents a desirability of the configurable system as a function of a next configuration state and an action,

in response to determining that a next configuration state of the records sample is outside the subset of said computing node, retrieving a target value function of said next configuration state from a neighboring computing node to which a subset of the set of configuration states comprising said next configuration state is assigned, updating the fast-learning neural network based on the computed target value function and determining an optimal value function of the fast-learning neural network, the optimal value function being defined for the configuration states belonging to the subset assigned to the computing node,

in response to detecting that the target value function converges to the optimal value function as a globally optimal value function according to a predefined convergence criterion, returning the optimal value function to a configuration agent suitable for interacting with the configurable system, otherwise iterating the method using the optimal value function as the target value function.