[go: up one dir, main page]

WO2020244735A1 - Method and apparatus for setting a configurable system - Google Patents

Method and apparatus for setting a configurable system Download PDF

Info

Publication number
WO2020244735A1
WO2020244735A1 PCT/EP2019/064380 EP2019064380W WO2020244735A1 WO 2020244735 A1 WO2020244735 A1 WO 2020244735A1 EP 2019064380 W EP2019064380 W EP 2019064380W WO 2020244735 A1 WO2020244735 A1 WO 2020244735A1
Authority
WO
WIPO (PCT)
Prior art keywords
value function
computing node
configuration state
configuration
target value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/EP2019/064380
Other languages
French (fr)
Inventor
Lorenzo MAGGI
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia Solutions and Networks Oy
Original Assignee
Nokia Solutions and Networks Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Solutions and Networks Oy filed Critical Nokia Solutions and Networks Oy
Priority to PCT/EP2019/064380 priority Critical patent/WO2020244735A1/en
Publication of WO2020244735A1 publication Critical patent/WO2020244735A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0803Configuration setting
    • H04L41/0823Configuration setting characterised by the purposes of a change of settings, e.g. optimising configuration for enhancing reliability
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/085Retrieval of network configuration; Tracking network configuration history

Definitions

  • the disclosure relates to methods and apparatuses for setting a configurable system, and in particular to methods and apparatuses that employ a reinforcement learning technique for setting a configurable system.
  • wireless communication devices present an exponential number of possible configurations dependent on various factors such as the number of users and the type of their equipment.
  • Traditional approaches using machine learning to setting wireless communication devices will be increasingly defied due to the evolving wireless communication devices and are insufficient to provide optimal parametrization of such devices to ensure optimal network coverage and quality of transmission for a growing number of users and more complex users' equipment.
  • the disclosure provides a method of reinforcement learning for setting a configurable system, e.g. an antenna suitable for wireless communication according to standard protocols of the fifth generation (5G).
  • a configurable system e.g. an antenna suitable for wireless communication according to standard protocols of the fifth generation (5G).
  • the method may comprise in each of a plurality of computing nodes repeatedly doing the steps of determining and storing a record comprising a current configuration state, an action and a next configuration state, wherein the action is suitable for causing said configurable system to transition from a current configuration state to a next configuration state.
  • the current configuration state and the next configuration state may belong to a set of configuration states which can be visited by the configurable system and the current configuration state may belong to a subset of the set of configuration states assigned to the computing node.
  • the set of configuration states assigned to the plurality of the computing nodes may be non- overlapping subsets which compose a partition of the set of configuration states.
  • the step of determining the record may comprise:
  • each computing node comprises a fast-learning neural network.
  • the method may further comprise in each computing node, the steps of:
  • the target value function may represent a desirability of the configurable system as a function of a next configuration state and an action.
  • the method may further comprise in response to determining that a next configuration state contained in the records sample is outside the subset assigned to the computing node, retrieving a target value function for said next configuration state from a neighboring computing node to which a subset of the set of configuration states comprising said next configuration state is assigned.
  • the method may further comprise the step of updating the fast-learning neural network based on the computed target value function and determining an optimal value function of the fast-learning neural network, the optimal value function being defined for the configuration states belonging to the subset assigned to the computing node.
  • the method may further comprise in response to detecting that the target value function converges to the optimal value function according to a predefined convergence criterion, returning the optimal value function as a globally optimal value function to a configuration agent suitable for interacting with the configurable system, otherwise iterating the method using the optimal value function as the current value function.
  • such method may comprise one or more of the features below.
  • the set of configuration states may be partitioned into subsets of the set of configuration states using various method.
  • the set of configuration states may be partitioned into box clusters using box clustering, each box cluster being a subset of the set of configuration states.
  • the set of configuration states may be partitioned into subsets of configuration states using spectral clustering techniques or random clustering.
  • the action may be randomly selected, e.g. when the computing node decides to take an exploration action.
  • the action may be selected to maximize the target value function in a current configuration state, e.g. when the computing node decides to take an exploitation action.
  • the records may be stored in a buffer of the computing node and the records sample may be randomly selected from the records stored in the buffer of the computing node.
  • the retrieving of the current value function from a neighboring computing node may comprise communicating with said neighboring computing node and copying the current value function of said neighboring computing node locally in the computing node.
  • the retrieving of the current value function from a neighboring computing node may comprise periodically communicating with said neighboring computing node and storing at each period the current value function of said neighboring computing node with a timestamp in a local memory of the computing node.
  • the current value function of said neighboring computing node may be retrieved from the local memory in response to determining that a next configuration state of the records sample is outside the subset assigned to said computing node.
  • the target value function may be approximated by a target neural network having target parameters.
  • the configurable system may be any wireless communication system.
  • the configurable system may be an antenna suitable for wireless communication according to standard protocols of the fifth generation (5G), for example implementing a Multiple Input Multiple Output protocol for wireless communication with a user equipment.
  • the antenna may be configured to emit a plurality of beams, each beam having a direction and an intensity.
  • the set of configuration states may comprise possible beams having different directions and intensities and each configuration states comprises a predefined number of beams.
  • the action may comprise selecting a beam of the possible beams and adding it to the current configuration state.
  • the configurable system may be a resource-controlled system e.g. data processing resources in a data processing system
  • the set of configuration state may comprise a resource variation parameter such as RAM capacity, CPU processing power, CPU bandwidth, number of processor-cores, number of virtual machines, disk I/O bandwidth and network I/O bandwidth.
  • the disclosure also provides a computer program comprising executable code that causes a computer to perform the steps of such method when executed.
  • the disclosure also provides a computing node configured for setting a configurable system, the computing node comprising a fast-learning neural network and means for repeatedly performing:
  • computing node further comprises means for performing: selecting a records sample from the stored records,
  • target value function represents a desirability of the configurable system as a function of a next configuration states and an action
  • the fast-learning neural network based on the computed target value function and determining an optimal value function of the fast-learning neural network, the optimal value function being defined for the configuration states belonging to the subset assigned to the computing node,
  • the target value function in response to detecting that the target value function converges to the optimal value function according to a predefined convergence criterion, returning the optimal value function as a globally optimal value function to a configuration agent suitable for interacting with the configurable system, otherwise iterating the method using the optimal value function as the target value function.
  • Figure 1 is a functional representation of a configurable system interacting with a configuration agent that may execute methods according to embodiment of the invention in order to optimize the settings of the configurable system.
  • Figure 2 is a functional representation of a plurality of computing nodes that may emulate the configuration agent of Fig. 1, in accordance with an embodiment.
  • Figure 3 is an iterative process that may be performed by a computing node to fill a buffer of the computing node.
  • Figure 4 is an iterative process that may be performed by a computing node for updating a target function value of the computing node of Fig. 3.
  • Figure 5 is a functional representation of an example of a configurable system.
  • Figure 6 is a functional representation of partitioning a set of configuration states of the configurable system of figure 5.
  • Figure 7 is a functional representation of computing nodes implementing the process of figures 3 and 4 for setting the configurable system of figure 5.
  • the present disclosure relates to methods that distribute the computational burden across several computing nodes and avoid failure events.
  • An embodiment provides a method of reinforcement learning and a computing node for setting a configurable system.
  • a configurable system 100 is being interacted with by a configuration agent 101 at discrete time steps t.
  • the configuration agent 101 monitors a current configuration state S t of the configurable system 100 that represents at a time step t, a current value of a set of metrics m q of the configurable system 100; q being and integer between 1 and the number of possible metrics.
  • the configuration agent 101 also takes, at each time step t, an action A t that causes the configurable system 100 to transition from the current configuration state S t to a next configuration state S t+1 .
  • the configuration agent receives a reward r t from the configurable system 101 that is a variable depending on both the action A t and the current configuration state S t .
  • the configurable system 100 can exhibit any current configuration state S t among a set S of configuration states.
  • a globally optimal strategy p * (s) of behaviour of the configuration agent 101 is determined by the method according to an embodiment.
  • the configuration agent 101 using the globally optimal strategy determines an action to cause the configurable system 100 to transition to an optimal configuration state S * ensuring the best parametrization possible of the configurable system 100.
  • the method comprises similar steps to be performed by each of a plurality of computing nodes 102, where / is an integer comprised between 1 and n, such as n is the number of the plurality of computing nodes 102,.
  • Each computing node 102 is configured to determine the globally optimal strategy p * (s) for the configuration agent 101 defined to maximize a globally optimal value function of said computing node 102,.
  • the globally optimal value function
  • Each computing node 102 returns the globally optimal strategy p * (s) to the configuration agent 101 so the configuration agent 101 can select an action suitable for transitioning the configurable system 100 to the globally optimal configuration state S * .
  • Each computing node 102 comprises local computing resources such as a processor and a local memory.
  • the computing resources 110 are configured for implementing an emulator 110, a fast-learning neural network 106 and a target neural network 104.
  • the computing node 102 is configured to learn a current value function Q q - (i) (s,a) of said computing node 102, by updating a fast-learning neural network 106.
  • the target value function Q q - (i) (s, a) is approximated using the target neural network 104 having parameters q-(i) that are determined by updating the fast-learning neural network 106, the parameters q-(i) being weights of the target neural network 104.
  • the emulator 110 is configured to emulate the configurable system 100 that the configuration agent 101 interacts with by determining a succession of configuration states S 1 , S 2 , etc. In each configuration state S t , the emulator 110 accepts as input an action A t that simulates the behavior of the configuration agent 101 and it produces the corresponding reward r t received from the configurable system 100. The emulator 110 also determines the next configuration state S t+1 to which the configurable system 100 transitions upon being interacted with by the configuration agent 101 performing the action A t .
  • Each computing node 102 is also configured to communicate with at least a neighboring computing node 102y through a communication channel 108.
  • Each computing node 102 is being assigned with a subset 5(0 of the set 5 of configuration states.
  • the assembly of the subsets 5(0 assigned to each of the computing nodes compose the set 5 of configuration states and the subsets 5(0 are non-overlapping.
  • none of the plurality of computing nodes shares a current configuration state S t with another computing node.
  • the subsets 5(0 satisfy the following equations:
  • i,j refer to two distinct computing nodes 102,, 102y.
  • Each computing nodes 102 comprises, in its local memory, a table mapping each current configuration state S t to the subset S(k) it belongs to, where 102 k refers to the computing node 102, or the neighboring computing node 102j.
  • the table may be a hash function h such as:
  • Each computing nodes 102 is configured to perform the steps depicted in Fig. 3, in accordance with an embodiment of the method.
  • the method comprises for each computing nodes 102, from the plurality of computing nodes, selecting an initial current configuration state S 0 comprised in the subset S(i) being assigned to the computing node 102,.
  • the initial current configuration state S 0 may be randomly selected from the subset S(i) of the set S of configuration states.
  • the method also comprises in each computing nodes 102 / m times iterations of the following steps:
  • Step 304 selecting an action A t ,
  • Step 306 emulating a next configuration state S t+1 and a reward r t for a current configuration state S t and the action A t , and
  • Step 308 storing a record comprising (S t , A t , r t , S t+1 ) in a buffer B(i) of the computing node 102,.
  • the above iteration may be terminated in response to encountering a current configuration state S t exiting the subset S(i).
  • the action A t in step 304 may be selected randomly or to maximize the target value function Q q - (i) (s,a) of the computing node 102, in a current configuration state S t .
  • the action A t is selected as follows:
  • Î is a predefined probability
  • the step 306 is performed by the emulator 110 of computing node 102,.
  • the buffer B(i) comprises the records (S t , A t , r t , S t+1 ) for t comprised between 0 and m.
  • the method further comprises in each computing nodes 102, after the steps described above, a step 402 of sampling a records sample B'(i) from the buffer B(i).
  • the method further comprises a step 404 of determining a computing node 102 ⁇ where S t+1 is comprised in the subset S(/c) assigned to the corresponding computing node k, for all S t+1 comprised in the records sample B'(i).
  • the computing node 102 ⁇ is determined using the table, in particular the hash function h, mapping each current configuration state S t to the subset S(k) it belongs to.
  • the method comprises a step 406 of retrieving the current value function Q q - (j) (s,a) of neighboring computing node 102 j for (A t , S t+1 ) of the records sample B'(i).
  • the step 406 of retrieving the target value function Q q - (j) (s,a) is performed by:
  • step 406 of retrieving the target value function Q q - (j) (s,a) is performed by:
  • the method comprises in each computing node 102 periodically communicating with neighboring computing nodes 102 j a nd storing at each period the value of the target neural network 104 parameters q _ (j) of each neighboring computing node 102 j with a timestamp in the local memory of the computing node 102 .
  • the step 406 of retrieving the target value function Q q - (j) (s,a) performed by retrieving the value of the parameters q _ (j) from the local memory in response to determining that the timestamp of said value of the parameters q _ (j) satisfies a recency criterion. Otherwise, the value of the target neural network 104 parameters q _ (j) is directly retrieved from the neighboring computing node 102 j . Then, the step 406 comprises computing the target value function Qe- (j) (s, a) using the target neural network 104 of the computing node 102, and the value of the parameters q _ (j) .
  • This embodiment allows to sparsify the communication among neighboring computing nodes and to reduce communication costs.
  • the limitation of the retrieval frequency may in some case, improve the convergence time of learning the optimal function value Q q - (i) (s,a) and reduce the amount of traffic exchanged among processors, thus decreasing the risk of congestion.
  • the method further comprises a step 408 of computing the current value function Q q - (i) (s,a) of the computing node 102 for (A t , S t+1 ) of the records sample
  • the method further comprises, in each computing node 102 , a step 410 of updating the target value function Q q - (i) (s,a).
  • the step 410 comprises a step of updating the fast-learning neural network 106 based on the computed target value function Q q - (i) (s,a) of step 408 and/or the computed target value function Q q - (i) (s,a) o
  • the step 410 further comprises determining an optimal Q q - (i) (s,a) for the fast-learning neural network 106, where q * (i) are parameters of the fast-learning neural network 106.
  • the step 410 comprises resolving the equation:
  • Q q - (i) (s,a) is a variable value function approximated by neural network parameters 0(i).
  • the step 410 further comprises updating the target value function Q q - (i) (s,a) as the optimal value function Q q *(i) (s,a), in particular updating the target neural network 104 parameters q _ (i) as the optimal parameters q * (i) and reiterating the method from step 302.
  • the step 410 also comprises a step of terminating the method and returning to the configuration agent 101 the optimal value function Q q *(i) (s,a), in particular the optimal parameters q * (i), in response to detecting that the current value function Q q-(i) (s,a) converges to the optimal value function Q q *(i) (s,a) according to a predefined convergence criterion.
  • the computing node 102 returns the optimal value function Q q *(i) (s,a) as the globally optimal value function the configuration agent 101 for the configuration states belonging to the subset the subset S(i) assigned to the computing node 102 .
  • the configuration agent 101 upon receiving the globally optimal value function Q q *(i) (s,a) computes the globally optimal strategy p * (s) for the configuration states belonging to the subset 5(0 assigned to the computing node 102 .
  • terminating the method is performed in response to satisfying the equation:
  • the method comprises a step of randomly initializing the target value function Q q-(i) (s,a), in particular a step of randomly initializing the parameters q _ (i) of the target neural network 104.
  • the method of Fig.3 and Fig.4 provides the advantages of:
  • the set S of configuration states is partitioned into subsets S(i) using spectral clustering such that the configuration states s within a subset S(i) are highly connected and the configuration states s between different subset S(i) are sparsely connected. This allows to reduce the need of communication among different computing nodes 102, since it creates a sparse computing nodes' topology.
  • the set S of configuration states is partitioned into subsets S(i) using random clustering by assigning configuration states s uniformly at random to the subsets S(i).
  • the computing nodes' topology is more connected compared to that obtained by spectral clustering, but its diameter is reduced. Thanks to this, the updated value of the target value function updated by each computing node 102 is propagated quickly enough to all the other computing nodes.
  • the set S of configuration states is partitioned into box clusters using box clustering, each box cluster being a subset S(i) of configuration states assigned to a computing node 102,.
  • each configuration state S t describes the value of the set of metrics m q that are of interest for the configuration agent 101 defined as follow:
  • Each metric m j can take on values within the interval / such as:
  • the interval is divided in h sub-intervals as
  • a box cluster B is the Cartesian product of sub-intervals across all metrics such as:
  • the set S of configuration states is partitioned by assigning all states within a box cluster B to a single subset S(i), for all boxes
  • Fig. 5 depicts a functional representation of an example of a configurable system 500.
  • the configurable system 500 is a tandem of processors having two queues 502 1 et 502 2 in which a service rate m 1 , m 2 of the queues 502 1 et 502 2 respectively is configurable.
  • Each queue 502i, 502 2 has a number of jobs waiting respectively b 1 , b 2 representing a buffer 506, respectively a buffer 508 of queue 502 1 , respectively queue 502 2 .
  • a set S of configuration states depends on the possible lengths b 1 , b 2 .
  • a globally optimal configuration state S * of the configurable system 500 minimizes a linear combination of an increasing function of the queue length and of the service rate. This allows to reduce the response time of the configurable system 500, while keeping its running cost low.
  • the method of Fig. 3 and 4 can be implemented by determining an action to cause the configurable system 500 to transition to the globally optimal configuration state S * , by determining the globally optimal value for the plurality of computing nodes 102.
  • the action may be performed by a configuration agent interacting with the configurable system 500 and receiving the globally optimal value function from the plurality of computing nodes 102.
  • the configuration agent takes, at each new job arrival, an action A t consisting of the tuning of the queue service rates m 1 , m 2 respectively of queues 502 1 and 502 2 .
  • C is a positive pre-defined parameter
  • An action A t of the configuration agent causes the configurable system 500 to transition to three possible next configuration states S t+1 as follows:
  • the configurable system 500 can transition from a current configuration state S t to a next configuration state S t+1 following direction 602.
  • the set S of configuration states is partitioned using box clustering as follows.
  • Each subset S(i) is assigned to a computing node 102, to perform the steps of Fig. 3.
  • the emulator 110 of each computing node 102 computes, in step 306, the reward r t according to equation (1) and the next configuration state S t+1 according to equation (2), given the current configuration state S t belonging to the subset S(i) and the action A t selected in step 304.
  • the target value function of each computing node 102 can be computed according to steps of Fig. 4.
  • the step 406 of Fig. 4 is performed due to selecting an action of step 304 of Fig. 3 causing the configurable system 500 to transition to a next configuration state following direction 604. Therefore, the communication pattern between pairs of computing nodes 102, and 102j is determined by the partitioning of the set S of the configuration states. In fact, after assigning a subset S(i ) to each computing node 102 , two neighboring computing nodes 102 and 102 j needs to communicate whenever there exists a state transition following direction 604 between two computing nodes.
  • Figure 7 depicts the topology of communication of the computing nodes 102, resulting from the partitioning explained above, i.e. a 2-dimensional grid with unidirectional communication over the diagonal 702.
  • Each computing node 102 performs step 406 of Fig. 4 for retrieving the target value function from five neighboring computing nodes (N,S,E,W,NW).
  • the configurable system is an antenna suitable for wireless communication according to standard protocols of the fifth generation (5G), for example implementing a Multiple Input Multiple Output protocol for wireless communication with a user equipment.
  • the antenna is configured to emit a plurality of beams, each beam having a direction and an intensity.
  • the method of Fig. 3 and Fig. 4 may be implemented for setting the antenna by determining an action to cause the antenna to transition to an optimal configuration state.
  • a current configuration state of the antenna is defined by the number of beams and the action comprises selecting a beam of possible beams and adding it to the current configuration state.
  • each beam is characterized by a direction and an intensity of emission of said antenna.
  • the method improves setting the antennas of the suitable for wireless communication according to standard protocols of the fifth generation (5G) by: - improving the quality of the computed grids in terms of achieved throughput, given a fixed time budget for computations
  • figure numbers and/or figure reference labels in the claims is intended to identify one or more possible embodiments of the claimed subject matter in order to facilitate the interpretation of the claims. Such use is not to be construed as necessarily limiting the scope of those claims to the embodiments shown in the corresponding figures.
  • program storage devices e.g., digital data storage media, which are machine or computer readable and encode machine-executable or computer-executable programs of instructions where said instructions perform some or all of the steps of methods described herein.
  • the program storage devices may be, e.g., digital memories, magnetic storage media such as a magnetic disks or tapes, hard drives, or optically readable digital data storage media.
  • the embodiments are also intended to cover computers programmed to perform said steps of methods described herein.
  • processors may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software.
  • the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared.
  • processor or “controller” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • ROM read only memory
  • RAM random access memory
  • non-volatile storage Other hardware, conventional and/or custom, may also be included.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

A method of reinforcement learning for setting a configurable system, e.g. an antenna of the 5G type. The method may comprise in each of a plurality of computing nodes repeatedly doing the steps of selecting (304) an action suitable for causing said configurable system to transition from a current configuration state to a next configuration state, wherein the current configuration state and the next configuration state belong to a set of configuration states which can be visited by the configurable system, wherein the current configuration state belongs to a subset of the set of configuration states assigned to the computing node, wherein the subsets of the set of configuration states assigned to each of the computing nodes are non-overlapping subsets which compose a partition of the set of configuration states.

Description

METHOD AND APPARATUS FOR SETTING A CONFIGURABLE SYSTEM
Field
The disclosure relates to methods and apparatuses for setting a configurable system, and in particular to methods and apparatuses that employ a reinforcement learning technique for setting a configurable system.
Background
Growing complexity in the hyper-connected society is a major challenge for configuring wireless communication devices. Indeed, such wireless communication devices present an exponential number of possible configurations dependent on various factors such as the number of users and the type of their equipment. Traditional approaches using machine learning to setting wireless communication devices will be increasingly defied due to the evolving wireless communication devices and are insufficient to provide optimal parametrization of such devices to ensure optimal network coverage and quality of transmission for a growing number of users and more complex users' equipment.
Summary
In some example embodiments the disclosure provides a method of reinforcement learning for setting a configurable system, e.g. an antenna suitable for wireless communication according to standard protocols of the fifth generation (5G).
The method may comprise in each of a plurality of computing nodes repeatedly doing the steps of determining and storing a record comprising a current configuration state, an action and a next configuration state, wherein the action is suitable for causing said configurable system to transition from a current configuration state to a next configuration state. The current configuration state and the next configuration state may belong to a set of configuration states which can be visited by the configurable system and the current configuration state may belong to a subset of the set of configuration states assigned to the computing node. The set of configuration states assigned to the plurality of the computing nodes may be non- overlapping subsets which compose a partition of the set of configuration states.
In an embodiment, the step of determining the record may comprise:
- selecting (304) the action suitable for causing the configurable system to transition from a current configuration state to the next configuration state,
- emulating the next configuration state corresponding to a response of the configurable system to the selected action in the current configuration state.
In an embodiment, each computing node comprises a fast-learning neural network. The method may further comprise in each computing node, the steps of:
- selecting a records sample from the stored records ,
- improving a target value function for one or more actions and one or more next configuration states contained in the records sample. The target value function may represent a desirability of the configurable system as a function of a next configuration state and an action.
The method may further comprise in response to determining that a next configuration state contained in the records sample is outside the subset assigned to the computing node, retrieving a target value function for said next configuration state from a neighboring computing node to which a subset of the set of configuration states comprising said next configuration state is assigned.
The method may further comprise the step of updating the fast-learning neural network based on the computed target value function and determining an optimal value function of the fast-learning neural network, the optimal value function being defined for the configuration states belonging to the subset assigned to the computing node.
The method may further comprise in response to detecting that the target value function converges to the optimal value function according to a predefined convergence criterion, returning the optimal value function as a globally optimal value function to a configuration agent suitable for interacting with the configurable system, otherwise iterating the method using the optimal value function as the current value function.
In some example embodiments, such method may comprise one or more of the features below.
The set of configuration states may be partitioned into subsets of the set of configuration states using various method. In an example embodiment, the set of configuration states may be partitioned into box clusters using box clustering, each box cluster being a subset of the set of configuration states. In an alternative embodiment, the set of configuration states may be partitioned into subsets of configuration states using spectral clustering techniques or random clustering.
The action may be randomly selected, e.g. when the computing node decides to take an exploration action. The action may be selected to maximize the target value function in a current configuration state, e.g. when the computing node decides to take an exploitation action.
In an example embodiment, the records may be stored in a buffer of the computing node and the records sample may be randomly selected from the records stored in the buffer of the computing node.
In an example embodiment, in each computing node, the retrieving of the current value function from a neighboring computing node may comprise communicating with said neighboring computing node and copying the current value function of said neighboring computing node locally in the computing node.
In another example embodiment, in each computing node, the retrieving of the current value function from a neighboring computing node may comprise periodically communicating with said neighboring computing node and storing at each period the current value function of said neighboring computing node with a timestamp in a local memory of the computing node. The current value function of said neighboring computing node may be retrieved from the local memory in response to determining that a next configuration state of the records sample is outside the subset assigned to said computing node. In an example embodiment, in each computing node, the target value function may be approximated by a target neural network having target parameters.
The configurable system may be any wireless communication system. In some embodiment, the configurable system may be an antenna suitable for wireless communication according to standard protocols of the fifth generation (5G), for example implementing a Multiple Input Multiple Output protocol for wireless communication with a user equipment. The antenna may be configured to emit a plurality of beams, each beam having a direction and an intensity.
In an example embodiment, the set of configuration states may comprise possible beams having different directions and intensities and each configuration states comprises a predefined number of beams. The action may comprise selecting a beam of the possible beams and adding it to the current configuration state.
In some embodiment, the configurable system may be a resource-controlled system e.g. data processing resources in a data processing system, and the set of configuration state may comprise a resource variation parameter such as RAM capacity, CPU processing power, CPU bandwidth, number of processor-cores, number of virtual machines, disk I/O bandwidth and network I/O bandwidth.
In example embodiments, the disclosure also provides a computer program comprising executable code that causes a computer to perform the steps of such method when executed.
In some example embodiments, the disclosure also provides a computing node configured for setting a configurable system, the computing node comprising a fast-learning neural network and means for repeatedly performing:
determining and storing a record comprising a current configuration state, an action and a next configuration state, wherein the action is an action suitable for causing said configurable system to transition from a current configuration state to a next configuration state, wherein the current configuration state and the next configuration state belong to a set of configuration states which can be visited by the configurable system, wherein the current configuration state belongs to a subset of the set of configuration states assigned to the computing node , wherein the computing node belongs to a plurality of computing nodes and wherein a plurality of non-overlapping subsets of the set of configuration states are assigned to the plurality of the computing nodes and compose partitions of the set of configuration states,
and wherein the computing node further comprises means for performing: selecting a records sample from the stored records,
improving a target value function for one or more actions and one or more next configuration states contained in the records sample, wherein the target value function represents a desirability of the configurable system as a function of a next configuration states and an action,
in response to determining that a next configuration state contained in the records sample is outside the subset assigned to said computing node, retrieving a target value function for said next configuration state from a neighboring computing node to which a subset of the set of configuration states comprising said next configuration state is assigned,
updating the fast-learning neural network based on the computed target value function and determining an optimal value function of the fast-learning neural network, the optimal value function being defined for the configuration states belonging to the subset assigned to the computing node,
in response to detecting that the target value function converges to the optimal value function according to a predefined convergence criterion, returning the optimal value function as a globally optimal value function to a configuration agent suitable for interacting with the configurable system, otherwise iterating the method using the optimal value function as the target value function.
Brief description of the drawings
These and other aspects of the invention will be apparent from and elucidated with reference to example embodiments described hereinafter, by way of example, with reference to the drawings. Figure 1 is a functional representation of a configurable system interacting with a configuration agent that may execute methods according to embodiment of the invention in order to optimize the settings of the configurable system.
Figure 2 is a functional representation of a plurality of computing nodes that may emulate the configuration agent of Fig. 1, in accordance with an embodiment.
Figure 3 is an iterative process that may be performed by a computing node to fill a buffer of the computing node.
Figure 4 is an iterative process that may be performed by a computing node for updating a target function value of the computing node of Fig. 3.
Figure 5 is a functional representation of an example of a configurable system.
Figure 6 is a functional representation of partitioning a set of configuration states of the configurable system of figure 5.
Figure 7 is a functional representation of computing nodes implementing the process of figures 3 and 4 for setting the configurable system of figure 5.
Detailed description of the embodiments
One of the challenges reinforcement learning approaches for determining the best parametrization of a configurable system faces concerns the exponential number of possible parameters. The present disclosure relates to methods that distribute the computational burden across several computing nodes and avoid failure events.
An embodiment provides a method of reinforcement learning and a computing node for setting a configurable system.
As depicted in Fig. 1, a configurable system 100 is being interacted with by a configuration agent 101 at discrete time steps t. At each time step t, the configuration agent 101 monitors a current configuration state St of the configurable system 100 that represents at a time step t, a current value of a set of metrics mq of the configurable system 100; q being and integer between 1 and the number of possible metrics. The configuration agent 101 also takes, at each time step t, an action At that causes the configurable system 100 to transition from the current configuration state St to a next configuration state St+1. Upon taking the action At, the configuration agent receives a reward rt from the configurable system 101 that is a variable depending on both the action At and the current configuration state St.
At each time step t, the configurable system 100 can exhibit any current configuration state St among a set S of configuration states.
In order to setting the configurable setting 100, a globally optimal strategy p *(s) of behaviour of the configuration agent 101 is determined by the method according to an embodiment. The configuration agent 101 using the globally optimal strategy determines an action to cause the configurable system 100 to transition to an optimal configuration state S* ensuring the best parametrization possible of the configurable system 100.
As depicted in Fig. 2, the method comprises similar steps to be performed by each of a plurality of computing nodes 102,, where / is an integer comprised between 1 and n, such as n is the number of the plurality of computing nodes 102,. Each computing node 102, is configured to determine the globally optimal strategy p * (s) for the configuration agent 101 defined to maximize a globally optimal value function of said computing node 102,. The globally optimal value function
Figure imgf000008_0001
represents a desirability of the configurable system as a function of one or more configuration states s and one or more actions a. In particular, the globally optima l strategy p *(s) satisfies:
Each computing node 102,, returns the globally optimal strategy p *(s) to the configuration agent 101 so the configuration agent 101 can select an action suitable for transitioning the configurable system 100 to the globally optimal configuration state S*.
Each computing node 102, comprises local computing resources such as a processor and a local memory. The computing resources 110 are configured for implementing an emulator 110, a fast-learning neural network 106 and a target neural network 104.
The computing node 102, is configured to learn a current value function Qq-(i) (s,a) of said computing node 102, by updating a fast-learning neural network 106. In particular, the target value function Qq-(i)(s, a) is approximated using the target neural network 104 having parameters q-(i) that are determined by updating the fast-learning neural network 106, the parameters q-(i) being weights of the target neural network 104.
The emulator 110 is configured to emulate the configurable system 100 that the configuration agent 101 interacts with by determining a succession of configuration states S1, S2, etc. In each configuration state St, the emulator 110 accepts as input an action At that simulates the behavior of the configuration agent 101 and it produces the corresponding reward rt received from the configurable system 100. The emulator 110 also determines the next configuration state St+1 to which the configurable system 100 transitions upon being interacted with by the configuration agent 101 performing the action At.
Each computing node 102, is also configured to communicate with at least a neighboring computing node 102y through a communication channel 108.
Each computing node 102, is being assigned with a subset 5(0 of the set 5 of configuration states. The assembly of the subsets 5(0 assigned to each of the computing nodes compose the set 5 of configuration states and the subsets 5(0 are non-overlapping. In other word, none of the plurality of computing nodes shares a current configuration state St with another computing node. The subsets 5(0 satisfy the following equations:
Figure imgf000009_0001
where i,j refer to two distinct computing nodes 102,, 102y.
Each computing nodes 102, comprises, in its local memory, a table mapping each current configuration state St to the subset S(k) it belongs to, where 102k refers to the computing node 102, or the neighboring computing node 102j. The table may be a hash function h such as:
Figure imgf000010_0001
Each computing nodes 102, is configured to perform the steps depicted in Fig. 3, in accordance with an embodiment of the method.
In step 302, the method comprises for each computing nodes 102, from the plurality of computing nodes, selecting an initial current configuration state S0 comprised in the subset S(i) being assigned to the computing node 102,. The initial current configuration state S0 may be randomly selected from the subset S(i) of the set S of configuration states.
The method also comprises in each computing nodes 102/ m times iterations of the following steps:
Step 304: selecting an action At,
Step 306: emulating a next configuration state St+1 and a reward rt for a current configuration state St and the action At, and
Step 308: storing a record comprising (St, At, rt, St+1) in a buffer B(i) of the computing node 102,.
The above iteration may be terminated in response to encountering a current configuration state St exiting the subset S(i). In other words, a current configuration state St belonging to a subset S(j) of a neighboring computing node 102 j; i.e. St S(i).
In an example embodiment of the method, the action At in step 304 may be selected randomly or to maximize the target value function Qq-(i) (s,a) of the computing node 102, in a current configuration state St.
In an example embodiment of the method, the action At is selected as follows:
Figure imgf000010_0002
where Î is a predefined probability.
In an example embodiment, the step 306 is performed by the emulator 110 of computing node 102,. In an example embodiment, after performing in the computing node 102, m times the steps 304 to 308, the buffer B(i) comprises the records (St, At, rt, St+1) for t comprised between 0 and m.
As depicted in Fig. 4; the method further comprises in each computing nodes 102, after the steps described above, a step 402 of sampling a records sample B'(i) from the buffer B(i). The method further comprises a step 404 of determining a computing node 102^ where St+1 is comprised in the subset S(/c) assigned to the corresponding computing node k, for all St+1 comprised in the records sample B'(i).
In an example embodiment, the computing node 102^ is determined using the table, in particular the hash function h, mapping each current configuration state St to the subset S(k) it belongs to.
In response to determining at step 404 that a configuration state St+1 is comprised in a subset S(k) belonging to a neighboring computing node 102j, i.e. k=j, the method comprises a step 406 of retrieving the current value function Qq-(j) (s,a) of neighboring computing node 102j for (At, St+1) of the records sample B'(i).
In an example embodiment, the step 406 of retrieving the target value function Qq-(j) (s,a) is performed by:
communicating with the neighboring computing node 102j,
copying the value of the target neural network 104 parameters q-(j) of the neighboring computing node 102j locally in the computing node 102 and
computing the target value function Qq-(j) (s,a) for (At, St+1) using the target neural network 104 of the computing node 102, and the retrieved value of parameters q-(j) .
Alternatively, the step 406 of retrieving the target value function Qq-(j) (s,a) is performed by:
communicating with the neighboring computing node 102j, and
copying in the computing node 102 , the target value function Qq-(j) (s,a) for
(At, St+1) computed by the neighboring computing node 102j. In another example embodiment, the method comprises in each computing node 102 periodically communicating with neighboring computing nodes 102j a nd storing at each period the value of the target neural network 104 parameters q_(j) of each neighboring computing node 102j with a timestamp in the local memory of the computing node 102 . In this case, the step 406 of retrieving the target value function Qq-(j) (s,a) performed by retrieving the value of the parameters q_(j) from the local memory in response to determining that the timestamp of said value of the parameters q_(j) satisfies a recency criterion. Otherwise, the value of the target neural network 104 parameters q_(j) is directly retrieved from the neighboring computing node 102j. Then, the step 406 comprises computing the target value function Qe-(j) (s, a) using the target neural network 104 of the computing node 102, and the value of the parameters q_(j) . This embodiment allows to sparsify the communication among neighboring computing nodes and to reduce communication costs. Moreover, the limitation of the retrieval frequency, may in some case, improve the convergence time of learning the optimal function value Qq-(i) (s,a) and reduce the amount of traffic exchanged among processors, thus decreasing the risk of congestion.
In response to determining at step 404 that a configuration state St+1 is comprised in a subset S(k) belonging to the computing node 102 , i.e. k=i, the method further comprises a step 408 of computing the current value function Q
Figure imgf000012_0002
q-(i) (s,a) of the computing node 102 for (At, St+1) of the records sample
Figure imgf000012_0001
The method further comprises, in each computing node 102 , a step 410 of updating the target value function Qq-(i) (s,a). The step 410 comprises a step of updating the fast-learning neural network 106 based on the computed target value function Qq-(i) (s,a) of step 408 and/or the computed target value function Qq-(i) (s,a) o The step 410 further comprises determining an optimal
Figure imgf000012_0003
Qq-(i) (s,a) for the fast-learning neural network 106, where q* (i) are parameters of the fast-learning neural network 106. In particular, the step 410 comprises resolving the equation:
Figure imgf000013_0001
where the computing node 102k is either to the computing node 102,, i.e. k = i, or the neighbouring computing node 102j,. i.e. k = j;
Qq-(i) (s,a) is a variable value function approximated by neural network parameters 0(i).
s' is a next configuration state St+1 comprised in the sample records
The step 410 further comprises updating the target value function Qq-(i) (s,a) as the optimal value function Qq *(i) (s,a), in particular updating the target neural network 104 parameters q_(i) as the optimal parameters q *(i) and reiterating the method from step 302.
The step 410 also comprises a step of terminating the method and returning to the configuration agent 101 the optimal value function Qq *(i) (s,a), in particular the optimal parameters q *(i), in response to detecting that the current value function Qq-(i) (s,a) converges to the optimal value function Qq *(i) (s,a) according to a predefined convergence criterion. In particular, the computing node 102, returns the optimal value function Qq *(i) (s,a) as the globally optimal value function the configuration agent 101 for the configuration states belonging to the subset the subset S(i) assigned to the computing node 102 . Thus, the configuration agent 101 upon receiving the globally optimal value function Qq *(i) (s,a) computes the globally optimal strategy p *(s) for the configuration states belonging to the subset 5(0 assigned to the computing node 102 .
In an example embodiment, terminating the method is performed in response to satisfying the equation:
Figure imgf000013_0002
where d is predefined small parameter.
In an example embodiment, the method comprises a step of randomly initializing the target value function Qq-(i) (s,a), in particular a step of randomly initializing the parameters q_(i) of the target neural network 104. The method of Fig.3 and Fig.4 provides the advantages of:
- distributing the reinforcement learning across several computing node enabling to converge faster to the globally optimal value function,
- partitioning the configuration states set into non-overlapping subsets to reduce the number of possible configurations states to visit by each computing node and reducing the computational time,
- avoiding a central node for communicating with the plurality of computing nodes thus enhancing the robustness to failure.
In an example embodiment of the method, the set S of configuration states is partitioned into subsets S(i) using spectral clustering such that the configuration states s within a subset S(i) are highly connected and the configuration states s between different subset S(i) are sparsely connected. This allows to reduce the need of communication among different computing nodes 102, since it creates a sparse computing nodes' topology.
In an example embodiment of the method, the set S of configuration states is partitioned into subsets S(i) using random clustering by assigning configuration states s uniformly at random to the subsets S(i). In this case, the computing nodes' topology is more connected compared to that obtained by spectral clustering, but its diameter is reduced. Thanks to this, the updated value of the target value function updated by each computing node 102 is propagated quickly enough to all the other computing nodes.
In a preferred example embodiment of the method, the set S of configuration states is partitioned into box clusters using box clustering, each box cluster being a subset S(i) of configuration states assigned to a computing node 102,. In this example, for the configurable system 100, each configuration state St describes the value of the set of metrics mq that are of interest for the configuration agent 101 defined as follow:
Figure imgf000014_0001
where Q is the number of possible metrics. Each metric mj can take on values within the interval / such as:
Figure imgf000015_0001
The interval is divided in h sub-intervals as
Figure imgf000015_0004
Figure imgf000015_0002
A box cluster B is the Cartesian product of sub-intervals across all metrics such as:
Figure imgf000015_0003
In this example, the set S of configuration states is partitioned by assigning all states within a box cluster B to a single subset S(i), for all boxes
{B(h,
Fig. 5 depicts a functional representation of an example of a configurable system 500.
The configurable system 500 is a tandem of processors having two queues 5021 et 5022 in which a service rate m1, m2 of the queues 5021 et 5022 respectively is configurable. Each queue 502i, 5022 has a number of jobs waiting respectively b1 , b2 representing a buffer 506, respectively a buffer 508 of queue 5021, respectively queue 5022. A current configuration state St of the configurable system 500 is defined as St = b1, b2 ). Thus, a set S of configuration states depends on the possible lengths b1, b2 .
A globally optimal configuration state S* of the configurable system 500 minimizes a linear combination of an increasing function of the queue length and of the service rate. This allows to reduce the response time of the configurable system 500, while keeping its running cost low. In order to set up the configurable system 500, the method of Fig. 3 and 4 can be implemented by determining an action to cause the configurable system 500 to transition to the globally optimal configuration state S*, by determining the globally optimal value for the plurality of computing nodes 102. The action may be performed by a configuration agent interacting with the configurable system 500 and receiving the globally optimal value function from the plurality of computing nodes 102. The configuration agent takes, at each new job arrival, an action At consisting of the tuning of the queue service rates m1, m2 respectively of queues 5021 and 5022. A reward rt for the configuration agent in a current configuration state St = (b1, b2 ) can be defined as follows:
Figure imgf000016_0001
where C is a positive pre-defined parameter.
An action At of the configuration agent causes the configurable system 500 to transition to three possible next configuration states St+1 as follows:
Figure imgf000016_0002
In other words, depending on whether a new job arrives in the configurable system 500, a job is served by queue 502i and routed to queue 5022, or else a job is served by queue 5022 and leaves the system. The configurable system 500 can transition from a current configuration state St to a next configuration state St+1 following direction 602.
The set S of configuration state may be partitioned into non-overlapping subsets S(i) and assigned to each computing node 102, as follows. As depicted in Fig. 6, is the maximum buffer length of buffers b1 and b2 . Thus, the configuration state of the configurable system 500 is defined by two metrics m 1 = b1 and m2 = b2 , each metric can take on a value between 0 and .
Figure imgf000016_0007
The set S of configuration states is partitioned using box clustering as follows. The interval of all buffer occupancy levels for queues 502/ as l = 1,2 is
Figure imgf000016_0006
divided into h sub-intervals Then, a
Figure imgf000016_0003
box is the Cartesian product between two sub-intervals X
Figure imgf000016_0004
, which also defines a subsets S(i).
Figure imgf000016_0005
Each subset S(i) is assigned to a computing node 102, to perform the steps of Fig. 3. In particular, the emulator 110 of each computing node 102, computes, in step 306, the reward rt according to equation (1) and the next configuration state St+1 according to equation (2), given the current configuration state St belonging to the subset S(i) and the action At selected in step 304.
The target value function of each computing node 102, can be computed according to steps of Fig. 4. According to the above partitioning, the step 406 of Fig. 4 is performed due to selecting an action of step 304 of Fig. 3 causing the configurable system 500 to transition to a next configuration state following direction 604. Therefore, the communication pattern between pairs of computing nodes 102, and 102j is determined by the partitioning of the set S of the configuration states. In fact, after assigning a subset S(i ) to each computing node 102 , two neighboring computing nodes 102 and 102j needs to communicate whenever there exists a state transition following direction 604 between two computing nodes.
Figure 7 depicts the topology of communication of the computing nodes 102, resulting from the partitioning explained above, i.e. a 2-dimensional grid with unidirectional communication over the diagonal 702. Each computing node 102, performs step 406 of Fig. 4 for retrieving the target value function from five neighboring computing nodes (N,S,E,W,NW).
In another example embodiment, the configurable system is an antenna suitable for wireless communication according to standard protocols of the fifth generation (5G), for example implementing a Multiple Input Multiple Output protocol for wireless communication with a user equipment. The antenna is configured to emit a plurality of beams, each beam having a direction and an intensity. The method of Fig. 3 and Fig. 4 may be implemented for setting the antenna by determining an action to cause the antenna to transition to an optimal configuration state.
A current configuration state of the antenna is defined by the number of beams and the action comprises selecting a beam of possible beams and adding it to the current configuration state. For example, each beam is characterized by a direction and an intensity of emission of said antenna.
The method improves setting the antennas of the suitable for wireless communication according to standard protocols of the fifth generation (5G) by: - improving the quality of the computed grids in terms of achieved throughput, given a fixed time budget for computations
- providing a solution of similar quality to the state of the art within a smaller amount of time.
While this disclosure includes references to illustrative embodiments, this specification is not intended to be construed in a limiting sense. Various modifications of the described embodiments, as well as other embodiments within the scope of the disclosure, which are apparent to persons skilled in the art to which the disclosure pertains are deemed to lie within the principle and scope of the disclosure, e.g., as expressed in the following claims.
The use of figure numbers and/or figure reference labels in the claims is intended to identify one or more possible embodiments of the claimed subject matter in order to facilitate the interpretation of the claims. Such use is not to be construed as necessarily limiting the scope of those claims to the embodiments shown in the corresponding figures.
Although the elements in the following method claims, if any, are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily intended to be limited to being implemented in that particular sequence.
Reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the disclosure. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments necessarily mutually exclusive of other embodiments. The same applies to the term "implementation."
The described embodiments are to be considered in all respects as only illustrative and not restrictive. In particular, the scope of the disclosure is indicated by the appended claims rather than by the description and figures herein. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.
A person of ordinary skill in the art would readily recognize that steps of various above-described methods can be performed by programmed computers. Herein, some embodiments are intended to cover program storage devices, e.g., digital data storage media, which are machine or computer readable and encode machine-executable or computer-executable programs of instructions where said instructions perform some or all of the steps of methods described herein. The program storage devices may be, e.g., digital memories, magnetic storage media such as a magnetic disks or tapes, hard drives, or optically readable digital data storage media. The embodiments are also intended to cover computers programmed to perform said steps of methods described herein.
The functions of the various elements shown in the figures, including any functional blocks labeled as "processors" and/or "controllers," may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term "processor" or "controller" should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage. Other hardware, conventional and/or custom, may also be included.

Claims

1. A method of reinforcement learning for setting a configurable system (100), wherein the method comprises in each of a plurality of computing nodes (102 ) repeatedly doing the steps of:
determining and storing a record comprising a current configuration state, an action and a next configuration state, wherein the action is suitable for causing said configurable system to transition from a current configuration state to a next configuration state, wherein the current configuration state and the next configuration state belong to a set of configuration states which can be visited by the configurable system (100), wherein the current configuration state belongs to a subset of the set of configuration states assigned to the computing node (102 ), wherein the subsets of the set of configuration states assigned to the plurality of the computing nodes are non-overlapping subsets which compose a partition of the set of configuration states,
and wherein each computing node (102 ) comprises a fast-learning neural network (106) and the method further comprises, in each computing node (102 ), the steps of: selecting (402) a records sample from the stored records ,
improving (408) a target value function for one or more actions and one or more next configuration states contained in the records sample, wherein the target value function represents a desirability of the configurable system as a function of a next configuration state and an action,
in response to determining that a next configuration state contained in the records sample is outside the subset assigned to said computing node, retrieving (406) a target value function for said next configuration state from a neighboring computing node (102j) to which a subset of the set of configuration states comprising said next configuration state is assigned,
updating the fast-learning neural network based on the computed target value function and determining an optimal value function of the fast-learning neural network, the optimal value function being defined for the configuration states belonging to the subset assigned to the computing node, in response to detecting that the target value function converges to the optimal value function according to a predefined convergence criterion, returning the optimal value function as a globally optimal value function to a configuration agent (101) suitable for interacting with the configurable system, otherwise iterating the method using the optimal value function as the current value function.
2. Method according to claim 1, wherein the set of configuration states is partitioned into box clusters using box clustering, each box cluster being a subset of the set of configuration states.
3. Method according to claim 1 or 2, wherein the step of determining the record comprises:
selecting (304) the action suitable for causing the configurable system to transition from a current configuration state to the next configuration state, and
emulating (306) the next configuration state corresponding to a response of the configurable system to the selected action in the current configuration state.
4. Method according to claim 3, wherein the action is randomly selected.
5. Method according to claim 3, wherein the action is selected to maximize the target value function in a current configuration state.
6. Method according to anyone of claims 1 to 5, wherein the records are stored in a buffer of the computing node and wherein the records sample is randomly selected from the records stored in the buffer of the computing node (102 ).
7. Method according to anyone of claims 1 to 6, wherein, in each computing node, the retrieving (406) of the target value function from a neighboring computing node comprises communicating with said neighboring computing node (102,) and copying the target value function of said neighboring computing node (102y) locally in the computing node (102 ).
8. Method according to anyone of claims 1 to 6, wherein, in each computing node, the retrieving of the target value function from a neighboring computing node comprises periodically communicating with said neighboring computing node and storing at each period the target value function of said neighboring computing node with a timestamp in a local memory of the computing node,
wherein the target value function is retrieved from said local memory in response to determining that a next configuration state of the records sample is outside the subset assigned to said computing node.
9. Method according to anyone of claims 1 to 8, wherein, in each computing node, the target value function is approximated by a target neural network (104) having target parameters.
10. Method according to anyone of claims 1 to 9, wherein the configurable system is an antenna suitable for wireless communication according to standard protocols of the fifth generation (5G), for example implementing a Multiple Input Multiple Output protocol for wireless communication with a user equipment, said antenna being configured to emit a plurality of beams, each beam having a direction and an intensity.
11. Method according to claim 10, wherein the set of configuration states comprises possible beams having different directions and intensities and each configuration states comprises a predefined number of beams and the action comprises selecting a beam of the possible beams and adding it to the current configuration state.
12. A computing node (102 ) configured for setting a configurable system, the computing node (102 ) comprising a fast-learning neural network (106) and means for repeatedly performing:
determining and storing a record comprising a current configuration state, an action and a next configuration state, wherein the action is suitable for causing said configurable system to transition from a current configuration state to a next configuration state, wherein the current configuration state and the next configuration state belong to a set of configuration states which can be visited by the configurable system, wherein the current configuration state belongs to a subset of the set of configuration states assigned to the computing node (102 ), wherein the computing node belongs to a plurality of computing nodes and wherein a plurality of non-overlapping subsets of the set of configuration states are assigned to the plurality of the computing nodes and compose partitions of the set of configuration states,
and wherein the computing node (102 ) further comprises means for performing: selecting (402) a records sample from the stored records , improving (408) a target value function for one or more actions and one or more next configuration states contained in the records sample, wherein the target value function represents a desirability of the configurable system as a function of a next configuration states and an action,
in response to determining that a next configuration state contained in the records sample is outside the subset assigned to said computing node, retrieving (406) a target value function for said next configuration state from a neighboring computing node to which a subset of the set of configuration states comprising said next configuration state is assigned,
updating the fast-learning neural network based on the computed target value function and determining an optimal value function of the fast-learning neural network, the optimal value function being defined for the configuration states belonging to the subset assigned to the computing node,
in response to detecting that the target value function converges to the optimal value function as a globally optimal value function according to a predefined convergence criterion, returning the optimal value function to a configuration agent suitable for interacting with the configurable system, otherwise iterating the method using the optimal value function as the current value function.
13. A computer program comprising executable code for causing a computing node repeatedly performing:
determining and storing a record comprising a current configuration state, an action and a next configuration state, wherein the action is suitable for causing said configurable system to transition from a current configuration state to a next configuration state, wherein the current configuration state and the next configuration state belong to a set of configuration states which can be visited by the configurable system, wherein the current configuration state belongs to a subset of the set of configuration states assigned to the computing node , wherein the subsets of the set of configuration states assigned to of the plurality of the computing nodes are non-overlapping subsets which compose a partition of the set of configuration states,
wherein each computing node comprises a fast-learning neural network and the computer program further comprises executable code for causing said computing node performing:
selecting a records sample from the stored records ,
improving a target value function for one or more actions and one or more next configuration states contained in the records sample, wherein the target value function represents a desirability of the configurable system as a function of a next configuration state and an action,
in response to determining that a next configuration state of the records sample is outside the subset of said computing node, retrieving a target value function of said next configuration state from a neighboring computing node to which a subset of the set of configuration states comprising said next configuration state is assigned, updating the fast-learning neural network based on the computed target value function and determining an optimal value function of the fast-learning neural network, the optimal value function being defined for the configuration states belonging to the subset assigned to the computing node,
in response to detecting that the target value function converges to the optimal value function as a globally optimal value function according to a predefined convergence criterion, returning the optimal value function to a configuration agent suitable for interacting with the configurable system, otherwise iterating the method using the optimal value function as the target value function.
PCT/EP2019/064380 2019-06-03 2019-06-03 Method and apparatus for setting a configurable system Ceased WO2020244735A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2019/064380 WO2020244735A1 (en) 2019-06-03 2019-06-03 Method and apparatus for setting a configurable system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2019/064380 WO2020244735A1 (en) 2019-06-03 2019-06-03 Method and apparatus for setting a configurable system

Publications (1)

Publication Number Publication Date
WO2020244735A1 true WO2020244735A1 (en) 2020-12-10

Family

ID=66793973

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2019/064380 Ceased WO2020244735A1 (en) 2019-06-03 2019-06-03 Method and apparatus for setting a configurable system

Country Status (1)

Country Link
WO (1) WO2020244735A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220413721A1 (en) * 2021-06-28 2022-12-29 Google Llc Control of machine-learning systems

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7257624B2 (en) * 2002-06-04 2007-08-14 Lucent Technologies Inc. System for storing active and inactive configuration commands at a network node for managing its configuration state
WO2018215665A1 (en) * 2017-05-26 2018-11-29 Deepmind Technologies Limited Training action selection neural networks using look-ahead search

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7257624B2 (en) * 2002-06-04 2007-08-14 Lucent Technologies Inc. System for storing active and inactive configuration commands at a network node for managing its configuration state
WO2018215665A1 (en) * 2017-05-26 2018-11-29 Deepmind Technologies Limited Training action selection neural networks using look-ahead search

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
EREN BALEVI ET AL: "Online Antenna Tuning in Heterogeneous Cellular Networks with Deep Reinforcement Learning", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 15 March 2019 (2019-03-15), XP081370102 *
RANJAN R ET AL: "Reinforcement learning for dynamic channel allocation in mobile cellular systems", RECENT ADVANCES IN MICROWAVE THEORY AND APPLICATIONS, 2008. MICROWAVE 2008. INTERNATIONAL CONFERENCE ON, IEEE, PISCATAWAY, NJ, USA, 21 November 2008 (2008-11-21), pages 924 - 927, XP031413613, ISBN: 978-1-4244-2690-4 *
RIEDMILLER MARTIN ED - ANDO N ET AL: "10 Steps and Some Tricks to Set up Neural Reinforcement Controllers", 31 December 2012, INTERNATIONAL CONFERENCE ON FINANCIAL CRYPTOGRAPHY AND DATA SECURITY; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], SPRINGER, BERLIN, HEIDELBERG, PAGE(S) 735 - 757, ISBN: 978-3-642-17318-9, XP047388476 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220413721A1 (en) * 2021-06-28 2022-12-29 Google Llc Control of machine-learning systems
US12293093B2 (en) * 2021-06-28 2025-05-06 Google Llc Control of deterministic machine learning systems using trigger tables and configuration state registries

Similar Documents

Publication Publication Date Title
Satheesh et al. FEDRESOURCE: Federated Learning Based Resource Allocation in Modern Wireless Networks
CN107172166B (en) Cloud and mist computing system for industrial intelligent service
Ramteke et al. Optimized routing technique for IoT enabled software-defined heterogeneous WSNs using genetic mutation based PSO
CN114173421B (en) LoRa logical channel and power allocation method based on deep reinforcement learning
US20240061724A1 (en) Quantum computing task execution method and apparatus, and quantum computer operating system
US11327806B1 (en) Profiling and application monitoring for edge devices based on headroom
CN111669291A (en) Deployment method of virtualized network service function chain based on deep reinforcement learning
ES2994884T3 (en) Method for scheduling inference workloads on edge network resources
US20240086715A1 (en) Training and using a neural network for managing an environment in a communication network
KR20200081630A (en) Method for allocating resource using machine learning in a wireless network and recording medium for performing the method
CN111953547B (en) Heterogeneous base station overlapping grouping and resource allocation method and device based on service
CN114500561B (en) Power Internet of Things network resource allocation decision-making methods, systems, equipment and media
KR20230069488A (en) Migration priority selection method according to cloud edge platform application
US12165020B2 (en) Load balancing using data-efficient learning
CN111988787B (en) Task network access and service placement position selection method and system
CN113722112B (en) Service resource load balancing processing method and system
Kafle et al. Intelligent and agile control of edge resources for latency-sensitive IoT services
Ebrahim et al. Privacy-aware load balancing in fog networks: A reinforcement learning approach
CN117061365B (en) A node selection method, device, equipment and readable storage medium
WO2020244735A1 (en) Method and apparatus for setting a configurable system
KR20240053405A (en) Dynamic split computing framework in serverless edge computing
Zhou et al. DRL-based workload allocation for distributed coded machine learning
CN117499491B (en) Internet of things service arrangement method and device based on double-agent deep reinforcement learning
US20250055764A1 (en) Management of Communication Network Parameters
Othman et al. Automated deployment of virtual network function in 5G network slicing using deep reinforcement learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19729226

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19729226

Country of ref document: EP

Kind code of ref document: A1