[go: up one dir, main page]

WO2024199670A1 - Procédés, appareils et programmes informatiques - Google Patents

Procédés, appareils et programmes informatiques Download PDF

Info

Publication number
WO2024199670A1
WO2024199670A1 PCT/EP2023/058488 EP2023058488W WO2024199670A1 WO 2024199670 A1 WO2024199670 A1 WO 2024199670A1 EP 2023058488 W EP2023058488 W EP 2023058488W WO 2024199670 A1 WO2024199670 A1 WO 2024199670A1
Authority
WO
WIPO (PCT)
Prior art keywords
base station
time step
station node
state vector
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/EP2023/058488
Other languages
English (en)
Inventor
Alvaro VALCARCE RIAL
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia Solutions and Networks Oy
Original Assignee
Nokia Solutions and Networks Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Solutions and Networks Oy filed Critical Nokia Solutions and Networks Oy
Priority to PCT/EP2023/058488 priority Critical patent/WO2024199670A1/fr
Publication of WO2024199670A1 publication Critical patent/WO2024199670A1/fr
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/02Arrangements for optimising operational condition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • a communication system can be seen as a facility that enables communication sessions between two or more entities such as user terminals, base stations and/or other nodes by providing carriers between the various entities involved in the communications path.
  • a communication system can be provided for example by means of a communication network and one or more compatible communication devices.
  • the communication sessions may comprise, for example, communication of data for carrying communications such as voice, video, electronic mail (email), text message, multimedia and/or content data and so on.
  • Non- limiting examples of services provided comprise two-way or multi-way calls, data communication or multimedia services and access to a data network system, such as the Internet.
  • wireless communication system at least a part of a communication session between at least two stations occurs over a wireless link.
  • wireless systems comprise public land mobile networks (PLMN), satellite-based communication systems and different wireless local networks, for example wireless local area networks (WLAN).
  • PLMN public land mobile networks
  • WLAN wireless local area networks
  • Some wireless systems can be divided into cells, and are therefore often referred to as cellular systems.
  • a user can access the communication system by means of an appropriate communication device or terminal.
  • a communication device of a user may be referred to as user equipment (UE) or user device.
  • UE user equipment
  • a communication device is provided with an appropriate signal receiving and transmitting apparatus for enabling communications, for example enabling access to a communication network or communications directly with other users.
  • the communication device may access a carrier provided by a station, for example a base station of a cell, and transmit and/or receive communications on the carrier.
  • the communication system and associated devices typically operate in accordance with a given standard or specification which sets out what the various entities associated with the system are permitted to do and how that should be achieved. Communication protocols and/or parameters which shall be used for the connection are also typically defined.
  • UTRAN 3G radio
  • Other examples of communication systems are the long-term evolution (LTE) of the Universal Mobile Telecommunications System (UMTS) radio-access technology and so-called 5G or New Radio (NR) networks.
  • LTE long-term evolution
  • UMTS Universal Mobile Telecommunications System
  • NR New Radio
  • an apparatus comprising network node apparatus for a cellular network, the network node apparatus comprising: means for receiving, from one or more base station nodes, information comprising a plurality of data tuples, each respective data tuple comprising: a first state vector of a system controlled by a base station node of the one or more base station nodes at a respective origin time step; an action performed by the base station node at the respective origin time step; a second state vector of the system at a respective subsequent time, the respective subsequent time being later than the respective origin time step, the second state vector being a consequence of the action performed by the base station node at the respective origin time step; a value of a reward function determined by the base station node based on a change of a function of a data rate across user equipment devices controlled by the base station node apparatus between the respective origin time step and the respective subsequent time step; the apparatus comprising: means for training a neural network based on the information.
  • each data tuple further comprises: a value of an expected cumulative reward determined by the base station node for the first state vector; a value of the expected cumulative reward determined by the base station node for the second state vector.
  • training the neural network based on the information comprises training using Q network reinforcement learning
  • the apparatus comprises: means for sending, to the base station node and before receiving the information, current Q network parameters being used in the Q network reinforcement learning, wherein the base station node uses the current Q network parameters to determine the value of the expected cumulative reward for the first state vector and the value of the expected cumulative for the second state vector.
  • the action performed by the base station node at the respective origin time step comprises allocating a physical resource block to a user equipment of the user equipment devices controlled by the base station node.
  • the value of the reward function is zero when a threshold number of total time steps have been taken.
  • the threshold number of total time steps is an integer multiple of a product of a number of available physical resource blocks and a number of subframes in which the available physical resource blocks must be allocated.
  • each state vector comprises at least one of: a maximum number of user equipment devices that can be in a radio resource configuration connected mode in the system controlled by the base station node; one or more channel quality indicators for one or more respective user equipment devices; one or more buffer occupancy values for the one or more respective user equipment devices; an index of a subframe where physical resource blocks are being allocated in a current time step; an index of the physical resource block being allocated in the current time step; a current subframe number; an index of a user equipment to which a physical resource block has been allocated to.
  • training the neural network based on the information comprises training using Q network reinforcement learning
  • the apparatus comprises: means for sending, to a validation base station node, a first set of Q network parameters being used in the Q network reinforcement learning, wherein the base station node uses the first set of Q network parameters to determine a first average reward value between an initial time step and a final time step; means for sending, to the validation base station node, a second set of Q network parameters, the second set of Q network parameters determined based on training the first set of Q network parameters, wherein the base station node uses the second set of Q network parameters to determine a second average reward value between the initial time step and the final time step; means for receiving the first average reward value between the initial time step and the final time step and the second average reward value between the initial time step and the final time step; means for validating the training based on a comparison of the first average reward value between the initial time step and the final time step and the second average reward value between the initial time step and the final time step.
  • a method comprising: receiving, from one or more base station nodes, information comprising a plurality of data tuples, each respective data tuple comprising: a first state vector of a system controlled by a base station node of the one or more base station nodes at a respective origin time step; an action performed by the base station node at the respective origin time step; a second state vector of the system at a respective subsequent time, the respective subsequent time being later than the respective origin time step, the second state vector being a consequence of the action performed by the base station node at the respective origin time step; a value of a reward function determined by the base station node based on a change in minimum data rate of user equipment devices controlled by the base station node apparatus between the respective origin time step and the respective subsequent time step; the method comprising: training a neural network based on the information.
  • each data tuple further comprises: a value of an expected cumulative reward determined by the base station node for the first state vector; a value of the expected cumulative reward determined by the base station node for the second state vector.
  • training the neural network based on the information comprises training using Q network reinforcement learning, and the method comprises: sending, to the base station node and before receiving the information, current Q network parameters being used in the Q network reinforcement learning, wherein the base station node uses the current Q network parameters to determine the value of the expected cumulative reward for the first state vector and the value of the expected cumulative for the second state vector.
  • the action performed by the base station node at the respective origin time step comprises allocating a physical resource block to a user equipment of the user equipment devices controlled by the base station node.
  • the value of the reward function is zero when a threshold number of total time steps have been taken.
  • the threshold number of total time steps is an integer multiple of a product of a number of available physical resource blocks and a number of subframes in which the available physical resource blocks must be allocated.
  • each state vector comprises at least one of: a maximum number of user equipment devices that can be in a radio resource configuration connected mode in the system controlled by the base station node; one or more channel quality indicators for one or more respective user equipment devices; one or more buffer occupancy values for the one or more respective user equipment devices; an index of a subframe where physical resource blocks are being allocated in a current time step; an index of the physical resource block being allocated in the current time step; a current subframe number; an index of a user equipment to which a physical resource block has been allocated to.
  • training the neural network based on the information comprises training using Q network reinforcement learning
  • the method comprises: sending each data tuple further comprises: a value of an expected cumulative reward determined by the base station node for the first state vector; a value of the expected cumulative reward determined by the base station node for the second state vector.
  • training the neural network based on the information comprises training using Q network reinforcement learning
  • the method comprises: sending, to the base station node and before receiving the information, current Q network parameters being used in the Q network reinforcement learning, wherein the base station node uses the current Q network parameters to determine the value of the expected cumulative reward for the first state vector and the value of the expected cumulative for the second state vector.
  • the action performed by the base station node at the respective origin time step comprises allocating a physical resource block to a user equipment of the user equipment devices controlled by the base station node.
  • the value of the reward function is zero when a threshold number of total time steps have been taken.
  • the threshold number of total time steps is an integer multiple of a product of a number of available physical resource blocks and a number of subframes in which the available physical resource blocks must be allocated.
  • each state vector comprises at least one of: a maximum number of user equipment devices that can be in a radio resource configuration connected mode in the system controlled by the base station node; one or more channel quality indicators for one or more respective user equipment devices; one or more buffer occupancy values for the one or more respective user equipment devices; an index of a subframe where physical resource blocks are being allocated in a current time step; an index of the physical resource block being allocated in the current time step; a current subframe number; an index of a user equipment to which a physical resource block has been allocated to.
  • training the neural network based on the information comprises training using Q network reinforcement learning
  • the method comprises: sending, to a validation base station node, a first set of Q network parameters being used in the Q network reinforcement learning, wherein the base station node uses the first set of Q network parameters to determine a first average reward value between an initial time step and a final time step; sending, to the validation base station node, a second set of Q network parameters, the second set of Q network parameters determined based on training the first set of Q network parameters, wherein the base station node uses the second set of Q network parameters to determine a second average reward value between the initial time step and the final time step; receiving the first average reward value between the initial time step and the final time step and the second average reward value between the initial time step and the final time step; validating the training based on a comparison of the first average reward value between the initial time step and the final time step and the second average reward value between the initial time step and the final time step.
  • an apparatus comprising at least one processor and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: receive, from one or more base station nodes, information comprising a plurality of data tuples, each respective data tuple comprising: a first state vector of a system controlled by a base station node of the one or more base station nodes at a respective origin time step; an action performed by the base station node at the respective origin time step; a second state vector of the system at a respective subsequent time, the respective subsequent time being later than the respective origin time step, the second state vector being a consequence of the action performed by the base station node at the respective origin time step; a value of a reward function determined by the base station node based on a change in minimum data rate of user equipment devices controlled by the base station node apparatus between the respective origin time step and the respective subsequent time step; and cause the apparatus at least to: train a neural network based on the information.
  • each data tuple further comprises: a value of an expected cumulative reward determined by the base station node for the first state vector; a value of the expected cumulative reward determined by the base station node for the second state vector.
  • the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform: training the neural network based on the information comprises training using Q network reinforcement learning, and the method comprises: sending, to the base station node and before receiving the information, current Q network parameters being used in the Q network reinforcement learning, wherein the base station node uses the current Q network parameters to determine the value of the expected cumulative reward for the first state vector and the value of the expected cumulative for the second state vector.
  • the action performed by the base station node at the respective origin time step comprises allocating a physical resource block to a user equipment of the user equipment devices controlled by the base station node.
  • the value of the reward function is zero when a threshold number of total time steps have been taken.
  • the threshold number of total time steps is an integer multiple of a product of a number of available physical resource blocks and a number of subframes in which the available physical resource blocks must be allocated.
  • each state vector comprises at least one of: a maximum number of user equipment devices that can be in a radio resource configuration connected mode in the system controlled by the base station node; one or more channel quality indicators for one or more respective user equipment devices; one or more buffer occupancy values for the one or more respective user equipment devices; an index of a subframe where physical resource blocks are being allocated in a current time step; an index of the physical resource block being allocated in the current time step; a current subframe number; an index of a user equipment to which a physical resource block has been allocated to.
  • training the neural network based on the information comprises training using Q network reinforcement learning
  • the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform: sending each data tuple further comprises: a value of an expected cumulative reward determined by the base station node for the first state vector; a value of the expected cumulative reward determined by the base station node for the second state vector.
  • the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform: training the neural network based on the information comprises training using Q network reinforcement learning, and the method comprises: sending, to the base station node and before receiving the information, current Q network parameters being used in the Q network reinforcement learning, wherein the base station node uses the current Q network parameters to determine the value of the expected cumulative reward for the first state vector and the value of the expected cumulative for the second state vector.
  • the action performed by the base station node at the respective origin time step comprises allocating a physical resource block to a user equipment of the user equipment devices controlled by the base station node.
  • the value of the reward function is zero when a threshold number of total time steps have been taken.
  • the threshold number of total time steps is an integer multiple of a product of a number of available physical resource blocks and a number of subframes in which the available physical resource blocks must be allocated.
  • each state vector comprises at least one of: a maximum number of user equipment devices that can be in a radio resource configuration connected mode in the system controlled by the base station node; one or more channel quality indicators for one or more respective user equipment devices; one or more buffer occupancy values for the one or more respective user equipment devices; an index of a subframe where physical resource blocks are being allocated in a current time step; an index of the physical resource block being allocated in the current time step; a current subframe number; an index of a user equipment to which a physical resource block has been allocated to.
  • the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform: training the neural network based on the information comprises training using Q network reinforcement learning, and the method comprises: sending, to a validation base station node, a first set of Q network parameters being used in the Q network reinforcement learning, wherein the base station node uses the first set of Q network parameters to determine a first average reward value between an initial time step and a final time step; sending, to the validation base station node, a second set of Q network parameters, the second set of Q network parameters determined based on training the first set of Q network parameters, wherein the base station node uses the second set of Q network parameters to determine a second average reward value between the initial time step and the final time step; receiving the first average reward value between the initial time step and the final time step and the second average reward value between the initial time step and the final time step; validating the training based on a comparison of the first average reward value between the initial time step and the final time step and the second average
  • an apparatus comprising: circuitry for: receiving, from one or more base station nodes, information comprising a plurality of data tuples, each respective data tuple comprising: a first state vector of a system controlled by a base station node of the one or more base station nodes at a respective origin time step; an action performed by the base station node at the respective origin time step; a second state vector of the system at a respective subsequent time, the respective subsequent time being later than the respective origin time step, the second state vector being a consequence of the action performed by the base station node at the respective origin time step; a value of a reward function determined by the base station node based on a change in minimum data rate of user equipment devices controlled by the base station node apparatus between the respective origin time step and the respective subsequent time step; and: training a neural network based on the information.
  • a computer program comprising instructions for causing an apparatus to perform at least the following: receiving, from one or more base station nodes, information comprising a plurality of data tuples, each respective data tuple comprising: a first state vector of a system controlled by a base station node of the one or more base station nodes at a respective origin time step; an action performed by the base station node at the respective origin time step; a second state vector of the system at a respective subsequent time, the respective subsequent time being later than the respective origin time step, the second state vector being a consequence of the action performed by the base station node at the respective origin time step; a value of a reward function determined by the base station node based on a change in minimum data rate of user equipment devices controlled by the base station node apparatus between the respective origin time step and the respective subsequent time step; and: training a neural network based on the information.
  • a computer program comprising instructions stored thereon for performing at least the following: receiving, from one or more base station nodes, information comprising a plurality of data tuples, each respective data tuple comprising: a first state vector of a system controlled by a base station node of the one or more base station nodes at a respective origin time step; an action performed by the base station node at the respective origin time step; a second state vector of the system at a respective subsequent time, the respective subsequent time being later than the respective origin time step, the second state vector being a consequence of the action performed by the base station node at the respective origin time step; a value of a reward function determined by the base station node based on a change in minimum data rate of user equipment devices controlled by the base station node apparatus between the respective origin time step and the respective subsequent time step; and: training a neural network based on the information.
  • a non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following: receiving, from one or more base station nodes, information comprising a plurality of data tuples, each respective data tuple comprising: a first state vector of a system controlled by a base station node of the one or more base station nodes at a respective origin time step; an action performed by the base station node at the respective origin time step; a second state vector of the system at a respective subsequent time, the respective subsequent time being later than the respective origin time step, the second state vector being a consequence of the action performed by the base station node at the respective origin time step; a value of a reward function determined by the base station node based on a change in minimum data rate of user equipment devices controlled by the base station node apparatus between the respective origin time step and the respective subsequent time step; and: training a neural network based on the information.
  • a non-transitory computer readable medium comprising program instructions stored thereon for performing at least the following: receiving, from one or more base station nodes, information comprising a plurality of data tuples, each respective data tuple comprising: a first state vector of a system controlled by a base station node of the one or more base station nodes at a respective origin time step; an action performed by the base station node at the respective origin time step; a second state vector of the system at a respective subsequent time, the respective subsequent time being later than the respective origin time step, the second state vector being a consequence of the action performed by the base station node at the respective origin time step; a value of a reward function determined by the base station node based on a change in minimum data rate of user equipment devices controlled by the base station node apparatus between the respective origin time step and the respective subsequent time step; and: training a neural network based on the information.
  • a base station node apparatus comprising: means for sending, to a network node apparatus, information comprising a plurality of data tuples, each respective data tuple comprising: a first state vector of a system controlled by the base station node apparatus at a respective origin time step; an action performed by the base station node apparatus at the respective origin time step; a second state vector of the system at a respective subsequent time, the respective subsequent time being later than the respective origin time step, the second state vector being a consequence of the action performed by the base station node at the respective origin time step; a value of a reward function determined by the base station node apparatus based on a change of a function of a data rate across user equipment devices controlled by the base station node between the respective origin time step and the respective subsequent time step.
  • each data tuple further comprises: a value of an expected cumulative reward determined by the base station node for the first state vector; a value of the expected cumulative reward determined by the base station node for the second state vector.
  • a neural network is trained by the network node apparatus based on the information by using Q network reinforcement learning, and wherein the apparatus comprises: means for receiving, from the network node apparatus and before sending the information, current Q network parameters being used in the Q network reinforcement learning; means for using the current Q network parameters to determine the value of the expected cumulative reward for the first state vector and the value of the expected cumulative for the second state vector.
  • the action performed by the base station node at the respective origin time step comprises allocating a physical resource block to a user equipment of the user equipment devices controlled by the base station node.
  • the value of the reward function is zero when a threshold number of total time steps have been taken.
  • the threshold number of total time steps is an integer multiple of a product of a number of available physical resource blocks and a number of subframes in which the available physical resource blocks must be allocated.
  • each state vector comprises at least one of: a maximum number of user equipment devices that can be in a radio resource configuration connected mode in the system controlled by the base station node; one or more channel quality indicators for one or more respective user equipment devices; one or more buffer occupancy values for the one or more respective user equipment devices; an index of a subframe where physical resource blocks are being allocated in a current time step; an index of the physical resource block being allocated in the current time step; a current subframe number; an index of a user equipment to which a physical resource block has been allocated to.
  • an apparatus comprising: at least one processor; and at least one memory including computer program code; the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform: sending, to a network node apparatus, information comprising a plurality of data tuples, each respective data tuple comprising: a first state vector of a system controlled by the base station node apparatus at a respective origin time step; an action performed by the base station node apparatus at the respective origin time step; a second state vector of the system at a respective subsequent time, the respective subsequent time being later than the respective origin time step, the second state vector being a consequence of the action performed by the base station node at the respective origin time step; a value of a reward function determined by the base station node apparatus based on a change of a function of a data rate across user equipment devices controlled by the base station node between the respective origin time step and the respective subsequent time step.
  • each data tuple further comprises: a value of an expected cumulative reward determined by the base station node for the first state vector; a value of the expected cumulative reward determined by the base station node for the second state vector.
  • a neural network is trained by the network node apparatus based on the information by using Q network reinforcement learning, and the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to perform: receiving, from the network node apparatus and before sending the information, current Q network parameters being used in the Q network reinforcement learning; using the current Q network parameters to determine the value of the expected cumulative reward for the first state vector and the value of the expected cumulative for the second state vector.
  • the action performed by the base station node at the respective origin time step comprises allocating a physical resource block to a user equipment of the user equipment devices controlled by the base station node.
  • the value of the reward function is zero when a threshold number of total time steps have been taken.
  • the threshold number of total time steps is an integer multiple of a product of a number of available physical resource blocks and a number of subframes in which the available physical resource blocks must be allocated.
  • each state vector comprises at least one of: a maximum number of user equipment devices that can be in a radio resource configuration connected mode in the system controlled by the base station node; one or more channel quality indicators for one or more respective user equipment devices; one or more buffer occupancy values for the one or more respective user equipment devices; an index of a subframe where physical resource blocks are being allocated in a current time step; an index of the physical resource block being allocated in the current time step; a current subframe number; an index of a user equipment to which a physical resource block has been allocated to.
  • a method performed by a base station node comprising: sending, to a network node apparatus, information comprising a plurality of data tuples, each respective data tuple comprising: a first state vector of a system controlled by the base station node at a respective origin time step; an action performed by the base station node at the respective origin time step; a second state vector of the system at a respective subsequent time, the respective subsequent time being later than the respective origin time step, the second state vector being a consequence of the action performed by the base station node at the respective origin time step; a value of a reward function determined by the base station node based on a change of a function of a data rate of user equipment devices controlled by the base station node between the respective origin time step and the respective subsequent time step.
  • each data tuple further comprises: a value of an expected cumulative reward determined by the base station node for the first state vector; a value of the expected cumulative reward determined by the base station node for the second state vector.
  • a neural network is trained by the network node apparatus based on the information by using Q network reinforcement learning, and the method comprises: receiving, from the network node apparatus and before sending the information, current Q network parameters being used in the Q network reinforcement learning; using the current Q network parameters to determine the value of the expected cumulative reward for the first state vector and the value of the expected cumulative for the second state vector.
  • the action performed by the base station node at the respective origin time step comprises allocating a physical resource block to a user equipment of the user equipment devices controlled by the base station node.
  • the value of the reward function is zero when a threshold number of total time steps have been taken.
  • the threshold number of total time steps is an integer multiple of a product of a number of available physical resource blocks and a number of subframes in which the available physical resource blocks must be allocated.
  • each state vector comprises at least one of: a maximum number of user equipment devices that can be in a radio resource configuration connected mode in the system controlled by the base station node; one or more channel quality indicators for one or more respective user equipment devices; one or more buffer occupancy values for the one or more respective user equipment devices; an index of a subframe where physical resource blocks are being allocated in a current time step; an index of the physical resource block being allocated in the current time step; a current subframe number; an index of a user equipment to which a physical resource block has been allocated to.
  • an apparatus comprising circuitry for: sending, to a network node apparatus, information comprising a plurality of data tuples, each respective data tuple comprising: a first state vector of a system controlled by the base station node at a respective origin time step; an action performed by the base station node at the respective origin time step; a second state vector of the system at a respective subsequent time, the respective subsequent time being later than the respective origin time step, the second state vector being a consequence of the action performed by the base station node at the respective origin time step; a value of a reward function determined by the base station node based on a change of a function of a data rate of user equipment devices controlled by the base station node between the respective origin time step and the respective subsequent time step.
  • a computer program comprising instructions for causing an apparatus to perform at least the following: sending, to a network node apparatus, information comprising a plurality of data tuples, each respective data tuple comprising: a first state vector of a system controlled by the base station node at a respective origin time step; an action performed by the base station node at the respective origin time step; a second state vector of the system at a respective subsequent time, the respective subsequent time being later than the respective origin time step, the second state vector being a consequence of the action performed by the base station node at the respective origin time step; a value of a reward function determined by the base station node based on a change of a function of a data rate of user equipment devices controlled by the base station node between the respective origin time step and the respective subsequent time step.
  • a fourteenth aspect there is provided a computer program comprising instructions stored thereon for performing at least the following: sending, to a network node apparatus, information comprising a plurality of data tuples, each respective data tuple comprising: a first state vector of a system controlled by the base station node at a respective origin time step; an action performed by the base station node at the respective origin time step; a second state vector of the system at a respective subsequent time, the respective subsequent time being later than the respective origin time step, the second state vector being a consequence of the action performed by the base station node at the respective origin time step; a value of a reward function determined by the base station node based on a change of a function of a data rate of user equipment devices controlled by the base station node between the respective origin time step and the respective subsequent time step.
  • a non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following: sending, to a network node apparatus, information comprising a plurality of data tuples, each respective data tuple comprising: a first state vector of a system controlled by the base station node at a respective origin time step; an action performed by the base station node at the respective origin time step; a second state vector of the system at a respective subsequent time, the respective subsequent time being later than the respective origin time step, the second state vector being a consequence of the action performed by the base station node at the respective origin time step; a value of a reward function determined by the base station node based on a change of a function of a data rate of user equipment devices controlled by the base station node between the respective origin time step and the respective subsequent time step.
  • a non-transitory computer readable medium comprising program instructions stored thereon for performing at least the following: sending, to a network node apparatus, information comprising a plurality of data tuples, each respective data tuple comprising: a first state vector of a system controlled by the base station node at a respective origin time step; an action performed by the base station node at the respective origin time step; a second state vector of the system at a respective subsequent time, the respective subsequent time being later than the respective origin time step, the second state vector being a consequence of the action performed by the base station node at the respective origin time step; a value of a reward function determined by the base station node based on a change of a function of a data rate of user equipment devices controlled by the base station node between the respective origin time step and the respective subsequent time step.
  • a non-transitory computer readable medium comprising program instructions that, when executed by an apparatus, cause the apparatus to perform at least the method according to any of the preceding aspects.
  • Figure 1 shows an example representation of a time-frequency radio resource grid
  • Figure 2 shows an example state/action transition diagram
  • Figure 3 shows an example network
  • Figure 4 shows an example method flow diagram
  • Figure 5 shows an example apparatus
  • Figure 6 shows an example apparatus
  • Figure 7 shows a method flow diagram
  • Figure 8 shows a schematic representation of a non-volatile memory medium storing instructions which when executed by a processor allow a processor to perform one or more of the steps of the method of Figures 4 and 7.
  • DETAILED DESCRIPTION In the following certain embodiments are explained with reference to mobile communication devices capable of communication via a wireless cellular system and mobile communication systems serving such mobile communication devices.
  • a method and apparatus for allocating time- frequency radio resources to User Equipment’s (UEs) in wireless systems.
  • This task may be carried out by a Medium Access Control (MAC) scheduler, where typical scheduling algorithms include variations of Proportional Fair (PF), Exponential Rule (EXP), Modified Largest Weighted Delay First (MLWDF), etc.
  • MAC Medium Access Control
  • PF Proportional Fair
  • EXP Exponential Rule
  • MLWDF Modified Largest Weighted Delay First
  • DRL deep reinforcement learning
  • DRL is used as an algorithm for some sequential decision-making tasks such as videogames and robotics.
  • DRL algorithms are usually data hungry and take long to train. Examples use a DRL-based scheduler that leverages data collection from numerous eNBs/gNBs to reduce training time and avoid the tuning requirements of other schedulers.
  • radio resource allocation is formulated as a Markov Decision Process (MDP).
  • MDP Markov Decision Process
  • the size of the action space is reduced by time-stepping through resource blocks (in uplink) or through resource block groups (in downlink). For example, time-stepping can be performed through Resource Block Groups each Transmission Time Interval (TTI).
  • TTI Transmission Time Interval
  • the number of UEs to schedule on each TTI is not imposed, thus supporting long time-horizon strategies that respect QoS latency budgets.
  • an MDP is solved using a distributed DRL technique that does not require stack assumptions or constraints.
  • MAC schedulers can be generalize to more radio environments than when based on single eNB/gNB approaches. Therefore the number of scenarios where the solution can provide a benefit is increased.
  • Scheduling data from multiple eNBs/gNBs (actor eNBs/gNBs) can be collected and stored in a so-called Replay memory, from which a GPU-powered learner software feeds to train its own Artificial Neural Network (ANN).
  • ANN Artificial Neural Network
  • the data collection requires no modification to the scheduling policies used by the actor eNBs/gNBs.
  • the method described herein can be used for downlink or uplink scheduling.
  • PRBs Physical Resource Blocks
  • some PRBs carry data bits (shown with small circles in a PRB), others control bits (shown with small crosses in a PRB), and others control & data bits (shown as blank PRBs).
  • the MAC scheduler allocates each and all these PRBs to active UEs (i.e. UEs that are in RRC CONNECTED mode). The MAC scheduler takes these decisions considering various metrics about the current conditions of the system and the UE requirements.
  • Time-stepping dynamics can be performed as two nested sweeps, first over the subframes, and then in each subframe, over the PRBs.
  • ⁇ ⁇ as the index of the subframe where PRBs are being allocated in time step t
  • ⁇ ⁇ as the index of the PRB to be allocated. Their values are defined as follows: where ⁇ 0 ⁇ 0 and ⁇ 0 ⁇ 1. The concept of legal/illegal actions is described below.
  • is defined as the number of subframes during PRBs must be allocated.
  • the time step leading up to the final state is defined as the final step.
  • the sequence of time steps between the initial and the final states is called an episode. For the sake of time averaging, ⁇ should not be too small (e.g.1), or else the performance evaluation would be too noisy and unreliable. On the other hand, it should not be too large either, or the learning algorithm may be too slow.
  • a scheduling decision 203b may be taken at ⁇ 1 201b that moves the agent to a new state ⁇ 2 201c.
  • the current conditions of the system are aggregated into a state vector.
  • ⁇ ⁇ ⁇ ⁇ [0...15] is the most recently available wideband Channel Quality Indicator (CQI) received from the u th connected UE, where a higher value indicates better channel conditions for that UE.
  • this invention defines a value of 0 to indicate that the UE is not connected.
  • ⁇ ⁇ ⁇ ⁇ [ 0,1 ] is the buffer occupancy of the u th connected UE and it takes value 0 if no downlink data is available for the UE or 1 if it is.
  • ⁇ ⁇ ⁇ [1... ⁇ + 1] is the index of the subframe where PRBs are being allocated in the current time step.
  • ⁇ ⁇ ⁇ [0... ⁇ ⁇ 1] is the index of the PRB being allocated in the current time step.
  • ⁇ ⁇ ⁇ [0...9] is the current subframe number as per the LTE or 5GNR standard within a 10 ms radio frame.
  • ⁇ ⁇ ⁇ , ⁇ ⁇ [0... indicates the index of the UE to which the ⁇ ⁇ h resource has been allocated during the ⁇ ⁇ h subframe.
  • a value of 0 indicates that the resource has not yet been allocated to any UE. It should be noted that in some examples, ⁇ ⁇ may have more or fewer terms than those listed above.
  • the MAC scheduler decides to which UE the current PRB is to be allocated to. These decisions are the so-called actions that the scheduler can take.
  • selecting action ⁇ is an attempt to allocate the current PRB to the u th UE. It may not be possible, on each time step, to allocate the current PRB to any UE. Such an attempt may be defined as an illegal action. This is because, depending on the current aggregation level chosen by the Link Adaptation algorithm for the current and the previous UEs, there may not be Control Channel Elements (CCEs) left to assign a new Physical Downlink Control Channel (PDCCH) for a new UE on the current TTI.
  • CCEs Control Channel Elements
  • PDCCH Physical Downlink Control Channel
  • the learning algorithm used by this invention trains the scheduler to avoid this problem by providing zero reward if the scheduler cannot allocate all PRBs after a maximum of ⁇ ⁇ ⁇ ⁇ ⁇ time steps, where ⁇ is a positive integer.
  • a behavioural policy comprise a mapping from states to actions, and can be used to indicate which action to take on a given state.
  • An action-value method can be used to evaluate different actions and select which action to take.
  • the scheduler uses the current state to calculate the so-called Q value of each and all actions.
  • a trained scheduler e.g., one deployed in a commercial product
  • a training scheduler may sometimes choose a non-optimal action for the purpose of exploring the state space.
  • the Q function is approximated using a Dueling neural network architecture as described in “Dueling Network Architectures for Deep Reinforcement Learning”, Z. Wang et. Al., Apr 2016” with one hidden layer.
  • the hidden layer may have a Dueling network has two streams: one advantage stream, and one value stream.
  • an example network architecture is: ⁇ 1 fully-connected input layer of size equal to the number of features in the state vector.
  • a scheduler After taking an action, a scheduler receives a reward, which is a scalar number that evaluates how good the state transition from ⁇ ⁇ to ⁇ ⁇ +1 was.
  • the sum of all rewards received during an episode is known as the return.
  • the return can be discounted or not by means of a discount factor. Due to the intrinsic randomness of the wireless channel and of the UE traffic requirements, the sequence of rewards received during two episodes may vary even if the initial states were the same. As such, expected return can be more interesting than the instantaneous reward from a policy evaluation viewpoint.
  • a scheduler s policy is modified to optimize the expected return.
  • the reward function received by the scheduler on time step ⁇ as: where ⁇ ⁇ is the normalized downlink data rate experienced by the k th UE averaged over the past episode (i.e. during the past ⁇ subframes).
  • This reward function has the objective of maximizing the minimum data rate across all UEs.
  • This is a utility function for the allocation of radio resources to Non-GBR (Guaranteed Bit Rate) UEs. It should be noted that in other examples, other reward functions may be used. In some examples, these reward functions may be based on a data rate across UEs in the network. Training a neural network can be used as a process to improve an agent’s performance over time.
  • all training is centralized on a GPU-capable machine 309 which is not collocated with the base station node 305a, 305b.
  • data collected from multiple eNBs/gNBs via network 307 is used for training (see Figure 3). In general, training data can then be collected from large portions of live networks without changes to the current commercial MAC scheduling policies due to the off-policy learning properties of Q-learning.
  • Prioritized Experience Replay as described in “Reinforcement Learning Scheduler for Vehicle-to-Vehicle Communications Outside Coverage”, T. Sahin et al., Dec 2018, IEEE Vehicular Networking Conference (VNC) is used, which accelerates learning by retraining more often on the most interesting MDP transitions (i.e. those with a higher prediction error).
  • the MAC scheduler in an eNB/gNB e.g., base station 305a, 305b
  • executes an action i.e. it allocates a PRB to a UE or determines not to assign a PRB to a UE. This moves the scheduler from an initial state to a new state.
  • This transition can be constructed as a data tuple ( ⁇ ⁇ , ⁇ ⁇ , ⁇ ⁇ + ⁇ , ⁇ ⁇ + ⁇ , ⁇ ( ⁇ ⁇ ), ⁇ ( ⁇ ⁇ + ⁇ )), where ⁇ is the size of the eligibility trace for Temporal Difference (TD) bootstraping and ⁇ ( ⁇ ⁇ ), ⁇ ( ⁇ ⁇ + ⁇ )), where ⁇ is the size of the eligibility trace for Temporal Difference (TD) bootstraping and ⁇ ( ⁇ ⁇ ) is the Q value estimate done by the eNB/gNB on state ⁇ .
  • These transitions can be sent from the base stations to device 309. According to some examples, these transitions are buffered locally at base station nodes 305a and 305b and periodically sent in lots to the central Experience Replay Memory 313 of device 309 for training using learner (GPU) 311.
  • Each gNB/eNB 305a, 305b can periodically request the latest Q network parameters from the learner in order to calculate the Q values. Note that this is an inference computation, and that no training takes place at the eNBs/gNBs 305a and 305b. Therefore, no hardware accelerator for training Artificial Neural Networks is required on the Radio Access Network (RAN), as it is provided at device 309. Base station nodes may be provided with backhaul links from the eNBs/gNBs 305a and 305b to device 309. In some examples, the experience transitions to be sent to the Replay Memory 313 have backhaul capacity requirements on the same order of magnitude as network logs. This can be taken into account when configuring the network.
  • RAN Radio Access Network
  • the learner software can collect transition batches from the Experience Replay Memory via prioritized sampling (see “Prioritized Experience Replay”, T. Schaul et. al., Feb 2016) and applies root mean squared propagation (RMSProp) on them to change the current Q network parameters. These updates are compatible with most deep reinforcement learning algorithms, for example Double Deep Q Network (DQN) architecture.
  • a so-called validation base station node can be configured. This is a base station (e.g. eNB/gNB) that also collects the latest Q network parameters from learner 311. However, the transitions generated by the validation base station shall not be sent to the Experience Replay Memory 313.
  • Figure 4 shows an example method flow that can take place at a scheduler (e.g. MAC scheduler) of a base station (e.g. eNB/gNB 305a, 305b).
  • a scheduler e.g. MAC scheduler
  • a base station e.g. eNB/gNB 305a, 305b.
  • Figure 4 uses a method of Q learning, but in other examples other learning methods may be used.
  • a state vector ⁇ ⁇ is built for time t.
  • an ANN is used to approximate a value of an expected cumulative reward (Q-value) for the state vector ⁇ ⁇ for all actions belonging to an action space A.
  • ANN parameters Q network parameters
  • the scheduler selects an action ⁇ ⁇ that maximises the expected cumulative reward (Q-value) for the state vector ⁇ ⁇ .
  • the scheduler allocates the pth PRB to the UE indicated by ⁇ ⁇ . This may be a UE controlled by the base station node comprising the scheduler. In some examples, if the action is illegal, the PRB is allocated to no UE. The method then moves on to a subsequent time step t+1 and the scheduler builds new state vector ⁇ ⁇ +1 .
  • a reward based on the data-rate of UEs controlled by the base station node is determined. This is a scalar number that evaluates how good the transition from t to t+1 was. This may be, for example, a measure of the minimum date rate across all UEs controlled by the base station node.
  • the scheduler sends a state transition data tuple ( ⁇ ⁇ , ⁇ ⁇ , ⁇ ⁇ +1 , ⁇ ⁇ +1 , ⁇ ( ⁇ ⁇ ) , ⁇ ( ⁇ ⁇ +1 )) to replay memory. According to some examples, this may happen after multiple time steps such that the data tuple covers a state transition over more than one time step.
  • a learning device can use this state transition data tuple to train a neural network to maximise Q (or in other examples that e.g., don’t use Q learning, to maximise a reward function).
  • multiple state transition data tuples are sent to a DRL scheduler from a plurality of base stations (e.g., from the schedulers of each base station).
  • the DRL scheduler may use state transition data tuples to train an ANN. In some examples, this is performed using Q network reinforcement learning.
  • t is increased by 1 (i.e., the scheduler moves from an origin time step to a subsequent time step) and p is increased to ( ⁇ + 1) ⁇ ⁇ ⁇ ⁇ . The method returns to 422 and repeats.
  • An origin time step may comprise any time step that is before a subsequent time step.
  • An origin time step may comprise a time step that is earlier relative to a subsequent time step.
  • the learning device can be trained with data from non-ML commercial schedulers without changes to their policies, Current commercial schedulers can continue to run as usual, with the new DRL scheduler learns in real-time form their experience.
  • ANN code accounts for non-intuitive and non-linear relationships between scheduler inputs and target utility function to replace code repositories of MAC schedulers, which have traditionally been built as expert systems due to their high complexity.
  • Traditional approaches to time-frequency resource allocation encounter challenges when applied to the massive bandwidths of mmWave bands.
  • FIG. 5 illustrates an example of a control apparatus 500 for controlling a function of the network as illustrated in Figure 3.
  • the control apparatus may comprise at least one random access memory (RAM) 511a, at least on read only memory (ROM) 511b, at least one processor 512, 513 and an input/output interface 514.
  • the at least one processor 512, 513 may be coupled to the RAM 511a and the ROM 511b.
  • the at least one processor 512, 513 may be configured to execute an appropriate software code 515.
  • the software code 515 may for example allow to perform one or more steps to perform one or more of the present aspects.
  • the software code 515 may be stored in the ROM 511b.
  • the control apparatus 500 may be interconnected with another control apparatus 500 controlling another function of the RAN or the core network.
  • Figure 6 illustrates an example of a terminal 600, such as a UE.
  • the terminal 600 may be provided by any device capable of sending and receiving radio signals.
  • Non-limiting examples comprise a user equipment, a mobile station (MS) or mobile device such as a mobile phone or what is known as a ’smart phone’, a computer provided with a wireless interface card or other wireless interface facility (e.g., USB dongle), a personal data assistant (PDA) or a tablet provided with wireless communication capabilities, a machine-type communications (MTC) device, an Internet of things (IoT) type communication device or any combinations of these or the like.
  • the terminal 600 may provide, for example, communication of data for carrying communications.
  • the communications may be one or more of voice, electronic mail (email), text message, multimedia, data, machine data and so on.
  • the terminal 600 may receive signals over an air or radio interface 607 via appropriate apparatus for receiving and may transmit signals via appropriate apparatus for transmitting radio signals.
  • transceiver apparatus is designated schematically by block 606.
  • the transceiver apparatus 606 may be provided for example by means of a radio part and associated antenna arrangement.
  • the antenna arrangement may be arranged internally or externally to the mobile device.
  • the terminal 600 may be provided with at least one processor 601, at least one memory ROM 602a, at least one RAM 602b and other possible components 603 for use in software and hardware aided execution of tasks it is designed to perform, including control of access to and communications with access systems and other communication devices.
  • the at least one processor 601 is coupled to the RAM 602b and the ROM 602a.
  • the at least one processor 601 may be configured to execute an appropriate software code 608.
  • the software code 608 may for example allow to perform one or more of the present aspects.
  • the software code 608 may be stored in the ROM 602a.
  • the processor, storage and other relevant control apparatus can be provided on an appropriate circuit board and/or in chipsets. This feature is denoted by reference 604.
  • the device may optionally have a user interface such as key pad 605, touch sensitive screen or pad, combinations thereof or the like.
  • one or more of a display, a speaker and a microphone may be provided depending on the type of the device.
  • Figure 7 shows an example method flow. In some examples, the method of Figure 7 may be performed by a central learning device such as device 309.
  • the device may comprise a DRL device.
  • the method comprises receiving, from at least one base station node, information comprising a plurality of data tuples.
  • Each respective data tuple may comprise a first state vector of a system controlled by a base station node of the one or more base station nodes at a respective origin time step; an action performed by the base station node at the respective origin time step; a second state vector of the system at a respective subsequent time, the respective subsequent time being later than the respective origin time step, the second state vector being a consequence of the action performed by the base station node at the respective origin time step; a value of a reward function determined by the base station node based on a change of a function of a data rate across user equipment devices controlled by the base station node apparatus between the respective origin time step and the respective subsequent time step.
  • the information may also comprise Q values determined by each base station at the origin time step and the subsequent time step.
  • the method comprises training a neural network based on the information received at 750.
  • the DRL device performing the method of Figure 7 may share the parameters to the trained neural network with one or more base stations. These base stations may then use these parameters to assign PRBs to UEs.
  • Figure 8 shows a schematic representation of non-volatile memory media 800a (e.g. computer disc (CD) or digital versatile disc (DVD)) and 800b (e.g. universal serial bus (USB) memory stick) storing instructions and/or parameters 802 which when executed by a processor allow the processor to perform one or more of the steps of the above-described methods.
  • CD computer disc
  • DVD digital versatile disc
  • USB universal serial bus
  • the apparatuses may comprise or be coupled to other units or modules etc., such as radio parts or radio heads, used in or for transmission and/or reception.
  • the apparatuses have been described as one entity, different modules and memory may be implemented in one or more physical or logical entities. It is noted that whilst some embodiments have been described in relation to 5G networks, similar principles can be applied in relation to other networks and communication systems. Therefore, although certain embodiments were described above by way of example with reference to certain example architectures for wireless networks, technologies and standards, embodiments may be applied to any other suitable forms of communication systems than those illustrated and described herein. It is also noted herein that while the above describes example embodiments, there are several variations and modifications which may be made to the disclosed solution without departing from the scope of the present invention.
  • the various embodiments may be implemented in hardware or special purpose circuitry, software, logic or any combination thereof. Some aspects of the disclosure may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the disclosure is not limited thereto.
  • circuitry may refer to one or more or all of the following: (a) hardware-only circuit implementations (such as implementations in only analog and/or digital circuitry) and (b) combinations of hardware circuits and software, such as (as applicable): (i) a combination of analog and/or digital hardware circuit(s) with software/firmware and (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and (c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation.”
  • This definition of circuitry applies to all uses of this term in this application, including in any claims.
  • circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware.
  • circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device.
  • the embodiments of this disclosure may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware.
  • Computer software or program also called program product, including software routines, applets and/or macros, may be stored in any apparatus-readable data storage medium and they comprise program instructions to perform particular tasks.
  • a computer program product may comprise one or more computer- executable components which, when the program is run, are configured to carry out embodiments.
  • the one or more computer-executable components may be at least one software code or portions of it.
  • the software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
  • the physical media is a non-transitory media.
  • the term “non-transitory,” as used herein, is a limitation of the medium itself (i.e., tangible, not a signal) as opposed to a limitation on data storage persistency (e.g., RAM vs. ROM).
  • the memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory.
  • the data processors may be of any type suitable to the local technical environment, and may comprise one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), FPGA, gate level circuits and processors based on multi core processor architecture, as non-limiting examples.
  • Embodiments of the disclosure may be practiced in various components such as integrated circuit modules.
  • the design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
  • the scope of protection sought for various embodiments of the disclosure is set out by the independent claims.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

L'invention concerne un appareil de nœud de réseau destiné à un réseau cellulaire, l'appareil de nœud de réseau comprenant : des moyens pour recevoir, en provenance d'un ou de plusieurs nœuds de station de base, des informations comprenant une pluralité de tuples de données, chaque tuple de données respectif comprenant : un premier vecteur d'état d'un système commandé par un nœud de station de base du ou des nœuds de station de base à une étape de temps d'origine respectif ; une action effectuée par le nœud de station de base à l'étape de temps d'origine respectif ; un second vecteur d'état du système à un temps ultérieur respectif, le temps ultérieur respectif étant ultérieur à l'étape de temps d'origine respectif, le second vecteur d'état étant une conséquence de l'action effectuée par le nœud de station de base à l'étape de temps d'origine respectif ; une valeur d'une fonction de récompense déterminée par le nœud de station de base sur la base d'un changement d'une fonction d'un débit de données à travers des dispositifs d'équipement utilisateur commandés par l'appareil de nœud de station de base entre l'étape de temps d'origine respectif et l'étape de temps ultérieur respectif ; l'appareil comprenant : un moyen d'apprentissage d'un réseau de neurones artificiels sur la base des informations.
PCT/EP2023/058488 2023-03-31 2023-03-31 Procédés, appareils et programmes informatiques Pending WO2024199670A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2023/058488 WO2024199670A1 (fr) 2023-03-31 2023-03-31 Procédés, appareils et programmes informatiques

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2023/058488 WO2024199670A1 (fr) 2023-03-31 2023-03-31 Procédés, appareils et programmes informatiques

Publications (1)

Publication Number Publication Date
WO2024199670A1 true WO2024199670A1 (fr) 2024-10-03

Family

ID=85984935

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2023/058488 Pending WO2024199670A1 (fr) 2023-03-31 2023-03-31 Procédés, appareils et programmes informatiques

Country Status (1)

Country Link
WO (1) WO2024199670A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021188022A1 (fr) * 2020-03-17 2021-09-23 Telefonaktiebolaget Lm Ericsson (Publ) Attribution de ressource radio
US20220014963A1 (en) * 2021-03-22 2022-01-13 Shu-Ping Yeh Reinforcement learning for multi-access traffic management
WO2022010409A1 (fr) * 2020-07-10 2022-01-13 Telefonaktiebolaget Lm Ericsson (Publ) Procédé et système d'ordonnancement basé sur l'apprentissage par renforcement profond (drl) dans un système sans fil
WO2022255694A1 (fr) * 2021-06-04 2022-12-08 Samsung Electronics Co., Ltd. Procédés et systèmes de détermination d'une politique dss entre de multiples rat

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021188022A1 (fr) * 2020-03-17 2021-09-23 Telefonaktiebolaget Lm Ericsson (Publ) Attribution de ressource radio
WO2022010409A1 (fr) * 2020-07-10 2022-01-13 Telefonaktiebolaget Lm Ericsson (Publ) Procédé et système d'ordonnancement basé sur l'apprentissage par renforcement profond (drl) dans un système sans fil
US20220014963A1 (en) * 2021-03-22 2022-01-13 Shu-Ping Yeh Reinforcement learning for multi-access traffic management
WO2022255694A1 (fr) * 2021-06-04 2022-12-08 Samsung Electronics Co., Ltd. Procédés et systèmes de détermination d'une politique dss entre de multiples rat

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
T. SAHIN ET AL.: "Reinforcement Learning Scheduler for Vehicle-to-Vehicle Communications Outside Coverage", IEEE VEHICULAR NETWORKING CONFERENCE (VNC, December 2018 (2018-12-01)
T. SCHAUL, PRIORITIZED EXPERIENCE REPLAY, February 2016 (2016-02-01)
Z. WANG, DUELING NETWORK ARCHITECTURES FOR DEEP REINFORCEMENT LEARNING, April 2016 (2016-04-01)

Similar Documents

Publication Publication Date Title
US12003971B2 (en) Method for sharing spectrum resources, apparatus, electronic device and storage medium
US12156240B2 (en) Scheduling method and apparatus in communication system, and storage medium
Wang et al. Optimal QoS-aware channel assignment in D2D communications with partial CSI
Ali et al. Sleeping multi-armed bandits for fast uplink grant allocation in machine type communications
US20230262683A1 (en) Method and system for deep reinforcement learning (drl) based scheduling in a wireless system
US20240236713A9 (en) Signalling support for split ml-assistance between next generation random access networks and user equipment
US20240291527A1 (en) Systems, devices and methods for scheduling wireless communications
US11917612B2 (en) Systems and methods to reduce network access latency and improve quality of service in wireless communication
Seguin et al. Deep reinforcement learning for downlink scheduling in 5G and beyond networks: A review
WO2019141894A1 (fr) Procédé, système et appareil
Yuan et al. Fairness-oriented user scheduling for bursty downlink transmission using multi-agent reinforcement learning
US20250008346A1 (en) Methods and devices for multi-cell radio resource management algorithms
WO2021164507A1 (fr) Procédé de planification, procédé d'entraînement d'algorithme de planification et système associé, et support de stockage
US11777866B2 (en) Systems and methods for intelligent throughput distribution amongst applications of a User Equipment
Alcaraz et al. Transmission control in NB-IoT with model-based reinforcement learning
EP4243480A1 (fr) Procédé de partage d'informations et appareil de communication
Bansbach et al. Deep reinforcement learning for wireless resource allocation using buffer state information
WO2024199670A1 (fr) Procédés, appareils et programmes informatiques
WO2025193348A1 (fr) Informations d'assistance pour prédiction de mobilité
Shahid et al. CSIT: channel state and idle time predictor using a neural network for cognitive LTE-Advanced network
CN115297536A (zh) 功率控制方法、装置及存储介质
Torabi et al. Wi-Fi aware radio resource management in 5G NR-U: a learning-based coexistence scheme for C-V2X
US20250132868A1 (en) Communication method, network node, storage medium, and program product
US20250351066A1 (en) Apparatus and method for a network device
US20250133543A1 (en) Method performed by network node in communication system and network node

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23716495

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202547104955

Country of ref document: IN

WWE Wipo information: entry into national phase

Ref document number: 2023716495

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2023716495

Country of ref document: EP

Effective date: 20251031

ENP Entry into the national phase

Ref document number: 2023716495

Country of ref document: EP

Effective date: 20251031

WWP Wipo information: published in national office

Ref document number: 202547104955

Country of ref document: IN

ENP Entry into the national phase

Ref document number: 2023716495

Country of ref document: EP

Effective date: 20251031

ENP Entry into the national phase

Ref document number: 2023716495

Country of ref document: EP

Effective date: 20251031

ENP Entry into the national phase

Ref document number: 2023716495

Country of ref document: EP

Effective date: 20251031