[go: up one dir, main page]

WO2025074369A1 - Système et procédé d'apprentissage de marl collaboratif efficace à l'aide de réseaux de tenseur - Google Patents

Système et procédé d'apprentissage de marl collaboratif efficace à l'aide de réseaux de tenseur Download PDF

Info

Publication number
WO2025074369A1
WO2025074369A1 PCT/IN2023/050901 IN2023050901W WO2025074369A1 WO 2025074369 A1 WO2025074369 A1 WO 2025074369A1 IN 2023050901 W IN2023050901 W IN 2023050901W WO 2025074369 A1 WO2025074369 A1 WO 2025074369A1
Authority
WO
WIPO (PCT)
Prior art keywords
agents
agent
task
training
joint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/IN2023/050901
Other languages
English (en)
Inventor
Saravanan M
Perepu SATHEESH KUMAR
RamKumar N
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Telefonaktiebolaget LM Ericsson AB
Original Assignee
Telefonaktiebolaget LM Ericsson AB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonaktiebolaget LM Ericsson AB filed Critical Telefonaktiebolaget LM Ericsson AB
Priority to PCT/IN2023/050901 priority Critical patent/WO2025074369A1/fr
Publication of WO2025074369A1 publication Critical patent/WO2025074369A1/fr
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/043Distributed expert systems; Blackboards

Definitions

  • the agents involved in cMARL may often only acquire a partial view of the full environment.
  • the observations made by an agent may contain a considerable amount of noise, and be only weekly correlated with the true state of the environment.
  • learning of an optimal (joint) policy may be particularly challenging even under the (unrealistic) assumption that the policy of each agent can be made conditional on the observations of all other agents.
  • the present disclosure seeks to develop the use of cMARL-based solutions in general and to mitigate one or more of the above-identified shortcomings thereof, in particular within the field of telecommunications networks.
  • the joint task may be specific to a cognitive layer of a telecommunications network.
  • a cognitive layer implies functionality added by converging artificial intelligence, machine learning and the increasing powers of newer-generation telecommunications networks, such as e.g. fifth-generation (5G) networks or later.
  • the joint task may be performed as part of an intent handler of the cognitive layer.
  • the joint task may include making predictions related to, and/or controlling of, the telecommunications network.
  • each agent may be responsible for meeting a target key performance indicator (KPI) for a specific service of a network slice of the telecommunications network.
  • KPI target key performance indicator
  • the joint task may include controlling of one or more network parameters relevant to the target KPIs for all agents.
  • the eight, tenth, twelfth or thirteenth aspect may e.g. be non-transitory, and be provided as e.g. a hard disk drive (HDD), solid state drive (SDD), USB flash drive, SD card, CD/DVD, and/or as any other storage medium capable of non-transitory storage of data.
  • the computer-readable storage medium may be transitory and e.g. correspond to a signal (electrical, optical, mechanical, or similar) present on e.g. a communication link, wire, or similar means of signal transferring.
  • Figure 1A schematically illustrates an exemplary system for cMARL according to the present disclosure
  • Figure 1B schematically illustrates an example ANN-based reinforcement learning agent
  • Figure 2 schematically illustrates an exemplary system for implementing QMIX according to the present disclosure
  • Figure 3 schematically illustrates a cMARL agent according to embodiments of the present disclosure
  • Figure 4 schematically illustrates a flowchart of an exemplary method for training a plurality of agents according to embodiments of the present disclosure
  • Figure 5 schematically illustrates a flowchart of an exemplary method for performing a joint task according to embodiments of the present disclosure
  • Figures 6A and 6B schematically illustrate exemplary training entities according to embodiments of the present disclosure
  • Figures 7A and 7B schematically illustrate exemplary task performing entities according to embodiments of the present disclosure
  • Figure 8 schematically illustrates exemplary computer program products, computer programs and computer-readable storage media according to embodiments of the present disclosure
  • Figure 9 schematically illustrates a plot of average reward
  • a Q-value function may e.g. be represented as a table, with one entry for each possible state-action pair, i.e.
  • DQL deep Q-learning
  • DQNs deep Q-networks
  • the Q-value function is instead parametrized by a limited number of parameters M (defined e.g. by the various weights of the ANN), i.e. A$ ⁇ , ⁇ & ⁇ A$ ⁇ , ⁇ ; M&.
  • M defined e.g. by the various weights of the ANN
  • Other known variants of using deep ANNs to estimate the Q- value functions include e.g. deep Q-learning (DRQL), independent Q- learning (IQL), value decomposition networks (VDNs), and similar.
  • DQL deep Q-learning
  • IQL independent Q- learning
  • VDNs value decomposition networks
  • weights of the mixing network are non-negative.
  • the weights of the mixing network may e.g. be produced by separate so-called hypernetworks, wherein each such hypernetwork may take as input the state of the environment ⁇ and generate the weights of a layer of the mixing network.
  • Each hypernetwork may include a single linear layer followed by an absolute activation function, to ensure that the mixing network weights are non-negative.
  • the hypernetworks are not explicitly shown in Figure 2B, but may e.g.
  • One strategy to reduce the number of parameters that need to be trained involves the use of tensor decompositions in order to more efficiently represent the various matrices defining the layers of the ANN sufficiently accurately.
  • one or more matrices representative of one or more layers of the ANN may instead be represented as one or more tensors, such as e.g.
  • a tensor train (TT) decomposition a Tucker decomposition (TD), a canonical polyadic (CP) decomposition, or similar, wherein a tensor is expressed as a combination of tensors having lower rank.
  • Tensor representations of an ANN may also allow to capture more complex interactions among input features which would not be evident on flattened data normally associated with using regular matrices to represent the network. To avoid curse-of- dimensionality problems, of higher-order tensors into multiple lower-order tensors may be used, while still allowing to capture higher-order relationships in the data.
  • a matrix may be decomposed into outer products of vectors (c.f.
  • tensor decomposition may involve decomposing higher-dimensional tensors into a sum of products of lower-dimensional factors. Such low-rank approximations may thus be used to extract the most important data while offering a reduction of e.g. the amount of memory needed to store and/or process such tensors.
  • decompositions include one or more steps wherein at least part of a tensor that is to be decomposed is reshaped (e.g., flattened, unfolded) into a matrix, and wherein SVD is then used to represent at least this matrix using one or more lower-rank vectors. [0057] In conventional SVD, a matrix !
  • a general idea behind ACA is to approximate a matrix with a rank-1 outer product of one row and one column of the matrix, and then iteratively use this process to construct an approximation of arbitrary precision. Ideally, at each step, the best fitting row and column is selected. After the first iteration, the focus of the ACA shifts from trying to approximate the original matrix to instead approximate a residual matrix formed by a difference between the original matrix and the current approximation of that matrix. More specifically, ACA as envisaged herein approximates a matrix ! (having size ⁇ ⁇ ⁇ ) as a matrix l !
  • FIG. 3 schematically illustrates an agent 310-i according to embodiments of the present disclosure, wherein the agent 310-i is configured to take part as one (e.g.
  • the agent 310-i does not directly use an ANN 312-i to approximate/parametrize its value function. Instead, the agent 310-i uses a tensor 314-i to represent at least one layer 313-i-k of the ANN 312-I, wherein the tensor 314-i is represented as at least one tensor decomposition 316-i. In particular, as described above, the agent 310-i is configured to generate the tensor decomposition 316-i based on ACA 318-i, e.g.
  • the training 500 also includes a second module 610b configured to perform operation S412.
  • both modules 610a and 610b may instead be provided as part of a single module, e.g. as part of a (cMARL) training module with which the method 400 may be implemented.
  • the training entity 600 may also include one or more optional functional modules (illustrated by the dashed box 610c), such as for example an environment state module configured to obtain various parameters indicative of the state of the environment, an/or e.g. a task solving module configured to solve the joint task using the plurality of agents trained/generated by the modules 610a and 610b (in which case the training entity 600 is also capable to perform the operations S510 and S512 of the method 500 described with reference to Figure 5).
  • an environment state module configured to obtain various parameters indicative of the state of the environment
  • an/or e.g. a task solving module configured to solve the joint task using the plurality of agents trained/generated by the modules 610a and 610b (in which case the training entity 600 is
  • Figure 9 schematically illustrates a plot 900 of how an average reward (indicated on the y-axis) changes with time (indicated on the x-axis) during a performed example training of agents (e.g.310-i) using the ACA-based decomposition as envisaged herein.
  • two agents were deployed in a speaker-listener scenario (also referred to as e.g. a “cooperative communication” scenario)a.
  • the environment included three landmarks each having a different color (e.g. red, green and blue).
  • the listener agent must navigate to a landmark of a particular color, and obtains a reward based on its distance to the correct landmark.
  • the listener agent can only observe its relative position and color of the landmarks, but is unaware of which of the three landmarks that is the correct one (i.e. the one to which the listener agent should navigate).
  • the tensor decomposition-based networks requires approximately only 10 % of the parameters. Further, the networks using tensor decompositions both required fewer training episodes to converge. More specifically, the ACA-based network managed to converge in only approximately 1000 episodes, while the SVD-based counterpart required more than ten times as many episodes (approximately 10000 episodes). It may thus be concluded that in this validation test, the use of the ACA-based tensor decomposition as envisaged herein required fewer episodes to converge and a smaller number of parameters to approximate the agent-specific value functions.
  • FIG. 10 schematically illustrates various layers of a telecommunications network solution 1000.
  • the solution 1000 includes a network layer 1020, which may also be considered as the environment.
  • the network layer 1020 includes e.g. the radio access network (RAN), core, business support system (BSS), customer experience management (CEM), Internet of things (IoT), and similar.
  • the solution 1000 also includes a business operations layer 1030 wherein things like various goals, service level agreements (SLAs), etc. are decided upon and conveyed in form of one or more intents 1032.
  • SLAs service level agreements
  • UE user equipment
  • UE user equipment
  • radio base stations such as gNBs
  • UPFs User Plane Functions
  • the services 1128a-1128c may e.g. be assumed to form part of a same network slice, or similar.
  • the first service 1128a is, in this particular example, for Conversational Video (CV)
  • the second service 1128b is for Ultra-Reliable Low Latency Communications (URLLC)
  • the third service 1128c is for Massive IoT (mIoT).
  • An observation space may e.g. include current locations of all vehicles, and an action space may e.g. include the next predicted locations of all the vehicles, while a reward may be an accuracy of the predicted locations, and/or e.g. based on whether there are conflicting locations or not, and similar.
  • the present disclosure provides an improved way of how to approximate value functions of agents taking part in cMARL, and in particular of how to use tensor decompositions based on ACA to reduce the number of parameters required to train/store the ANNs which are conventionally used to approximate the value functions (as in e.g. deep RL).
  • the use of ACA instead of e.g. SVD provides a reduction of the number of parameters, the number of samples required to find convergent solutions during training, while e.g. still providing high average reward.
  • the proposed ACA-based solution is also effective against noise (in e.g. the observations made by the agents).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

L'invention concerne un procédé mis en œuvre par ordinateur pour apprendre à une pluralité d'agents (310-i) à effectuer une tâche conjointe. Le procédé consiste à apprendre, par apprentissage par renforcement multi-agent coopératif (cMARL), à une pluralité d'agents (310-i) à effectuer une tâche conjointe. L'apprentissage consiste en outre, pour chaque agent, à effectuer une approximation d'une fonction de valeur spécifique à un agent en tant que décomposition de tenseur (316-i) sur la base d'une approximation croisée adaptative (318-i). Les cas d'utilisation revendiqués comprennent l'utilisation de tels agents dans une couche cognitive d'un réseau de télécommunications. L'invention concerne également un procédé de réalisation d'une tâche à l'aide des agents entraînés, ainsi que des entités, des programmes d'ordinateur et des produits programmes d'ordinateur correspondants.
PCT/IN2023/050901 2023-10-03 2023-10-03 Système et procédé d'apprentissage de marl collaboratif efficace à l'aide de réseaux de tenseur Pending WO2025074369A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IN2023/050901 WO2025074369A1 (fr) 2023-10-03 2023-10-03 Système et procédé d'apprentissage de marl collaboratif efficace à l'aide de réseaux de tenseur

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IN2023/050901 WO2025074369A1 (fr) 2023-10-03 2023-10-03 Système et procédé d'apprentissage de marl collaboratif efficace à l'aide de réseaux de tenseur

Publications (1)

Publication Number Publication Date
WO2025074369A1 true WO2025074369A1 (fr) 2025-04-10

Family

ID=95284225

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IN2023/050901 Pending WO2025074369A1 (fr) 2023-10-03 2023-10-03 Système et procédé d'apprentissage de marl collaboratif efficace à l'aide de réseaux de tenseur

Country Status (1)

Country Link
WO (1) WO2025074369A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120048123A (zh) * 2025-04-24 2025-05-27 重庆凯瑞机器人技术有限公司 多智能体强化学习的车路协同动态调度系统及其方法

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230229916A1 (en) * 2022-01-20 2023-07-20 Nvidia Corporation Scalable tensor network contraction using reinforcement learning

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230229916A1 (en) * 2022-01-20 2023-07-20 Nvidia Corporation Scalable tensor network contraction using reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ARAYA TSEGA WELDU, IBN NAWAB MD RASHED, YUAN LING A. P.: "Research on Tensor-Based Cooperative and Competitive in Multi-Agent Reinforcement Learning", EUROPEAN JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE, vol. 4, no. 6, 1 December 2020 (2020-12-01), pages 1 - 9, XP093301802, ISSN: 2736-5751, DOI: 10.24018/ejece.2020.4.6.262 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120048123A (zh) * 2025-04-24 2025-05-27 重庆凯瑞机器人技术有限公司 多智能体强化学习的车路协同动态调度系统及其方法

Similar Documents

Publication Publication Date Title
US20240046106A1 (en) Multi-task neural networks with task-specific paths
Yang et al. Policy representation via diffusion probability model for reinforcement learning
US12245052B2 (en) Reinforcement learning (RL) and graph neural network (GNN)-based resource management for wireless access networks
CN113392971B (zh) 策略网络训练方法、装置、设备及可读存储介质
Khadka et al. Evolutionary reinforcement learning
CN114397817B (zh) 网络训练、机器人控制方法及装置、设备及存储介质
WO2022028926A1 (fr) Transfert de simulation-à-réalité hors ligne pour apprentissage par renforcement
CN114511042A (zh) 一种模型的训练方法、装置、存储介质及电子装置
CN118540239A (zh) 一种基于多路邻域节点自适应选择的异质图异常检测方法
Hafez et al. Topological Q-learning with internally guided exploration for mobile robot navigation
WO2025074369A1 (fr) Système et procédé d'apprentissage de marl collaboratif efficace à l'aide de réseaux de tenseur
CN116974185A (zh) 多智能体二分一致性的控制方法、装置、设备及存储介质
JP7579632B2 (ja) 推定装置、システム及び方法
Ma et al. Exploiting bias for cooperative planning in multi-agent tree search
JP7634222B2 (ja) 最適化装置、最適化方法、及び最適化プログラム
WO2022127603A1 (fr) Procédé de traitement de modèle et dispositif associé
CN114817744A (zh) 一种基于多智能体的推荐方法和装置
Luo et al. A comparison of controller architectures and learning mechanisms for arbitrary robot morphologies
Gregor et al. Novelty detector for reinforcement learning based on forecasting
CN115319741A (zh) 机器人控制模型的训练方法和机器人控制方法
Paternain et al. Learning policies for markov decision processes in continuous spaces
US20250131279A1 (en) Training neural networks for policy adaptation
CN113987963B (zh) 一种分布式信道汇聚策略生成方法及装置
Zhu et al. Reinforcement Learning Consensus Control for Discrete-Time Multi-Agent Systems
Chen et al. Deep Recurrent Policy Networks for Planning Under Partial Observability

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23954672

Country of ref document: EP

Kind code of ref document: A1