WO2025074369A1 - Système et procédé d'apprentissage de marl collaboratif efficace à l'aide de réseaux de tenseur - Google Patents
Système et procédé d'apprentissage de marl collaboratif efficace à l'aide de réseaux de tenseur Download PDFInfo
- Publication number
- WO2025074369A1 WO2025074369A1 PCT/IN2023/050901 IN2023050901W WO2025074369A1 WO 2025074369 A1 WO2025074369 A1 WO 2025074369A1 IN 2023050901 W IN2023050901 W IN 2023050901W WO 2025074369 A1 WO2025074369 A1 WO 2025074369A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- agents
- agent
- task
- training
- joint
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/092—Reinforcement learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
- G06N5/043—Distributed expert systems; Blackboards
Definitions
- the agents involved in cMARL may often only acquire a partial view of the full environment.
- the observations made by an agent may contain a considerable amount of noise, and be only weekly correlated with the true state of the environment.
- learning of an optimal (joint) policy may be particularly challenging even under the (unrealistic) assumption that the policy of each agent can be made conditional on the observations of all other agents.
- the present disclosure seeks to develop the use of cMARL-based solutions in general and to mitigate one or more of the above-identified shortcomings thereof, in particular within the field of telecommunications networks.
- the joint task may be specific to a cognitive layer of a telecommunications network.
- a cognitive layer implies functionality added by converging artificial intelligence, machine learning and the increasing powers of newer-generation telecommunications networks, such as e.g. fifth-generation (5G) networks or later.
- the joint task may be performed as part of an intent handler of the cognitive layer.
- the joint task may include making predictions related to, and/or controlling of, the telecommunications network.
- each agent may be responsible for meeting a target key performance indicator (KPI) for a specific service of a network slice of the telecommunications network.
- KPI target key performance indicator
- the joint task may include controlling of one or more network parameters relevant to the target KPIs for all agents.
- the eight, tenth, twelfth or thirteenth aspect may e.g. be non-transitory, and be provided as e.g. a hard disk drive (HDD), solid state drive (SDD), USB flash drive, SD card, CD/DVD, and/or as any other storage medium capable of non-transitory storage of data.
- the computer-readable storage medium may be transitory and e.g. correspond to a signal (electrical, optical, mechanical, or similar) present on e.g. a communication link, wire, or similar means of signal transferring.
- Figure 1A schematically illustrates an exemplary system for cMARL according to the present disclosure
- Figure 1B schematically illustrates an example ANN-based reinforcement learning agent
- Figure 2 schematically illustrates an exemplary system for implementing QMIX according to the present disclosure
- Figure 3 schematically illustrates a cMARL agent according to embodiments of the present disclosure
- Figure 4 schematically illustrates a flowchart of an exemplary method for training a plurality of agents according to embodiments of the present disclosure
- Figure 5 schematically illustrates a flowchart of an exemplary method for performing a joint task according to embodiments of the present disclosure
- Figures 6A and 6B schematically illustrate exemplary training entities according to embodiments of the present disclosure
- Figures 7A and 7B schematically illustrate exemplary task performing entities according to embodiments of the present disclosure
- Figure 8 schematically illustrates exemplary computer program products, computer programs and computer-readable storage media according to embodiments of the present disclosure
- Figure 9 schematically illustrates a plot of average reward
- a Q-value function may e.g. be represented as a table, with one entry for each possible state-action pair, i.e.
- DQL deep Q-learning
- DQNs deep Q-networks
- the Q-value function is instead parametrized by a limited number of parameters M (defined e.g. by the various weights of the ANN), i.e. A$ ⁇ , ⁇ & ⁇ A$ ⁇ , ⁇ ; M&.
- M defined e.g. by the various weights of the ANN
- Other known variants of using deep ANNs to estimate the Q- value functions include e.g. deep Q-learning (DRQL), independent Q- learning (IQL), value decomposition networks (VDNs), and similar.
- DQL deep Q-learning
- IQL independent Q- learning
- VDNs value decomposition networks
- weights of the mixing network are non-negative.
- the weights of the mixing network may e.g. be produced by separate so-called hypernetworks, wherein each such hypernetwork may take as input the state of the environment ⁇ and generate the weights of a layer of the mixing network.
- Each hypernetwork may include a single linear layer followed by an absolute activation function, to ensure that the mixing network weights are non-negative.
- the hypernetworks are not explicitly shown in Figure 2B, but may e.g.
- One strategy to reduce the number of parameters that need to be trained involves the use of tensor decompositions in order to more efficiently represent the various matrices defining the layers of the ANN sufficiently accurately.
- one or more matrices representative of one or more layers of the ANN may instead be represented as one or more tensors, such as e.g.
- a tensor train (TT) decomposition a Tucker decomposition (TD), a canonical polyadic (CP) decomposition, or similar, wherein a tensor is expressed as a combination of tensors having lower rank.
- Tensor representations of an ANN may also allow to capture more complex interactions among input features which would not be evident on flattened data normally associated with using regular matrices to represent the network. To avoid curse-of- dimensionality problems, of higher-order tensors into multiple lower-order tensors may be used, while still allowing to capture higher-order relationships in the data.
- a matrix may be decomposed into outer products of vectors (c.f.
- tensor decomposition may involve decomposing higher-dimensional tensors into a sum of products of lower-dimensional factors. Such low-rank approximations may thus be used to extract the most important data while offering a reduction of e.g. the amount of memory needed to store and/or process such tensors.
- decompositions include one or more steps wherein at least part of a tensor that is to be decomposed is reshaped (e.g., flattened, unfolded) into a matrix, and wherein SVD is then used to represent at least this matrix using one or more lower-rank vectors. [0057] In conventional SVD, a matrix !
- a general idea behind ACA is to approximate a matrix with a rank-1 outer product of one row and one column of the matrix, and then iteratively use this process to construct an approximation of arbitrary precision. Ideally, at each step, the best fitting row and column is selected. After the first iteration, the focus of the ACA shifts from trying to approximate the original matrix to instead approximate a residual matrix formed by a difference between the original matrix and the current approximation of that matrix. More specifically, ACA as envisaged herein approximates a matrix ! (having size ⁇ ⁇ ⁇ ) as a matrix l !
- FIG. 3 schematically illustrates an agent 310-i according to embodiments of the present disclosure, wherein the agent 310-i is configured to take part as one (e.g.
- the agent 310-i does not directly use an ANN 312-i to approximate/parametrize its value function. Instead, the agent 310-i uses a tensor 314-i to represent at least one layer 313-i-k of the ANN 312-I, wherein the tensor 314-i is represented as at least one tensor decomposition 316-i. In particular, as described above, the agent 310-i is configured to generate the tensor decomposition 316-i based on ACA 318-i, e.g.
- the training 500 also includes a second module 610b configured to perform operation S412.
- both modules 610a and 610b may instead be provided as part of a single module, e.g. as part of a (cMARL) training module with which the method 400 may be implemented.
- the training entity 600 may also include one or more optional functional modules (illustrated by the dashed box 610c), such as for example an environment state module configured to obtain various parameters indicative of the state of the environment, an/or e.g. a task solving module configured to solve the joint task using the plurality of agents trained/generated by the modules 610a and 610b (in which case the training entity 600 is also capable to perform the operations S510 and S512 of the method 500 described with reference to Figure 5).
- an environment state module configured to obtain various parameters indicative of the state of the environment
- an/or e.g. a task solving module configured to solve the joint task using the plurality of agents trained/generated by the modules 610a and 610b (in which case the training entity 600 is
- Figure 9 schematically illustrates a plot 900 of how an average reward (indicated on the y-axis) changes with time (indicated on the x-axis) during a performed example training of agents (e.g.310-i) using the ACA-based decomposition as envisaged herein.
- two agents were deployed in a speaker-listener scenario (also referred to as e.g. a “cooperative communication” scenario)a.
- the environment included three landmarks each having a different color (e.g. red, green and blue).
- the listener agent must navigate to a landmark of a particular color, and obtains a reward based on its distance to the correct landmark.
- the listener agent can only observe its relative position and color of the landmarks, but is unaware of which of the three landmarks that is the correct one (i.e. the one to which the listener agent should navigate).
- the tensor decomposition-based networks requires approximately only 10 % of the parameters. Further, the networks using tensor decompositions both required fewer training episodes to converge. More specifically, the ACA-based network managed to converge in only approximately 1000 episodes, while the SVD-based counterpart required more than ten times as many episodes (approximately 10000 episodes). It may thus be concluded that in this validation test, the use of the ACA-based tensor decomposition as envisaged herein required fewer episodes to converge and a smaller number of parameters to approximate the agent-specific value functions.
- FIG. 10 schematically illustrates various layers of a telecommunications network solution 1000.
- the solution 1000 includes a network layer 1020, which may also be considered as the environment.
- the network layer 1020 includes e.g. the radio access network (RAN), core, business support system (BSS), customer experience management (CEM), Internet of things (IoT), and similar.
- the solution 1000 also includes a business operations layer 1030 wherein things like various goals, service level agreements (SLAs), etc. are decided upon and conveyed in form of one or more intents 1032.
- SLAs service level agreements
- UE user equipment
- UE user equipment
- radio base stations such as gNBs
- UPFs User Plane Functions
- the services 1128a-1128c may e.g. be assumed to form part of a same network slice, or similar.
- the first service 1128a is, in this particular example, for Conversational Video (CV)
- the second service 1128b is for Ultra-Reliable Low Latency Communications (URLLC)
- the third service 1128c is for Massive IoT (mIoT).
- An observation space may e.g. include current locations of all vehicles, and an action space may e.g. include the next predicted locations of all the vehicles, while a reward may be an accuracy of the predicted locations, and/or e.g. based on whether there are conflicting locations or not, and similar.
- the present disclosure provides an improved way of how to approximate value functions of agents taking part in cMARL, and in particular of how to use tensor decompositions based on ACA to reduce the number of parameters required to train/store the ANNs which are conventionally used to approximate the value functions (as in e.g. deep RL).
- the use of ACA instead of e.g. SVD provides a reduction of the number of parameters, the number of samples required to find convergent solutions during training, while e.g. still providing high average reward.
- the proposed ACA-based solution is also effective against noise (in e.g. the observations made by the agents).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
L'invention concerne un procédé mis en œuvre par ordinateur pour apprendre à une pluralité d'agents (310-i) à effectuer une tâche conjointe. Le procédé consiste à apprendre, par apprentissage par renforcement multi-agent coopératif (cMARL), à une pluralité d'agents (310-i) à effectuer une tâche conjointe. L'apprentissage consiste en outre, pour chaque agent, à effectuer une approximation d'une fonction de valeur spécifique à un agent en tant que décomposition de tenseur (316-i) sur la base d'une approximation croisée adaptative (318-i). Les cas d'utilisation revendiqués comprennent l'utilisation de tels agents dans une couche cognitive d'un réseau de télécommunications. L'invention concerne également un procédé de réalisation d'une tâche à l'aide des agents entraînés, ainsi que des entités, des programmes d'ordinateur et des produits programmes d'ordinateur correspondants.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/IN2023/050901 WO2025074369A1 (fr) | 2023-10-03 | 2023-10-03 | Système et procédé d'apprentissage de marl collaboratif efficace à l'aide de réseaux de tenseur |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/IN2023/050901 WO2025074369A1 (fr) | 2023-10-03 | 2023-10-03 | Système et procédé d'apprentissage de marl collaboratif efficace à l'aide de réseaux de tenseur |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025074369A1 true WO2025074369A1 (fr) | 2025-04-10 |
Family
ID=95284225
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/IN2023/050901 Pending WO2025074369A1 (fr) | 2023-10-03 | 2023-10-03 | Système et procédé d'apprentissage de marl collaboratif efficace à l'aide de réseaux de tenseur |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025074369A1 (fr) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120048123A (zh) * | 2025-04-24 | 2025-05-27 | 重庆凯瑞机器人技术有限公司 | 多智能体强化学习的车路协同动态调度系统及其方法 |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20230229916A1 (en) * | 2022-01-20 | 2023-07-20 | Nvidia Corporation | Scalable tensor network contraction using reinforcement learning |
-
2023
- 2023-10-03 WO PCT/IN2023/050901 patent/WO2025074369A1/fr active Pending
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20230229916A1 (en) * | 2022-01-20 | 2023-07-20 | Nvidia Corporation | Scalable tensor network contraction using reinforcement learning |
Non-Patent Citations (1)
| Title |
|---|
| ARAYA TSEGA WELDU, IBN NAWAB MD RASHED, YUAN LING A. P.: "Research on Tensor-Based Cooperative and Competitive in Multi-Agent Reinforcement Learning", EUROPEAN JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE, vol. 4, no. 6, 1 December 2020 (2020-12-01), pages 1 - 9, XP093301802, ISSN: 2736-5751, DOI: 10.24018/ejece.2020.4.6.262 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120048123A (zh) * | 2025-04-24 | 2025-05-27 | 重庆凯瑞机器人技术有限公司 | 多智能体强化学习的车路协同动态调度系统及其方法 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20240046106A1 (en) | Multi-task neural networks with task-specific paths | |
| Yang et al. | Policy representation via diffusion probability model for reinforcement learning | |
| US12245052B2 (en) | Reinforcement learning (RL) and graph neural network (GNN)-based resource management for wireless access networks | |
| CN113392971B (zh) | 策略网络训练方法、装置、设备及可读存储介质 | |
| Khadka et al. | Evolutionary reinforcement learning | |
| CN114397817B (zh) | 网络训练、机器人控制方法及装置、设备及存储介质 | |
| WO2022028926A1 (fr) | Transfert de simulation-à-réalité hors ligne pour apprentissage par renforcement | |
| CN114511042A (zh) | 一种模型的训练方法、装置、存储介质及电子装置 | |
| CN118540239A (zh) | 一种基于多路邻域节点自适应选择的异质图异常检测方法 | |
| Hafez et al. | Topological Q-learning with internally guided exploration for mobile robot navigation | |
| WO2025074369A1 (fr) | Système et procédé d'apprentissage de marl collaboratif efficace à l'aide de réseaux de tenseur | |
| CN116974185A (zh) | 多智能体二分一致性的控制方法、装置、设备及存储介质 | |
| JP7579632B2 (ja) | 推定装置、システム及び方法 | |
| Ma et al. | Exploiting bias for cooperative planning in multi-agent tree search | |
| JP7634222B2 (ja) | 最適化装置、最適化方法、及び最適化プログラム | |
| WO2022127603A1 (fr) | Procédé de traitement de modèle et dispositif associé | |
| CN114817744A (zh) | 一种基于多智能体的推荐方法和装置 | |
| Luo et al. | A comparison of controller architectures and learning mechanisms for arbitrary robot morphologies | |
| Gregor et al. | Novelty detector for reinforcement learning based on forecasting | |
| CN115319741A (zh) | 机器人控制模型的训练方法和机器人控制方法 | |
| Paternain et al. | Learning policies for markov decision processes in continuous spaces | |
| US20250131279A1 (en) | Training neural networks for policy adaptation | |
| CN113987963B (zh) | 一种分布式信道汇聚策略生成方法及装置 | |
| Zhu et al. | Reinforcement Learning Consensus Control for Discrete-Time Multi-Agent Systems | |
| Chen et al. | Deep Recurrent Policy Networks for Planning Under Partial Observability |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23954672 Country of ref document: EP Kind code of ref document: A1 |