WO2025074369A1

WO2025074369A1 - System and method for efficient collaborative marl training using tensor networks

Info

Publication number: WO2025074369A1
Application number: PCT/IN2023/050901
Authority: WO
Inventors: Saravanan M; Perepu SATHEESH KUMAR; RamKumar N
Original assignee: Telefonaktiebolaget LM Ericsson AB
Current assignee: Telefonaktiebolaget LM Ericsson AB
Priority date: 2023-10-03
Filing date: 2023-10-03
Publication date: 2025-04-10
Anticipated expiration: 2026-04-03

Abstract

A computer-implemented method of training a plurality of agents (310-i) to perform a joint task is provided. The method includes training, using cooperative multi-agent reinforcement learning (cMARL) a plurality of agents (310-i) to perform a joint task. The training further includes, for each agent, approximating an agent-specific value function as a tensor decomposition (316-i) based on an adaptive cross-approximation (318-i). Claimed use cases include the use of such agents in a cognitive layer of a telecommunications network. A method of performing a task using the trained agents, as well as corresponding entities, computer programs and computer program products are also provided.

Description

SYSTEM AND METHOD FOR COLLABORATIVE MARL TRAINING USING TENSOR NETWORKS Technical field [0001] The present disclosure relates to the field of collaborative multi-agent reinforcement learning (cMARL). In particular, the present disclosure relates to cMARL models wherein agent-specific value functions are approximated using tensor decompositions/networks. An example use case relates to the use of cMARL for performing various tasks in a cognitive layer of a telecommunications network. Background

its name implies, cooperative multi-agent reinforcement learning (cMARL) includes using multiple agents that are trained to cooperate in order to reach a common goal, i.e. to perform a joint task. One particular use case of cMARL exists within so-called cognitive networks or cognitive layers of a telecommunications network, in which it is envisaged that various machine learning algorithms will help to enable for example autonomous operation and self-adapting behavior of the network. Other related use cases are described in e.g.3GPP SA2, 3GPP RAN3, 3GPP SA5, TMF, ETSI ZSM, and similar efforts of standardization. [0003] A telecommunications network serves to illustrate a situation wherein the number of actions that an agent may take is often large, and/or wherein the states between which the environment in which the agents operate may transition are often many. As a consequence, manual engineering of value and/or policy functions of the agents quickly becomes impractical. Instead, such value and/or policy functions may be approximated using artificial neural networks (ANNs), a practice referred to as deep reinforcement learning (deep RL). A disadvantage of such ANN-based function approximators is that as the dimensionality of the state- and/or action-space increases, deeper and more complex ANNs are often required to accurately capture the complexity of the environment. As a result, the number of parameters that need to be modeled/trained increases, more gradients are required to be computed during training of the ANNs, a number of episodes required to find convergent solutions increases, and the overall requirements on memory and computational power make such solutions less suitable for use in end devices such as those intended for Internet-of-things, or similar. [0004] One way of reducing the number of parameters includes the use of tensor decompositions, which are also believed to present results which are more easily understandable and interpretable to a human than the “black box”-like solutions provided by traditional ANNs. Common tensor decomposition strategies often rely on the use of singular value decomposition, SVD, to perform low-rank approximations of matrices and higher-order tensors, wherein a tensor (^-dimensional array) is expressed as a sequence of elementary operations acting on other, simpler tensors, such as in e.g. tensor train techniques and other tensor networks. [0005] In many real-world applications, independent on whether e.g. the policy and/or value functions are approximated using conventional ANNs or with tensor decompositions, the agents involved in cMARL may often only acquire a partial view of the full environment. In addition, the observations made by an agent may contain a considerable amount of noise, and be only weekly correlated with the true state of the environment. Under such circumstances, learning of an optimal (joint) policy may be particularly challenging even under the (unrealistic) assumption that the policy of each agent can be made conditional on the observations of all other agents. [0006] The present disclosure seeks to develop the use of cMARL-based solutions in general and to mitigate one or more of the above-identified shortcomings thereof, in particular within the field of telecommunications networks. Summary [0007] To at least partially overcome some or all of the above-identified shortcomings of contemporary technology, the present disclosure provides a computer-implemented method of training a plurality of agents to perform a joint task, a method of performing a joint task, a training entity for training of a plurality of agents to perform a joint task, a task performing entity for performing a joint task, and corresponding computer programs and computer program products, as defined in and by the accompanying independent claims. Various embodiments of the methods, entities, computer programs and computer program products are defined in and by the accompanying dependent claims. [0008] According to a first aspect, is provided a computer-implemented method of training a plurality of agents to perform a joint task. The method includes training, using cooperative multi-agent reinforcement learning (cMARL), a plurality of agents to perform a joint task. The training further includes, for each agent, approximating an agent-specific value function as a tensor decomposition based on an adaptive cross-approximation (ACA). [0009] As used herein, performing a joint task may e.g. include controlling and/or making predictions about a particular system (or environment). The system may be a physical system, such as a telecommunications system or any other system for which control and/or predictions are needed. The term “collaborative multi-agent reinforcement learning” is assigned its usual meaning, wherein multiple agents, possibly with their own distinctive attributes and roles, work together to jointly handle one or more tasks. Such reinforcement learning may also be referred to as for example “cooperative multi-agent reinforcement learning”, or similar. In particular, the term “collaborative” implies that the agents are faced with a goal of at least partly maximizing group utility. [0010] As will be described in more detail later herein, an “adaptive cross- approximation” (ACA) builds on the assumption that a tensor (e.g. a matrix or higher-dimensional object) e.g. includes many numerically rank-deficient blocks, and that such a tensor may thus be represented by a much-reduced set of lower- dimensional objects. For example, blocks of a matrix may be represented by a set of column vectors and compressed by using the ACA. [0011] The present disclosure improves upon contemporary technology in that, as will be exemplified later herein, the ACA may provide a reduction of observation noise for an agent and thus improve the actions taken by such an agent as a result of making more accurate predictions and observations, and result in an improved possibility for successful cMARL applications. By reducing the noise, the use of the ACA as part of the tensor decomposition used to approximate the agent-specific value functions, convergence during training may be achieved quicker (in fewer number of episodes) and with a thereby reduced number of samples required for training. The thereby resulting reduced overall demand on memory and computing power due to the use of the ACA instead of e.g. conventional singular value decomposition (SVD) as part of forming the tensor decompositions may thus, as the inventors have realized, enable to make lower-end devices (such those intended for IoT) more suitable for both training and running models capable of handle more complex environments. [0012] In one or more embodiments of the method, the joint task may be specific to a cognitive layer of a telecommunications network. As used herein, a “cognitive layer” implies functionality added by converging artificial intelligence, machine learning and the increasing powers of newer-generation telecommunications networks, such as e.g. fifth-generation (5G) networks or later. As mentioned earlier herein, examples of such cognitive layers, cognitive networks, and similar, are provided for example in 3GPP SA2, 3GPP RAN3, 3GPP SA5, TMF, ETSI ZSM, and similar (ongoing) efforts of standardization. [0013] In one or more embodiments of the method, the method may further include emulating the telecommunications network as part of the training. For example, the agents may be trained in a simulator (off-policy training) capable of emulating the behavior of a real telecommunications network, after which the agents are deployed as part of controlling (and/or making predictions about) one or more real such networks. Using a simulated training environment may be advantageous in that the agents are then e.g. able to make mistakes as they learn, without negatively affecting the real network. [0014] In one or more embodiments of the method, the joint task may include at least one of throughput prediction (as part of e.g. an interacting closed loops scenario), intent-based automation, prediction of trajectories for connected users and/or vehicles (in order to e.g. optimize Quality of Service, QoS, performances due to the reduced noise), and similar. [0015] According to a second aspect, there is provided a (computer-implemented) method of performing a joint task. The method includes obtaining a plurality of agents trained to perform the joint task, wherein, for each agent, an agent-specific value function is approximated as a tensor decomposition based on an adaptive cross-approximation (ACA). Phrased differently, the obtained agents are trained e.g. in accordance with the method of the first aspect. The method of the second aspect further includes performing the joint task using the trained plurality of agents. [0016] In one or more embodiments the method, the plurality of agents may (as stated above) be trained using the method according to the first aspect (or any embodiment thereof disclosed herein). [0017] In one or more embodiments of the method, if the joint task is specific to a cognitive layer of a telecommunications network, performing the joint task may be performed as part of an intent handler of the cognitive layer. [0018] In one or more embodiments of the method, the joint task may include making predictions related to, and/or controlling of, the telecommunications network. [0019] In one or more embodiments of the method, each agent may be responsible for meeting a target key performance indicator (KPI) for a specific service of a network slice of the telecommunications network. The joint task may include controlling of one or more network parameters relevant to the target KPIs for all agents. [0020] In one or more embodiments of the method, the plurality of agents may include at least two of: i) a first agent responsible for meeting a KPI in form of Quality of Experience (QoE) for a Conversational Video (CV) service; ii) a second agent responsible for meeting a KPI in form of a Packet Error Rate (PER) for an Ultra- Reliable Low Latency Communications (URLLC) service, and iii) a third agent responsible for meeting a KPI in form of a PER for a Massive Internet of Things (IoT) service. The one or more network parameters may include priority and maximum bit rate (MBR) of the network slice. [0021] According to a third aspect, there is provided a training entity for training of a plurality of agents to perform a joint task. The training entity includes processing circuitry and a memory storing instructions. The instructions are such that they, when executed by the processing circuitry, cause the training entity to train, using cooperative multi-agent reinforcement learning (cMARL) a plurality of agents to perform a joint task. For each agent, an agent-specific value function is approximated as a tensor decomposition based on an adaptive cross-approximation (ACA), as described earlier herein. Phrased differently, the training agent is configured to perform e.g. the method of the first aspect. [0022] In one or more embodiments the training entity, as indicated above, the instructions may be further such that they, when executed by the processing circuitry, cause the training entity to perform any embodiments of the method of the first aspect described herein. [0023] In one or more embodiments of the training entity, the training entity may form part of a node for/of a telecommunications network. [0024] According to a fourth aspect, there is provided a task performing entity for performing a joint task. The task performing entity includes processing circuitry and a memory storing instructions. The instructions are such that they, when executed by the processing circuitry, cause the task performing entity to obtain a plurality of agents trained to perform a joint task, wherein, for each agent, an agent-specific value function is approximated as a tensor decomposition based on an adaptive cross- approximation (ACA) as described earlier herein, and to perform the joint task using the plurality of agents. Phrased differently, the task performing agent is configured to perform e.g. the method of the second aspect. [0025] In one or more embodiments of the task performing entity, as indicated above, the instructions may be further such that they, when executed by the processing circuitry, cause the task performing entity to perform any embodiments of the method of the second aspect described herein. [0026] In one or more embodiments of the task performing entity, the task performing entity may form part of a node for/of a telecommunications network. [0027] According to a fifth aspect, there is provided a computer program for training a plurality of agents to perform a joint task. The computer program includes computer code which, when executed by/run on processing circuitry, causes the processing circuitry to train, using cooperative multi-agent reinforcement learning (cMARL) a plurality of agents to perform a join task, wherein, for each agent, an agent-specific value function is approximated as a tensor decomposition based on an adaptive cross-approximation (ACA). Phrased differently, the computer program is configured to make the processing circuitry execute (i.e. by controlling a suitable entity) the method (or any embodiment thereof) the first aspect. [0028] According to a sixth aspect, there is provided a computer program product. The computer program product includes a computer program (such as that of the fifth aspect), and a computer-readable medium on which the computer program is stored. [0029] According to a seventh aspect, there is provided a computer program for performing a joint task. The computer program includes computer code which, when executed by/run on processing circuitry, causes the processing circuitry to obtain a plurality of agents trained to perform a joint task, wherein, for each agent, an agent- specific value function is approximated as a tensor decomposition based on an adaptive cross-approximation (ACA), and to perform the joint task using the plurality of agents. Phrased differently, the computer program is configured to make the processing circuitry execute (i.e. by controlling a suitable entity) the method (or any embodiment thereof) the second aspect. [0030] According to an eight aspect, there is provided a computer program product. The computer program product includes a computer program (such as that of the seventh aspect), and a computer-readable storage medium on which the computer program is stored. [0031] According to a ninth aspect, there is provided a computer-readable storage medium, on which data representative of one or more agents trained using cMARL and in accordance with any of the methods described herein, such as the method of the first aspect or any embodiments thereof, is stored. [0032] As used herein, a computer-readable storage medium (such as that of e.g. the eight, tenth, twelfth or thirteenth aspect) may e.g. be non-transitory, and be provided as e.g. a hard disk drive (HDD), solid state drive (SDD), USB flash drive, SD card, CD/DVD, and/or as any other storage medium capable of non-transitory storage of data. In other embodiments, the computer-readable storage medium may be transitory and e.g. correspond to a signal (electrical, optical, mechanical, or similar) present on e.g. a communication link, wire, or similar means of signal transferring. [0033] Other objects and advantages of the present disclosure will be apparent from the following detailed description, the drawings and the claims. Within the scope of the present disclosure, it is envisaged that all features and advantages described with reference to e.g. the method of the first aspect are relevant for, apply to, and may be used in combination with also the method of the second aspect, the entities of the third and fourth aspects, computer programs of the fifth and seventh aspects, the computer program products of the sixth and eight aspects, and the computer-readable storage medium of the ninth aspect, and vice versa. Brief description of the drawings will be described below with reference to the

accompanying drawings, in which: Figure 1A schematically illustrates an exemplary system for cMARL according to the present disclosure; Figure 1B schematically illustrates an example ANN-based reinforcement learning agent; Figure 2 schematically illustrates an exemplary system for implementing QMIX according to the present disclosure; Figure 3 schematically illustrates a cMARL agent according to embodiments of the present disclosure; Figure 4 schematically illustrates a flowchart of an exemplary method for training a plurality of agents according to embodiments of the present disclosure; Figure 5 schematically illustrates a flowchart of an exemplary method for performing a joint task according to embodiments of the present disclosure; Figures 6A and 6B schematically illustrate exemplary training entities according to embodiments of the present disclosure; Figures 7A and 7B schematically illustrate exemplary task performing entities according to embodiments of the present disclosure; Figure 8 schematically illustrates exemplary computer program products, computer programs and computer-readable storage media according to embodiments of the present disclosure; Figure 9 schematically illustrates a plot of average reward as a function of time during an exemplary training of agents according to embodiments of the present disclosure; Figure 10 schematically illustrates layers of a telecommunications network solution according to embodiments of the present disclosure, and Figure 11 schematically illustrates another telecommunications network solution according to embodiments of the present disclosure. [0035] In the drawings, like reference numerals will be used for like elements unless stated otherwise. Unless explicitly stated to the contrary, the drawings show only such elements that are necessary to illustrate the example embodiments, while other elements, in the interest of clarity, may be omitted or merely suggested. Detailed description [0036] Figure 1A schematically illustrates an exemplary (computer) system 100 for (cooperative) multi-agent reinforcement learning, as envisaged herein. As an example, it may be assumed that the possible states of an environment 120 are given by a set ^ = {^_^, ^_^, … }. The environment 120 may also be referred to as a system, but it is noted that such a system is not the same as the system 100. [0037] The system 100 includes a plurality of agents 110-1 to 110-N, where ^ is an integer indicating the total number of agents. The agents 110-1 to 110-N may be considered to form a set ℕ = {1,2, … , ^} of agents. [0038] The agents 110-1 to 110-N are assumed to interact with the environment 120 at discrete time steps/instances. At each time instance ^, the environment 120 is assumed to be in a particular state ^^{^}^^{^} ∈ ^, and each ^:th agent is capable of making its own individual observation ^^{^}^^^ about the state ^^^^ of the environment. Most likely, each agent 110-1 to 110-N is only capable of making partial observations about the true state of the environment 120, and the observations made by the agents 110-1 to 110-N are often noisy. [0039] At each time instance ^, the environment 120 may be assumed to give rise to a joint observation ^^^^ = ^^^{^}^^^, ^^{^}^^^, … , ^^{^}^^^^, wherein each ^:th agent 110-1 to 110-N is only capable of observing its own respective part ^^{^}. Here, a set of joint observations may be defined as ^ =×_^∈ℕ ^^{^}, where ^^{^} is a set of observations available to the ^:th agent (that is, each agent 110-1 to 110-N is capable of observing a part ^^{^} ∈ ^^{^} of the full, true state of the environment 120). [0040] For each ^:th agent, there is a set ^^{^} = {^_^ ^{^} , ^_^ ^{^} , … } of available actions where ^^{^} ^ indicates the ^:th action which the ^:th agent may take/propose. To deduce which action to propose at the time instance ^, the ^:th agent uses its current (partial) observation ^^{^}^^^ and a reward ^^{^}^^^ that is received from the environment 120 as a result of the previous transition of the environment 120, i.e. as the environment 120 moved from the state ^^^ − 1^ to ^^^^. The rewards ^^{^}^^^ may, in a (fully) cooperative setting, be equal for all agents 110-1 and 110N, in which case it may be said that a single, joint reward ^^^^ is provided to all agents 110-1 and 110-N. In other settings, the rewards may of course be different for different agents. [0041] Based on its (partial) observation ^^{^}^^^ about the state of the environment 120, as well as on the received reward ^^{^}^^^, each ^:th agent thus decides upon an action ^^{^}^^^ ∈ !^{^}. The various actions proposed by the agents 110-1 to 110-N leads to a joint action ^^{^}^^{^} = ^^^{^^}^^{^}, ^^{^^}^^{^}, … , ^^{^^}^^{^}^ ∈ ^, where ^ =×_^∈ℕ ^^{^} is a set of joint actions of all the agents 110-1 to 110-N. As a result of the joint action ^^^^the environment 120 transitions from the current state ^^^^ to a new state ^^^ + 1^, and corresponding rewards ^^^ + 1^ = {^^{^}^^ + 1^, ^^{^}^^ + 1^, … } for transitioning from ^^^^ to ^^^ + 1^ are provided to the agents 110-1 to 110-N. The process then continues by each ^:th agent making a (partial) observation ^^{^}^^ + 1^, receiving its reward ^^{^}^^ + 1^, and based thereon proposing a new action ^^{^}^^ + 1^ leading to a joint action ^^^ + 1^, and so on, as time progresses. [0042] How the joint action ^^^^ influences the environment 120 may be provided by a transition probability function #, which specifies the probability of the environment 120 transitioning to state ^^^ + 1^ as a result of taking action ^^^^ when in the state ^^^^, i.e. #$^^%, ^, ^& = Pr$^^^ + 1^ = ^^%|^^^^ = ^, ^^^^ = ^&. [0043] An observation probability function +$^, ^, ^^%& may be defined as the probability to make the joint observation ^ as a result of the environment 120 having transitioned to the next state ^^^ + 1^ in response to the joint action ^^^^, i.e. +^$^, ^, ^^%& = Pr^$^^{^}^^{^} = ^^|^^{^}^^{^} = ^, ^^{^}^ + 1^{^} = ^^%&. [0044] The immediate reward awarded for each joint action ^ may be defined by a function ,: ^ × ^ → ℝ, i.e. a mapping from state-action space to real numbers. [0045] In what follows, it will be that the environment 120 and the interaction therewith may be described as a decentralized partially observable Markov decision process (Dec-POMDP), wherein each agent 110-1 to 110-N is only aware of its own actions and does not observe the actions taken/proposed by the ^{other agents 110-1 to 110-N. Such a Dec-POMPDP may be defined as a tuple} _{^ℕ, ^, ^, #, ^, +, ,, 0, … ^. In addition, one may provide some criterion of optimality} which governs how the agents are to plan their work in order to achieve a common goal, i.e. how their immediate rewards are to be combined into a single number. As an example, for a finite time horizon, one may use an undiscounted expected cumulative reward 4^{5^} 1 23 ,^$^^{^}^^{^}, ^^{^}^^{^&} 9, where ,$^^{^}^^{^}, ^^{^}^^{^}& is the

agents 110-1 to 110-N from the environment 120 at time instance ^ as a result of taking action ^^^^ when the latter is in state ^^{^}^^{^}, and where ℎ is an upper bound on the time horizon. In such a case, the goal of the agents 110-1 to 110-N may be to find a policy ;^{^} for each agent which maximizes the cumulative reward which, due to the dependence on the actions of all agents 110-1 to 110-N, amounts to finding a tuple of policies forming a joint policy ; that maximizes the cumulative reward for all agents 110-1 to 110-N. Another example includes attempting to maximize a discounted expected cumulative reward, i.e. 4^{5^} 123 ⁶,^$^^{^}^^{^}, ^^{^}^^{^&} 9, where 0 ∈ ^0,1& is a discount

is below 1, such a discounted expected cumulative reward may also be used in case of an infinite time horizon, as the sum will then not diverge. [0046] Each ^:th agent 110-1 to 110-N may be configured to maintain an individual action-observation history =^{^}^^^ which indicates a sequence of actions and observations taken/made by the agent, i.e. =^{^}^^^ = >^^{^}^0^, ^^{^}^0^; ^^{^}^1^, ^^{^}^1^; … ; ^^{^}^^^, ^^{^}^^^&@. Likewise, one may define a joint action-observation history for all agents 110-1 to 110- N, e.g. as =^^^ = ^=^{^}^^^, =^{^}^^^, … , =^{^}^^^^. Similarly, one may also define individual and joint observation histories as well as individual and joint action histories, including only sequences of observations and respectively. The policy ;^{^} for the ^:th agent is thus a mapping from histories to actions, and such policies may be stochastic or deterministic in nature. [0047] A Q-value function A_B$=, ^& may be defined as the expected (discounted) cumulative reward given if, as a result of a certain action-observation history = and for a given action ^, following

policy ; when deciding what to do for all subsequent time steps. An optimal Q-value function A_∗$=, ^& may similarly be defined ^{as the maximum value of AB$= , ^& obtainable by following any policy, i.e.} A_{∗$=, ^& = mBax AB$=, ^&.} An optimal policy ;^∗ may be

optimal Q-value function as ;^∗$^& = argm_Iax A_∗$=, ^&. Each agent may generate and maintain its own Q-value function and policy (based on e.g. the action-observation histories =^{^}), and there may also be a joint Q-value function and joint policy for all agents (based e.g. on the joint action-observation history =). [0048] A Q-value function may e.g. be represented as a table, with one entry for each possible state-action pair, i.e. $^_^ ^{^} , ^_^ ^{^} &; $^_^ ^{^} , ^_^ ^{^} &; …; $^_^ ^{^} , ^_^ ^{^} &; $^_^ ^{^} , ^_^ ^{^} &; …, etc. (and similarly for the joint case). However, as soon as the state- and/or action-space grows sufficiently large, the number of elements in such a table would quickly become too large to handle. For example, if assuming that the environment 120 is e.g. a small 128x128 pixel grayscale image, the number of possible states in which the ^{environment 120 may be (if assuming 256 possible grayscale values for each pixel) is} _{256^^L×^^L, a number which exceeds e.g. an estimated number of atoms in the} universe. Consequently, even if not even having taking the number of possible actions into account, the requirements to store and process such a table most likely exceeds the storage/processing capable of any modern computer. To overcome this issue, it is known to replace trying to manually engineer the Q-value function(s) with instead using artificial neural networks (ANNs). In such a strategy, known as deep Q-learning (DQL) using deep Q-networks (DQNs), the Q-value function is instead parametrized by a limited number of parameters M (defined e.g. by the various weights of the ANN), i.e. A$^, ^& ≈ A$^, ^; M&. Other known variants of using deep ANNs to estimate the Q- value functions include e.g. deep Q-learning (DRQL), independent Q- learning (IQL), value decomposition networks (VDNs), and similar. [0049] Figure 1B schematically illustrates an example agent 110-i as envisaged herein, wherein an ANN 112-i is used to, based on the observation ^^{^}^^^ and the received reward ^^{^}^^^ deduce what action ^^{^}^^^ to propose. [0050] Another alternative for how to use ANNs to approximate Q-value functions, in particular in a multi-agent reinforcement learning (MARL) setting, is provided in Rashid, T. et al.,"QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning", Proceedings of the 35^th International Conference on Machine Learning, Stockholm, Sweden, pp.4295-4304, PMLR 80, 2018. Here, the joint action ^^^^ is decided upon by ensuring to satisfy a condition in ^{which an attempt to find a joint action ^ that maximizes a joint Q-value function} _{A$=, ^), i.e.} ^ = argm_Iax A$=, ^&, yields a same result as a set of individual argmax-operations argm_Ia_Ox A^{^}$=^{^}, ^^{^}&, argm_Ia_Px A^{^}$=^{^}, ^^{^}&, …, argm_Ia_Qx A^{^}$=^{^} , ^^{^}&. This allows each

in a decentralized execution by choosing greedy actions with respect to its own Q-value function A^{^}. In particular, such a condition may be enforced through a constraint on a relationship between A and each A^{^}, by requiring that RA/RA^{^} ≥ 0 for each ^:th agent. [0051] As proposed by Rashid, T. et al., this may be enforced by representing the joint Q-value function A as an architecture as illustrated in Figure 2. [0052] Figure 2A schematically illustrates an exemplary system 200 for implementing the QMIX architecture proposed by Rashid, T. et al. and as envisaged herein. The system 200 includes a plurality of agents 210-1 to 210-N each having its own agent-specific ANNs 212-1 to 212-N, which, for the ^:th agent and each time instance ^, takes as input the current (partial) observation ^^{^}^^^ of an environment 220 and the last action ^^{^}^^ − 1^, and which approximates and outputs the agent- specific Q-value function A^{^}$=^{^} , ^^{^}^^^&. The outputs A^{^}$=^{^} , ^^{^}^^^& from all agents 210-1 to 210-N are provided to a mixing network 230 (implemented as one or more ANNs). [0053] The mixing network 230 mixes the outputs A^{^} from all agents 210-1 to 210-N to produce the values of A$=, ^&, from which a joint action ^^^^ is then decided and taken on the environment 220. To enforce that RA/RA^{^} ≥ 0, weights of the mixing network are non-negative. The weights of the mixing network may e.g. be produced by separate so-called hypernetworks, wherein each such hypernetwork may take as input the state of the environment ^^^^ and generate the weights of a layer of the mixing network. Each hypernetwork may include a single linear layer followed by an absolute activation function, to ensure that the mixing network weights are non-negative. The hypernetworks are not explicitly shown in Figure 2B, but may e.g. be implemented as separate entities or as part of a same entity. The output vector from the hypernetwork may be reshaped to a matrix of appropriate size. The biases for the mixing network 230 may be produced in a similar manner, but may not necessarily be non-negative. The final bias may be generated by a 2-layer hypernetwork with e.g. a ReLU-type nonlinearity. To train the system 200, a loss function may be defined as Z^{^} ℒ^$M^& = 3 2VW_^ ^6X6 − A^$=, ^, ^; M^&Y 9 , where [ is batch size

buffer), and where W_^ ^6X6 = ^ + 0 m_Ia_%x A$=^%, ^^%, ^^%; M⁵& and M⁵ are parameters of a target network as in e.g. conventional DQN. The parameters M may be updated/learned by minimizing this loss using e.g. a (stochastic) gradient descent method. Various tricks, such as only updating the parameters of the target Q-network every N:th iteration and/or the use of experience replay may be used to avoid instabilities and slow/no convergence. In addition, it is known to e.g. use multiple networks to e.g. both select actions and evaluate actions, and similar. [0054] One disadvantage with the system 200 as illustrated in Figure 2 is that the required number of hidden layers (and nodes in general) of the ANNs in e.g. each agent 210-1 to 210-N may become large as a complexity of the environment 220 increases, and the execution time required to evaluate the ANNs 212-1 to 212-N for each time step may become too long to e.g. meet multiple incoming requests. As described earlier herein, this may be particularly relevant in a telecommunications network and in particular in a cognitive of such a network, wherein several hundred thousand or even millions of requests may have to be processed by the system 200 in order to e.g. provide analytics, intent-based automation, or similar, as part of the desired cognitive layer functionality. As a result, and as the inventors have realized, the use of conventional deep RL methods in e.g. telecommunications networks may thus be limited, especially in multi-agent scenarios wherein many samples are often required before convergence is obtained. In addition, ANN-based agents such as the agents 210-1 to 210-N may suffer from lack of reliability, and tend to be highly variable in performance and also sensitive to a range of different factors, including implementation details, hyperparameters, choice of environment, and even random seeds. Such variability may hinder reproducible research and may be both costly and even dangerous in real-world applications as practitioners cannot reliably evaluate or predict the performance of any particular algorithm, compare different algorithms, or even compare different implementations of the same algorithm. [0055] In addition to the above-mentioned disadvantages, the required number of parameters of the ANNs used to approximate e.g. Q-value functions may still be considerably large, especially as a greater number of more complex hidden layers is required to capture the behavior of more complex environments. With a greater number of parameters, the computational cost required to e.g. train the ANNs also increases as more and more gradients are required to be computed for all parameters. In addition, more episodes may be required to find convergent solutions, and the overall requirements on memory and computing power may still be too high to make such models trainable and/or usable on lower-end hardware. [0056] One strategy to reduce the number of parameters that need to be trained involves the use of tensor decompositions in order to more efficiently represent the various matrices defining the layers of the ANN sufficiently accurately. By so doing, one or more matrices representative of one or more layers of the ANN may instead be represented as one or more tensors, such as e.g. a tensor train (TT) decomposition, a Tucker decomposition (TD), a canonical polyadic (CP) decomposition, or similar, wherein a tensor is expressed as a combination of tensors having lower rank. Tensor representations of an ANN may also allow to capture more complex interactions among input features which would not be evident on flattened data normally associated with using regular matrices to represent the network. To avoid curse-of- dimensionality problems, of higher-order tensors into multiple lower-order tensors may be used, while still allowing to capture higher-order relationships in the data. Just like e.g. a matrix may be decomposed into outer products of vectors (c.f. singular value decomposition, SVD), tensor decomposition may involve decomposing higher-dimensional tensors into a sum of products of lower-dimensional factors. Such low-rank approximations may thus be used to extract the most important data while offering a reduction of e.g. the amount of memory needed to store and/or process such tensors. Often, such decompositions include one or more steps wherein at least part of a tensor that is to be decomposed is reshaped (e.g., flattened, unfolded) into a matrix, and wherein SVD is then used to represent at least this matrix using one or more lower-rank vectors. [0057] In conventional SVD, a matrix ! of size ^ × \ (that is, a matrix having ^ rows and \ columns) is decomposed into a product of three matrices ! = ]^_^`, where the size of ] is ^ × ^, the size of ^ is ^ × ^, the size of V is \ × ^, and ^ is the rank of the matrix !. If the eigenvalues of !, provided on the diagonal of ^, decay sufficiently quickly with increasing order, it is known to approximate ! as !^a _{= ]} ^{b a^} _{_} ^{c `} _, where ^ is replaced with ^̃ ≪ ^, such that ^{a^} includes (on its diagonal) only the first ^̃ eigenvalues of !, and wherein !^a is known as a low-rank approximation of !. [0058] Although it may be shown that !^a provides a best possible low-rank approximation of ! (in the Frobenius norm), computing of !^a may often be computationally costly, and sometimes also numerically unstable. The present disclosure proposes to, instead of e.g. SVD, use a so-called skeleton decomposition wherein ! is approximated as !^a _{= f!} ^{g5^} _,. Here, f is a matrix formed by selecting ^ columns of the matrix !, , is formed by selecting ^ rows of the matrix !, and !^a is found based on the intersections of such rows and columns of ! (i.e. from the elements of ! which lie both in one of the selected columns and one of the selected rows). Such a technique may also be referred to as e.g. ”CUR-factorization” of the matrix !. [0059] In particular, the present proposes to use a so-called adaptive cross-approximation (ACA) in order to decide upon which rows and columns to include in the skeleton decomposition. This approximation may also be referred to as e.g. an incomplete cross-approximation. [0060] A general idea behind ACA is to approximate a matrix with a rank-1 outer product of one row and one column of the matrix, and then iteratively use this process to construct an approximation of arbitrary precision. Ideally, at each step, the best fitting row and column is selected. After the first iteration, the focus of the ACA shifts from trying to approximate the original matrix to instead approximate a residual matrix formed by a difference between the original matrix and the current approximation of that matrix. More specifically, ACA as envisaged herein approximates a matrix ! (having size ^ × \) as a matrix l !^a = ] = 3i h⃗ _^k⃗_^` , where ^ is an effective rank of

and _ are (dense) matrices of size ^ × ^ and ^ × \, respectively, and wherei h⃗ _^ and k⃗_^ are the k:th column and row vectors, respectively, of the matrices ] and _. A goal of ACA may be defined as attempting to achieve ‖,‖ = n! − !^an ≤ p‖q‖, where , is the residual

where ‖ ⋅ ‖ refers to the Frobenius norm. If the effective ^ is less than both of ^ and \, the ^ × \ entries of the matrix ! may thus be stored/approximated using only ^$^ + \^& × ^ entries. Further details about ACA is found e.g. in M. Bebendorf, “Approximation of boundary element matrices,” Numer. Math., vol.86, no.4, pp.565–589, Jun.2000. [0061] As envisaged herein, one example implementation of the ACA algorithm may be described as follows. Let ,^{$^&} denote the residual matrix after the ^:th ^{iteration, and let s^ and t^ denote a row and column, respectively, selected during the} _{^:th iteration. Initially, define empty sets s = {} and t = {} of such selected rows and} columns, respectively, and set ,^$8& = !. [0062] For the first iteration, set ^ = 1 and select s_^ and t_^ as the row and column indices at which the value of the intersection element of ,^$8& is largest, i.e. |. Add s_^ to s and t to t, and ` i h⃗ _^ = ,^$ :_v ^8& O and k⃗_^ = V,^$ w^8& O_: Y /,^$ w^8& O_vO . Finally, update the

,^{$^&} = ,^$8& −i h⃗ _^k⃗_^` . This procedure is then iterated,

that at each ^:th iteration, $^{s , & $^5^&} ^{^ t^ = argmaxyz |,^u |;} , and

,^{$^&} = ,^{$^5^&} −i h⃗ _^k⃗_^` . Iteration is continued until n,

may be referred to as ACA with full pivoting. Herein, ~_:u and ~_^: means the ^:th column and ^:th row, respectively, of a matrix ~. [0063] As envisaged herein, ACA may also be implemented with only so-called partial pivoting, as will now be described. Initially, define the empty sets s and t, and select a first row index ^^∗ either randomly or as e.g. ^^∗ = 1. [0064] For each ^:th iteration, select a column index ^^∗ according to ^^∗ = argmax_z ^,^{$^5^&} ^∗_u ^, and define

^_^ = ,^{$^5^&} ^∗_u∗ . ^{If ^^ = 0, then stop iterating if the}

i^{n the set s equals min$^, \& −} _{1. Otherwise, if ^^ ≠ 0, define} ` i _^ = ^{$^5^&} k⃗_^ = ^{$^5^&} ^ ,

and . Update the sets s and t as s ← and select a new row index for the next iteration as ^^∗ = argmax_y∉^ |h_^^|, ^{where h^^ is the ^:th element of the vector i h⃗ ^). Iteration is then continued (i.e. ^ =} _{^ + 1) until some stopping criterium is met. In ACA with partial pivoting, the residual} matrix , is never formed explicitly, but the elements needed for each iteration are computed from ,^{$^&} ^_u = !_^u − ∑^{^} ^_7^ h_^^k_^u . Thus, ACA with partial

expensive than ACA with full pivoting, and the matrix ! may be represented with fewer elements using ACA with partial pivoting as long as the stopping criterion is met before ^ reaches the rank of ^{the matrix !. In ACA with partial pivoting, the stopping criterium} ‖_{ih⃗ ^‖‖k⃗^‖ ≤ p‖!‖} may e.g. be replaced with the criterium ‖ih⃗ _^‖‖k⃗_^‖ ≤ pn!^{$^&}n, where

n!^{$^&}n = n!^{$^5^&}n + 2 ^{∑^} u₇ ⁵ ^^{^} ^i h⃗ _^`i h⃗ _u^ ⋅ |k⃗_u` k⃗_^| + ^‖ih⃗ _^ ^‖‖k⃗_^‖. [0065] As

ACA as part of using tensors to represent/approximate agent-specific value functions of ANNs may, compared to e.g. the use of conventional SVD, be more resistant to noise in the observations about the environment made by the various agents taking part in the cMARL, and reduce the number of samples and/or episodes, and thereby also the time, required to obtain convergent solutions. [0066] An example of an agent as envisaged herein will now be described in more detail with reference also to Figure 3. [0067] Figure 3 schematically illustrates an agent 310-i according to embodiments of the present disclosure, wherein the agent 310-i is configured to take part as one (e.g. an ^:th) of multiple agents in a cMARL system, for solving (together with the other such agents) a joint task as earlier herein. The system may e.g. correspond to the system 100 or 22 described with reference to Figures 1A and 2, respectively, but wherein the agents 110-1 to 110-N (or 210-1 to 210-N) are replaced with agents such as the improved agent 310-i. [0068] The agent 310-i receives (or makes) (partial) observations ^^{^}^^^ about the ^{state of an environment (such as the environment 110 or 210), as well as a reward} _{^^^^^. The agent 310-i may also be capable of maintaining an action-observation} history =^{^} , and is further configured to based on the made observations (e.g. on the action-observation history) output either an action ^^%^^^ or a Q-value function A^{^} as described earlier herein. In contrast to the agent 110-I described with reference to Figure 1B, the agent 310-i does not directly use an ANN 312-i to approximate/parametrize its value function. Instead, the agent 310-i uses a tensor 314-i to represent at least one layer 313-i-k of the ANN 312-I, wherein the tensor 314-i is represented as at least one tensor decomposition 316-i. In particular, as described above, the agent 310-i is configured to generate the tensor decomposition 316-i based on ACA 318-i, e.g. using ACA with full or partial pivoting as also described above. [0069] Although discussed herein in the context of the agents, the proposed ACA- based tensor decomposition may of course also be used to implement at least part of the mixing network 230 (if used), or in any situation in which a tensor decomposition representing at least one layer of an ANN is desirable. [0070] Various methods as envisaged herein will now be described in more detail with reference to Figures 4 and 5. [0071] Figure 4 schematically illustrates a flowchart of an embodiment of a method 400 for training a plurality of agents to perform a joint task according to the present disclosure. Such agents may e.g. be agents 310-i as described with reference to Figure 3. The method 400 may be implemented on one or more computers. In an operation S410, the method 400 includes training, using cMARL, a plurality of agents (such as the agent 310-i) to perform a joint task. The training further includes, for each agent (and e.g. as part of an operation S412 or as part of the operation S410) approximating the agent-specific value function (i.e., A_^) as a tensor decomposition (e.g. the decomposition 316-i), where the tensor decomposition is based on ACA. [0072] Figure 5 schematically a flowchart of an embodiment of a method 500 for performing a joint task. The method 500 may be implemented on one or more computers. In an operation S510, the method 500 includes obtaining a plurality of agents (such as agents 310-i) trained to perform a joint task (e.g. trained using the method 400). For each agent, the agent-specific value function (A^{^}& is approximated as a tensor decomposition (e.g.316-i), where the tensor decomposition is based on ACA. [0073] Various entities and computer programs/computer program products as envisaged herein will now be described in more detail with reference also to Figures 6A, 6B, 7A, 7B and 8. [0074] Figure 6A schematically illustrates, in terms of a number of functional units, the components of an embodiment of training entity 600 according to the present disclosure. The training entity 600 is configured for training a plurality of agents (e.g.310-i) to perform a joint task in accordance with e.g. the method 400 described herein with reference to Figure 4, and includes processing circuitry 610. The processing circuitry 610 is provided using any combination of one or more of a suitable central processing unit (CPU), multiprocessor, microcontroller, digital signal processor (DSP), etc., capable of executing software instructions stored in a computer program product 810a (see Figure 8 and the description thereof), e.g. in form of a storage medium 630 that may also form part of the training entity 600. The processing circuitry 610 may further be provided as at least one application specific integrated circuit (ASIC), or field-programmable gate array (FPGA). [0075] Particularly, the processing circuitry 610 is configured to cause the training entity 600 to perform a set of operations, or steps, as disclosed above e.g. when describing the method 400 illustrated in Figure 4. For example, the storage medium 630 may store a set of operations, and the processing circuitry 610 may be configured to retrieve the set of operations from the storage medium 630 to cause the training entity 600 to perform the set of operations. The set of operations may be provided as a set of executable instructions. Thus, the processing circuitry 610 is thereby arranged to execute methods associated with training of a plurality of agents (in a cMARL setting) as disclosed herein e.g. with reference to Figures 2, 3 and 4. [0076] The storage medium 630 include persistent storage, which, for example, can be any single or combination of magnetic memory, optical memory, solid state memory or even remotely mounted memory. [0077] The training entity 600 may further include a communications interface 620 for communications with other entities, functions, nodes, and devices, such as e.g. those of a telecommunications network or any other network associated with the task to be solved. For example, the communications interface 620 may allow the training entity 600 to communicate with e.g. other entities, with one or more Network Functions (NFs) or nodes of a telecommunications network, or e.g. with one or more internal operational modules within a same node, etc. For example, the communications interface 620 may be configured to receive one or more signals indicative of at least a partial state of an environment with which the training entity 600 interacts during training, or e.g. with one or more simulated such environments. As such, the communication interface 620 may include one or more transmitters and receivers, including analogue and/or digital components. For example, the communications interface 620 may be used to obtain data about the state of the environment in which the task is to be solved, whether the environment is a real environment or a simulated/emulated environment, or similar. The communications interface 620 may for example receive data from one or more sensors configured to detect/measure the state of the environment. The communications interface 620 may also be capable of communicating one or more actions that are to be taken in the environment, e.g. to various actuators or controllers which may influence the environment. [0078] The processing circuitry 610 controls the general operation of the training entity 600 e.g. by sending data and control signals to the communications interface 620 and the storage medium 630, by receiving data and reports from the communications interface 620, and by retrieving data and instructions from the storage medium 630. Other components, as well as their related functionality, of the training entity 600 are omitted in order not to obscure the concepts presented herein. [0079] Figure 6B schematically illustrates, in terms of a number of functional modules 610a, 610b and 610c, the components of a training entity 600 according to one embodiment of the present disclosure. The training entity 600 includes at least a first module 610a configured to perform operation S410 of the method 400 described with reference to Figure 4. The training 500 also includes a second module 610b configured to perform operation S412. In other embodiments, both modules 610a and 610b may instead be provided as part of a single module, e.g. as part of a (cMARL) training module with which the method 400 may be implemented. The training entity 600 may also include one or more optional functional modules (illustrated by the dashed box 610c), such as for example an environment state module configured to obtain various parameters indicative of the state of the environment, an/or e.g. a task solving module configured to solve the joint task using the plurality of agents trained/generated by the modules 610a and 610b (in which case the training entity 600 is also capable to perform the operations S510 and S512 of the method 500 described with reference to Figure 5). The training entity 600 may in some embodiments include one or more functional modules for e.g. implementing a mixing network (such as e.g.230), or similar. In some embodiments, the training entity 600 may be operable between a training stage and a deployment stage. In the training stage, the modules 610a and 610b may work together to generate a trained plurality of agents for solving the joint task. In the deployment stage, the training entity 600 may be configured to switch to using the plurality of trained agents to solve the joint task, including e.g. deactivating the modules 610a and 610b. The training entity 600 may also be operable to a retraining stage, in which the modules 610a and 610b are once again activated in order to e.g. further train (or retrain) the plurality of agents, e.g. in response to the environment changing in a way which makes the currently trained plurality of agents less successful in solving the joint task than before, and similar. [0080] In general terms, each functional module 610a-c may be implemented in hardware or in software. Preferably, one or more or all functional modules 610a-c may be implemented by the processing circuitry 610, possibly in cooperation with the communications interface 620 and/or the storage medium 630. The processing circuitry 610 may thus be arranged to from the storage medium 630 fetch instructions as provided by a functional module 610a-c, and to execute these instructions and thereby perform any operations of the method 400 performed by/in the training entity 600 as disclosed herein. The training entity 600 may also be referred to as cMARL model training entity, or similar. [0081] Figure 7A schematically in terms of a number of functional units, the components of an embodiment of a task performing entity 700 according to the present disclosure. The task performing entity 700 is configured perform a joint task using a plurality of agents (e.g.310-i). The task performing entity 700 includes processing circuitry 710. The processing circuitry 710 is provided using any combination of one or more of a suitable central processing unit (CPU), multiprocessor, microcontroller, digital signal processor (DSP), etc., capable of executing software instructions stored in a computer program product 810b (see Figure 8 and the description thereof), e.g. in form of a storage medium 730 that may also form part of the task performing entity 700. The processing circuitry 710 may further be provided as at least one application specific integrated circuit (ASIC), or field-programmable gate array (FPGA). [0082] Particularly, the processing circuitry 710 is configured to cause the task performing entity 700 to perform a set of operations, or steps, needed to solve a joint task as envisaged herein, e.g. by performing the method 500 as described with reference to Figure 5. For example, the storage medium 730 may store a set of operations, and the processing circuitry 710 may be configured to retrieve the set of operations from the storage medium 730 to cause the task performing entity 700 to perform the set of operations. The set of operations may be provided as a set of executable instructions. Thus, the processing circuitry 710 is thereby arranged to execute methods associated performing a joint task as disclosed herein, e.g. with reference to any one of Figures 2, 3, 4 and 5. [0083] The storage medium 730 may also include persistent storage, which, for example, can be any single or combination of magnetic memory, optical memory, solid state memory or even remotely mounted memory. [0084] The task performing entity 700 may further include a communications interface 720 for communications with other entities, functions, nodes, and devices, such as e.g. those of a telecommunications network or any other network associated with the joint task to be solved. For example, the communications interface 720 may allow the task performing entity 700 to communicate with e.g. a training entity (such as the training entity 600) to receive trained agents, with one or more Network Functions (NFs) or nodes of a telecommunications network, or e.g. with one or more operational units/modules within a same NF or node. As such, the communication interface 720 may include one or more and receivers, including analogue and/or digital components. For example, the communications interface 720 may be used to obtain data including details about the trained agents which are to be used to solve the joint task, data about the state of an environment in which the joint task is to be performed/solved and from which e.g. the state of the environment may be estimated/obtained and provided to the trained agents, or similar. The communications interface 720 may for example receive data from one or more sensors configured to detect/measure the state of the environment. The communications interface 720 may also be capable of communicating one or more actions that are to be taken in the environment, e.g. to various actuators or controllers which may influence the environment. The communications interface 720 may for example receive data indicative of trained agents, and may receive one or more subsequent updates of the agents from e.g. the training entity 600. [0085] The processing circuitry 710 controls the general operation of the task performing entity 700 e.g. by sending data and control signals to the communications interface 720 and the storage medium 730, by receiving data and reports from the communications interface 720, and by retrieving data and instructions from the storage medium 730. Other components, as well as their related functionality, of the task performing entity 700 are omitted in order not to obscure the concepts presented herein. [0086] Figure 7B schematically illustrates, in terms of a number of functional modules 710a-710c, the components of a task performing entity 700 according to one embodiment of the present disclosure. The task performing training entity 700 includes at least an agent receiving module 710a configured to obtain (e.g. by receiving from an external entity, or from internal storage) a plurality of trained agents (e.g.310-i) trained in accordance with e.g. method 400. The task performing entity 700 further includes a task performing module 710c configured to perform the joint task using the trained agents. The task performing entity 700 may also include one or more optional functional modules (illustrated by the dashed box 710b), such as for example an environment state module configured to obtain various parameters indicative of the state of the environment, and/or e.g. an action module configured to communicate one or more actions which are to be taken in the environment based on the output from the trained agents. The task performing entity 700 may further, in some embodiments, include a mixing configured to perform the function of a mixing network (e.g.230), in order to mix the value functions A^{^} of the various agents to obtain a joint action ^^^^ as described earlier herein, and similar. [0087] In general terms, each functional module 710a-710c may be implemented in hardware or in software. Preferably, one or more of the functional modules 710a- 710c may be implemented by the processing circuitry 710, possibly in cooperation with the communications interface 720 and/or the storage medium 730. The processing circuitry 710 may thus be arranged to from the storage medium 730 fetch instructions as provided by a functional module 710a, 710b and/or 710c (if included), and to execute these instructions and thereby perform the joint task in accordance with e.g. method 500 as disclosed herein. [0088] Figure 8 schematically illustrates a computer program product 810a, 810b including computer readable means 830. On the computer readable means 830, a computer program 820a can be stored, which computer program 820a can cause the processing circuitry 610 and thereto operatively coupled entities and devices, such as the communication interface 620 and the storage medium 630, of the training entity 600 to execute method 400 according to embodiments described herein with reference to Figure 4. The computer program 820a and/or computer program product 810a may thus provide means for performing any operations of the method 400 performed by the training entity 600 as disclosed herein. [0089] On the computer readable means 830, a computer program 820b can also be stored, either in addition to or instead of the computer program 820a, which computer program 820b can cause the processing circuitry 710 and thereto operatively coupled entities and devices, such as the communication interface 720 and the storage medium 730, of the task performing entity 700 to retrieve the trained agents and perform the joint task in accordance with e.g. method 500 according to embodiments described herein with reference to Figure 5. The computer program 820b and/or computer program product 810b may thus provide means for performing any operations required to perform the joint task (using cMARL with ACA-based tensor decompositions for approximating agent-specific value functions). [0090] Figure 8 also serves to illustrate an embodiment of the computer readable means 830 in which it stores a computer program product 810c including data 820c representative of the plurality of trained agents (i.e. agents such as 310-i trained in accordance with method 400). It is that the computer readable means 830 may include only such data, or such data in combination with one or more of the computer programs 820a, 820b described above, or in combination with any other data. The data 820c may for example include information about the various weights and/or biases of at least one layer of a trained ANN, and/or about various values associated with a tensor decomposition of such (at least one layer of) a trained ANN and in particular associated with an ACA-based such tensor decomposition. [0091] In the example of Figure 8, the computer program product 810a, 810b, 810c and computer readable means 830 are illustrated as an optical disc, such as a CD (compact disc) or a DVD (digital versatile disc) or a Blu-Ray disc. The computer program product 810a, 810b, 810c and computer readable means 830 could also be embodied as a memory, such as a random-access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or an electrically erasable programmable read-only memory (EEPROM) and more particularly as a non-volatile storage medium of a device in an external memory such as a USB (Universal Serial Bus) memory or a Flash memory, such as a compact Flash memory. Thus, while the computer program 820a, 820b and data 820c are here schematically shown as a track on the depicted optical disk, the computer program 820a, 820b and/or data 820c may be stored in any way which is suitable for the computer program product 810a, 810b, 810c and computer readable means 830. [0092] Results of a performed verification test of the concept envisaged herein will now be described with reference also to Figure 9. [0093] Figure 9 schematically illustrates a plot 900 of how an average reward (indicated on the y-axis) changes with time (indicated on the x-axis) during a performed example training of agents (e.g.310-i) using the ACA-based decomposition as envisaged herein. In this particular example, two agents (a “speaker” and a “listener”) were deployed in a speaker-listener scenario (also referred to as e.g. a “cooperative communication” scenario)a. The environment included three landmarks each having a different color (e.g. red, green and blue). At each episode, the listener agent must navigate to a landmark of a particular color, and obtains a reward based on its distance to the correct landmark. The listener agent can only observe its relative position and color of the landmarks, but is unaware of which of the three landmarks that is the correct one (i.e. the one to which the listener agent should navigate). Conversely, the may observe which one of the landmarks that is the correct landmark, and may issue a communication to the listener agent which is observed by the listener agent. Based on the movement of the listener agent, the speaker agent must thus learn to output landmark color, and the listener agent must learn how to navigate based on the observations it makes, including the color output by the speaker agent. Training was performed using both conventional deep neural networks (DNNs, or deep ANNs) and tensor networks. In the plot 900, the dashed curve 910 is the average reward as function of time for a DNN, while the solid curve 920 is the average reward as function of time for a value function estimated using tensor decomposition instead of DNN. As can be seen in Figure 9, the solution based on tensor decomposition settles at a higher reward than the DNN-based counterpart. Further info obtained from this experiment is provided in Tables 1 and 2, which show also the differences in performance between tensor decompositions based on SVD and the tensor decompositions based on ACA as envisaged herein. Network type Convergence time Average reward

Network type No. of Time per No. of Samples e

ACA-based 12461 3.6 981 15 tensor milliseconds of

ep sodes o converge, and number o samp es per ep sode. [0094] From Tables 1 and 2, it can be seen that, when compared to DNN, the tensor decomposition-based networks requires approximately only 10 % of the parameters. Further, the networks using tensor decompositions both required fewer training episodes to converge. More specifically, the ACA-based network managed to converge in only approximately 1000 episodes, while the SVD-based counterpart required more than ten times as many episodes (approximately 10000 episodes). It may thus be concluded that in this validation test, the use of the ACA-based tensor decomposition as envisaged herein required fewer episodes to converge and a smaller number of parameters to approximate the agent-specific value functions. It is further noted that the tensor decomposition-based networks required longer time per iteration, which may be attributed to a larger number of required mathematical operations (such as multiplications, reshape operations, and similar) used in the decompositions. However, it is also noted that there are application-specific so-called tensor processing units (TPUs) available which are capable to speed up such processes by e.g. making use of available parallelism and similar. If implemented using such TPUs, it is expected that the time per iteration for the tensor decomposition-based networks will go below that of the DNN-based solutions. [0095] A second validation test was performed, to investigate how accurately the ACA-based solution as envisaged herein is capable of representing e.g. a matrix, especially in comparison to the SVD-based solutions. In the second validation test, a matrix ! was approximated using both SVD (!^a = ]^{b a^}_^{c `}) and the envisaged ACA- based solution (!^a = ]_), and the error resulting from applying these low-rank approximations to a random vector ^ was studied. The matrix-vector relative errors (i.e. ]^{b a^}_^{c `}^ − !^ and ]_^ − !^, respectively, were investigated using the Manhattan distance/Taxicab norm (^^{^}-norm), the Euclidian norm (^^{^}-norm), and the ^^{^}-norm (i.e. the maximum absolute row sum of . Also, the Frobenius norm of the approximated matrix was compared with the Frobenius norm of the original matrix for the respective approximations (i.e. ! − ]^{b a^}_^{c `} and ! − ]_^`, respectively), with the Frobenius norm for a matrix ^ being defined as ^ ^{^} ‖^‖ = ^33^[ ^{^} ^_^,u^ , where [_^,u is the matrix element found

w and ^:th column of ^, and ^ and ^ the total number of rows and columns, respectively, of ^. An SVD- recompression of the ACA-approximated matrix were also studied as part of the second validation test. Here, the ACA-approximation !^a = ]_^` was first rewritten as ]_^` = A_^,_^,_^`A_^` where A_^ and ,_^, and A_^ and ,_^, are found by a QR-decomposition of ] and _, respectively. The square, low-dimensional subspace matrix ,_^,_^` was decomposed using SVD as ,_{^,^` = ^} ^b _Σ ^c _q ^c` _, and the singular values of Σ^c were filtered to further reduce the rank of the ACA- decomposition as ]^{^ = A^^b Σc,} __{^ = qcA^,} and !_{≈ ]} ^{^} _{_} ^{^ `} _. Tests were performed for a fixed number of 50 iterations for different ranks (given in percent of rank of original matrix), and the results are shown in Tables 3 and 4. Type Rank ^^{^} ^^{^} ^^{^} Frobenius

ACA 50% 0.6594 1.1360 1.9964 1095.9406

Type Rank _^ ^{^} _^ ^{^} ^^{^} Frobenius

[0096] From Tables 3 and 4, it can be deduced that the envisaged ACA-based decomposition provides consistent and acceptably good errors, with consistent results within a small range of rank, and similar level of accuracy compared to the SVD-based solution. [0097] With reference also to Figures 10 and 11, various envisaged implementations and embodiments of the above concepts in a telecommunications network environment will now be described in more detail. [0098] Figure 10 schematically illustrates various layers of a telecommunications network solution 1000. The solution 1000 includes a network layer 1020, which may also be considered as the environment. The network layer 1020 includes e.g. the radio access network (RAN), core, business support system (BSS), customer experience management (CEM), Internet of things (IoT), and similar. The solution 1000 also includes a business operations layer 1030 wherein things like various goals, service level agreements (SLAs), etc. are decided upon and conveyed in form of one or more intents 1032. In between the network layer/environment 1020 and the business operations layer 1030 is a cognitive layer 1010 which includes capability to e.g. receive the intents 1032 and to based thereon control the network layer 1020, e.g. by making observations ^ (or ^) about the state of the network layer 1020 and based thereon deciding upon one or more actions ^ that should be taken. Phrased differently, it is envisaged that the layer 1010 may serve as an interface between the business operations layer 1030 and the actual network (layer) 1020. In addition to being able to (autonomously) control the network layer 1020 to meet the goals established by the intents 1032, the cognitive layer 1010 may also be capable of analyzing the performance of the network layer 1020 and to e.g. compile one or more reports that are then provided to the business operations layer 1030, or similar. [0099] It should be noted that the solution 1000 is only an exemplary solution, and that a similar cognitive layer may be provided in many different ways and for many different purposes. In any case, the cognitive layer 1010 preferably has the capability to derive optimal policies for how to control the network layer 1020, to gather data (stored in e.g. one or more storage units 1012), and to learn from past behaviors in order to improve its own capabilities. As part of the cognitive layer 1010, it is envisaged to provide machine learning agents, and in particular a plurality of agents 310-i collaboratively trained (using cMARL) as described earlier herein, e.g. in accordance with the method 400. For this purpose, the cognitive layer 1010 may e.g. include one or more training entities 600 as described with reference to Figures 6A and 6B, and/or one or more task performing entities 700 described with reference to Figures 7A and 7B. [00100] A more specific use case as envisaged herein will now be described in more detail with reference also to Figure 11. [00101] Figure 11 schematically illustrates an embodiment of a solution 1100 similar to the solution 1000, wherein an environment 1120 (e.g., network layer) includes a telecommunications network. The environment 1120 may e.g. be a real environment, or an environment created by simulating/emulating a telecommunications network. The telecommunications network includes e.g. a plurality of user equipment (UE) 1122 which connect to radio base stations (such as gNBs) 1124a and 1124b, which in turn communicate with User Plane Functions (UPFs) 1126a and 1126b, respectively, and cooperate to offer three services 1128a, 1128b and 1128c. The services 1128a-1128c may e.g. be assumed to form part of a same network slice, or similar. The first service 1128a is, in this particular example, for Conversational Video (CV), the second service 1128b is for Ultra-Reliable Low Latency Communications (URLLC), and the third service 1128c is for Massive IoT (mIoT). For each service, there are one or more corresponding targets on specific key performance indicators (KPIs), e.g. of Experience (QoE) for the CV service 1128a, Packet Error Rate (PER) for the URLLC service 1128b, and PER also for the mIoT service 1128c. [00102] As part of an intent-handler 1110, multiple agents 310-1 to 310-3 performs closed loop control in order to meet their individual KPI targets. However, as the resources of the environment 1120 are shared between all agents 310-1 to 310-3, the agents 310-1 to 310-3 need to work in tandem (and i.e. cooperate) to find an optimal strategy for how to jointly control the environment 1120. For example, it may be assumed that there are two variables of the environment/network 1120 that are manipulable: a (service) priority, and a maximum bit rate (MBR). Any change in these two variables may affect a data throughput for the UEs 1122. As the air-link bandwidth is limited, any change in UE throughput may affect the other UEs. Hence, it is a joint task of the agents 310-1 to 310-3 to select values for the priority and MBR in such a way that the UEs share the available bandwidth in an optimal way. The various KPIs may e.g. be provided to the intent manager/handler 1110 as part of intents 1132, e.g. from a business operations layer such as 1030. [00103] As envisaged herein, the agents 310-1 to 310-3 are trained to perform such a joint task in accordance with e.g. the method 400 described earlier herein with reference to Figure 4. In this particular embodiment, it may be envisaged that each agent 310-1 to 310-3 the parameters for their respective service in order to achieve their respective target. Thus, an action space may be defined as all possible combination of values for the (service) priority and MBR. An observation space may be defined as the observable KPI values of the respective service, as well as a difference between the observed KPI value and the desired KPI value for each service. A reward for each agent 310-1 to 310-3 may be defined as e.g. a difference between (absolute) difference between the observed KPI for the service/agent 310-1 to 310-3 and the corresponding desired KPI target. A global reward may be calculated as the sum of such differences for all services. In an alternative embodiment, the action space may instead be defined by various actions which either increases, maintains, or decreases a current value of the respective parameter, such as e.g. {“increase priority”, “maintain current priority”, “decrease priority”} or {“increase MBR”, “maintain current MBR”, “decrease MBR”}, or e.g. a set including all possible combinations of elements from such two sets, e.g. {“increase priority, increase MBR”, “increase priority, maintain MBR”, priority, decrease MBR”, “maintain priority, increase MBR”, “maintain priority, maintain MBR”, “maintain priority, decrease MBR”, “decrease priority, increase MBR”, “decrease priority, maintain MBR”, “decrease priority, decrease MBR”}, or similar. [00104] As explained earlier herein, it is assumed that for each time step, the agents 310-1 to 310-3 make observations about the current state of the environment (in terms of at least the current values of the KPIs), and based thereon arrives at a joint action (in terms of priority and MBR). Each agent 310-1 to 310-3 may in some embodiments only make partial observations of the full state of the environment, e.g. an observation including information only about the KPI relevant for the individual agent 310-1 to 310-3, and similar. [00105] To validate the above-proposed system, a simulated environment 1120 was used, in which it was possible to run the three services with traffic routed through the radio base stations 1124a and 1124b and the UPFs 1126a and 1126b. During training, goal-oriented training as defined above was used, and the training performance is shown in Table 3. To compare the proposed ACA-based solution against tensor decompositions based on SVD and also regular DNNs, simulations/trainings were performed also for such setups, and the results thereof are also shown in Table 4. Network Convergence Average No. of No. of Samples e

[00106] As can be seen in Table 3, decomposition-based solutions both achieved higher average reward with similar convergence time, compared to the DNN-based counterpart. the method (300) according to any one of claims 1 to 4. Both tensor decomposition-based solutions also requires a substantially reduced amount of parameters in order to approximate the agent-specific value functions compared with the DNN-based solution, wherein the number of parameters required for the latter is approximately 15-16 times higher. In particular, it can also be seen that the ACA-based solution as envisaged herein converges faster, requires a smaller number of parameters, and also requires fewer samples in total (as both the number of episodes required for convergence as well as the number of samples per episode are smaller) compared to the SVD-based solution. Consequently, it may be concluded that in the simulated situation, the ACA-based solution improves upon conventional SVD-based tensor decompositions, and allows for a reduction in required memory, processing power, and similar. It is also noted that although the reduction of total number of samples for the tensor decomposition-based solutions vis-à-vis the DNN- based solution is not as significant as in the more trivial speaker-listener experiment discussed with reference to Tables 1 and 2, there is still some reduction also in the more complex example of interacting closed loops as discussed above with reference to Figure 11. [00107] An additional validation test was also performed, but for different number of agents in the intent manager 1110. In this test, simulations were performed for the ACA-based solution as envisaged herein only, but for: i) only one active service (CV, with random noise in the observations made by the agent 310-2), ii) two active service (CV and URLLC), and iii) all services active (CV, URLLC, mIoT). The last scenario, iii), is of course the same as that discussed with reference to Table 3, but is included also here for completeness. The results of this simulation are shown in Table 4. Network Convergence Average No. of No. of Samples e

i) CV with Approx.1 day 0.981 8145 99 11 random

[00108] From the results presented in Table 4, it can be seen that the proposed ACA-based solution as envisaged herein results in more or less similar results for all scenarios i), ii) and iii), from which it may be concluded that the proposed ACA-based solution is linearly scalable. [00109] As also envisaged herein, the proposed cMARL training of agents with ACA-based tensor decompositions to approximate agent-specific value functions may provide improvements also for other use cases than those described above. For example, it may be desirable to perform accurate and efficient prediction of trajectories for intelligent vehicles, especially in more sophisticated traffic scenarios wherein various vehicles and human crowds travel towards respective destinations with distinctive moving patterns. An intelligent vehicle is expected to be capable of taking actions proactively in order to handle emergencies encountered along the way, such as an upcoming need to slow down to enable surrounding vehicles to enter in front, or e.g. to speed up in order to switch lanes as part of overtaking another vehicle, and similar. Consequently, intelligent vehicles are required to reason about accurate future trajectories of surrounding (adjacent) vehicles in order to conduct risk assessment of vehicle behavior and to further take appropriate actions. [00110] For such tasks, techniques like optimization, reinforcement learning, and/or other artificial neural network-based methods to project trajectories may be used. An important aspect is, however, that the computational cost of such methods should be minimal as decisions need to be taken in real-time without additional delay. In such a case, it is envisaged to use the ACA-based solutions proposed herein, as it has been shown that they may both the required memory and computational power. The proposed ACA-based solution is also suitable to reduce noises in predicted trajectories caused e.g. by noise observations of the environment, which will further improve performance in terms of e.g. Quality of Service (QoS) and similar. In particular, in case of cMARL implemented with a real environment, the state of the environment is often obtained using various sensors, whose output are often more or less corrupted by noise. If the noise observations obtained from such sensors are correlated, the performance of MARL-based models may suffer. However, as the proposed ACA-based tensor decomposition inherently acts as a denoising process, the effect of such noise on the output of the network may thus be reduced, making the proposed ACA-based solution particularly useful in situations wherein the observations of the agents (as obtained from e.g. one or more sensors) are noisy. [00111] An observation space may e.g. include current locations of all vehicles, and an action space may e.g. include the next predicted locations of all the vehicles, while a reward may be an accuracy of the predicted locations, and/or e.g. based on whether there are conflicting locations or not, and similar. Phrased differently, although much of the use cases of the proposed ACA-based solution described herein are associated with actual controlling of an environment, the proposed solutions may just as well be utilized to make predictions about an environment, instead of, or in addition, to controlling of the environment. Other tasks suitable for the proposed ACA-based solution may e.g. involve throughput prediction (wherein the agents learn to predict e.g. an expected throughput for one or more UEs based on for example the location and current signal-to-noise ratio of the UE), and/or intent-based automation such as e.g. described exemplary with reference to Figure 11. [00112] In summary of all of the above, the present disclosure provides an improved way of how to approximate value functions of agents taking part in cMARL, and in particular of how to use tensor decompositions based on ACA to reduce the number of parameters required to train/store the ANNs which are conventionally used to approximate the value functions (as in e.g. deep RL). As has been shown, the use of ACA instead of e.g. SVD provides a reduction of the number of parameters, the number of samples required to find convergent solutions during training, while e.g. still providing high average reward. The proposed ACA-based solution is also effective against noise (in e.g. the observations made by the agents). The proposed ACA-based solution therefore offers an over conventional technology in that it allows for a reduction of required memory and/or computational power, an improved noise tolerance, and e.g. a suitable alternative for use in for example a cognitive layer of a telecommunications network. The proposed ACA-based solution is therefore also more suitable for e.g. being implemented in lower-end device wherein memory and/or computational power is limited, such as devices configured for IoT and similar. [00113] Although features and elements may be described above in particular combinations, each feature or element may be used alone without the other features and elements or in various combinations with or without other features and elements. Additionally, variations to the disclosed embodiments may be understood and effected by the skilled person in practicing the claimed invention as defined by the appended patent claims, from a study of the drawings, the disclosure, and the appended claims themselves. In the claims, the words “comprising” and “including” does not exclude other elements, and the indefinite article “a” or “an” does not exclude a plurality. The mere fact that certain features are recited in mutually different dependent claims does not indicate that a combination of these features cannot be used to advantage.

Claims

1. A computer-implemented method (400) of training a plurality of agents (310-i) to perform a joint task: - training (S410), using cooperative multi-agent reinforcement learning, cMARL, a plurality of agents (310-i) to perform a joint task, wherein said training further comprises, for each agent, approximating (S312) an agent-specific value function (^_^) as a tensor decomposition (316-i) based on an adaptive cross-approximation, ACA (318-i).

2. The method according to claim 1, wherein the joint task is specific to a cognitive layer (1010, 1110) of a telecommunications network.

3. The method according to claim 2, further comprising emulating the telecommunications network as part of said training.

4. The method according to claim 2 or 3, wherein the joint task comprises at least one of throughput prediction, intent-based automation, and prediction of trajectories for connected users and/or vehicles.

5. A computer-implemented method (500) of performing a joint task, comprising: - obtaining (S510) a plurality of agents (310-i) trained to perform a joint task, wherein, for each agent, an agent-specific value function (^_^) is approximated as a tensor decomposition (316-i) based on an adaptive cross-approximation, ACA (318-i), and - performing (S512) the joint task using the trained plurality of agents.

6. The method according to claim 5, wherein the plurality of agents is trained by performing the method (400) according to any one of claims 1 to 4.

7. The method according to claim 6 depending on claim 2, wherein performing the joint task is performed as part of an intent handler (1110) of the cognitive layer.

8. The method according to claim 7, the joint task comprises making predictions related to, and/or controlling of, the telecommunications network.

9. The method according to claim 7 or 8, wherein each agent is responsible for meeting a target key performance indicator, KPI, for a specific service (1128a-c) of a network slice of the telecommunications network, and wherein the joint task comprises controlling of one or more network parameters relevant to the target KPIs for all agents.

10. The method according to claim 9, wherein the plurality of agents comprises at least two of: - a first agent (310-2) responsible for meeting a KPI in form of Quality of Experience, QoE, for a Conversational Video, CV, service; - a second agent (310-1) responsible for meeting a KPI in form of Packet Error Rate, PER, for an Ultra-Reliable Low Latency Communications, URLLC, service, and - a third agent (310-3) responsible for meeting a KPI in form of a PER for a Massive Internet of Things, mIoT, service, wherein the one or more network parameters comprises priority and maximum bit rate (MBR) of the network slice.

11. A training entity (600) for training of a plurality of agents (310-i) to perform a joint task, the training entity comprising processing circuitry (610) and a memory (630) storing instructions, wherein the instructions are such that they, when executed by the processing circuitry, cause the training entity to: - train, using cooperative multi-agent reinforcement learning, cMARL, a plurality of agents (310-i) to perform a joint task, wherein, for each agent, an agent- specific value function (^_^) is approximated as a tensor decomposition (316-i) based on an adaptive cross-approximation, ACA (318-i).

12. The training entity according to claim 11, wherein the instructions are further such that they, when executed by the processing circuitry, cause the training entity to perform the method (400) according to any one of claims 2 to 4.

13. The training entity according to 11 or 12, wherein said training entity forms part of a node for a telecommunications network.

14. A task performing entity (700) for performing a joint task, the task performing entity comprising processing circuitry (710) and a memory (730) storing instructions, wherein the instructions are such that they, when executed by the processing circuitry, cause the task performing entity to: - obtain a plurality of agents (310-i) trained to perform a joint task, wherein, for each agent, an agent-specific value function (^_^) is approximated as a tensor decomposition (316-i) based on an approximation, ACA (318-i), and

- perform the joint task using the of agents.

15. The task performing entity according to claim 14, wherein the instructions are further such that they, when executed by the processing circuitry, cause the task performing entity to obtain the plurality of agents by performing the method (500) according to any one of claims 6 to 10.

16. The task performing entity according to claim 14 or 15, wherein said task performing entity forms part of a node for a telecommunications network.

17. A computer program (820a) for training a plurality of agents (310-i) to perform a joint task, the computer program comprising computer code which, when run on processing circuitry (610), causes the processing circuitry to: - train, using cooperative multi-agent reinforcement learning, cMARL, a plurality of agents (310-i) to perform a joint task, wherein, for each agent, an agent- specific value function (^_^) is approximated as a tensor decomposition (316-i) based on an adaptive cross-approximation, ACA (318-i).

18. A computer program product (810a) comprising a computer program (820a) according to claim 17, and a computer-readable storage medium (830) on which the computer program is stored.

19. A computer program (820b) for a joint task, the computer program comprising computer code which, when run on processing circuitry (710), causes the processing circuitry to: - obtain a plurality of agents (310-i) trained to perform a joint task, wherein, for each agent, an agent-specific value function (^_^) is approximated as a tensor decomposition (316-i) based on an cross-approximation, ACA (318-i), and - perform the joint task using the

of agents.

20. A computer program product (810b) comprising a computer program (820b) according to claim 19, and a computer-readable storage medium (830) on which the computer program is stored.