WO2025140055A1 - Distributed training method, apparatus and system, and chip module and storage medium - Google Patents
Distributed training method, apparatus and system, and chip module and storage medium Download PDFInfo
- Publication number
- WO2025140055A1 WO2025140055A1 PCT/CN2024/141150 CN2024141150W WO2025140055A1 WO 2025140055 A1 WO2025140055 A1 WO 2025140055A1 CN 2024141150 W CN2024141150 W CN 2024141150W WO 2025140055 A1 WO2025140055 A1 WO 2025140055A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- model
- information
- expert
- network
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/098—Distributed learning, e.g. federated learning
Definitions
- the present application relates to artificial intelligence (AI), and in particular to a distributed training method, device, system, chip module and storage medium.
- AI artificial intelligence
- Federated learning is a machine learning paradigm for distributed training.
- different sub-nodes train a public model based on local data to obtain a local model that can better fit the characteristics of the local data.
- the generalization of the entire model is improved, but the public model's ability to fit the local data characteristics of each sub-node decreases, and personalized enhancement is needed.
- the global model issued by the central node includes multiple expert models.
- Each child node selects at least a part of the output results of the expert model based on the output results of the gated network model to train the global model, obtains gradient information, and reports the gradient information to the central node for model aggregation.
- the present application provides a distributed training method, device, system, chip module and storage medium to improve the performance of the aggregation model of the central node.
- the central node indicates the selection parameters so that all sub-nodes participating in the training will apply the selection parameters to the output results of the gated network model during the training of the gated network model and the expert model, thereby enabling each sub-node to align the selection of the output results of the gated network model and improve the performance of the aggregation model of the central node.
- the second information is further used to indicate an identifier of the selected at least one expert model.
- the method further includes: updating the global model based on the plurality of second information.
- the method further includes: updating the selection parameter based on the plurality of second information; and sending fourth information, where the fourth information is used to indicate the updated selection parameter.
- a distributed training device for implementing the distributed training method in the above-mentioned first aspect or any one of the implementations of the first aspect.
- the device may be a sub-node/third-party device, or a module (such as a processor, chip, or chip system, etc.) applied to a sub-node/third-party device, or a logical node, logical module or software that can implement all or part of the functions of a sub-node/third-party device.
- the distributed training device may include a sending unit, a receiving unit, and may also include a processing unit. The sending unit and the receiving unit may be independent or combined together (which may be referred to as a "transceiver unit").
- a distributed training device for implementing the distributed training method in the second aspect or any one of the implementations of the second aspect.
- the device may be a central node, or a module applied to the central node (such as a processor, a chip, or a chip system, etc.), or a logical node, a logical module, or software that can implement all or part of the functions of the central node.
- the distributed training device may include a sending unit, a receiving unit, and may also include a processing unit. The sending unit and the receiving unit may be independent or combined together (which may be referred to as a "transceiver unit").
- the distributed training device in the third to fourth aspects includes a unit for respectively executing the method in any aspect or any implementation of the first to second aspects.
- the sending unit may be an output unit, such as an output circuit or a communication interface; the receiving unit may be an input unit, such as an input circuit or a communication interface.
- the sending unit may be a transmitter or a transmitter; the receiving unit may be a receiver or a receiver.
- a computer program product comprising instructions, which, when executed on a distributed training device, enables the distributed training device to execute the methods described in the above aspects.
- a distributed training system which includes the distributed training device described in the third aspect and the distributed training device described in the fourth aspect.
- FIG1 is a schematic diagram of the architecture of a distributed training system provided in an embodiment of the present application.
- FIG2 is a simplified schematic diagram of a wireless communication system provided by an embodiment of the present application.
- FIG3 is a schematic diagram of the architecture of another distributed training system provided by the present application.
- FIGS. 4A to 4D are schematic diagrams of a network architecture provided in an embodiment of the present application.
- FIG5 is a schematic diagram of a neuron structure
- Fig. 6 is a schematic diagram of a neural network
- FIG7 is a schematic diagram of an AI application framework
- FIG8 is a schematic diagram of the architecture of another communication system provided in an embodiment of the present application.
- 9A to 9E are schematic diagrams of the structure of a global model provided in an embodiment of the present application.
- FIG10 is a schematic diagram of a flow chart of a distributed training method provided in an embodiment of the present application.
- FIG11 is a schematic diagram of optimal beam training according to an example of an embodiment of the present application.
- FIG12 is a schematic diagram of the structure of a distributed training device provided in an embodiment of the present application.
- FIG13 is a schematic diagram of the structure of another distributed training device provided in an embodiment of the present application.
- the embodiments of the present application can be applied to a distributed training system as shown in Figure 1, which includes a central node and multiple child nodes.
- the distributed training system can be a federated learning system or a Gossip learning system.
- Model parameter information can be transmitted between the central node and each child node.
- the distributed training system can also include a third-party device (not shown in the figure), which can serve as a model training entity for the central node/child node.
- the machine learning model trained by the distributed training system can be for non-wireless communication services, such as image recognition, natural language processing, etc., or for wireless communication services, such as beam selection based on environmental information.
- the communication system can be a fourth generation (4G) communication system (such as a long term evolution (LTE) system), a fifth generation (5G) communication system, a world-wide interoperability for microwave access (WiMAX), a wireless local area network (WLAN) system, a satellite communication system, a fusion system of multiple systems, or a future communication system, such as a sixth generation (6G) communication system.
- 4G fourth generation
- 5G fifth generation
- WiMAX world-wide interoperability for microwave access
- WLAN wireless local area network
- satellite communication system a satellite communication system
- a fusion system of multiple systems such as a sixth generation (6G) communication system.
- 6G sixth generation
- the 5G communication system can also be called a new radio (NR) system.
- NR new radio
- a network element in a communication system can send a signal to another network element or receive a signal from another network element.
- the signal may include information, signaling, or data, etc.
- the network element may also be replaced by an entity, a network entity, a device, a terminal device, a communication module, a node, a communication node, etc.
- the network element is used as an example for description in this application.
- a communication system may include at least one terminal device and at least one network device.
- the network device may send a downlink signal to the terminal device, and/or the terminal device may send an uplink signal to the network device.
- the plurality of terminal devices may also send signals to each other, that is, the signal sending network element and the signal receiving network element may both be terminal devices.
- FIG. 2 is a simplified schematic diagram of a wireless communication system provided in an embodiment of the present application.
- the wireless communication system includes a wireless access network 100.
- the wireless access network 100 may be a next generation (e.g., 6G or higher) wireless access network, or a traditional (e.g., 5G, 4G) wireless access network.
- One or more terminal devices 120a-120j, collectively referred to as 120
- 120 may be connected to each other, or to one or more network devices (110a, 110b, collectively referred to as 110) in the wireless access network 100.
- FIG. 2 is only a schematic diagram, and other devices may also be included in the wireless communication system, such as core network devices, wireless relay devices, and/or wireless backhaul devices, which are not shown in FIG. 2 .
- the wireless communication system may include multiple network devices (also referred to as access network devices) at the same time, and may also include multiple terminal devices at the same time.
- a network device may serve one or more terminal devices at the same time.
- a terminal device may also access one or more network devices at the same time.
- the embodiment of the present application does not limit the number of terminal devices and network devices included in the wireless communication system.
- the network device can be an entity on the network side for transmitting or receiving signals.
- the network device can be an access device for the terminal device to access the wireless communication system by wireless means, such as the network device can be a base station.
- the base station can broadly cover the following various names, or be replaced with the following names, such as: radio access network (RAN) node, node B (NodeB), evolved NodeB (evolved NodeB, eNB), next generation NodeB (next generation NodeB, gNB), network equipment in open radio access network (open radio access network, O-RAN), relay station, access point, transmission point (transmitting and receiving point, TRP), transmitting point (transmitting point, TP), master eNB (MeNB), secondary eNB (SeNB), multi-standard radio (multi-standard radio, MSR) node, home base station, network controller, access Node, wireless node, access point (AP), transmission node, transceiver node, building baseband unit (BBU), remote radio unit (
- the base station can be a macro base station, a micro base station, a relay node, a donor node or the like, or a combination thereof.
- the network device can also refer to a communication module, a modem or a chip used to be set in the aforementioned device or apparatus.
- the network device may also be a mobile switching center and a device that performs base station functions in device-to-device (D2D), vehicle-to-everything (V2X), and machine-to-machine (M2M) communications, a network-side device in a 6G network, and a device that performs base station functions in future communication systems.
- the network device may support networks with the same or different access technologies. The embodiments of the present application do not limit the specific technology and specific device form used by the network device.
- the network equipment can be fixed or mobile.
- base stations 110a, 110b are stationary and are responsible for wireless transmission and reception in one or more cells from terminal device 120.
- the helicopter or drone 120i shown in Figure 2 can be configured to act as a mobile base station, and one or more cells can move according to the location of the mobile base station 120i.
- the helicopter or drone (120i) can be configured to act as a terminal device that communicates with base station 110b.
- the communication device used to implement the above-mentioned access network function can be a network device, or a network device with partial access network functions, or a device capable of supporting the access network function, such as a chip system, a hardware circuit, a software module, or a hardware circuit plus a software module, which can be installed in the network device or used in combination with the network device.
- the communication device used to implement the network device function is described as a network device.
- the terminal device can be an entity on the user side for receiving or transmitting signals, such as a mobile phone.
- the terminal device can be used to connect people, objects and machines.
- the terminal device can communicate with one or more core networks through a network device.
- the terminal device includes a handheld device with a wireless connection function, other processing devices connected to a wireless modem, or a vehicle-mounted device.
- the terminal device can be a portable, pocket-sized, handheld, computer-built-in or vehicle-mounted mobile device.
- the terminal device 120 can be widely used in various scenarios, such as cellular communication, D2D, V2X, point-to-point (P2P), machine-to-machine (M2M), machine type communication (MTC), Internet of Things (IoT), virtual reality (VR), augmented reality (AR), industrial control, automatic driving, telemedicine, smart grid, smart furniture, smart office, smart wear, smart transportation, smart city, drone, robot, remote sensing, passive sensing, positioning, navigation and tracking, autonomous delivery and mobility, etc.
- P2P point-to-point
- M2M machine-to-machine
- MTC machine type communication
- IoT Internet of Things
- VR virtual reality
- AR augmented reality
- industrial control automatic driving, telemedicine, smart grid, smart furniture, smart office, smart wear, smart transportation, smart city, drone, robot, remote sensing, passive sensing, positioning, navigation and tracking, autonomous delivery and mobility, etc.
- terminal devices 120 are: 3GPP standard user equipment (UE), fixed equipment, mobile devices, handheld devices, wearable devices, cellular phones, smart phones, session initiation protocol (SIP) phones, laptops, personal computers, smart books, vehicles, satellites, global positioning system (GPS) equipment, target tracking equipment, drones, helicopters, aircraft, ships, remote control equipment, smart home equipment, industrial equipment, personal communication service (PCS) phones, wireless local loop (WLL) stations, personal digital assistants (PDAs), etc.
- the terminal device 120 may be a wireless device in the above-mentioned scenarios or a device used to be set in a wireless device, for example, a communication module, a modem or a chip in the above-mentioned device.
- the terminal device may also be referred to as a terminal, a terminal device, a UE, a mobile station (MS), a mobile terminal (MT), or the like.
- the terminal device may also be referred to as a terminal, a terminal device, a UE, a mobile station (MS), a mobile terminal (MT), or the like.
- the terminal device may also be a terminal device in a future wireless communication system.
- the terminal device can be used in a dedicated network device or a general-purpose device. The embodiments of the present application do not limit the specific technology and specific device form used by the terminal device.
- the terminal device can be used to act as a base station.
- the UE can act as a scheduling entity that provides sidelink signals between UEs in V2X, D2D, or P2P, etc.
- the cellular phone 120a and the car 120b communicate with each other using sidelink signals.
- the cellular phone 120a and the smart home device 120e communicate without relaying the communication signal through the base station 110b.
- the communication device for realizing the functions of the terminal device may be a terminal device, or a terminal device having some functions of the above terminal devices, or a device capable of supporting the functions of the above terminal devices, such as a chip system, which may be installed in the terminal device or used in combination with the terminal device.
- the chip system may be composed of a chip, or may include a chip and other discrete devices.
- the communication device is described as a terminal device or UE as an example.
- the RAN node can be a CU, DU, CU-CP, CU-UP, or RU.
- the CU and DU can be set separately, or can also be included in the same network element, such as a BBU.
- the RU can be included in a radio frequency device or a radio frequency unit, such as an RRU, AAU, or RRH.
- the RAN node may support one or more types of fronthaul interfaces, and different fronthaul interfaces correspond to DUs and RUs with different functions. If the fronthaul interface between the DU and the RU is a common public radio interface (CPRI), the DU is configured to implement one or more of the baseband functions, and the RU is configured to implement one or more of the radio frequency functions.
- CPRI common public radio interface
- the DU For uplink transmission, based on de-RE mapping, the DU is configured to implement one or more functions before de-mapping (i.e., one or more functions of decoding, de-rate matching, de-scrambling, demodulation, inverse discrete Fourier transform (IDFT), channel equalization, and de-RE mapping), while other functions after de-mapping (e.g., one or more functions of digital BF or FFT/CP removal) are moved to the RU for implementation.
- de-mapping i.e., one or more functions of decoding, de-rate matching, de-scrambling, demodulation, inverse discrete Fourier transform (IDFT), channel equalization, and de-RE mapping
- other functions after de-mapping e.g., one or more functions of digital BF or FFT/CP removal
- the processing unit for implementing the baseband function in the BBU is called a baseband high layer (BBH) unit, and the processing unit for implementing the baseband function in the RRU/AAU/RRH is called a baseband low layer (BBL) unit.
- BHB baseband high layer
- BBL baseband low layer
- CU or CU-CP and CU-UP
- DU or RU may also have different names, but those skilled in the art can understand their meanings.
- ORAN open radio access network
- CU may also be called O-CU (open CU)
- DU may also be called O-DU
- CU-CP may also be called O-CU-CP
- CU-UP may also be called O-CU-UP
- RU may also be called O-RU.
- Any unit in the CU (or CU-CP, CU-UP), DU and RU in this application may be implemented by a software module, a hardware module, or a combination of a software module and a hardware module.
- the device for realizing the function of the network device may be a network device; or it may be a device capable of supporting the network device to realize the function, such as a chip system, a hardware circuit, a software module, or a hardware circuit plus a software module.
- the device may be installed in the network device or used in combination with the network device.
- only the device for realizing the function of the network device is a network device as an example for explanation, and the scheme of the embodiments of the present application is not limited.
- Protocol layer structure between network devices and terminal devices
- the protocol layer structure may include a control plane protocol layer structure and a user plane protocol layer structure.
- the control plane protocol layer structure may include the functions of the radio resource control (RRC) layer, the packet data convergence protocol (PDCP) layer, the radio link control (RLC) layer, the medium access control (MAC) layer and the physical layer.
- the user plane protocol layer structure may include the functions of the PDCP layer, the RLC layer, the MAC layer and the physical layer.
- the service data adaptation protocol (SDAP) layer may also be included above the PDCP layer.
- SDAP service data adaptation protocol
- the protocol layer structure between the network device and the terminal device may also include an artificial intelligence (AI) layer for transmitting data related to AI functions.
- AI artificial intelligence
- the terminal device may also have an application layer and a non-access layer.
- the application layer may be used to provide services to applications installed in the terminal device. For example, downlink data received by the terminal device may be sequentially transmitted from the physical layer to the application layer, and then provided to the application by the application layer; for another example, the application layer may obtain data generated by the application, and sequentially transmit the data to the physical layer and send it to other communication devices.
- the non-access layer may be used to forward user data, such as forwarding uplink data received from the application layer to the SDAP layer, or forwarding downlink data received from the SDAP layer to the application layer.
- the network device and UE1 to UE5 can form a distributed AI training system as shown in Figure 3.
- UE1 to UE5 can send data to the network device, and the network device needs to receive uplink data sent by UE1 to UE5.
- the uplink data can be the representation difference or model parameter calculated by the child node, or it can be the feedback amount containing its status information.
- the network device can send configuration information to UE1-UE5.
- the configuration information can be the model parameter data used by the central node to synchronize each child node, or it can be control data indicating the training method of the child node.
- AI nodes may also be introduced into the network.
- the AI node can be deployed in one or more of the following locations in the communication system: network equipment, terminal equipment, or core network equipment, etc., or the AI node can also be deployed separately, for example, deployed in a location other than any of the above devices, such as a host or cloud server in an over-the-top (OTT) system.
- the AI node can communicate with other devices in the communication system, and the other devices can be, for example, one or more of the following: network equipment, terminal equipment, or network elements of the core network, etc.
- An AI node can be an AI network element or an AI module.
- One or more AI modules are provided in one or more devices of these network element nodes, such as core network equipment, access network nodes (RAN nodes), terminals or OAM.
- the access network node can be a separate RAN node, or it can include multiple RAN nodes, for example, including CU and DU.
- the CU and/or DU can also be provided with one or more AI modules.
- the CU can also be split into CU-CP and CU-UP.
- One or more AI models are provided in the CU-CP and/or CU-UP.
- the AI module is used to implement the corresponding AI function.
- the AI modules deployed in different network elements can be the same or different.
- the model of the AI module can implement different functions according to different parameter configurations.
- the model of the AI module can be configured based on one or more of the following parameters: structural parameters (such as the number of neural network layers, the width of the neural network, the connection relationship between layers, the weight of the neuron, the activation function of the neuron, or at least one of the biases in the activation function), input parameters (such as the type of input parameters and/or the dimension of input parameters), or output parameters (such as the type of output parameters and/or the dimension of output parameters).
- the bias in the activation function can also be called the bias of the neural network.
- the communication system includes a RAN intelligent controller.
- the RIC may be the above-mentioned AI module, which is used to implement AI-related functions.
- the RIC includes near-real time RIC (near-real time RIC, near-RT RIC) and non-real time RIC (non-real time RIC, Non-RT RIC).
- the non-real time RIC mainly processes non-real-time information, such as data that is not sensitive to delay, and the delay of the data can be in the second level.
- the real-time RIC mainly processes near-real-time information, such as data that is relatively sensitive to delay, and the delay of the data is in the tens of milliseconds.
- Non-real-time RIC is also used for model training and reasoning. For example, it is used to train an AI model and use the model for reasoning.
- Non-real-time RIC can obtain information on the network side and/or the terminal side from RAN nodes (such as CU, CU-CP, CU-UP, DU and/or RU) and/or terminals. This information can be used as training data or reasoning data, and the reasoning results can be submitted to the RAN node and/or the terminal.
- the reasoning results can be exchanged between the CU and the DU, and/or between the DU and the RU.
- the non-real-time RIC submits the reasoning results to the DU, and the DU sends it to the RU.
- the near real-time RIC and the non-real-time RIC may also be separately set as a network element.
- the near real-time RIC and the non-real-time RIC may also be part of other devices, for example, the near real-time RIC is set in a RAN node (for example, in a CU or DU), and the non-real-time RIC is set in an OAM, a cloud server, a core network device, or other network devices.
- the configuration of near real-time RIC and non-real-time RIC in the network architecture may be as shown in FIG. 4A to FIG. 4D :
- the network device includes a near real-time RIC module for performing model learning and/or reasoning.
- a non-real-time RIC may be included outside a network device.
- the non-real-time RIC may be located in an OAM or a core network device.
- the network device includes a near real-time RIC, and the network device also includes a non-real-time RIC.
- the non-real-time RIC may be located in the OAM or in the core network device.
- the CU is separated into CU-CP and CU-UP in Fig. 4B.
- the settings of the near real-time RIC and the non-real-time RIC are the same as those in (c) in Fig. 4A.
- the network device includes one or more AI entities, and the function of the AI entity is similar to the above-mentioned near real-time RIC.
- the OAM includes one or more AI entities, and the function of the AI entity is similar to the above-mentioned non-real-time RIC.
- the core network device includes one or more AI entities, and the function of the AI entity is similar to the above-mentioned non-real-time RIC.
- the difference in models may include at least one of the following differences: structural parameters of the model (such as the number of layers of the model, and/or weights, etc.), input parameters of the model, or output parameters of the model.
- the network device in Figure 4D is separated into CU and DU.
- the CU may include an AI entity, and the function of the AI entity is similar to the above-mentioned near real-time RIC.
- the DU may include an AI entity, and the function of the AI entity is similar to the above-mentioned near real-time RIC.
- the CU in Figure 4D can be further split into CU-CP and CU-UP.
- one or more AI models can be deployed in the CU-CP.
- one or more AI models can be deployed in the CU-UP.
- the OAM of the network device and the OAM of the core network device can be deployed separately and independently.
- AI refers to the intelligence displayed by machines created by humans.
- artificial intelligence refers to the technology that presents human intelligence through ordinary computer programs.
- Artificial intelligence can be defined as machines or computers that imitate humans and have cognitive functions related to human thinking, such as learning and problem solving. Artificial intelligence is able to learn from past experiences, make reasonable decisions, and respond quickly.
- the goal of artificial intelligence is to understand intelligence by building computer programs with symbolic reasoning or reasoning.
- Machine learning is a way to achieve artificial intelligence, that is, to solve problems in artificial intelligence by means of machine learning.
- the theory of machine learning mainly designs and analyzes some algorithms that allow computers to "learn" automatically.
- Machine learning algorithms are a type of algorithm that automatically analyzes data to obtain patterns and uses the patterns to predict unknown data. Because learning algorithms involve a lot of statistical theories, machine learning is particularly closely related to inferential statistics, and is also called statistical learning theory.
- Machine learning can be divided into supervised learning, unsupervised learning, and reinforcement learning.
- Supervised learning uses machine learning algorithms to learn the mapping relationship from sample values to sample labels based on the collected sample values and sample labels, and uses machine learning models to express the learned mapping relationship.
- the process of training a machine learning model is the process of learning this mapping relationship.
- the received signal containing noise is the sample
- the real constellation point corresponding to the signal is the label.
- Machine learning expects to learn the mapping relationship between samples and labels through training, that is, to make the machine learning model learn a signal detector.
- the model parameters are optimized by calculating the error between the model's predicted value and the real label.
- the learned mapping can be used to predict the label of each new sample.
- the mapping relationship learned by supervised learning can include linear mapping and nonlinear mapping. According to the type of label, the learning task can be divided into classification task and regression task.
- Unsupervised learning is based only on the collected sample values, using algorithms to discover the inherent patterns of samples.
- the model parameters are optimized by calculating the error between the model's predicted value and the sample itself.
- Self-supervised learning can be used in applications such as signal compression and decompression recovery. Common algorithms include autoencoders and adversarial generative networks.
- Reinforcement learning is different from supervised learning. It is a type of algorithm that learns problem-solving strategies by interacting with the environment. Unlike supervised and unsupervised learning, reinforcement learning problems do not have clear "correct" action label data.
- the algorithm needs to interact with the environment to obtain reward signals from the environment, and then adjust the decision-making actions to obtain a larger reward signal value. For example, in downlink power control, the reinforcement learning model adjusts the downlink transmission power of each user according to the total system throughput fed back by the wireless network, and then expects to obtain a higher system throughput.
- the goal of reinforcement learning is also to learn the mapping relationship between the state of the environment and the optimal decision action. However, because the label of the "correct action" cannot be obtained in advance, the network cannot be optimized by calculating the error between the action and the "correct action”. Reinforcement learning training is achieved through iterative interaction with the environment.
- AI models are algorithms or computer programs that can realize AI functions. They are the specific implementation of AI technology functions. AI models represent the mapping relationship between the input and output of the model.
- the types of AI models can be neural networks, linear regression models, decision tree models, support vector machines (SVM), Bayesian networks, Q learning models or other machine learning models.
- Deep neural network is a specific implementation form of AI or machine learning technology. According to the universal approximation theorem, neural network can theoretically approximate any continuous function, so that neural network has the ability to learn any mapping.
- Traditional communication systems require rich expert knowledge to design communication modules, while DNN-based deep learning communication systems can automatically discover implicit pattern structures from large data sets, establish mapping relationships between data, and obtain performance that is superior to traditional modeling methods.
- each neuron performs a weighted sum operation on its input values and outputs the operation result through an activation function.
- FIG 5 it is a schematic diagram of the neuron structure.
- the bias for weighted summation of input values according to the weights is, for example, b.
- b, w i , xi can be decimals, integers (such as 0, positive integers or negative integers), or complex numbers.
- the activation functions of different neurons in a neural network can be the same or different.
- a neural network generally includes multiple layers, each of which may include one or more neurons.
- the expressive power of the neural network can be improved, providing a more powerful information extraction and abstract modeling capability for complex systems.
- the depth of a neural network may refer to the number of layers included in the neural network, wherein the number of neurons included in each layer may be referred to as the width of the layer.
- the neural network includes an input layer and an output layer. The input layer of the neural network processes the received input information through neurons, passes the processing results to the output layer, and obtains the output result of the neural network from the output layer.
- the neural network includes an input layer, a hidden layer, and an output layer, and reference may be made to the schematic diagram of the neural network in FIG6.
- the input layer of the neural network processes the received input information through neurons, passes the processing results to the middle hidden layer, and the hidden layer calculates the received processing results to obtain the calculation results.
- the hidden layer passes the calculation results to the output layer or the adjacent hidden layer, and finally obtains the output result of the neural network from the output layer.
- a neural network may include one hidden layer, or include multiple hidden layers connected in sequence, without limitation.
- DNN can include feedforward neural network (FNN), convolutional neural network (CNN) and recurrent neural network (RNN).
- FNN feedforward neural network
- CNN convolutional neural network
- RNN recurrent neural network
- Figure 6 shows an FNN network, which is characterized by the neurons in adjacent layers being fully connected to each other, which usually requires a large amount of storage space and leads to high computational complexity.
- CNN is a neural network that is specifically designed to process data with a grid-like structure.
- time series data and image data can be considered to be data with a grid-like structure.
- CNN does not use all the input information for calculations at once, but uses a fixed-size window to intercept part of the information for convolution operations, which greatly reduces the amount of calculation of model parameters.
- each window can use different convolution kernel operations, which enables CNN to better extract the features of the input data.
- RNN is a type of DNN network that uses feedback time series information. Its input includes the new input value at the current moment and its own output value at the previous moment. RNN is suitable for obtaining sequence features that are correlated in time, and is particularly suitable for applications such as speech recognition and channel coding.
- the above-mentioned FNN, CNN, and RNN are common neural network structures, which are all constructed based on neurons.
- each neuron performs a weighted sum operation on its input values, and the weighted sum result generates an output through a nonlinear function.
- the parameters of all neurons in a neural network constitute the parameters of this neural network.
- the training data set is used for training the AI model.
- the training data set may include the input of the AI model, or the input and target output of the AI model.
- the training data set includes one or more training data, and the training data may be a training sample input to the AI model, or it may be the target output of the AI model.
- the target output may also be referred to as a label or a label sample.
- the training data set is one of the important parts of machine learning. Model training is essentially learning certain features from the training data so that the output of the AI model is as close to the target output as possible, such as the difference between the output of the AI model and the target output is as small as possible.
- the composition and selection of the training data set can determine the performance of the trained AI model to a certain extent.
- a loss function can be defined.
- the loss function describes the gap or difference between the output value of the AI model and the target output value. This application does not limit the specific form of the loss function.
- the training process of the AI model is to adjust the model parameters of the AI model so that the value of the loss function is less than the threshold, or the value of the loss function meets the target requirements.
- the AI model is a neural network, and adjusting the model parameters of the neural network includes adjusting at least one of the following parameters: the number of layers, width, weights of neurons, or parameters in the activation function of neurons.
- the design of the AI model mainly includes a data collection link (for example, collecting training data and/or reasoning data), a model training link, and a model reasoning link. It can also include a reasoning result application link. See Figure 7, which illustrates an AI application framework.
- the data collection link the data source is used to provide training data sets and reasoning data.
- the model training link the AI model is obtained by analyzing or training the training data provided by the data source.
- the AI model represents the mapping relationship between the input and output of the model. Learning the AI model through the model training node is equivalent to learning the mapping relationship between the input and output of the model using the training data.
- the AI model trained through the model training link is used to reason based on the reasoning data provided by the data source to obtain the reasoning result.
- This link can also be understood as: inputting the reasoning data into the AI model, and obtaining the output through the AI model, which is the reasoning result.
- the reasoning result can indicate: the configuration parameters used (executed) by the execution object, and/or the operation performed by the execution object.
- the reasoning results are published in the reasoning result application link.
- the reasoning results can be planned uniformly by the execution entity (actor).
- the execution entity can send the reasoning results to one or more execution objects (for example, core network equipment, network equipment, or terminal equipment, etc.) for execution.
- the execution entity can also feedback the performance of the model to the data source to facilitate the subsequent implementation of the model update training.
- a network element with artificial intelligence function may be included in the communication system.
- the above-mentioned AI model design-related links may be performed by one or more network elements with artificial intelligence function.
- an AI function (such as an AI module or an AI entity) may be configured in an existing network element in the communication system to implement AI-related operations, such as training and/or reasoning of an AI model.
- the existing network element may be a network device (such as a gNB), a terminal device, a core network device, or a network management system.
- the network management system may divide the network management work into three categories according to the actual needs of the operator's network operation: operation, administration, and maintenance.
- the network management system may also be referred to as an OAM network element, or OAM for short.
- Operation mainly completes the analysis, prediction, planning, and configuration of daily networks and services; maintenance mainly involves daily operational activities such as testing and fault management of the network and its services.
- the network management system may detect the network operation status, optimize network connections and performance, improve network operation stability, and reduce network maintenance costs.
- an independent network element may also be introduced into the communication system to perform AI-related operations, such as training an AI model.
- the independent network element can be called an AI network element or an AI node, etc., and this application does not limit this name.
- the AI network element can be directly connected to the network equipment in the communication system, or it can be indirectly connected through a third-party device and the network equipment.
- the third-party device can be a core network element such as an authentication management function (AMF) network element, a user plane function (UPF) network element, an OAM, a cloud server or other network elements, without limitation.
- AMF authentication management function
- UPF user plane function
- OAM OAM
- cloud server or other network elements, without limitation.
- the communication system includes a network device 810, terminal devices 820, 830, and an AI network element 840 is also introduced in the communication system.
- a model can be inferred to obtain one parameter, or multiple parameters.
- the training process of different models can be deployed in different devices or nodes, or in the same device or node.
- the reasoning process of different models can be deployed in different devices or nodes, or in the same device or node.
- the model parameters may include one or more of the following structural parameters of the model (such as the number of layers of the model, and/or weights, etc.), the input parameters of the model (such as input dimension, number of input ports), or the output parameters of the model (such as output dimension, number of output ports).
- the input dimension may refer to the size of an input data.
- the input dimension corresponding to the sequence may indicate the length of the sequence.
- the number of input ports may refer to the number of input data.
- the output dimension may refer to the size of an output data.
- the output dimension corresponding to the sequence may indicate the length of the sequence.
- the number of output ports may refer to the number of output data.
- Distributed training is an effective solution to the above challenges.
- This type of technology allows the machine training process to be divided into multiple sub-nodes on the user side to achieve scalability of the learning algorithm. It allows the cloud or server as a central node to collect machine learning models trained by multiple sub-nodes. The central node improves the effect of the entire machine learning training by integrating the models trained by the sub-nodes. Since the training data is always kept in the sub-nodes, distributed learning technology is expected to achieve the same performance as centralized training while using the UE's data and/or computing power to protect user data privacy.
- Federated learning is a machine learning paradigm for distributed training. Its original intention was to effectively help multiple institutions use data and conduct machine learning modeling while meeting the requirements of user privacy protection and data security.
- what is transmitted between nodes is not the data itself, but the intermediate results obtained during training, such as model parameters or gradients.
- federated learning can effectively solve the problem of data silos, allowing participants to jointly model without sharing data, technically breaking down data silos and achieving AI collaboration.
- federated learning can be divided into three categories: horizontal federated learning, vertical federated learning, and federated transfer learning.
- Horizontal federated learning means that when there is a lot of overlap in user features but little overlap in users of two datasets, we split the dataset horizontally (i.e., user dimension) and take out the data with the same user features but different users for training.
- Vertical federated learning means that when there is a lot of overlap in users of two datasets but little overlap in user features, we split the dataset vertically (i.e., feature dimension) and take out the data with the same users but different user features for training.
- Federated transfer learning means that when there is little overlap in users and user features of two datasets, we do not split the data, but use transfer learning to overcome the lack of data or labels.
- PBCH physical broadcast channel
- SSB synchronization signal blocks
- CSI-RS channel state information-reference signals
- AI/ML can be used to train a model, using, for example, a plurality of received SSB/CSI-RS signals (part or all), or the strength (RSRP) (part or all) of a plurality of received SSB/CSI-RS signals or the estimated channel as input to infer the optimal beam ID and feedback it to the network device.
- Each user can collect his/her own receiving beam/channel information and the corresponding optimal beam ID as samples (i.e., local samples) for training the above AL/ML model.
- the number of samples that each user can collect is limited.
- the performance of the model obtained by the user using only local data for training will be limited, that is, due to the position relationship, the optimal beam ID of the user may be only a subset of the SSB/CSI-RS codebook.
- the server aggregates the data of each user for model training. Although the performance of the model can be improved, there is a risk of leaking the user's privacy information, such as the user's current location can be inferred through the channel.
- federated learning can be used.
- the central node sends a global model to each user participating in federated learning. Each user uses local data to train the global model to obtain a local model, and sends the parameter information of the local model, such as gradients, weights, etc. (encrypted) to the server.
- the server performs model fusion (model aggression, MA) to update the global model, and sends the global model to each user again. The user continues to update the local model and sends it to the central node. This is repeated for multiple times until convergence.
- the global model sent by the central node includes multiple expert models.
- Each sub-node selects at least a part of the output results of the expert model based on the output results of the gated network model to train the global model, obtains gradient information, and reports the gradient information to the central node for model aggregation.
- the hybrid expert model combines multiple networks or multiple expert models to achieve better model performance.
- a large model based on the hybrid expert model can effectively improve its training efficiency and the number of model parameters.
- the global model includes a general layer
- whether the expert layer includes a gated network model, and whether there is a clear definition of the expert layer it can be classified as follows:
- the expert layer includes the gated network model and has a clear definition of the expert layer:
- the global model includes a general layer and an expert layer.
- the general layer is a network layer that is common to all input data, and may be a feature extraction network such as a convolutional neural network.
- the expert layer includes a gated network model and N expert models. N is a positive integer.
- the N expert models may be based on different types of neural networks, such as CNN, RNN, Transformer, MLP, etc.; or based on the same type of neural network, but with different parameter configurations, such as the number of layers, depth, and specific configurations including kernel size.
- the gated network model is trained, and the output of the gated network model is used to select the output of the expert model. Selecting an appropriate and matching gated output mechanism can merge and balance the expert's choices.
- the different expert models here can refer to different network structures, for example, expert model 1 is a convolutional network, expert model 2 is a fully connected network, and so on. However, it is necessary to ensure that the output of each expert network can be merged. Selecting a part of the expert models for prediction based on the output of the gated network model can reduce the amount of calculation and select the most appropriate expert model for different inputs.
- the output of the general layer serves as the common input of the expert model and the gating network model.
- the global model includes only the expert layer, but not the general layer.
- the expert layer includes a gated network model and N expert models. The meanings of the expert model and the gated network model can be referred to in the above description.
- the expert layer does not include a gated network model, but only includes N expert models.
- the gated network model is independent of the expert layer.
- the meanings of the expert model and the gated network model can be referred to the above description.
- the input of the expert layer and the gated network model are different.
- the result of the first preprocessing of the input local data is used as the input of expert layer 1 to expert layer N; the result of the second preprocessing of the input local data is used as the input of the gated network model.
- the present application provides a distributed training solution, in which the central node indicates the selection parameters so that all sub-nodes participating in the training will apply the selection parameters to the output results of the gated network model during the training process of the gated network model and the expert model, thereby enabling each sub-node to align the selection of the output results of the gated network model and improve the performance of the aggregation model of the central node.
- the distributed training method provided by the embodiment of the present application is described in detail below. It can be understood that the present application uses the central node and the sub-node as an example to illustrate the execution subject of the interactive diagram, but the present application does not limit the execution subject of the interactive diagram.
- a flow chart of a distributed training method provided in an embodiment of the present application is provided.
- the method is applied to a distributed training system, which includes a central node and multiple child nodes.
- the global model of the distributed training system includes N expert models and a gated network model, where N is a positive integer.
- the method may include the following steps:
- the subnode sends third information to the central node.
- the central node receives the third information.
- the child node is any child node participating in the distributed training. This embodiment is described by taking the interaction process between a child node and a central node as an example, and the interaction between other child nodes and the central node refers to the interaction process of this embodiment.
- the child node When initially constructing the distributed training system, the child node can report its own capabilities to the central node. Exemplarily, the child node sends third information to the central node, where the third information is used to indicate at least one of the following information: the size of the child node's memory space, the child node's computing power information, whether model training is supported, and the type of model supported for training.
- the central node When training starts, the central node will send the initialization model of the central node to the child node, so that the child node can train the initialization model. Therefore, before the training starts, the child node can report the size of the child node's memory space to the central node, so that the central node knows whether the child node's memory space is large enough to store the initialization model of the central node and subsequent training data.
- the size of the child node's memory space refers to the size of the memory space that the child node can use to store AI/ML models.
- the child node Before training begins, the child node can also report the computing power information of the child node to the central node, so that the central node knows whether the child node has strong enough computing power and can timely feedback the model information after training.
- the computing power information of the child node refers to the computing power of running the AI/ML model.
- the child node Before training begins, the child node can also report to the central node whether it supports model training, so that the central node can determine whether the child node can participate in distributed training and send the global model to it.
- the third information may also include hardware information of the sub-node, including but not limited to the antenna configuration of the sub-node (number of antennas, polarization direction, etc.), number of RF channels, sensor type (position sensor/global positioning system (GPS), motion sensor, etc.) and parameters.
- hardware information of the sub-node including but not limited to the antenna configuration of the sub-node (number of antennas, polarization direction, etc.), number of RF channels, sensor type (position sensor/global positioning system (GPS), motion sensor, etc.) and parameters.
- the child nodes use local data for training, so the child nodes do not need to report a series of information related to the actual collected data or involving privacy, such as the amount of data that can be processed.
- the central node may also obtain the above information of the child nodes in advance. Therefore, this step is optional and is indicated by a dotted line in the figure.
- the central node sends first information to the child node.
- the child node receives the first information.
- the central node After receiving the third information reported by each sub-node, the central node selects a sub-node to participate in this round of distributed training according to the third information of each sub-node.
- the global model includes N expert models (or network 1 to network N as shown in FIG. 9E ), and these N expert models can be based on different types of neural networks, such as CNN, RNN, Transformer, MLP, etc.; or based on the same type of neural network, but corresponding to different parameter configurations, such as the number of layers, depth, and specific configurations including kernel size. Therefore, the output results of the N expert models may be different.
- the more sparsely the output results of the expert model are selected the more data feature types the expert network can adapt to.
- the extreme case is that a network only learns one type of feature data.
- the global model includes a gated network model, and the sub-nodes also train the gated network model when training the expert model.
- the gated network model includes different types of neural networks, or includes the same type of neural network, but corresponds to different parameter configurations, such as the number of layers, depth, and specific configurations including kernel size. Therefore, the gated network model has N outputs, and the N outputs of the gated network model are respectively connected to the outputs of the N expert models.
- the output results of the gated network model can be used to select the output results of the expert model, and the output results of the gated network model correspond one-to-one to the output results of the expert model connected to it. Therefore, in order to align the selection of the output results of the expert model by each sub-node, the central node sends the first information to the sub-node. Among them, the first information is used to indicate the selection parameter, and the selection parameter is used to select the output result of the gated network model.
- the sub-node can determine whether to retain the output result of the gated network model or discard the output result of the gated network model according to the selection parameter.
- the output result of the corresponding expert model is also retained; if it is determined according to the selection parameters to discard the output result of the gated network model, the output result of the corresponding expert model is also discarded.
- the output result of the gated network model is a value between [0, 1] after being processed by the set function.
- the selection parameter is a first threshold.
- the value range of the first threshold may be (0, 1). If the output result of the gated network model is greater than the first threshold, the output result of the gated network model is retained; if the output result of the gated network model is less than the first threshold, the output result of the gated network model is discarded. In particular, if the output result of the gated network model is equal to the first threshold, it can be agreed to retain or discard the output result of the gated network model.
- the first information can also be used to indicate at least one of the following information: the identification of the N expert models, the competition mode of the N expert models, the training task, and the type of input and output of the global model.
- the first information is used to indicate the identifiers of the N expert models, so that the model parameter information of the expert models reported by the subsequent child nodes is based on the output results of which expert models.
- the central node can further instruct the child nodes on how to train each expert model.
- the central node will instruct the child nodes on the purpose of training the network. For example, channel recovery is to make the recovered channel (as the output of the central node) as close to the true value as possible (that is, to minimize the normalized mean squared error (NMSE)).
- NMSE normalized mean squared error
- the competitive mode of the N expert models where the competitive mode can also be called the usage mode, cooperation mode, collaboration mode, and whether to collaborate, etc.:
- A represents the true value (which can represent a single sample)
- pi represents the weight of the ith expert model (determined based on the output of the gating network model)
- oi represents the output result of the ith expert model.
- the central node may carry the competition mode of the N experts in the first information, where the competition mode indicates whether the N expert models are cooperating or competing.
- the central node also indicates the training task of the distributed training, and the training task can be understood as the function of the global model, that is, what the global model can be used for training.
- the training task can be beam prediction, channel recovery, channel prediction, etc.
- the input of the global model may be, for example, channel quality, RSRP of the beam, etc.
- the output of the global model may be, for example, the optimal beam index, etc.
- the central node can also send model parameter information of the global model to the child nodes.
- the child node When training a child node, it is necessary to input sample data into N expert models and gated network models. Therefore, before executing this step, the child node collects different types of sample data in different application scenarios.
- the subnodes in the federated learning architecture can be network devices, while the central node can be an independent federated learning management node; or the subnodes can be terminal devices, while the central node can be a network device that functions as a central node.
- the global model to be trained is an AI/ML model that takes the estimated channel measurement value or the received signal itself as input and the optimal beam index as output
- the subnode is responsible for collecting the channel measurement value or received signal as the model input and the label used for training the model, that is, the optimal beam index, during the data collection stage.
- All possible beams can be sent to the terminal device one by one through the network device, and the terminal device selects the beam direction index with the best performance (the best can refer to the beam with the largest physical layer-reference signal receiving power (layer1-reference signal receiving power, L1-RSRP) or signal-to-noise ratio (SNR) measurement value among all SSB/CSI-RS beams) as a label.
- SSB codebook-based synchronization signal blocks
- CSI-RS channel state information-reference signal
- the terminal device selects the beam direction index with the best performance (the best can refer to the beam with the largest physical layer-reference signal receiving power (layer1-reference signal receiving power, L1-RSRP) or signal-to-noise ratio (SNR) measurement value among all SSB/CSI-RS beams) as a label.
- L1-RSRP layer1-reference signal receiving power
- SNR signal-to-noise ratio
- the central node can also configure downlink resources for the subnodes to send the initial global model information of the central node.
- the downlink resources can be control channel resources, such as PDCCH resources; or data channel resources, such as PDSCH resources.
- the downlink resources include frequency domain resource block number, starting position, subband number, subband bandwidth, frequency hopping parameters, modulation and coding scheme (MCS) and other parameters.
- the global model can be sent down by the central node in a broadcast or multicast manner.
- the central node in a single-cell federated learning architecture where the central node is a network device and the sub-nodes are terminal devices, the global model can be sent down in a broadcast manner. Due to the characteristics of broadcast, sub-nodes that are not involved in the federated learning can also receive the broadcast information; in a multi-cell federated learning architecture where a network device with a federated learning management function is used as the central node and other network devices are used as sub-nodes, the central node can also send the global model to each sub-node in a broadcast manner.
- sub-nodes that are not involved in the federated learning can also receive the broadcast information; multicast can also be used for sub-nodes participating in the federated learning.
- Sub-nodes associated with the same central node are grouped together, have the same group number, and are configured with the same downlink resources. In multicast mode, sub-nodes that do not participate in the federated learning will not receive the multicast information.
- the central node can also configure uplink resources for the subnodes to report local models for the subnodes to report models/gradients/weights.
- Another federated learning management node can also configure uplink resources for the central node and subnodes for the subnodes to report local model information and necessary signaling. Similar to the downlink resource configuration, the uplink resources can be control channel resources, such as PUCCH resources, or data channel resources, such as PUSCH resources.
- the child node After the child node collects a certain amount of sample data, it inputs the sample data into the N expert models and the gated network model to obtain the N first output results of the N expert models and the N second output results of the gated network model. Since the N outputs of the gated network model are connected to the outputs of the N expert models, the N second output results correspond to the N first output results one by one.
- the child node selects, for each first output result among the N first output results and each second output result among the N second output results, a first output result corresponding to the second output result that satisfies the selection parameter.
- the child node determines whether to retain or discard the second output result according to the selection parameter.
- the selection parameter is a first threshold, and the child node determines whether the second output result is greater than or equal to the first threshold. If so, the second output result is retained; otherwise, the second output result is discarded.
- the subnode obtains model parameter information of at least one selected expert model and model parameter information of the gated network model based on the selected at least one first output result and the true value information.
- the subnode After the subnode selects at least one first output result, the subnode obtains model parameter information of the selected at least one expert model and model parameter information of the gated network model based on the selected at least one first output result and the true value information.
- the model parameter information of the expert model includes the weight, gradient, gradient change, etc. of the expert model.
- the model parameter information of the gated network model includes the weight, gradient, gradient change, etc. of the gated network model.
- the child node obtains the model parameter information of the selected at least one expert model and the model parameter information of the gated network model based on the selected at least one first output result and the true value information, and there are two possible implementation methods:
- the child node obtains the model parameter information of the selected at least one expert model and the model parameter information of the gated network model based on the average value and true value information of the selected at least one first output result. For example, if the first threshold is 0.5, the child node can retain the second output results of the gated network model whose values are greater than 0.5, and select at least one first output result of the expert model corresponding to the retained at least one second output result. The child node multiplies each of the selected at least one first output results by 0.5, sums and averages their products, and obtains the average value of the selected at least one first output result.
- the child node weights and averages the at least one selected first output result based on the at least one second output result corresponding to the at least one selected first output result, obtains the weighted average value of the at least one selected first output result, and obtains the model parameter information of the at least one selected expert model and the model parameter information of the gated network model based on the weighted average value and true value information of the at least one selected first output result.
- the first threshold is 0.5
- the child node can retain the second output result of the gated network model whose value is greater than 0.5, and select at least one first output result of the expert model corresponding to the at least one retained second output result.
- the child node multiplies each first output result of the at least one selected first output result by its corresponding second output result (for example, some second output results are 0.6, and some second output results are 0.8), and sums and averages their products to obtain the weighted average value of the at least one selected first output result.
- the subnode sends the second information to the central node.
- the central node receives the second information.
- the child node After the child node obtains the model parameter information of at least one selected expert model and the model parameter information of the gated network model, it sends second information to the central node, wherein the second information is used to indicate the model parameter information of the gated network model and the model parameter information of at least one selected expert model from the N expert models.
- the second information is also used to indicate the identifier of the selected at least one expert model.
- the subnode carries the indication information of the identifier of the selected at least one expert model in the second information, so that the central node performs model aggregation based on the indication information.
- the central node updates the global model and the selection parameters based on the plurality of second information received from the plurality of child nodes.
- the central node receives multiple pieces of second information from multiple child nodes respectively, and can update the global model based on the multiple pieces of second information.
- the central node can further update the selection parameters based on the updated global model. For example, during initial training, the first threshold configured by the central node is relatively large, and the first threshold is updated based on information such as training feedback from the child nodes. The first threshold can also be obtained based on neural network learning.
- the central node sends the fourth information.
- the subnode receives the fourth information.
- the fourth information may be sent to each child node, wherein the fourth information is used to indicate the updated selection parameters, so that the fourth information selects the output result of the gated network model based on the updated selection parameters during the next round of training.
- the distributed training system may execute the above steps S1002 to S1008 multiple times until the global model converges.
- the global model includes a general layer Net_com, three expert models Net1/2/3, and a fully connected network Net_FC.
- the central node e.g., a base station
- the central node sends model parameter information of the global model to each child node, including Net_com, Net1/2/3, and Net_FC.
- Their corresponding neural network gradient/weight data are represented as W_com, W_1/W_2/W_3, and W_FC.
- UE1 W_com(UE1), W_1(UE1)/W_2(UE1)/W_3(UE1) and W_FC(UE1);
- UE2 W_com(UE2), W_1(UE2)/W_2(UE2)/W_3(UE2) and W_FC(UE2).
- the sub-nodes (UE1, UE2) feed back the above weights (which may also be gradients or gradient changes) to the central node (base station).
- the central node After the central node obtains the model parameter information from the two child nodes, it corresponds to the same global model and starts to fuse the two sets of parameters.
- FIG. 11 it is a schematic diagram of the optimal beam training of an example embodiment of the present application.
- the network device side performs beam scanning of the synchronization signal block or CSI-RS based on the codebook
- the channels between users (UE1 ⁇ UEN) at different locations may also include training nodes with similar data characteristics to these UEs) and the network device are different.
- the network device sends a unified global model to users at different locations. Users at different locations measure the received SSB or CSI-RS beam, for example, measure L1-RSRP, and feedback the beam identifier corresponding to the maximum RSRP value.
- AI/ML can be used to train a model, using, for example, a plurality of received SSB/CSI-RS signals (part or all of them), or the strength (RSRP) of a plurality of received SSB/CSI-RS signals (part or all of them) or the estimated channel as input to infer the optimal beam ID and feed it back to the network device.
- RSRP strength of a plurality of received SSB/CSI-RS signals
- the global model includes a gated network model and N expert models, and may also include other layers (i.e., the above-mentioned general layers, which may be trained or not), and the other layers are connected to the gated network model and the N expert models.
- the gated network model has N outputs, and each output is connected to the output of an expert model.
- each user inputs the measured RSRP into other layers, and the training output results of other layers are used as the common input of the gated network model and the N expert models.
- Each user trains the gated network model and the N expert models based on the input.
- the network device Before training, the network device sends selection parameters to each user, and each user determines whether to retain the N second output results of the gated network model according to the selection parameters, and obtains the first output result of at least one expert model corresponding to the at least one second output result retained according to the retained at least one second output result. Then, each user obtains the model parameter information of the selected at least one expert model and the model parameter information of the gated network model based on the selected at least one first output result and the true value information, and sends the model parameter information of the selected at least one expert model and the model parameter information of the gated network model to the network device. Furthermore, each user obtains an optimal beam identifier based on the selected at least one first output result and the true value information.
- the central node is a third-party device that performs the aforementioned actions related to the central node.
- the above steps S1001-S1002, S1006-S1008 are all performed by a third-party device.
- the subnode is a third-party device that performs the aforementioned subnode-related actions.
- the above steps S1001-S1008 are all performed by a third-party device.
- the central node is a network device.
- the network device can complete the training of the model.
- the above steps S1001-S1002 and S1006-S1008 are all performed by the network device.
- sending information to... e.g., a child node
- the destination end of the information is a child node. It can include sending information to a child node directly or indirectly.
- Sending information from... e.g., a child node
- receiving information from... e.g., a child node
- receiving information from... e.g., a child node
- the source end of the information is a child node, which can include receiving information from a child node directly or indirectly.
- the information may be processed as necessary between the source end and the destination end of the information transmission, such as format changes, etc., but the destination end can understand the valid information from the source end. Similar expressions in this application can be understood similarly and will not be repeated here.
- the embodiment of the present application can divide the functional modules of the distributed training device according to the above method embodiment.
- each functional module can be divided according to each function, or two or more functions can be integrated into one processing unit.
- the above integrated modules can be implemented in the form of hardware or in the form of software functional modules. It should be noted that the division of modules in the embodiment of the present application is schematic and is only a logical functional division. There may be other division methods in actual implementation.
- the present application also provides the following distributed training device:
- the distributed training device 1200 includes a transceiver unit 1201 and a processing unit 1202; wherein:
- the transceiver unit 1201 is used to perform one or more of the operations of the subnode in steps S1001, S1002, S1006 and S1008 of the embodiment shown in Figure 10, and the processing unit 1202 is used to perform one or more of steps S1003-S1005 of the embodiment shown in Figure 10.
- the distributed training device can be a terminal device, or a third-party device such as an OTT or cloud server, or a system composed of a terminal device and a third-party device.
- the transceiver unit 1201 is used to perform one or more of the operations of the central node in steps S1001, S1002, S1006 and S1008 of the embodiment shown in Figure 10, and the processing unit 1202 is used to perform step S1007 of the embodiment shown in Figure 10.
- the distributed training device can be a network device, or a third-party device such as an OTT or cloud server, or a system consisting of a network device and a third-party device.
- the aforementioned transceiver unit and/or processing unit can be implemented through a virtual module, for example, the processing unit can be implemented through a software function unit or a virtual device, and the transceiver unit can be implemented through a software function or a virtual device.
- the processing unit or the transceiver unit can also be implemented through a physical circuit, for example, if the device is implemented using a chip/chip circuit, the transceiver unit can be an input-output circuit and/or a communication interface, performing input operations (corresponding to the aforementioned receiving operations) and output operations (corresponding to the aforementioned sending operations); the processing unit is a processing circuit, such as an integrated processor or microprocessor or integrated circuit.
- the distributed training device 1300 includes one or more processing circuits 1301 (one processing circuit is illustrated in the figure).
- the distributed training device 1300 may also include a memory 1303 (indicated by a dotted line in the figure).
- the memory 1303 is used to store instructions executed by the processing circuit 1301, or to store input data required for the processing circuit 1301 to run instructions, or to store data generated after the processing circuit 1301 runs instructions.
- the distributed training device 1300 may also include an interface circuit 1302 (indicated by a dotted line in the figure), and the processing circuit 1301 and the interface circuit 1302 are coupled to each other. It can be understood that the interface circuit 1302 can be a transceiver or an input-output interface.
- the processing circuit may be a processor or a circuit in a processor used for processing.
- the interface circuit 1302 is used to execute one or more of the operations of the sub-node in steps S1001, S1002, S1006 and S1008 of the embodiment shown in Figure 10
- the processing circuit 1301 is used to execute one or more of steps S1003-S1005 of the embodiment shown in Figure 10.
- the interface circuit 1302 is used to execute one or more of the operations of the central node in steps S1001, S1002, S1006 and S1008 of the embodiment shown in Figure 10, and the processing circuit 1301 is used to execute step S1007 of the embodiment shown in Figure 10.
- the chip implements the function of the central node in the above-mentioned method embodiment.
- the chip receives information from other modules in the central node, and the information is sent by the subnode to the central node; or, the chip sends information to other modules in the central node, and the information is sent by the central node to the subnode.
- the chip implements the functions of the subnode in the above-mentioned method embodiment.
- the chip receives information from other modules in the subnode, and the information is sent from the central node to the subnode; or, the chip sends information to other modules in the subnode, and the information is sent from the subnode to the central node.
- the module of the subnode here can be the baseband chip of the subnode, or, the baseband chip and the processing chip.
- the processing chip can be used to implement AI training.
- the module of the subnode here can be the processing chip of the third-party device.
- the processing chip can be used to implement AI training.
- An embodiment of the present application further provides a computer-readable storage medium, in which a computer program or instruction is stored. When the computer program or instruction is executed, the method in the above embodiment is implemented.
- the embodiments of the present application also provide a computer program product including instructions, which, when executed on a computer, enables the computer to execute the method in the above embodiments.
- the embodiment of the present application further provides a chip system, including: at least one processor and an interface, the at least one processor is coupled to a memory via the interface, and when the at least one processor runs a computer program or instruction in the memory, the chip system executes a method in any of the above method embodiments.
- the chip system may be composed of a chip, or may include a chip and other discrete devices, which is not specifically limited in the embodiment of the present application.
- A/B can represent A or B; wherein A and B can be singular or plural.
- multiple refers to two or more than two.
- At least one of the following" or similar expressions refers to any combination of these items, including any combination of single items or plural items.
- at least one of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c can be single or multiple.
- the above embodiments it can be implemented in whole or in part by software, hardware, firmware or any combination thereof.
- a software program it can be implemented in whole or in part in the form of a computer program product.
- the computer program product includes one or more computer instructions.
- the computer program instructions When the computer program instructions are loaded and executed on a computer, the process or function described in the embodiment of the present application is generated in whole or in part.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
- the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) mode to another website site, computer, server or data center.
- wired e.g., coaxial cable, optical fiber, digital subscriber line (DSL)
- wireless e.g., infrared, wireless, microwave, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mobile Radio Communication Systems (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
Description
本申请要求于2023年12月26日提交中国国家知识产权局、申请号为202311819054.6、发明名称为“分布式训练方法、装置、系统、芯片模组及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the State Intellectual Property Office of China on December 26, 2023, with application number 202311819054.6 and invention name “Distributed training method, device, system, chip module and storage medium”, all contents of which are incorporated by reference in this application.
本申请涉及人工智能(artificial intelligence,AI),尤其涉及一种分布式训练方法、装置、系统、芯片模组及存储介质。The present application relates to artificial intelligence (AI), and in particular to a distributed training method, device, system, chip module and storage medium.
联邦学习(federated learning,FL)是分布式训练的一种机器学习范式。在联邦学习构架中,不同的子节点基于本地数据训练公共模型,得到本地模型,该本地模型能较好地拟合本地数据的特征。然而,这些子节点的本地模型在经中心节点聚合之后,整个模型的泛化性提升了,然而公共模型对各个子节点的本地数据特征的拟合能力下降,需要做个性化加强。Federated learning (FL) is a machine learning paradigm for distributed training. In the federated learning framework, different sub-nodes train a public model based on local data to obtain a local model that can better fit the characteristics of the local data. However, after the local models of these sub-nodes are aggregated by the central node, the generalization of the entire model is improved, but the public model's ability to fit the local data characteristics of each sub-node decreases, and personalized enhancement is needed.
因此,提出了一种混合专家模型(mixture of experts,MOE),中心节点下发的全局模型包括多个专家模型。各子节点基于门控网络模型的输出结果选择至少一部分专家模型的输出结果进行全局模型的训练,得到梯度信息,并上报该梯度信息等给中心节点进行模型聚合。Therefore, a mixture of experts (MOE) model is proposed. The global model issued by the central node includes multiple expert models. Each child node selects at least a part of the output results of the expert model based on the output results of the gated network model to train the global model, obtains gradient information, and reports the gradient information to the central node for model aggregation.
然而,若各子节点对如何使用门控网络模型的输出结果以及如何训练专家模型有不同的理解,会导致中心节点的聚合模型的性能下降。However, if each child node has a different understanding of how to use the output of the gating network model and how to train the expert model, the performance of the aggregation model of the central node will be degraded.
有鉴于此,对于包含混合专家模型的联邦学习场景,如何提高中心节点的聚合模型的性能,是亟待解决的问题。In view of this, for federated learning scenarios involving hybrid expert models, how to improve the performance of the aggregation model of the central node is an urgent problem to be solved.
本申请提供一种分布式训练方法、装置、系统、芯片模组及存储介质,以提高中心节点的聚合模型的性能。The present application provides a distributed training method, device, system, chip module and storage medium to improve the performance of the aggregation model of the central node.
第一方面,提供了一种分布式训练方法,所述方法应用于分布式训练系统,所述分布式训练系统的全局模型包括N个专家模型和一个门控网络模型,所述N为正整数,所述方法包括:接收第一信息,所述第一信息用于指示选择参数,所述选择参数用于选择所述门控网络模型的输出结果;以及发送第二信息,所述第二信息用于指示所述门控网络模型的模型参数信息和所述N个专家模型中被选择的至少一个专家模型的模型参数信息,所述被选择的至少一个专家模型是基于所述门控网络模型的输出结果得到的。In a first aspect, a distributed training method is provided, the method being applied to a distributed training system, the global model of the distributed training system comprising N expert models and one gated network model, wherein N is a positive integer, the method comprising: receiving first information, the first information being used to indicate a selection parameter, the selection parameter being used to select an output result of the gated network model; and sending second information, the second information being used to indicate model parameter information of the gated network model and model parameter information of at least one expert model selected from the N expert models, the selected at least one expert model being obtained based on the output result of the gated network model.
在该方面中,所有参与训练的子节点通过接收中心节点发送的选择参数,并在门控网络模型和专家模型的训练过程中,将该选择参数作用于门控网络模型的输出结果上,从而可以使得各个子节点对齐门控网络模型的输出结果的选择,提高中心节点的聚合模型的性能。In this aspect, all sub-nodes participating in the training receive the selection parameters sent by the central node, and during the training process of the gated network model and the expert model, the selection parameters are applied to the output results of the gated network model, thereby enabling each sub-node to align the selection of the output results of the gated network model and improving the performance of the aggregation model of the central node.
示例性地,所述门控网络模型的输出结果经设定函数处理后为[0,1]之间的值。Exemplarily, the output result of the gated network model is a value between [0, 1] after being processed by a set function.
在一种可能的实现中,所述方法还包括:输入样本数据到所述N个专家模型和所述门控网络模型,得到所述N个专家模型的N个第一输出结果和所述门控网络模型的N个第二输出结果,所述N个第二输出结果分别与所述N个第一输出结果一一对应;针对所述N个第一输出结果中的每个第一输出结果和所述N个第二输出结果中的每个第二输出结果,对于满足所述选择参数的所述第二输出结果,选择与所述第二输出结果对应的第一输出结果;以及基于被选择的至少一个第一输出结果和真值信息,获取所述被选择的至少一个专家模型的模型参数信息和所述门控网络模型的模型参数信息。In a possible implementation, the method further includes: inputting sample data into the N expert models and the gated network model to obtain N first output results of the N expert models and N second output results of the gated network model, wherein the N second output results correspond one-to-one to the N first output results respectively; for each first output result of the N first output results and each second output result of the N second output results, for the second output result that satisfies the selection parameter, selecting the first output result corresponding to the second output result; and based on the selected at least one first output result and the true value information, obtaining model parameter information of the selected at least one expert model and the model parameter information of the gated network model.
在另一种可能的实现中,所述选择参数为第一阈值。In another possible implementation, the selection parameter is a first threshold.
在该实现中,例如,该第一阈值的取值范围可以为(0,1)。门控网络模型的输出结果大于该第一阈值,则保留该门控网络模型的输出结果;门控网络模型的输出结果小于该第一阈值,则丢弃该门控网络模型的输出结果。特别的,门控网络模型的输出结果等于该第一阈值,可以约定保留或丢弃该门控网络模型的输出结果。In this implementation, for example, the value range of the first threshold can be (0,1). If the output result of the gated network model is greater than the first threshold, the output result of the gated network model is retained; if the output result of the gated network model is less than the first threshold, the output result of the gated network model is discarded. In particular, if the output result of the gated network model is equal to the first threshold, it can be agreed to retain or discard the output result of the gated network model.
在又一种可能的实现中,所述第一信息还用于指示以下至少一个信息:所述N个专家模型的标识,所述N个专家模型的竟合方式,训练任务,所述全局模型的输入和输出的类型。In another possible implementation, the first information is further used to indicate at least one of the following information: identifications of the N expert models, a competition mode of the N expert models, a training task, and types of input and output of the global model.
在该实现中,该竞合方式还可以称为使用方式,合作方式,协作方式,是否协作的指示信息等。该竞合方式指示N个专家模型之间是协作还是竞争。由于相同的训练节点基于不同的竟合方式训练混合专家模型得到的门控权和专家权也会不同,所以希望参与训练的所有子节点至少按照相同的方式使用/训练专家模型。因此,中心节点可以向子节点指示N个专家模型的竞合方式。In this implementation, the competitive mode can also be called a usage mode, a cooperation mode, a collaboration mode, an indication of whether to collaborate, etc. The competitive mode indicates whether the N expert models collaborate or compete. Since the same training node obtains different gate weights and expert weights by training the hybrid expert model based on different competitive modes, it is hoped that all child nodes participating in the training use/train the expert model in at least the same way. Therefore, the central node can indicate the competitive mode of the N expert models to the child nodes.
在又一种可能的实现中,所述方法还包括:发送第三信息,所述第三信息用于指示以下至少一个信息:子节点的内存空间大小,所述子节点的算力信息,是否支持进行模型训练,支持训练的模型的类型。In another possible implementation, the method further includes: sending third information, wherein the third information is used to indicate at least one of the following information: the memory space size of the child node, the computing power information of the child node, whether model training is supported, and the type of model supported for training.
在又一种可能的实现中,所述第二信息还用于指示所述被选择的至少一个专家模型的标识。In yet another possible implementation, the second information is further used to indicate an identifier of the selected at least one expert model.
在又一种可能的实现中,所述基于被选择的至少一个第一输出结果和真值信息,获取所述被选择的至少一个专家模型的模型参数信息和所述门控网络模型的模型参数信息,包括以下任意一个操作:基于被选择的至少一个第一输出结果的平均值和真值信息,获取所述被选择的至少一个专家模型的模型参数信息和所述门控网络模型的模型参数信息;或基于被选择的至少一个第一输出结果分别对应的至少一个第二输出结果,对所述被选择的至少一个第一输出结果进行加权和平均,得到被选择的至少一个第一输出结果的加权平均值,以及基于被选择的至少一个第一输出结果的加权平均值和真值信息,获取所述被选择的至少一个专家模型的模型参数信息和所述门控网络模型的模型参数信息。In another possible implementation, the acquiring, based on the selected at least one first output result and the true value information, model parameter information of the selected at least one expert model and model parameter information of the gated network model comprises any one of the following operations: acquiring model parameter information of the selected at least one expert model and model parameter information of the gated network model based on an average value and true value information of the selected at least one first output result; or weighting and averaging the selected at least one first output result based on at least one second output result respectively corresponding to the selected at least one first output result to obtain a weighted average value of the selected at least one first output result, and acquiring model parameter information of the selected at least one expert model and model parameter information of the gated network model based on the weighted average value and true value information of the selected at least one first output result.
在又一种可能的实现中,所述方法还包括:接收第四信息,所述第四信息用于指示更新的选择参数,所述更新的选择参数是基于所述第二信息得到的。In yet another possible implementation, the method further includes: receiving fourth information, where the fourth information is used to indicate an updated selection parameter, and the updated selection parameter is obtained based on the second information.
在该实现中,若中心节点更新了选择参数,则可以向各个子节点发送第四信息,其中,该第四信息用于指示更新的选择参数,使得第四信息在下一轮训练时,基于更新的选择参数选择门控网络模型的输出结果。In this implementation, if the central node updates the selection parameters, fourth information can be sent to each child node, where the fourth information is used to indicate the updated selection parameters, so that the fourth information selects the output result of the gated network model based on the updated selection parameters during the next round of training.
第二方面,提供了一种分布式训练方法,所述方法应用于分布式训练系统,所述分布式训练系统的全局模型包括N个专家模型和一个门控网络模型,所述N个正整数,所述方法包括:向多个子节点发送第一信息,所述第一信息用于指示选择参数,所述选择参数用于选择所述门控网络模型的输出结果;以及分别接收多个第二信息,所述多个第二信息中的每个所述第二信息用于指示所述门控网络模型的模型参数信息和所述N个专家模型中被选择的至少一个专家模型的模型参数信息,所述被选择的至少一个专家模型是基于所述门控网络模型的输出结果得到的。In a second aspect, a distributed training method is provided, which is applied to a distributed training system, wherein a global model of the distributed training system includes N expert models and a gated network model, wherein the N are positive integers, and the method includes: sending first information to multiple child nodes, wherein the first information is used to indicate a selection parameter, and the selection parameter is used to select an output result of the gated network model; and receiving multiple second information respectively, wherein each of the multiple second information is used to indicate model parameter information of the gated network model and model parameter information of at least one expert model selected from the N expert models, and the selected at least one expert model is obtained based on the output result of the gated network model.
在该方面中,中心节点通过指示选择参数,使得所有参与训练的子节点在门控网络模型和专家模型的训练过程中,将该选择参数作用于门控网络模型的输出结果上,从而可以使得各个子节点对齐门控网络模型的输出结果的选择,提高中心节点的聚合模型的性能。In this aspect, the central node indicates the selection parameters so that all sub-nodes participating in the training will apply the selection parameters to the output results of the gated network model during the training of the gated network model and the expert model, thereby enabling each sub-node to align the selection of the output results of the gated network model and improve the performance of the aggregation model of the central node.
示例性地,所述门控网络模型的输出结果经设定函数处理后为[0,1]之间的值。Exemplarily, the output result of the gated network model is a value between [0, 1] after being processed by a set function.
在一种可能的实现中,所述选择参数为第一阈值。In a possible implementation, the selection parameter is a first threshold.
在另一种可能的实现中,所述第一信息还用于指示以下至少一个信息:所述N个专家模型的标识,所述N个专家模型的竟合方式,训练任务,所述全局模型的输入和输出的类型。In another possible implementation, the first information is further used to indicate at least one of the following information: identifications of the N expert models, a competition mode of the N expert models, a training task, and types of input and output of the global model.
在又一种可能的实现中,所述方法还包括:接收第三信息,所述第三信息用于指示以下至少一个信息:子节点的内存空间大小,所述子节点的算力信息,是否支持进行模型训练,支持训练的模型的类型。In another possible implementation, the method further includes: receiving third information, where the third information is used to indicate at least one of the following information: the memory space size of the child node, the computing power information of the child node, whether model training is supported, and the type of model supported for training.
在又一种可能的实现中,所述第二信息还用于指示所述被选择的至少一个专家模型的标识。In yet another possible implementation, the second information is further used to indicate an identifier of the selected at least one expert model.
在又一种可能的实现中,所述方法还包括:基于所述多个第二信息更新所述全局模型。In yet another possible implementation, the method further includes: updating the global model based on the plurality of second information.
在又一种可能的实现中,所述方法还包括:基于所述多个第二信息更新所述选择参数;以及发送第四信息,所述第四信息用于指示所述更新的选择参数。In yet another possible implementation, the method further includes: updating the selection parameter based on the plurality of second information; and sending fourth information, where the fourth information is used to indicate the updated selection parameter.
第三方面,提供了一种分布式训练装置,用于实现上述第一方面或第一方面的任意一种实现中的分布式训练方法。该装置可以是子节点/第三方设备,也可以是应用于子节点/第三方设备的模块(例如处理器、芯片、或芯片系统等),还可以是能实现全部或部分子节点/第三方设备功能的逻辑节点、逻辑模块或软件。一种实现中,该分布式训练装置可以包括发送单元、接收单元,还可以包括处理单元。发送单元和接收单元可以是独立的,也可以是结合在一起的(可以称为“收发单元”)。In a third aspect, a distributed training device is provided for implementing the distributed training method in the above-mentioned first aspect or any one of the implementations of the first aspect. The device may be a sub-node/third-party device, or a module (such as a processor, chip, or chip system, etc.) applied to a sub-node/third-party device, or a logical node, logical module or software that can implement all or part of the functions of a sub-node/third-party device. In one implementation, the distributed training device may include a sending unit, a receiving unit, and may also include a processing unit. The sending unit and the receiving unit may be independent or combined together (which may be referred to as a "transceiver unit").
第四方面,提供了一种分布式训练装置,用于实现上述第二方面或第二方面的任意一种实现中的分布式训练方法。该装置可以是中心节点,也可以是应用于中心节点的模块(例如处理器、芯片、或芯片系统等),还可以是能实现全部或部分中心节点功能的逻辑节点、逻辑模块或软件。一种实现中,该分布式训练装置可以包括发送单元、接收单元,还可以包括处理单元。发送单元和接收单元可以是独立的,也可以是结合在一起的(可以称为“收发单元”)。In a fourth aspect, a distributed training device is provided for implementing the distributed training method in the second aspect or any one of the implementations of the second aspect. The device may be a central node, or a module applied to the central node (such as a processor, a chip, or a chip system, etc.), or a logical node, a logical module, or software that can implement all or part of the functions of the central node. In one implementation, the distributed training device may include a sending unit, a receiving unit, and may also include a processing unit. The sending unit and the receiving unit may be independent or combined together (which may be referred to as a "transceiver unit").
在一种可能的实现方式中,上述第三方面至第四方面中的分布式训练装置包括用于分别执行上述第一方面至第二方面中的任一方面或任意一种实现中的方法的单元。In a possible implementation, the distributed training device in the third to fourth aspects includes a unit for respectively executing the method in any aspect or any implementation of the first to second aspects.
在另一种可能的实现方式中,上述第三方面至第四方面中的分布式训练装置包括与存储器耦合的处理电路;所述处理电路被配置为实现所述装置执行上述分布式训练方法中相应的功能。存储器用于与处理电路耦合,其保存所述装置必要的程序(指令)和/或数据。可选的,所述分布式训练装置还可以包括通信接口,用于实现所述装置与其他网元之间的通信。可选的,该存储器可以位于该分布式训练装置内部,也可以位于该分布式训练装置外部。示例性地,该处理电路可以为处理器或处理器中用于处理的电路。In another possible implementation, the distributed training device in the third to fourth aspects above includes a processing circuit coupled to a memory; the processing circuit is configured to enable the device to perform corresponding functions in the above distributed training method. The memory is used to couple with the processing circuit, which stores the necessary programs (instructions) and/or data for the device. Optionally, the distributed training device may further include a communication interface for enabling communication between the device and other network elements. Optionally, the memory may be located inside the distributed training device or outside the distributed training device. Exemplarily, the processing circuit may be a processor or a circuit in a processor for processing.
当上述第三方面至第四方面中的分布式训练装置为芯片时,发送单元可以是输出单元,比如输出电路或者通信接口;接收单元可以是输入单元,比如输入电路或者通信接口。当所述分布式训练装置为终端设备时,发送单元可以是发射器或发射机;接收单元可以是接收器或接收机。When the distributed training device in the third to fourth aspects is a chip, the sending unit may be an output unit, such as an output circuit or a communication interface; the receiving unit may be an input unit, such as an input circuit or a communication interface. When the distributed training device is a terminal device, the sending unit may be a transmitter or a transmitter; the receiving unit may be a receiver or a receiver.
第五方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序或指令,当所述计算机程序或指令被执行时,实现上述各方面所述的方法。In a fifth aspect, a computer-readable storage medium is provided, in which a computer program or instruction is stored. When the computer program or instruction is executed, the methods described in the above aspects are implemented.
第六方面,提供了一种包含指令的计算机程序产品,当该指令在分布式训练装置上运行时,使得分布式训练装置执行上述各方面所述的方法。In a sixth aspect, a computer program product comprising instructions is provided, which, when executed on a distributed training device, enables the distributed training device to execute the methods described in the above aspects.
第七方面,提供了一种分布式训练系统,该分布式训练系统包括第三方面所述的分布式训练装置和第四方面所述的分布式训练装置。In a seventh aspect, a distributed training system is provided, which includes the distributed training device described in the third aspect and the distributed training device described in the fourth aspect.
图1为本申请实施例提供的一种分布式训练系统的架构示意图;FIG1 is a schematic diagram of the architecture of a distributed training system provided in an embodiment of the present application;
图2为本申请实施例提供的无线通信系统的一简化示意图;FIG2 is a simplified schematic diagram of a wireless communication system provided by an embodiment of the present application;
图3为本申请提供的另一种分布式训练系统的架构示意图;FIG3 is a schematic diagram of the architecture of another distributed training system provided by the present application;
图4A~图4D为本申请实施例提供的网络架构的示意图;4A to 4D are schematic diagrams of a network architecture provided in an embodiment of the present application;
图5为神经元结构的一种示意图;FIG5 is a schematic diagram of a neuron structure;
图6为神经网络的示意图;Fig. 6 is a schematic diagram of a neural network;
图7为一种AI应用框架示意图;FIG7 is a schematic diagram of an AI application framework;
图8为本申请实施例提供的另一种通信系统的架构示意图;FIG8 is a schematic diagram of the architecture of another communication system provided in an embodiment of the present application;
图9A~图9E为本申请实施例提供的全局模型的结构示意图;9A to 9E are schematic diagrams of the structure of a global model provided in an embodiment of the present application;
图10为本申请实施例提供的一种分布式训练方法的流程示意图;FIG10 is a schematic diagram of a flow chart of a distributed training method provided in an embodiment of the present application;
图11为本申请实施例示例的最优波束训练的示意图;FIG11 is a schematic diagram of optimal beam training according to an example of an embodiment of the present application;
图12为本申请实施例提供的一种分布式训练装置的结构示意图;FIG12 is a schematic diagram of the structure of a distributed training device provided in an embodiment of the present application;
图13为本申请实施例提供的另一种分布式训练装置的结构示意图。FIG13 is a schematic diagram of the structure of another distributed training device provided in an embodiment of the present application.
下面结合本申请实施例中的附图对本申请实施例进行描述。The embodiments of the present application are described below in conjunction with the drawings in the embodiments of the present application.
本申请的实施例可以应用于如图1所示的分布式训练系统,该分布式训练系统包括中心节点和多个子节点。示例性地,该分布式训练系统可以是联邦学习系统或Gossip学习系统。中心节点和各个子节点之间可以互相传输模型参数信息等。该分布式训练系统还可以包括第三方设备(图中未示出),该第三方设备可以作为中心节点/子节点的模型训练实体。The embodiments of the present application can be applied to a distributed training system as shown in Figure 1, which includes a central node and multiple child nodes. Exemplarily, the distributed training system can be a federated learning system or a Gossip learning system. Model parameter information can be transmitted between the central node and each child node. The distributed training system can also include a third-party device (not shown in the figure), which can serve as a model training entity for the central node/child node.
该分布式训练系统训练的机器学习模型可以是面向非无线通信的业务,例如图像识别,自然语言处理等,也可以是面向无线通信业务,例如基于环境信息的波束选择。The machine learning model trained by the distributed training system can be for non-wireless communication services, such as image recognition, natural language processing, etc., or for wireless communication services, such as beam selection based on environmental information.
本申请提供的技术可以应用于各种通信系统,例如,该通信系统可以是第四代(4th generation,4G)通信系统(例如长期演进(long term evolution,LTE)系统)、第五代(5th generation,5G)通信系统、全球互联微波接入(worldwide interoperability for microwave access,WiMAX)、无线局域网(wireless local area network,WLAN)系统、卫星通信系统、多种系统的融合系统,或者是未来的通信系统,例如第六代(6th generation,6G)通信系统等。其中,5G通信系统还可以称为新无线(new radio,NR)系统。The technology provided in this application can be applied to various communication systems, for example, the communication system can be a fourth generation (4G) communication system (such as a long term evolution (LTE) system), a fifth generation (5G) communication system, a world-wide interoperability for microwave access (WiMAX), a wireless local area network (WLAN) system, a satellite communication system, a fusion system of multiple systems, or a future communication system, such as a sixth generation (6G) communication system. Among them, the 5G communication system can also be called a new radio (NR) system.
通信系统中的一个网元可以向另一个网元发送信号或从另一个网元接收信号。其中信号可以包括信息、信令或者数据等。其中,网元也可以被替换为实体、网络实体、设备、终端设备、通信模块、节点、通信节点等等,本申请中以网元为例进行描述。例如,通信系统可以包括至少一个终端设备和至少一个网络设备。网络设备可以向终端设备发送下行信号,和/或终端设备可以向网络设备发送上行信号。此外,可以理解的是,若通信系统中包括多个终端设备,多个终端设备之间也可以互发信号,即信号的发送网元和信号的接收网元均可以是终端设备。A network element in a communication system can send a signal to another network element or receive a signal from another network element. The signal may include information, signaling, or data, etc. The network element may also be replaced by an entity, a network entity, a device, a terminal device, a communication module, a node, a communication node, etc. The network element is used as an example for description in this application. For example, a communication system may include at least one terminal device and at least one network device. The network device may send a downlink signal to the terminal device, and/or the terminal device may send an uplink signal to the network device. In addition, it can be understood that if a plurality of terminal devices are included in the communication system, the plurality of terminal devices may also send signals to each other, that is, the signal sending network element and the signal receiving network element may both be terminal devices.
参见图2,图2为本申请实施例提供的无线通信系统的一简化示意图。如图2所示,该无线通信系统包括无线接入网100。无线接入网100可以是下一代(例如6G或更高版本)无线接入网,或传统(例如5G、4G)无线接入网。一个或多个终端设备(120a-120j,统称为120)可以相互连接,或连接到无线接入网100中的一个或多个网络设备(110a、110b,统称为110)。可选的,图2只是示意图,该无线通信系统中还可以包括其它设备,如还可以包括核心网设备、无线中继设备和/或无线回传设备等,在图2中未画出。Referring to FIG. 2 , FIG. 2 is a simplified schematic diagram of a wireless communication system provided in an embodiment of the present application. As shown in FIG. 2 , the wireless communication system includes a wireless access network 100. The wireless access network 100 may be a next generation (e.g., 6G or higher) wireless access network, or a traditional (e.g., 5G, 4G) wireless access network. One or more terminal devices (120a-120j, collectively referred to as 120) may be connected to each other, or to one or more network devices (110a, 110b, collectively referred to as 110) in the wireless access network 100. Optionally, FIG. 2 is only a schematic diagram, and other devices may also be included in the wireless communication system, such as core network devices, wireless relay devices, and/or wireless backhaul devices, which are not shown in FIG. 2 .
可选的,在实际应用中,该无线通信系统可以同时包括多个网络设备(也称为接入网设备),也可以同时包括多个终端设备。一个网络设备可以同时服务于一个或多个终端设备。一个终端设备也可以同时接入一个或多个网络设备。本申请实施例对该无线通信系统中包括的终端设备和网络设备的数量不做限定。Optionally, in practical applications, the wireless communication system may include multiple network devices (also referred to as access network devices) at the same time, and may also include multiple terminal devices at the same time. A network device may serve one or more terminal devices at the same time. A terminal device may also access one or more network devices at the same time. The embodiment of the present application does not limit the number of terminal devices and network devices included in the wireless communication system.
其中,网络设备可以是网络侧的一种用于发射或接收信号的实体。网络设备可以为终端设备通过无线方式接入到该无线通信系统中的接入设备,如网络设备可以是基站。基站可以广义的覆盖如下中的各种名称,或与如下名称进行替换,比如:无线接入网(radio access network,RAN)节点、节点B(NodeB)、演进型基站(evolved NodeB,eNB)、下一代基站(next generation NodeB,gNB)、开放无线接入网(open radio access network,O-RAN)中的网络设备、中继站、接入点、传输点(transmitting and receiving point,TRP)、发射点(transmitting point,TP)、主站(master eNB,MeNB)、辅站(secondary eNB,SeNB)、多制式无线(multi-standard radio,MSR)节点、家庭基站、网络控制器、接入节点、无线节点、接入点(access point,AP)、传输节点、收发节点、基带单元(building baseband unit,BBU)、射频拉远单元(remote radio unit,RRU)、有源天线单元(active antenna unit,AAU)、射频头(remote radio head,RRH)、集中式单元(centralized unit,CU)、分布式单元(distributed unit,DU)、无线单元(radio unit,RU)、集中式单元控制面(CU control plane,CU-CP)节点、集中式单元用户面(CU user plane,CU-UP)节点、定位节点、RAN智能控制器(RAN intelligent controller,RIC)等。基站可以是宏基站、微基站、中继节点、施主节点或类似物,或其组合。网络设备还可以指用于设置于前述设备或装置内的通信模块、调制解调器或芯片。网络设备还可以是移动交换中心以及设备到设备(device-to-device,D2D)、车辆外联(vehicle-to-everything,V2X)、机器到机器(machine-to-machine,M2M)通信中承担基站功能的设备、6G网络中的网络侧设备、未来的通信系统中承担基站功能的设备等。网络设备可以支持相同或不同接入技术的网络。本申请的实施例对网络设备所采用的具体技术和具体设备形态不做限定。Among them, the network device can be an entity on the network side for transmitting or receiving signals. The network device can be an access device for the terminal device to access the wireless communication system by wireless means, such as the network device can be a base station. The base station can broadly cover the following various names, or be replaced with the following names, such as: radio access network (RAN) node, node B (NodeB), evolved NodeB (evolved NodeB, eNB), next generation NodeB (next generation NodeB, gNB), network equipment in open radio access network (open radio access network, O-RAN), relay station, access point, transmission point (transmitting and receiving point, TRP), transmitting point (transmitting point, TP), master eNB (MeNB), secondary eNB (SeNB), multi-standard radio (multi-standard radio, MSR) node, home base station, network controller, access Node, wireless node, access point (AP), transmission node, transceiver node, building baseband unit (BBU), remote radio unit (RRU), active antenna unit (AAU), remote radio head (RRH), centralized unit (CU), distributed unit (DU), radio unit (RU), centralized unit control plane (CU-CP) node, centralized unit user plane (CU-UP) node, positioning node, RAN intelligent controller (RIC), etc. The base station can be a macro base station, a micro base station, a relay node, a donor node or the like, or a combination thereof. The network device can also refer to a communication module, a modem or a chip used to be set in the aforementioned device or apparatus. The network device may also be a mobile switching center and a device that performs base station functions in device-to-device (D2D), vehicle-to-everything (V2X), and machine-to-machine (M2M) communications, a network-side device in a 6G network, and a device that performs base station functions in future communication systems. The network device may support networks with the same or different access technologies. The embodiments of the present application do not limit the specific technology and specific device form used by the network device.
网络设备可以是固定的,也可以是移动的。例如,基站110a、110b是静止的,并负责来自终端设备120的一个或多个小区中的无线传输和接收。图2中示出的直升机或无人机120i可以被配置成充当移动基站,并且一个或多个小区可以根据移动基站120i的位置移动。在其他示例中,直升机或无人机(120i)可以被配置成用作与基站110b通信的终端设备。The network equipment can be fixed or mobile. For example, base stations 110a, 110b are stationary and are responsible for wireless transmission and reception in one or more cells from terminal device 120. The helicopter or drone 120i shown in Figure 2 can be configured to act as a mobile base station, and one or more cells can move according to the location of the mobile base station 120i. In other examples, the helicopter or drone (120i) can be configured to act as a terminal device that communicates with base station 110b.
本申请中,用于实现如上接入网络功能的通信装置可以是网络设备,也可以是具有接入网络的部分功能的网络设备,也可以是能够支持实现接入网络功能的装置,例如芯片系统,硬件电路、软件模块、或硬件电路加软件模块,该装置可以被安装在网络设备中或者和网络设备匹配使用。本申请的方法中,以用于实现网络设备功能的通信装置是网络设备为例进行描述。In the present application, the communication device used to implement the above-mentioned access network function can be a network device, or a network device with partial access network functions, or a device capable of supporting the access network function, such as a chip system, a hardware circuit, a software module, or a hardware circuit plus a software module, which can be installed in the network device or used in combination with the network device. In the method of the present application, the communication device used to implement the network device function is described as a network device.
终端设备可以是用户侧的一种用于接收或发射信号的实体,如手机。终端设备可以用于连接人、物和机器。终端设备可通过网络设备与一个或多个核心网进行通信。终端设备包括具有无线连接功能的手持式设备、连接到无线调制解调器的其他处理设备或车载设备等。终端设备可以是便携式、袖珍式、手持式、计算机内置的或者车载的移动装置。终端设备120可以广泛应用于各种场景,例如蜂窝通信、D2D、V2X、端到端(point-to-point,P2P)、机器到机器(machine-to-machine,M2M)、机器类型通信(machine type communication,MTC)、物联网(internet of things,IoT)、虚拟现实(virtual reality,VR)、增强现实(augmented reality,AR)、工业控制、自动驾驶、远程医疗、智能电网、智能家具、智能办公、智能穿戴、智能交通、智慧城市、无人机、机器人、遥感、被动传感、定位、导航与跟踪、自主交付与移动等。终端设备120的一些举例为:3GPP标准的用户设备(user equipment,UE)、固定设备、移动设备、手持设备、可穿戴设备、蜂窝电话、智能电话、会话发起协议(session initiated protocol,SIP)电话、笔记本电脑、个人计算机、智能书、车辆、卫星、全球定位系统(global positioning system,GPS)设备、目标跟踪设备、无人机、直升机、飞行器、船只、遥控设备、智能家居设备、工业设备、个人通信业务(personal communication service,PCS)电话、无线本地环路(wireless local loop,WLL)站、个人数字助理(personal digital assistant,PDA)、无线网络摄像头、平板电脑、掌上电脑、移动互联网设备(mobile internet device,MID)、可穿戴设备如智能手表、VR设备、AR设备、工业控制(industrial control)中的无线终端、车联网系统中的终端、无人驾驶(self driving)中的无线终端、智能电网(smart grid)中的无线终端、运输安全(transportation safety)中的无线终端、智慧城市(smart city)中的无线终端如智能加油器,高铁上的终端设备以及智慧家庭(smart home)中的无线终端,如智能音响、智能咖啡机、智能打印机等。终端设备120可以为以上各种场景中的无线设备或用于设置于无线设备的装置,例如,上述设备中的通信模块、调制解调器或芯片等。终端设备也可以称为终端、终端设备、UE、移动台(mobile station,MS)、移动终端(mobile terminal,MT)等。终端设备还可以是未来的无线通信系统中的终端设备。终端设备可以用于专用网设备或者通用设备中。本申请的实施例对终端设备所采用的具体技术和具体设备形态不做限定。The terminal device can be an entity on the user side for receiving or transmitting signals, such as a mobile phone. The terminal device can be used to connect people, objects and machines. The terminal device can communicate with one or more core networks through a network device. The terminal device includes a handheld device with a wireless connection function, other processing devices connected to a wireless modem, or a vehicle-mounted device. The terminal device can be a portable, pocket-sized, handheld, computer-built-in or vehicle-mounted mobile device. The terminal device 120 can be widely used in various scenarios, such as cellular communication, D2D, V2X, point-to-point (P2P), machine-to-machine (M2M), machine type communication (MTC), Internet of Things (IoT), virtual reality (VR), augmented reality (AR), industrial control, automatic driving, telemedicine, smart grid, smart furniture, smart office, smart wear, smart transportation, smart city, drone, robot, remote sensing, passive sensing, positioning, navigation and tracking, autonomous delivery and mobility, etc. Some examples of terminal devices 120 are: 3GPP standard user equipment (UE), fixed equipment, mobile devices, handheld devices, wearable devices, cellular phones, smart phones, session initiation protocol (SIP) phones, laptops, personal computers, smart books, vehicles, satellites, global positioning system (GPS) equipment, target tracking equipment, drones, helicopters, aircraft, ships, remote control equipment, smart home equipment, industrial equipment, personal communication service (PCS) phones, wireless local loop (WLL) stations, personal digital assistants (PDAs), etc. The terminal device 120 may be a wireless device in the above-mentioned scenarios or a device used to be set in a wireless device, for example, a communication module, a modem or a chip in the above-mentioned device. The terminal device may also be referred to as a terminal, a terminal device, a UE, a mobile station (MS), a mobile terminal (MT), or the like. The terminal device may also be referred to as a terminal, a terminal device, a UE, a mobile station (MS), a mobile terminal (MT), or the like. The terminal device may also be a terminal device in a future wireless communication system. The terminal device can be used in a dedicated network device or a general-purpose device. The embodiments of the present application do not limit the specific technology and specific device form used by the terminal device.
可选的,终端设备可以用于充当基站。例如,UE可以充当调度实体,其在V2X、D2D或P2P等中的UE之间提供侧行链路信号。如图2所示,蜂窝电话120a和汽车120b利用侧行链路信号彼此通信。蜂窝电话120a和智能家居设备120e之间通信,而无需通过基站110b中继通信信号。Optionally, the terminal device can be used to act as a base station. For example, the UE can act as a scheduling entity that provides sidelink signals between UEs in V2X, D2D, or P2P, etc. As shown in Figure 2, the cellular phone 120a and the car 120b communicate with each other using sidelink signals. The cellular phone 120a and the smart home device 120e communicate without relaying the communication signal through the base station 110b.
本申请中,用于实现终端设备功能的通信装置可以是终端设备,也可以是具有以上终端设备的部分功能的终端设备,也可以是能够支持实现以上终端设备的功能的装置,例如芯片系统,该装置可以被安装在终端设备中或者和终端设备匹配使用。本申请中,芯片系统可以由芯片构成,也可以包括芯片和其他分立器件。本申请提供的技术方案中,以通信装置是终端设备或UE为例进行描述。In the present application, the communication device for realizing the functions of the terminal device may be a terminal device, or a terminal device having some functions of the above terminal devices, or a device capable of supporting the functions of the above terminal devices, such as a chip system, which may be installed in the terminal device or used in combination with the terminal device. In the present application, the chip system may be composed of a chip, or may include a chip and other discrete devices. In the technical solution provided in the present application, the communication device is described as a terminal device or UE as an example.
可选的,无线通信系统通常由小区组成,基站提供小区的管理,基站向小区中多个移动台(mobile station,MS)提供通信服务。其中基站包含基带单元(baseband unit,BBU)和远端射频单元(remote radio unit,RRU)。BBU和RRU可以放置在不同的地方,例如:RRU拉远,放置于高话务量的区域,BBU放置于中心机房。BBU和RRU也可以放置在同一机房。BBU和RRU也可以为一个机架下的不同部件。可选的,一个小区可以对应于一个载波或成员载波。Optionally, a wireless communication system is usually composed of cells, and a base station provides management of the cell. The base station provides communication services to multiple mobile stations (MS) in the cell. The base station includes a baseband unit (BBU) and a remote radio unit (RRU). The BBU and RRU can be placed in different places, for example: the RRU is remote and placed in an area with high traffic volume, and the BBU is placed in a central computer room. The BBU and RRU can also be placed in the same computer room. The BBU and RRU can also be different components under one rack. Optionally, a cell can correspond to one carrier or component carrier.
在一些部署中,本申请实施例所提及的网络设备可以为包括CU、或DU、或包括CU和DU的设备、或者控制面CU节点(集中式单元控制面(centralized unit-control plane,CU-CP))和用户面CU节点(集中式单元用户面(centralized unit-user plane,CU-UP))以及DU节点的设备。例如,网络设备可以包括gNB-CU-CP、gNB-CU-UP和gNB-DU。In some deployments, the network device mentioned in the embodiments of the present application may be a device including a CU, or a DU, or a device including a CU and a DU, or a device including a control plane CU node (centralized unit-control plane (CU-CP)) and a user plane CU node (centralized unit-user plane (CU-UP)) and a DU node. For example, the network device may include a gNB-CU-CP, a gNB-CU-UP, and a gNB-DU.
在一些部署中,由多个RAN节点协作协助终端实现无线接入,不同RAN节点分别实现基站的部分功能。例如,RAN节点可以是CU,DU,CU-CP,CU-UP,或者RU等。CU和DU可以是单独设置,或者也可以包括在同一个网元中,例如BBU中。RU可以包括在射频设备或者射频单元中,例如包括在RRU、AAU或RRH中。In some deployments, multiple RAN nodes collaborate to assist the terminal in achieving wireless access, and different RAN nodes implement part of the functions of the base station. For example, the RAN node can be a CU, DU, CU-CP, CU-UP, or RU. The CU and DU can be set separately, or can also be included in the same network element, such as a BBU. The RU can be included in a radio frequency device or a radio frequency unit, such as an RRU, AAU, or RRH.
RAN节点可以支持一种或多种类型的前传接口,不同前传接口,分别对应具有不同功能的DU和RU。若DU和RU之间的前传接口为通用公共无线电接口(common public radio interface,CPRI),DU被配置用于实现基带功能中的一项或多项,RU被配置用于实现射频功能中的一项或多项。若DU和RU之间的前传接口为另一种接口,其相对于CPRI,将下行和/或上行的部分基带功能,比如,针对下行,预编码(precoding),数字波束赋形(beamforming,BF),或快速傅立叶反变换(inverse fast Fourier transform,IFFT)/添加循环前缀(cyclic prefix,CP)中的一项或多项,从DU中移至RU中实现,针对上行,数字波束赋形(beamforming,BF),或快速傅立叶变换(fast Fourier transform,FFT)/去除循环前缀中的一项或多项,从DU中移至RU中实现。在一种可能的实现方式中,该接口可以为增强型通用公共无线电接口(enhanced common public radio interface,eCPRI)。在eCPRI架构下,DU和RU之间的切分方式不同,对应不同类型(category,Cat)的eCPRI,比如eCPRI Cat A,B,C,D,E,F。The RAN node may support one or more types of fronthaul interfaces, and different fronthaul interfaces correspond to DUs and RUs with different functions. If the fronthaul interface between the DU and the RU is a common public radio interface (CPRI), the DU is configured to implement one or more of the baseband functions, and the RU is configured to implement one or more of the radio frequency functions. If the fronthaul interface between the DU and the RU is another interface, relative to the CPRI, part of the baseband functions of the downlink and/or uplink, such as, for the downlink, one or more of precoding, digital beamforming (BF), or inverse fast Fourier transform (IFFT)/adding cyclic prefix (CP), are moved from the DU to the RU for implementation, and for the uplink, one or more of digital beamforming (BF), or fast Fourier transform (FFT)/removing cyclic prefix, are moved from the DU to the RU for implementation. In a possible implementation, the interface may be an enhanced common public radio interface (eCPRI). Under the eCPRI architecture, the division between DU and RU is different, corresponding to different types (category, Cat) of eCPRI, such as eCPRI Cat A, B, C, D, E, and F.
以eCPRI Cat A为例,对于下行传输,以层映射为切分,DU被配置用于实现层映射及之前的一项或多项功能(即编码、速率匹配,加扰,调制,层映射中的一项或多项),而层映射之后的其他功能(例如,资源元素(resource element,RE)映射,数字波束赋形(beamforming,BF),或快速傅立叶反变换(inverse fast Fourier transform,IFFT)/添加循环前缀(cyclic prefix,CP)中的一项或多项)移至RU中实现。对于上行传输,以解RE映射为切分,DU被配置用于实现解映射及之前的一项或多项功能(即解码,解速率匹配,解扰,解调,离散傅里叶逆变换(inverse discrete Fourier transform,IDFT),信道均衡,解RE映射中的一项或多项功能),而解映射之后的其他功能(例如,数字BF或FFT/去CP中的一项或多项)移至RU中实现。可以理解的是,关于各种类型的eCPRI所对应的DU和RU的功能描述,可以参考eCPRI协议,在此不予赘述。Taking eCPRI Cat A as an example, for downlink transmission, based on layer mapping, the DU is configured to implement one or more functions before layer mapping (i.e., one or more functions of coding, rate matching, scrambling, modulation, and layer mapping), while other functions after layer mapping (e.g., one or more functions of resource element (RE) mapping, digital beamforming (BF), or inverse fast Fourier transform (IFFT)/adding cyclic prefix (CP)) are moved to the RU for implementation. For uplink transmission, based on de-RE mapping, the DU is configured to implement one or more functions before de-mapping (i.e., one or more functions of decoding, de-rate matching, de-scrambling, demodulation, inverse discrete Fourier transform (IDFT), channel equalization, and de-RE mapping), while other functions after de-mapping (e.g., one or more functions of digital BF or FFT/CP removal) are moved to the RU for implementation. It can be understood that for the functional description of DU and RU corresponding to various types of eCPRI, reference can be made to the eCPRI protocol and will not be repeated here.
一种可能的设计中,BBU中用于实现基带功能的处理单元称为基带高层(base band high,BBH)单元,RRU/AAU/RRH中用于实现基带功能的处理单元称为基带低层(base band low,BBL)单元。In one possible design, the processing unit for implementing the baseband function in the BBU is called a baseband high layer (BBH) unit, and the processing unit for implementing the baseband function in the RRU/AAU/RRH is called a baseband low layer (BBL) unit.
在不同系统中,CU(或CU-CP和CU-UP)、DU或RU也可以有不同的名称,但是本领域的技术人员可以理解其含义。例如,在开放式无线接入网(open radio access network,ORAN)系统中,CU也可以称为O-CU(开放式CU),DU也可以称为O-DU,CU-CP也可以称为O-CU-CP,CU-UP也可以称为O-CU-UP,RU也可以称为O-RU。本申请中的CU(或CU-CP、CU-UP)、DU和RU中的任一单元,可以是通过软件模块、硬件模块、或者软件模块与硬件模块结合来实现。In different systems, CU (or CU-CP and CU-UP), DU or RU may also have different names, but those skilled in the art can understand their meanings. For example, in an open radio access network (ORAN) system, CU may also be called O-CU (open CU), DU may also be called O-DU, CU-CP may also be called O-CU-CP, CU-UP may also be called O-CU-UP, and RU may also be called O-RU. Any unit in the CU (or CU-CP, CU-UP), DU and RU in this application may be implemented by a software module, a hardware module, or a combination of a software module and a hardware module.
本申请实施例中,用于实现网络设备的功能的装置可以是网络设备;也可以是能够支持网络设备实现该功能的装置,例如芯片系统、硬件电路、软件模块、或硬件电路加软件模块。该装置可以被安装在网络设备中或者和网络设备匹配使用。在本申请实施例中仅以用于实现网络设备的功能的装置为网络设备为例进行说明,不对本申请实施例的方案构成限定。In the embodiments of the present application, the device for realizing the function of the network device may be a network device; or it may be a device capable of supporting the network device to realize the function, such as a chip system, a hardware circuit, a software module, or a hardware circuit plus a software module. The device may be installed in the network device or used in combination with the network device. In the embodiments of the present application, only the device for realizing the function of the network device is a network device as an example for explanation, and the scheme of the embodiments of the present application is not limited.
可以理解的是,本申请可以应用在网络设备和终端设备之间。It can be understood that the present application can be applied between network devices and terminal devices.
网络设备和终端设备之间的协议层结构:Protocol layer structure between network devices and terminal devices:
网络设备和终端设备之间的通信遵循一定的协议层结构。该协议层结构可以包括控制面协议层结构和用户面协议层结构。例如,控制面协议层结构可以包括无线资源控制(radio resource control,RRC)层、分组数据汇聚层协议(packet data convergence protocol,PDCP)层、无线链路控制(radio link control,RLC)层、媒体接入控制(medium access control,MAC)层和物理层等协议层的功能。例如,用户面协议层结构可以包括PDCP层、RLC层、MAC层和物理层等协议层的功能,在一种可能的实现中,PDCP层之上还可以包括业务数据适配协议(service data adaptation protocol,SDAP)层。The communication between the network device and the terminal device follows a certain protocol layer structure. The protocol layer structure may include a control plane protocol layer structure and a user plane protocol layer structure. For example, the control plane protocol layer structure may include the functions of the radio resource control (RRC) layer, the packet data convergence protocol (PDCP) layer, the radio link control (RLC) layer, the medium access control (MAC) layer and the physical layer. For example, the user plane protocol layer structure may include the functions of the PDCP layer, the RLC layer, the MAC layer and the physical layer. In a possible implementation, the service data adaptation protocol (SDAP) layer may also be included above the PDCP layer.
可选的,网络设备和终端设备之间的协议层结构还可以包括人工智能(artificial intelligence,AI)层,用于传输AI功能相关的数据。Optionally, the protocol layer structure between the network device and the terminal device may also include an artificial intelligence (AI) layer for transmitting data related to AI functions.
以网络设备和终端设备之间的数据传输为例,数据传输需要经过用户面协议层,比如经过SDAP层、PDCP层、RLC层、MAC层、物理层。其中,SDAP层、PDCP层、RLC层、MAC层和物理层也可以统称为接入层。根据数据的传输方向分为发送或接收,上述每层又分为发送部分和接收部分。以下行数据传输为例,PDCP层自上层取得数据后,将数据传送到RLC层与MAC层,再由MAC层生成传输块,然后通过物理层进行无线传输。数据在各个层中进行相对应的封装。例如,某一层从该层的上层收到的数据视为该层的服务数据单元(service data unit,SDU),经过该层封装后成为协议数据单元(protocol data unit,PDU),再传递给下一个层。Taking data transmission between network devices and terminal devices as an example, data transmission needs to pass through the user plane protocol layer, such as the SDAP layer, PDCP layer, RLC layer, MAC layer, and physical layer. Among them, the SDAP layer, PDCP layer, RLC layer, MAC layer, and physical layer can also be collectively referred to as the access layer. According to the transmission direction of the data, it is divided into sending or receiving, and each of the above layers is divided into a sending part and a receiving part. Taking downlink data transmission as an example, after the PDCP layer obtains data from the upper layer, it transmits the data to the RLC layer and the MAC layer, and then the MAC layer generates a transmission block, which is then wirelessly transmitted through the physical layer. Data is encapsulated accordingly in each layer. For example, the data received by a layer from the upper layer of the layer is regarded as the service data unit (SDU) of the layer, which becomes a protocol data unit (PDU) after being encapsulated by the layer, and then passed to the next layer.
示例性的,终端设备还可以具有应用层和非接入层。其中,应用层可以用于向终端设备中所安装的应用程序提供服务,比如,终端设备接收到的下行数据可以由物理层依次传输到应用层,进而由应用层提供给应用程序;又比如,应用层可以获取应用程序产生的数据,并将数据依次传输到物理层,发送给其它通信装置。非接入层可以用于转发用户数据,比如将从应用层接收到的上行数据转发给SDAP层,或者将从SDAP层接收到的下行数据转发给应用层。Exemplarily, the terminal device may also have an application layer and a non-access layer. The application layer may be used to provide services to applications installed in the terminal device. For example, downlink data received by the terminal device may be sequentially transmitted from the physical layer to the application layer, and then provided to the application by the application layer; for another example, the application layer may obtain data generated by the application, and sequentially transmit the data to the physical layer and send it to other communication devices. The non-access layer may be used to forward user data, such as forwarding uplink data received from the application layer to the SDAP layer, or forwarding downlink data received from the SDAP layer to the application layer.
应理解,图2所示的通信系统中各个设备的数量、类型仅作为示意,本申请并不限于此,实际应用中在通信系统中还可以包括更多的终端设备、更多的网络设备,还可以包括其它网元,例如可以包括核心网设备,和/或用于实现人工智能功能的网元。It should be understood that the number and type of each device in the communication system shown in Figure 2 are for illustration only, and the present application is not limited to this. In actual applications, the communication system may also include more terminal devices, more network devices, and other network elements, such as core network devices, and/or network elements for implementing artificial intelligence functions.
可以理解的是,终端设备、网络设备、核心网设备、或用于实现人工智能功能的网元中的一项或多项所实现的全部或部分功能均可以进行虚拟化,也即,通过专有处理器或通用处理器中的一项或多项和相应的软件模块来实现。其中,终端设备和网络设备因涉及空口传输的接口,该接口的收发功能可由硬件来实现。核心网设备,如操作维护管理(operation administration and maintenance,OAM)网元,均可虚拟化。可选的,虚拟化后的终端设备、网络设备、核心网设备、或用于实现人工智能功能的网元中的一项或多项功能可以由云端设备来实现,比如过顶(over the top,OTT)系统中的云端设备来实现。It is understandable that all or part of the functions implemented by one or more of the terminal devices, network devices, core network devices, or network elements used to implement artificial intelligence functions can be virtualized, that is, implemented by one or more of the proprietary processors or general-purpose processors and corresponding software modules. Among them, since the terminal devices and network devices involve interfaces for air interface transmission, the transceiver functions of the interfaces can be implemented by hardware. Core network devices, such as operation administration and maintenance (OAM) network elements, can be virtualized. Optionally, one or more functions of the virtualized terminal devices, network devices, core network devices, or network elements used to implement artificial intelligence functions can be implemented by cloud devices, such as cloud devices in over the top (OTT) systems.
在本申请的实施例中,当中心节点为网络设备,子节点为终端设备(如UE)时,网络设备和UE1~UE5可以组成一个如图3所示的分布式AI训练系统。在该通信系统中,UE1~UE5可以发送数据给网络设备,网络设备需要接收UE1~UE5发送的上行数据,上行数据可以是子节点计算的表征差异或者模型参数,也可以是包含其状态信息的反馈量。同时,网络设备可以向UE1-UE5发送配置信息,配置信息可以是中心节点用于同步各个子节点的模型参数数据,也可以是指示子节点训练方法的控制数据。网络设备和UE之间的数据可以承载于物理信道上,例如物理下行控制信道(physical downlink control channel,PDCCH)、物理下行数共享信道(physical downlink shared channel,PDSCH)、物理上行共享信道(physical uplink shared channel,PUSCH)或物理上行控制信道(physical uplink control channel,PUCCH)等;再例如,物理边链路控制信道(physical sidelink control channel,PSCCH)、物理边链路共享信道(physical sidelink shared channel,PSSCH)。In an embodiment of the present application, when the central node is a network device and the child node is a terminal device (such as a UE), the network device and UE1 to UE5 can form a distributed AI training system as shown in Figure 3. In this communication system, UE1 to UE5 can send data to the network device, and the network device needs to receive uplink data sent by UE1 to UE5. The uplink data can be the representation difference or model parameter calculated by the child node, or it can be the feedback amount containing its status information. At the same time, the network device can send configuration information to UE1-UE5. The configuration information can be the model parameter data used by the central node to synchronize each child node, or it can be control data indicating the training method of the child node. The data between the network equipment and the UE can be carried on physical channels, such as the physical downlink control channel (physical downlink control channel, PDCCH), the physical downlink shared channel (physical downlink shared channel, PDSCH), the physical uplink shared channel (physical uplink shared channel, PUSCH) or the physical uplink control channel (physical uplink control channel, PUCCH); for example, the physical sidelink control channel (physical sidelink control channel, PSCCH) and the physical sidelink shared channel (physical sidelink shared channel, PSSCH).
为了在无线网络中支持AI技术,网络中还可能引入AI节点。In order to support AI technology in wireless networks, AI nodes may also be introduced into the network.
可选地,AI节点可以部署于该通信系统中的如下位置中的一项或多项:网络设备、终端设备、或核心网设备等,或者,AI节点也可单独部署,例如,部署于上述任一项设备之外的位置,比如,过顶(over the top,OTT)系统的主机或云端服务器中。AI节点可以与通信系统中的其它设备通信,其它设备例如可以为以下中的一项或多项:网络设备,终端设备,或核心网的网元等。Optionally, the AI node can be deployed in one or more of the following locations in the communication system: network equipment, terminal equipment, or core network equipment, etc., or the AI node can also be deployed separately, for example, deployed in a location other than any of the above devices, such as a host or cloud server in an over-the-top (OTT) system. The AI node can communicate with other devices in the communication system, and the other devices can be, for example, one or more of the following: network equipment, terminal equipment, or network elements of the core network, etc.
可以理解,本申请对于AI节点的数量不予限制。例如,当有多个AI节点时,多个AI节点可以基于功能进行划分,如不同的AI节点负责不同的功能。It is understood that the present application does not limit the number of AI nodes. For example, when there are multiple AI nodes, the multiple AI nodes can be divided based on functions, such as different AI nodes are responsible for different functions.
还可以理解,AI节点可以是各自独立的设备,也可以集成于同一设备中实现不同的功能,或者可以是硬件设备中的网络元件,也可以是在专用硬件上运行的软件功能,或者是平台(例如,云平台)上实例化的虚拟化功能,本申请对于上述AI节点的具体形态不作限定。It can also be understood that AI nodes can be independent devices, or they can be integrated into the same device to implement different functions, or they can be network elements in hardware devices, or they can be software functions running on dedicated hardware, or they can be virtualized functions instantiated on a platform (for example, a cloud platform). This application does not limit the specific form of the above-mentioned AI nodes.
AI节点可以为AI网元或AI模块。An AI node can be an AI network element or an AI module.
这些网元节点,例如核心网设备、接入网节点(RAN节点)、终端或OAM中的一个或多个设备中设置有一个或多个AI模块。所述接入网节点可以作为单独的RAN节点,也可以包括多个RAN节点,例如,包括CU和DU。所述CU和、或DU也可以设置一个或多个AI模块。可选的,CU还可以被拆分为CU-CP和CU-UP。CU-CP和/或CU-UP中设置有一个或多个AI模型。One or more AI modules are provided in one or more devices of these network element nodes, such as core network equipment, access network nodes (RAN nodes), terminals or OAM. The access network node can be a separate RAN node, or it can include multiple RAN nodes, for example, including CU and DU. The CU and/or DU can also be provided with one or more AI modules. Optionally, the CU can also be split into CU-CP and CU-UP. One or more AI models are provided in the CU-CP and/or CU-UP.
AI模块用以实现相应的AI功能。不同网元中部署的AI模块可以相同或不同。AI模块的模型根据不同的参数配置,AI模块可以实现不同的功能。AI模块的模型可以是基于以下一项或多项参数配置的:结构参数(例如神经网络层数、神经网络宽度、层间的连接关系、神经元的权值、神经元的激活函数、或激活函数中的偏置中的至少一项)、输入参数(例如输入参数的类型和/或输入参数的维度)、或输出参数(例如输出参数的类型和/或输出参数的维度)。其中,激活函数中的偏置还可以称为神经网络的偏置。The AI module is used to implement the corresponding AI function. The AI modules deployed in different network elements can be the same or different. The model of the AI module can implement different functions according to different parameter configurations. The model of the AI module can be configured based on one or more of the following parameters: structural parameters (such as the number of neural network layers, the width of the neural network, the connection relationship between layers, the weight of the neuron, the activation function of the neuron, or at least one of the biases in the activation function), input parameters (such as the type of input parameters and/or the dimension of input parameters), or output parameters (such as the type of output parameters and/or the dimension of output parameters). Among them, the bias in the activation function can also be called the bias of the neural network.
一个AI模块可以具有一个或多个模型。一个模型可以推理得到一个输出,该输出包括一个参数或者多个参数。不同模型的学习过程、训练过程、或推理过程可以部署在不同的节点或设备中,或者可以部署在相同的节点或设备中。An AI module can have one or more models. A model can be inferred to obtain an output, which includes one parameter or multiple parameters. The learning process, training process, or inference process of different models can be deployed in different nodes or devices, or can be deployed in the same node or device.
通信系统中包括RAN智能控制器。例如,该RIC可以是上述AI模块,用于实现AI相关的功能。该RIC包括近实时RIC(near-real time RIC,near-RT RIC)和非实时RIC(non-real time RIC,Non-RT RIC)。其中,非实时RIC主要处理非实时的信息,比如,对时延不敏感的数据,该数据的时延可以为秒级。实时RIC主要处理近实时的信息,比如,对时延相对敏感的数据,该数据的时延为数十毫秒级。The communication system includes a RAN intelligent controller. For example, the RIC may be the above-mentioned AI module, which is used to implement AI-related functions. The RIC includes near-real time RIC (near-real time RIC, near-RT RIC) and non-real time RIC (non-real time RIC, Non-RT RIC). Among them, the non-real time RIC mainly processes non-real-time information, such as data that is not sensitive to delay, and the delay of the data can be in the second level. The real-time RIC mainly processes near-real-time information, such as data that is relatively sensitive to delay, and the delay of the data is in the tens of milliseconds.
近实时RIC用于进行模型训练和推理。例如,用于训练AI模型,利用该AI模型进行推理。近实时RIC可以从RAN节点(例如CU、CU-CP、CU-UP、DU和/或RU)和/或终端获得网络侧和/或终端侧的信息。该信息可以作为训练数据或者推理数据。可选的,近实时RIC可以将推理结果递交给RAN节点和/或终端。可选的,CU和DU之间,和/或DU和RU之间可以交互推理结果。例如近实时RIC将推理结果递交给DU,DU将其发给RU。Near real-time RIC is used for model training and reasoning. For example, it is used to train an AI model and use the AI model for reasoning. Near real-time RIC can obtain information on the network side and/or the terminal side from RAN nodes (such as CU, CU-CP, CU-UP, DU and/or RU) and/or terminals. This information can be used as training data or reasoning data. Optionally, near real-time RIC can submit the reasoning results to the RAN node and/or terminal. Optionally, the reasoning results can be exchanged between the CU and DU, and/or between the DU and RU. For example, the near real-time RIC submits the reasoning results to the DU, and the DU sends it to the RU.
非实时RIC也用于进行模型训练和推理。例如,用于训练AI模型,利用该模型进行推理。非实时RIC可以从RAN节点(例如CU、CU-CP、CU-UP、DU和/或RU)和/或终端获得网络侧和/或终端侧的信息。该信息可以作为训练数据或者推理数据,推理结果可以被递交给RAN节点和/或终端。可选的,CU和DU之间,和/或DU和RU之间可以交互推理结果,例如非实时RIC将推理结果递交给DU,由DU将其发给RU。Non-real-time RIC is also used for model training and reasoning. For example, it is used to train an AI model and use the model for reasoning. Non-real-time RIC can obtain information on the network side and/or the terminal side from RAN nodes (such as CU, CU-CP, CU-UP, DU and/or RU) and/or terminals. This information can be used as training data or reasoning data, and the reasoning results can be submitted to the RAN node and/or the terminal. Optionally, the reasoning results can be exchanged between the CU and the DU, and/or between the DU and the RU. For example, the non-real-time RIC submits the reasoning results to the DU, and the DU sends it to the RU.
近实时RIC、非实时RIC也可以分别作为一个网元单独设置。可选的,近实时RIC、非实时RIC也可以作为其他设备的一部分,例如,近实时RIC设置在RAN节点中(例如,CU,DU中),而非实时RIC设置在OAM中、云服务器中、核心网设备、或者其他网络设备中。The near real-time RIC and the non-real-time RIC may also be separately set as a network element. Optionally, the near real-time RIC and the non-real-time RIC may also be part of other devices, for example, the near real-time RIC is set in a RAN node (for example, in a CU or DU), and the non-real-time RIC is set in an OAM, a cloud server, a core network device, or other network devices.
示例性地,近实时RIC、非实时RIC在网络架构中的设置可以如图4A~图4D所示:Exemplarily, the configuration of near real-time RIC and non-real-time RIC in the network architecture may be as shown in FIG. 4A to FIG. 4D :
如图4A中的(a)所示,第一种可能的实现中,网络设备中包括近实时RIC模块,用于进行模型学习和/或推理。As shown in (a) of FIG. 4A , in a first possible implementation, the network device includes a near real-time RIC module for performing model learning and/or reasoning.
如图4A中的(b)所示,第二种可能的实现中,在通信系统中,网络设备之外可以包括非实时RIC,可选的,该非实时RIC可以位于OAM中或核心网设备中。As shown in (b) of FIG. 4A , in a second possible implementation, in a communication system, a non-real-time RIC may be included outside a network device. Optionally, the non-real-time RIC may be located in an OAM or a core network device.
如图4A中的(c)所示,第三种可能的实现中,网络设备中包括近实时RIC,网络设备之外还包括非实时RIC。可选的,非实时RIC可以位于OAM中或者核心网设备中。As shown in (c) of Figure 4A, in a third possible implementation, the network device includes a near real-time RIC, and the network device also includes a non-real-time RIC. Optionally, the non-real-time RIC may be located in the OAM or in the core network device.
相对图4A中的(c),图4B中将CU分离为了CU-CP和CU-UP。近实时RIC和非实时RIC的设置于图4A中的(c)相同。Compared with (c) in Fig. 4A, the CU is separated into CU-CP and CU-UP in Fig. 4B. The settings of the near real-time RIC and the non-real-time RIC are the same as those in (c) in Fig. 4A.
如图4C所示,可选的,网络设备中包括一个或多个AI实体,该AI实体的功能类似上述近实时RIC。可选的,OAM中包括一个或多个AI实体,该AI实体的功能类似上述非实时RIC。可选的,核心网设备中包括一个或多个AI实体,该AI实体的功能类似上述非实时RIC。当OAM和核心网设备中都包括AI实体时,他们各自的AI实体所训练得到的模型不同,和/或用于进行推理的模型不同。本申请中,模型不同可以包括以下至少一项不同:模型的结构参数(例如模型的层数、和/或权值等)、模型的输入参数、或模型的输出参数。As shown in Figure 4C, optionally, the network device includes one or more AI entities, and the function of the AI entity is similar to the above-mentioned near real-time RIC. Optionally, the OAM includes one or more AI entities, and the function of the AI entity is similar to the above-mentioned non-real-time RIC. Optionally, the core network device includes one or more AI entities, and the function of the AI entity is similar to the above-mentioned non-real-time RIC. When both the OAM and the core network device include AI entities, the models trained by their respective AI entities are different, and/or the models used for reasoning are different. In the present application, the difference in models may include at least one of the following differences: structural parameters of the model (such as the number of layers of the model, and/or weights, etc.), input parameters of the model, or output parameters of the model.
相对图4C,图4D中的网络设备分离为CU和DU。可选的,CU中可以包括AI实体,该AI实体的功能类似上述近实时RIC。可选的,DU中可以包括AI实体,该AI实体的功能类似上述近实时RIC。当CU和DU中都包括AI实体时,他们各自的AI实体所训练得到的模型不同,和/或用于进行推理的模型不同。可选的,还可以进一步将图4D中的CU拆分为CU-CP和CU-UP。可选的,CU-CP中可以部署有一个或多个AI模型。和/或,CU-UP中可以部署有一个或多个AI模型。可选的,图4C或图4D中,网络设备的OAM和核心网设备的OAM可以分开独立部署。Relative to Figure 4C, the network device in Figure 4D is separated into CU and DU. Optionally, the CU may include an AI entity, and the function of the AI entity is similar to the above-mentioned near real-time RIC. Optionally, the DU may include an AI entity, and the function of the AI entity is similar to the above-mentioned near real-time RIC. When both the CU and the DU include AI entities, the models trained by their respective AI entities are different, and/or the models used for reasoning are different. Optionally, the CU in Figure 4D can be further split into CU-CP and CU-UP. Optionally, one or more AI models can be deployed in the CU-CP. And/or, one or more AI models can be deployed in the CU-UP. Optionally, in Figure 4C or Figure 4D, the OAM of the network device and the OAM of the core network device can be deployed separately and independently.
为便于理解,下面先对本申请涉及的AI技术进行介绍。可以理解的是,该介绍并不作为对本申请的限定。For ease of understanding, the AI technology involved in this application is first introduced below. It is understandable that this introduction is not intended to limit this application.
(1)AI模型(1) AI Model
AI指由人制造出来的机器所表现出来的智能。通常人工智能是指通过普通计算机程序来呈现人类智能的技术。人工智能可以定义为模仿人类、且具有与人类思维相关的认知功能的机器或计算机,如学习和解决问题。人工智能能够从过去的经验中学习,做出合理的决策,并快速回应。人工智能的目标是通过构建具有象征意义的推理或推理的计算机程序来理解智能。AI refers to the intelligence displayed by machines created by humans. Usually, artificial intelligence refers to the technology that presents human intelligence through ordinary computer programs. Artificial intelligence can be defined as machines or computers that imitate humans and have cognitive functions related to human thinking, such as learning and problem solving. Artificial intelligence is able to learn from past experiences, make reasonable decisions, and respond quickly. The goal of artificial intelligence is to understand intelligence by building computer programs with symbolic reasoning or reasoning.
机器学习(machine learning,ML)是实现人工智能的一个途径,即以机器学习为手段解决人工智能中的问题。机器学习理论主要是设计和分析一些让计算机可以自动“学习”的算法。机器学习算法是一类从数据中自动分析获得规律,并利用规律对未知数据进行预测的算法。因为学习算法中涉及了大量的统计学理论,机器学习与推断统计学联系尤为密切,也被称为统计学习理论。Machine learning (ML) is a way to achieve artificial intelligence, that is, to solve problems in artificial intelligence by means of machine learning. The theory of machine learning mainly designs and analyzes some algorithms that allow computers to "learn" automatically. Machine learning algorithms are a type of algorithm that automatically analyzes data to obtain patterns and uses the patterns to predict unknown data. Because learning algorithms involve a lot of statistical theories, machine learning is particularly closely related to inferential statistics, and is also called statistical learning theory.
机器学习可以分为监督学习、非监督学习、强化学习。Machine learning can be divided into supervised learning, unsupervised learning, and reinforcement learning.
监督学习依据已采集到的样本值和样本标签,利用机器学习算法学习样本值到样本标签的映射关系,并用机器学习模型来表达学到的映射关系。训练机器学习模型的过程就是学习这种映射关系的过程。如信号检测中,含噪声的接收信号即为样本,该信号对应的真实星座点即为标签,机器学习期望通过训练学到样本与标签之间的映射关系,即,使机器学习模型学到一种信号检测器。在训练时,通过计算模型的预测值与真实标签的误差来优化模型参数。一旦映射关系学习完成,就可以利用学到映射来预测每一个新的样本的标签。监督学习学到的映射关系可以包括线性映射、非线性映射。根据标签的类型可将学习的任务分为分类任务和回归任务。Supervised learning uses machine learning algorithms to learn the mapping relationship from sample values to sample labels based on the collected sample values and sample labels, and uses machine learning models to express the learned mapping relationship. The process of training a machine learning model is the process of learning this mapping relationship. For example, in signal detection, the received signal containing noise is the sample, and the real constellation point corresponding to the signal is the label. Machine learning expects to learn the mapping relationship between samples and labels through training, that is, to make the machine learning model learn a signal detector. During training, the model parameters are optimized by calculating the error between the model's predicted value and the real label. Once the mapping relationship is learned, the learned mapping can be used to predict the label of each new sample. The mapping relationship learned by supervised learning can include linear mapping and nonlinear mapping. According to the type of label, the learning task can be divided into classification task and regression task.
无监督学习仅依据采集到的样本值,利用算法自行发掘样本的内在模式。无监督学习中有一类算法将样本自身作为监督信号,即模型学习从样本到样本的映射关系,称为自监督学习。训练时,通过计算模型的预测值与样本本身之间的误差来优化模型参数。自监督学习可用于信号压缩及解压恢复的应用,常见的算法包括自编码器和对抗生成型网络等。Unsupervised learning is based only on the collected sample values, using algorithms to discover the inherent patterns of samples. There is a type of algorithm in unsupervised learning that uses the sample itself as a supervisory signal, that is, the model learns the mapping relationship from sample to sample, which is called self-supervised learning. During training, the model parameters are optimized by calculating the error between the model's predicted value and the sample itself. Self-supervised learning can be used in applications such as signal compression and decompression recovery. Common algorithms include autoencoders and adversarial generative networks.
强化学习不同于监督学习,是一类通过与环境进行交互来学习解决问题的策略的算法。与监督、无监督学习不同,强化学习问题并没有明确的“正确的”动作标签数据,算法需要与环境进行交互,获取环境反馈的奖励信号,进而调整决策动作以获得更大的奖励信号数值。如下行功率控制中,强化学习模型根据无线网络反馈的系统总吞吐率,调整各个用户的下行发送功率,进而期望获得更高的系统吞吐率。强化学习的目标也是学习环境状态与最优决策动作之间的映射关系。但因为无法事先获得“正确动作”的标签,所以不能通过计算动作与“正确动作”之间的误差来优化网络。强化学习的训练是通过与环境的迭代交互而实现的。Reinforcement learning is different from supervised learning. It is a type of algorithm that learns problem-solving strategies by interacting with the environment. Unlike supervised and unsupervised learning, reinforcement learning problems do not have clear "correct" action label data. The algorithm needs to interact with the environment to obtain reward signals from the environment, and then adjust the decision-making actions to obtain a larger reward signal value. For example, in downlink power control, the reinforcement learning model adjusts the downlink transmission power of each user according to the total system throughput fed back by the wireless network, and then expects to obtain a higher system throughput. The goal of reinforcement learning is also to learn the mapping relationship between the state of the environment and the optimal decision action. However, because the label of the "correct action" cannot be obtained in advance, the network cannot be optimized by calculating the error between the action and the "correct action". Reinforcement learning training is achieved through iterative interaction with the environment.
AI模型为能实现AI功能的算法或者计算机程序,是AI技术功能的具体实现,AI模型表征了模型的输入和输出之间的映射关系。AI模型的类型可以是神经网络、线性回归模型、决策树模型、支持向量机(support vector machine,SVM)、贝叶斯网络、Q学习模型或者其他机器学习模型。AI models are algorithms or computer programs that can realize AI functions. They are the specific implementation of AI technology functions. AI models represent the mapping relationship between the input and output of the model. The types of AI models can be neural networks, linear regression models, decision tree models, support vector machines (SVM), Bayesian networks, Q learning models or other machine learning models.
(2)深度神经网络(deep neural network,DNN)(2) Deep neural network (DNN)
深度神经网络是AI或机器学习技术的一种具体实现形式。根据通用近似定理,神经网络在理论上可以逼近任意连续函数,从而使得神经网络具备学习任意映射的能力。传统通信系统需要借助丰富的专家知识来设计通信模块,而基于DNN的深度学习通信系统可以从大量的数据集中自动发现隐含的模式结构,建立数据之间的映射关系,获得优于传统建模方法的性能。Deep neural network is a specific implementation form of AI or machine learning technology. According to the universal approximation theorem, neural network can theoretically approximate any continuous function, so that neural network has the ability to learn any mapping. Traditional communication systems require rich expert knowledge to design communication modules, while DNN-based deep learning communication systems can automatically discover implicit pattern structures from large data sets, establish mapping relationships between data, and obtain performance that is superior to traditional modeling methods.
DNN的思想来源于大脑组织的神经元结构。例如,每个神经元都对其输入值进行加权求和运算,通过一个激活函数输出运算结果。如图5所示,为神经元结构的一种示意图。假设神经元的输入为x=[x0,x1,…,xn],与各个输入对应的权值分别为w=[w0,w1,…,wn],其中,wi作为xi的权值,用于对xi进行加权。根据权值对输入值进行加权求和的偏置例如为b。激活函数的形式可以有多种,假设一个神经元的激活函数为:y=f(z)=max(0,z),则该神经元的输出为:The idea of DNN comes from the neuron structure of brain tissue. For example, each neuron performs a weighted sum operation on its input values and outputs the operation result through an activation function. As shown in Figure 5, it is a schematic diagram of the neuron structure. Assume that the input of the neuron is x = [x 0 , x 1 ,…, x n ], and the weights corresponding to each input are w = [w 0 , w 1 ,…, w n ], where w i is the weight of xi and is used to weight xi . The bias for weighted summation of input values according to the weights is, for example, b. The activation function can take many forms. Assuming that the activation function of a neuron is: y = f(z) = max(0,z), the output of the neuron is:
再例如,一个神经元的激活函数为:y=f(z)=z,则该神经元的输出为: For another example, the activation function of a neuron is: y = f(z) = z, then the output of the neuron is:
其中,b、wi、xi可以是小数、整数(例如0、正整数或负整数)、或复数等各种可能的取值。神经网络中不同神经元的激活函数可以相同或不同。 b, w i , xi can be decimals, integers (such as 0, positive integers or negative integers), or complex numbers. The activation functions of different neurons in a neural network can be the same or different.
神经网络一般包括多个层,每层可包括一个或多个神经元。通过增加神经网络的深度和/或宽度,能够提高该神经网络的表达能力,为复杂系统提供更强大的信息提取和抽象建模能力。其中,神经网络的深度可以是指神经网络包括的层数,其中每层包括的神经元个数可以称为该层的宽度。在一种实现方式中,神经网络包括输入层和输出层。神经网络的输入层将接收到的输入信息经过神经元处理,将处理结果传递给输出层,由输出层得到神经网络的输出结果。在另一种实现方式中,神经网络包括输入层、隐藏层和输出层,可参考图6的神经网络的示意图。神经网络的输入层将接收到的输入信息经过神经元处理,将处理结果传递给中间的隐藏层,隐藏层对接收的处理结果进行计算,得到计算结果,隐藏层将计算结果传递给输出层或者相邻的隐藏层,最终由输出层得到神经网络的输出结果。其中,一个神经网络可以包括一个隐藏层,或者包括多个依次连接的隐藏层,不予限制。A neural network generally includes multiple layers, each of which may include one or more neurons. By increasing the depth and/or width of a neural network, the expressive power of the neural network can be improved, providing a more powerful information extraction and abstract modeling capability for complex systems. Among them, the depth of a neural network may refer to the number of layers included in the neural network, wherein the number of neurons included in each layer may be referred to as the width of the layer. In one implementation, the neural network includes an input layer and an output layer. The input layer of the neural network processes the received input information through neurons, passes the processing results to the output layer, and obtains the output result of the neural network from the output layer. In another implementation, the neural network includes an input layer, a hidden layer, and an output layer, and reference may be made to the schematic diagram of the neural network in FIG6. The input layer of the neural network processes the received input information through neurons, passes the processing results to the middle hidden layer, and the hidden layer calculates the received processing results to obtain the calculation results. The hidden layer passes the calculation results to the output layer or the adjacent hidden layer, and finally obtains the output result of the neural network from the output layer. Among them, a neural network may include one hidden layer, or include multiple hidden layers connected in sequence, without limitation.
根据网络的构建方式,DNN可以包括前馈神经网络(feedforward neural network,FNN)、卷积神经网络(convolutional neural networks,CNN)和递归神经网络(recurrent neural network,RNN)。图6所示即为一种FNN网络,其特点为相邻层的神经元之间两两完全相连,这使得FNN通常需要大量的存储空间、导致较高的计算复杂度。According to the way the network is constructed, DNN can include feedforward neural network (FNN), convolutional neural network (CNN) and recurrent neural network (RNN). Figure 6 shows an FNN network, which is characterized by the neurons in adjacent layers being fully connected to each other, which usually requires a large amount of storage space and leads to high computational complexity.
CNN是一种专门来处理具有类似网格结构的数据的神经网络。例如,时间序列数据和图像数据都可以认为是类似网格结构的数据。CNN并不一次性利用全部的输入信息做运算,而是采用一个固定大小的窗截取部分信息做卷积运算,这就大大降低了模型参数的计算量。另外根据窗截取的信息类型的不同(如同一副图中的人和物为不同类型信息),每个窗可以采用不同的卷积核运算,这使得CNN能更好的提取输入数据的特征。CNN is a neural network that is specifically designed to process data with a grid-like structure. For example, time series data and image data can be considered to be data with a grid-like structure. CNN does not use all the input information for calculations at once, but uses a fixed-size window to intercept part of the information for convolution operations, which greatly reduces the amount of calculation of model parameters. In addition, depending on the type of information intercepted by the window (for example, people and objects in a picture are different types of information), each window can use different convolution kernel operations, which enables CNN to better extract the features of the input data.
RNN是一类利用反馈时间序列信息的DNN网络。它的输入包括当前时刻的新的输入值和自身在前一时刻的输出值。RNN适合获取在时间上具有相关性的序列特征,特别适用于语音识别、信道编译码等应用。RNN is a type of DNN network that uses feedback time series information. Its input includes the new input value at the current moment and its own output value at the previous moment. RNN is suitable for obtaining sequence features that are correlated in time, and is particularly suitable for applications such as speech recognition and channel coding.
上述FNN、CNN、RNN为常见的神经网络结构,这些网络结构都是以神经元为基础而构造的。上述已经介绍,每个神经元都对其输入值做加权求和运算,并加权求和结果通过一个非线性函数产生输出,则我们将神经网络中神经元加权求和运算的权值以及非线性函数称作神经网络的参数。以max{0,x}为非线性函数的神经元为例,做操作的神经元的参数为权值为w=[w0,…,wn],加权求和的偏置为b,以及非线性函数max{0,x}。一个神经网络所有神经元的参数构成这个神经网络的参数。The above-mentioned FNN, CNN, and RNN are common neural network structures, which are all constructed based on neurons. As mentioned above, each neuron performs a weighted sum operation on its input values, and the weighted sum result generates an output through a nonlinear function. We call the weights of the weighted sum operation of neurons in the neural network and the nonlinear function the parameters of the neural network. Take the neuron with max{0,x} as the nonlinear function as an example, The parameters of the operated neurons are the weight w = [w 0 ,…,w n ], the bias of the weighted sum b, and the nonlinear function max{0,x}. The parameters of all neurons in a neural network constitute the parameters of this neural network.
(3)训练数据集和推理数据(3) Training Dataset and Inference Data
训练数据集用于AI模型的训练,训练数据集可以包括AI模型的输入,或者包括AI模型的输入和目标输出。其中,训练数据集包括一个或多个训练数据,训练数据可以是输入至AI模型的训练样本,也可以是AI模型的目标输出。其中,目标输出也可以被称为标签或者标签样本。训练数据集是机器学习重要的部分之一,模型训练本质上就是从训练数据中学习它的某些特征,使得AI模型的输出尽可能接近目标输出,如AI模型的输出与目标输出之间的差异尽可能地小。训练数据集的构成与选取,在一定程度上可以决定训练出来的AI模型的性能。The training data set is used for training the AI model. The training data set may include the input of the AI model, or the input and target output of the AI model. Among them, the training data set includes one or more training data, and the training data may be a training sample input to the AI model, or it may be the target output of the AI model. Among them, the target output may also be referred to as a label or a label sample. The training data set is one of the important parts of machine learning. Model training is essentially learning certain features from the training data so that the output of the AI model is as close to the target output as possible, such as the difference between the output of the AI model and the target output is as small as possible. The composition and selection of the training data set can determine the performance of the trained AI model to a certain extent.
另外,在AI模型(如神经网络)的训练过程中,可以定义损失函数。损失函数描述了AI模型的输出值与目标输出值之间的差距或差异。本申请并不限制损失函数的具体形式。AI模型的训练过程就是通过调整AI模型的模型参数,使得损失函数的取值小于门限,或者使得损失函数的取值满足目标需求的过程。例如,AI模型为神经网络,调整神经网络的模型参数包括调整如下参数中的至少一种:神经网络的层数、宽度、神经元的权值、或神经元的激活函数中的参数。In addition, during the training process of an AI model (such as a neural network), a loss function can be defined. The loss function describes the gap or difference between the output value of the AI model and the target output value. This application does not limit the specific form of the loss function. The training process of the AI model is to adjust the model parameters of the AI model so that the value of the loss function is less than the threshold, or the value of the loss function meets the target requirements. For example, the AI model is a neural network, and adjusting the model parameters of the neural network includes adjusting at least one of the following parameters: the number of layers, width, weights of neurons, or parameters in the activation function of neurons.
推理数据可以作为已训练好的AI模型的输入,用于AI模型的推理。在模型推理过程中,将推理数据输入AI模型,可以得到对应的输出即为推理结果。Inference data can be used as input to the trained AI model for inference. During the model inference process, the inference data is input into the AI model, and the corresponding output is the inference result.
(4)AI模型的设计(4) AI model design
AI模型的设计主要包括数据收集环节(例如收集训练数据和/或推理数据)、模型训练环节以及模型推理环节。进一步地还可以包括推理结果应用环节。参见图7,示意了一种AI应用框架。在前述数据收集环节中,数据源(data source)用于提供训练数据集和推理数据。在模型训练环节中,通过对数据源提供的训练数据(training data)进行分析或训练,得到AI模型。其中,AI模型表征了模型的输入和输出之间的映射关系。通过模型训练节点学习得到AI模型,相当于利用训练数据学习得到模型的输入和输出之间的映射关系。在模型推理环节中,使用经由模型训练环节训练后的AI模型,基于数据源提供的推理数据进行推理,得到推理结果。该环节还可以理解为:将推理数据输入到AI模型,通过AI模型得到输出,该输出即为推理结果。该推理结果可以指示:由执行对象使用(执行)的配置参数、和/或由执行对象执行的操作。在推理结果应用环节中进行推理结果的发布,例如推理结果可以由执行实体(actor)统一规划,例如执行实体可以发送推理结果给一个或多个执行对象(例如,核心网设备、网络设备、或终端设备等)去执行。又如执行实体还可以反馈模型的性能给数据源,便于后续实施模型的更新训练。The design of the AI model mainly includes a data collection link (for example, collecting training data and/or reasoning data), a model training link, and a model reasoning link. It can also include a reasoning result application link. See Figure 7, which illustrates an AI application framework. In the aforementioned data collection link, the data source is used to provide training data sets and reasoning data. In the model training link, the AI model is obtained by analyzing or training the training data provided by the data source. Among them, the AI model represents the mapping relationship between the input and output of the model. Learning the AI model through the model training node is equivalent to learning the mapping relationship between the input and output of the model using the training data. In the model reasoning link, the AI model trained through the model training link is used to reason based on the reasoning data provided by the data source to obtain the reasoning result. This link can also be understood as: inputting the reasoning data into the AI model, and obtaining the output through the AI model, which is the reasoning result. The reasoning result can indicate: the configuration parameters used (executed) by the execution object, and/or the operation performed by the execution object. The reasoning results are published in the reasoning result application link. For example, the reasoning results can be planned uniformly by the execution entity (actor). For example, the execution entity can send the reasoning results to one or more execution objects (for example, core network equipment, network equipment, or terminal equipment, etc.) for execution. For another example, the execution entity can also feedback the performance of the model to the data source to facilitate the subsequent implementation of the model update training.
可以理解的是,在通信系统中可以包括具备人工智能功能的网元。上述AI模型设计相关的环节可以由一个或多个具备人工智能功能的网元执行。一种可能的设计中,可以在通信系统中已有网元内配置AI功能(如AI模块或者AI实体)来实现AI相关的操作,例如AI模型的训练和/或推理。例如该已有网元可以是网络设备(如gNB)、终端设备、核心网设备、或网管等。其中,网管可以根据运营商网络运营的实际需要,将网络的管理工作划分为3类:操作(operation)、管理(administration)、维护(maintenance)。网管又可以称为OAM网元,简称OAM。操作主要完成日常网络和业务进行的分析、预测、规划和配置工作;维护主要是对网络及其业务的测试和故障管理等进行的日常操作活动,网管可以检测网络运行状态、优化网络连接和性能,提升网络运行稳定性,降低网络维护成本。或者另一种可能的设计中,也可以在通信系统中引入独立的网元来执行AI相关的操作,如训练AI模型。该独立的网元可以称为AI网元或者AI节点等,本申请对此名称不进行限制。该AI网元可以和通信系统中的网络设备之间直接连接,也可以通过第三方设备和网络设备实现间接连接。其中,第三方设备可以是认证管理功能(authentication management function,AMF)网元、用户面功能(user plane function,UPF)网元等核心网网元、OAM、云服务器或者其他网元,不予限制。示例性的,参见图8,该通信系统包括网络设备810、终端设备820、830,还在该通信系统中引入AI网元840。It is understandable that a network element with artificial intelligence function may be included in the communication system. The above-mentioned AI model design-related links may be performed by one or more network elements with artificial intelligence function. In one possible design, an AI function (such as an AI module or an AI entity) may be configured in an existing network element in the communication system to implement AI-related operations, such as training and/or reasoning of an AI model. For example, the existing network element may be a network device (such as a gNB), a terminal device, a core network device, or a network management system. Among them, the network management system may divide the network management work into three categories according to the actual needs of the operator's network operation: operation, administration, and maintenance. The network management system may also be referred to as an OAM network element, or OAM for short. Operation mainly completes the analysis, prediction, planning, and configuration of daily networks and services; maintenance mainly involves daily operational activities such as testing and fault management of the network and its services. The network management system may detect the network operation status, optimize network connections and performance, improve network operation stability, and reduce network maintenance costs. Or in another possible design, an independent network element may also be introduced into the communication system to perform AI-related operations, such as training an AI model. The independent network element can be called an AI network element or an AI node, etc., and this application does not limit this name. The AI network element can be directly connected to the network equipment in the communication system, or it can be indirectly connected through a third-party device and the network equipment. Among them, the third-party device can be a core network element such as an authentication management function (AMF) network element, a user plane function (UPF) network element, an OAM, a cloud server or other network elements, without limitation. For example, referring to Figure 8, the communication system includes a network device 810, terminal devices 820, 830, and an AI network element 840 is also introduced in the communication system.
本申请中,一个模型可以推理得到一个参数,或者推理得到多个参数。不同模型的训练过程可以部署在不同的设备或节点中,也可以部署在相同的设备或节点中。不同模型的推理过程可以部署在不同的设备或节点中,也可以部署在相同的设备或节点中。In this application, a model can be inferred to obtain one parameter, or multiple parameters. The training process of different models can be deployed in different devices or nodes, or in the same device or node. The reasoning process of different models can be deployed in different devices or nodes, or in the same device or node.
其中,模型参数可以包括如下的一种或多种模型的结构参数(例如模型的层数、和/或权值等)、模型的输入参数(如输入维度、输入端口数)、或模型的输出参数(如输出维度、输出端口数)。可以理解,输入维度可以指的是一个输入数据的大小,例如输入数据为一个序列时,该序列对应的输入维度可以指示该序列的长度。输入端口数可以指的是输入数据的数量。类似地,输出维度可以指的是一个输出数据的大小,例如输出数据为一个序列时,该序列对应的输出维度可以指示该序列的长度。输出端口数可以指的是输出数据的数量。Among them, the model parameters may include one or more of the following structural parameters of the model (such as the number of layers of the model, and/or weights, etc.), the input parameters of the model (such as input dimension, number of input ports), or the output parameters of the model (such as output dimension, number of output ports). It can be understood that the input dimension may refer to the size of an input data. For example, when the input data is a sequence, the input dimension corresponding to the sequence may indicate the length of the sequence. The number of input ports may refer to the number of input data. Similarly, the output dimension may refer to the size of an output data. For example, when the output data is a sequence, the output dimension corresponding to the sequence may indicate the length of the sequence. The number of output ports may refer to the number of output data.
(5)集中式训练(5) Centralized training
在过去的十年内,手机终端、可穿戴设备等智能设备的数量不断增加。可以预见,在不久的将来整个通信网络将部署数十亿台物联网设备,以实现社会运作的自动化和智能化。当前基于先进机器学习的智能服务的性能,可能会受益于这些设备上爆炸性增长的数据,以及这些UE本身可用的计算能力。大多数机器学习技术,例如,基于深度神经网络的学习算法,需要集中所有可利用的数据进行训练。然而,由于集中式训练需要收集足够多的数据,这些数据往往来源于UE,因此需要UE上传这些数据,这些数据的上传开销是较大的。并且,海量的训练数据集中在一个节点中,训练效率将受到存储空间的限制。此外,收集UE端的数据可能会侵犯用户的隐私。为实现集中式训练而存储海量数据可能会引发用户对隐私数据泄漏的重大担忧。另一方面,如果用户仅仅使用本身所拥有的数据进行机器学习算法的训练,训练效果往往受到有限的本地数据量的限制。In the past decade, the number of smart devices such as mobile terminals and wearable devices has continued to increase. It is foreseeable that in the near future, billions of IoT devices will be deployed throughout the communication network to achieve automation and intelligence in social operations. The performance of current smart services based on advanced machine learning may benefit from the explosive growth of data on these devices and the computing power available to these UEs themselves. Most machine learning techniques, such as learning algorithms based on deep neural networks, require all available data to be centralized for training. However, since centralized training requires the collection of enough data, which often comes from UEs, UEs are required to upload these data, and the uploading overhead of these data is large. In addition, a large amount of training data is concentrated in one node, and the training efficiency will be limited by the storage space. In addition, collecting data on the UE side may infringe on the privacy of users. Storing massive data for centralized training may cause users to have significant concerns about privacy data leakage. On the other hand, if users only use the data they own to train machine learning algorithms, the training effect is often limited by the limited amount of local data.
(6)分布式训练(6) Distributed training
分布式训练是解决上述挑战的有效解决方案。该类技术允许将机器训练过程划分到用户端多个子节点上,实现学习算法的可扩展性。它允许作为中心节点的云或服务器搜集多个子节点训练的机器学习模型,中心节点通过综合子节点训练的模型来提升整个机器学习训练的效果。由于训练数据始终保持在子节点,分布式学习技术有望在保障用户数据隐私的同时,使用UE的数据和/或计算能力达到与中心式训练相同的性能。Distributed training is an effective solution to the above challenges. This type of technology allows the machine training process to be divided into multiple sub-nodes on the user side to achieve scalability of the learning algorithm. It allows the cloud or server as a central node to collect machine learning models trained by multiple sub-nodes. The central node improves the effect of the entire machine learning training by integrating the models trained by the sub-nodes. Since the training data is always kept in the sub-nodes, distributed learning technology is expected to achieve the same performance as centralized training while using the UE's data and/or computing power to protect user data privacy.
联邦学习是分布式训练的一种机器学习范式。其初衷是为了能有效帮助多个机构在满足用户隐私保护、数据安全的要求下,进行数据使用和机器学习建模,在联邦学习框架内,节点之间传递的不是数据本身,而是训练中得到的中间结果,例如模型参数或者梯度等。联邦学习作为分布式的机器学习范式,可以有效解决数据孤岛问题,让参与方在不共享数据的基础上联合建模,能从技术上打破数据孤岛,实现AI协作。根据参与各方数据源分布的情况不同,联邦学习可以被分为三类:横向联邦学习、纵向联邦学习、联邦迁移学习。Federated learning is a machine learning paradigm for distributed training. Its original intention was to effectively help multiple institutions use data and conduct machine learning modeling while meeting the requirements of user privacy protection and data security. In the federated learning framework, what is transmitted between nodes is not the data itself, but the intermediate results obtained during training, such as model parameters or gradients. As a distributed machine learning paradigm, federated learning can effectively solve the problem of data silos, allowing participants to jointly model without sharing data, technically breaking down data silos and achieving AI collaboration. According to the different distribution of data sources among the participating parties, federated learning can be divided into three categories: horizontal federated learning, vertical federated learning, and federated transfer learning.
横向联邦学习指的是在两个数据集的用户特征重叠较多而用户重叠较少的情况下,我们把数据集按照横向(即用户维度)切分,并取出双方用户特征相同而用户不完全相同的那部分数据进行训练。纵向联邦学习指的是在两个数据集的用户重叠较多而用户特征重叠较少的情况下,我们把数据集按照纵向(即特征维度)切分,并取出双方用户相同而用户特征不完全相同的那部分数据进行训练。联邦迁移学习指的是在两个数据集的用户与用户特征重叠都较少的情况下,我们不对数据进行切分,而可以利用迁移学习来克服数据或标签不足的情况。Horizontal federated learning means that when there is a lot of overlap in user features but little overlap in users of two datasets, we split the dataset horizontally (i.e., user dimension) and take out the data with the same user features but different users for training. Vertical federated learning means that when there is a lot of overlap in users of two datasets but little overlap in user features, we split the dataset vertically (i.e., feature dimension) and take out the data with the same users but different user features for training. Federated transfer learning means that when there is little overlap in users and user features of two datasets, we do not split the data, but use transfer learning to overcome the lack of data or labels.
以高频波束管理问题为例,网络设备侧在进行基于码本的同步信号和物理广播信号(physical broadcast channel,PBCH)块(即同步信号块(synchronization signal block,SSB))或者信道状态信息参考信号(channel state information-reference signal,CSI-RS)的波束扫描时,不同位置的用户和网络设备间的信道不同,用户对接收到的SSB或者CSI-RS波束进行测量,例如测量物理层参考信号接收功率(L1-reference signal receiving power,L1-RSRP),并反馈最大RSRP值所对应的波束标识(identity,ID)。利用AI/ML可以实现训练一个模型,利用例如接收到的复数个SSB/CSI-RS信号(中的部分或全部),或者接收的复数个SSB/CSI-RS信号的强度(RSRP)(中的部分或全部)或者估计的信道作为输入,来推断出最优波束ID,并反馈给网络设备。每个用户可以收集自己的接收波束/信道信息以及对应的最优波束ID作为训练上述AL/ML模型的样本(即本地样本),但每个用户能够收集到的样本数量有限,用户仅利用本地数据训练得到的模型性能会有局限,即用户由于位置关系可能最优波束ID只是SSB/CSI-RS码本的一个子集。如果用户将本地数据发送给服务器,服务器汇总各个用户的数据进行模型训练,虽然可以提升模型的性能,但会存在泄露用户的隐私信息的风险,例如通过信道能推测用户当前位置等信息。为了解决这一问题,可以利用联邦学习,中心节点下发一个全局模型给每个参加联邦学习的用户,每个用户利用本地数据训练该全局模型,得到本地模型,将本地模型的参数信息,例如梯度,权值等(加密)发送给服务器,服务器进行模型融合(model aggression,MA)更新全局模型,将再次将全局模型发送给各个用户,用户继续更新本地模型,发送给中心节点,如此进行多次迭代,直到收敛。Taking the high-frequency beam management problem as an example, when the network equipment side performs beam scanning of codebook-based synchronization signals and physical broadcast channel (PBCH) blocks (i.e., synchronization signal blocks (SSB)) or channel state information-reference signals (CSI-RS), the channels between users and network equipment at different locations are different. Users measure the received SSB or CSI-RS beams, such as measuring the physical layer reference signal receiving power (L1-reference signal receiving power, L1-RSRP), and feedback the beam identity (ID) corresponding to the maximum RSRP value. AI/ML can be used to train a model, using, for example, a plurality of received SSB/CSI-RS signals (part or all), or the strength (RSRP) (part or all) of a plurality of received SSB/CSI-RS signals or the estimated channel as input to infer the optimal beam ID and feedback it to the network device. Each user can collect his/her own receiving beam/channel information and the corresponding optimal beam ID as samples (i.e., local samples) for training the above AL/ML model. However, the number of samples that each user can collect is limited. The performance of the model obtained by the user using only local data for training will be limited, that is, due to the position relationship, the optimal beam ID of the user may be only a subset of the SSB/CSI-RS codebook. If the user sends local data to the server, the server aggregates the data of each user for model training. Although the performance of the model can be improved, there is a risk of leaking the user's privacy information, such as the user's current location can be inferred through the channel. In order to solve this problem, federated learning can be used. The central node sends a global model to each user participating in federated learning. Each user uses local data to train the global model to obtain a local model, and sends the parameter information of the local model, such as gradients, weights, etc. (encrypted) to the server. The server performs model fusion (model aggression, MA) to update the global model, and sends the global model to each user again. The user continues to update the local model and sends it to the central node. This is repeated for multiple times until convergence.
在联邦学习构架中,不同的子节点基于本地数据训练公共模型,得到本地模型,该本地模型能较好地拟合本地数据的特征。然而,这些子节点的本地模型在经中心节点聚合之后,整个模型的泛化性提升了,然而公共模型对各个子节点的本地数据特征的拟合能力下降,需要做个性化加强。In the federated learning framework, different sub-nodes train public models based on local data to obtain local models, which can better fit the characteristics of local data. However, after the local models of these sub-nodes are aggregated by the central node, the generalization of the entire model is improved, but the public model's ability to fit the local data characteristics of each sub-node decreases, and needs to be personalized and strengthened.
因此,提出了一种混合专家模型,中心节点下发的全局模型包括多个专家模型。各子节点基于门控网络模型的输出结果选择至少一部分专家模型的输出结果进行全局模型的训练,得到梯度信息,并上报该梯度信息等给中心节点进行模型聚合。Therefore, a hybrid expert model is proposed. The global model sent by the central node includes multiple expert models. Each sub-node selects at least a part of the output results of the expert model based on the output results of the gated network model to train the global model, obtains gradient information, and reports the gradient information to the central node for model aggregation.
混合专家模型通过将多个网络或者称为多个专家模型结合在一起,以获得更好的模型性能。基于混合专家模型的大模型可以有效地提高其训练效率以及模型的参数量。The hybrid expert model combines multiple networks or multiple expert models to achieve better model performance. A large model based on the hybrid expert model can effectively improve its training efficiency and the number of model parameters.
如图9A~图9E,给出了多种全局模型结构。As shown in FIG. 9A to FIG. 9E , various global model structures are given.
根据全局模型中是否包括通用层、专家层中是否包括门控网络模型以及有无明确的专家层的定义,可以进行如下分类:According to whether the global model includes a general layer, whether the expert layer includes a gated network model, and whether there is a clear definition of the expert layer, it can be classified as follows:
(1)专家层中包括门控网络模型,以及有明确的专家层的定义:(1) The expert layer includes the gated network model and has a clear definition of the expert layer:
在图9A中,该全局模型包括通用层和专家层。其中,通用层是对于所有输入数据通用的网络层,可以是诸如卷积神经网络的特征提取网络等。该专家层中又包括门控网络模型和N个专家模型。N为正整数。N个专家模型可以是基于不同类型的神经网络,例如是CNN、RNN、Transformer、MLP等;或者基于相同类型的神经网络,但是对应不同的参数配置,例如层数、深度、以及包括kernel size在内的具体配置。In FIG9A , the global model includes a general layer and an expert layer. The general layer is a network layer that is common to all input data, and may be a feature extraction network such as a convolutional neural network. The expert layer includes a gated network model and N expert models. N is a positive integer. The N expert models may be based on different types of neural networks, such as CNN, RNN, Transformer, MLP, etc.; or based on the same type of neural network, but with different parameter configurations, such as the number of layers, depth, and specific configurations including kernel size.
对门控网络模型进行训练,该门控网络模型的输出结果用于选择专家模型的输出结果。选择合适的、配套的门控输出机制,可以合并和平衡专家的选择。这里的不同的专家模型,可以指网络的结构不同,例如,专家模型1为卷积网络,专家模型2为全连接网络,等等。但是要保证各个专家网络的输出结果是可以进行合并的。基于门控网络模型的输出结果选择一部分专家模型进行预测,既可以减少计算量,并且能针对不同的输入选择最合适的专家模型。The gated network model is trained, and the output of the gated network model is used to select the output of the expert model. Selecting an appropriate and matching gated output mechanism can merge and balance the expert's choices. The different expert models here can refer to different network structures, for example, expert model 1 is a convolutional network, expert model 2 is a fully connected network, and so on. However, it is necessary to ensure that the output of each expert network can be merged. Selecting a part of the expert models for prediction based on the output of the gated network model can reduce the amount of calculation and select the most appropriate expert model for different inputs.
在该架构中,通用层的输出结果作为专家模型和门控网络模型的共同的输入。In this architecture, the output of the general layer serves as the common input of the expert model and the gating network model.
在图9B中,该全局模型仅包括专家层,不包括通用层。其中,该专家层中又包括门控网络模型和N个专家模型。专家模型和门控网络模型的含义可参考上文描述。In FIG9B , the global model includes only the expert layer, but not the general layer. The expert layer includes a gated network model and N expert models. The meanings of the expert model and the gated network model can be referred to in the above description.
在该架构中,本地模型的输入作为专家模型和门控网络模型的共同的输入。In this architecture, the input of the local model serves as the common input of the expert model and the gating network model.
(2)专家层中不包括门控网络模型,以及有明确的专家层的定义:(2) The expert layer does not include the gated network model, and there is a clear definition of the expert layer:
在图9C中,专家层中不包括门控网络模型,仅包括N个专家模型。门控网络模型是与专家层独立的。专家模型和门控网络模型的含义可参考上文描述。专家层与门控网络模型的输入相同,通用预处理的结果作为专家层和门控网络模型的共同的输入。该通用预处理的结果可以是通用层模型的输出结果,也可以不是基于通用层模型的输出结果。In FIG9C , the expert layer does not include a gated network model, but only includes N expert models. The gated network model is independent of the expert layer. The meanings of the expert model and the gated network model can be referred to the above description. The expert layer has the same input as the gated network model, and the result of the general preprocessing is used as the common input of the expert layer and the gated network model. The result of the general preprocessing can be the output result of the general layer model, or it can be based on the output result of the general layer model.
在图9D中,专家层中不包括门控网络模型,仅包括N个专家模型。门控网络模型是与专家层独立的。专家模型和门控网络模型的含义可参考上文描述。且专家层与门控网络模型的输入不相同。对输入的本地数据进行第1种预处理的结果作为专家层1~专家层N的输入;对输入的本地数据进行的第2种预处理的结果作为门控网络模型的输入。In FIG. 9D , the expert layer does not include a gated network model, but only includes N expert models. The gated network model is independent of the expert layer. The meanings of the expert model and the gated network model can be referred to the above description. And the input of the expert layer and the gated network model are different. The result of the first preprocessing of the input local data is used as the input of expert layer 1 to expert layer N; the result of the second preprocessing of the input local data is used as the input of the gated network model.
(3)无明确的专家层的定义:(3) No clear definition of the expert level:
在图9E中,无明确的专家层的定义,该全局模型包括N个网络和门控网络模型。该N个网络的功能和作用可参考上文中的专家模型。门控网络模型的含义可参考上文描述。其中,网络1~网络N与门控网络模型的输入不相同。对输入的本地数据进行第1种预处理的结果作为网络1的输入;对输入的本地数据进行第2种预处理的结果作为网络2的输入;以此类推;对输入进行第N+1种预处理的结果作为门控网络模型的输入。In Figure 9E, there is no clear definition of the expert layer, and the global model includes N networks and a gated network model. The functions and effects of the N networks can refer to the expert model above. The meaning of the gated network model can refer to the above description. Among them, the inputs of network 1 to network N are different from those of the gated network model. The result of the first preprocessing of the input local data is used as the input of network 1; the result of the second preprocessing of the input local data is used as the input of network 2; and so on; the result of the N+1th preprocessing of the input is used as the input of the gated network model.
然而,若各子节点对如何使用门控网络模型的输出结果以及如何训练专家模型有不同的理解,会导致中心节点的聚合模型的性能下降。However, if each child node has a different understanding of how to use the output of the gating network model and how to train the expert model, the performance of the aggregation model of the central node will be degraded.
针对上述问题,本申请提供一种分布式训练方案,中心节点通过指示选择参数,使得所有参与训练的子节点在门控网络模型和专家模型的训练过程中,将该选择参数作用于门控网络模型的输出结果上,从而可以使得各个子节点对齐门控网络模型的输出结果的选择,提高中心节点的聚合模型的性能。In response to the above problems, the present application provides a distributed training solution, in which the central node indicates the selection parameters so that all sub-nodes participating in the training will apply the selection parameters to the output results of the gated network model during the training process of the gated network model and the expert model, thereby enabling each sub-node to align the selection of the output results of the gated network model and improve the performance of the aggregation model of the central node.
下面详细描述本申请实施例提供的分布式训练方法。可以理解的,本申请中是以中心节点和子节点作为该交互示意的执行主体为例进行示意的,但本申请并不限制交互示意的执行主体。例如,本申请提供的方法中的中心节点也可以是应用于中心节点的芯片、芯片系统、电路或处理器,还可以是能实现全部或部分中心节点的逻辑节点、逻辑模块或软件;本申请提供的方法中的子节点也可以是应用于子节点的芯片、芯片系统、电路或处理器,还可以是能实现全部或部分子节点功能的逻辑节点、逻辑模块或软件。The distributed training method provided by the embodiment of the present application is described in detail below. It can be understood that the present application uses the central node and the sub-node as an example to illustrate the execution subject of the interactive diagram, but the present application does not limit the execution subject of the interactive diagram. For example, the central node in the method provided by the present application may also be a chip, a chip system, a circuit or a processor applied to the central node, or a logical node, a logical module or software that can implement all or part of the central node; the sub-node in the method provided by the present application may also be a chip, a chip system, a circuit or a processor applied to the sub-node, or a logical node, a logical module or software that can implement all or part of the functions of the sub-node.
如图10所示,为本申请实施例提供的一种分布式训练方法的流程示意图。该方法应用于分布式训练系统,该分布式训练系统包括中心节点和多个子节点。该分布式训练系统的全局模型包括N个专家模型和一个门控网络模型,N为正整数。示例性地,该方法可以包括以下步骤:As shown in FIG10 , a flow chart of a distributed training method provided in an embodiment of the present application is provided. The method is applied to a distributed training system, which includes a central node and multiple child nodes. The global model of the distributed training system includes N expert models and a gated network model, where N is a positive integer. Exemplarily, the method may include the following steps:
S1001.子节点向中心节点发送第三信息。相应地,中心节点接收该第三信息。S1001. The subnode sends third information to the central node. Correspondingly, the central node receives the third information.
该子节点是参与该分布式训练的任意一个子节点。本实施例是以一个子节点和中心节点的交互流程为例进行描述,其它子节点和中心节点的交互参考本实施例的交互流程。The child node is any child node participating in the distributed training. This embodiment is described by taking the interaction process between a child node and a central node as an example, and the interaction between other child nodes and the central node refers to the interaction process of this embodiment.
在初始构建该分布式训练系统时,子节点可以向中心节点上报自身的能力。示例性地,子节点向中心节点发送第三信息,其中,该第三信息用于指示以下至少一个信息:子节点的内存空间大小,子节点的算力信息,是否支持进行模型训练,支持训练的模型的类型。When initially constructing the distributed training system, the child node can report its own capabilities to the central node. Exemplarily, the child node sends third information to the central node, where the third information is used to indicate at least one of the following information: the size of the child node's memory space, the child node's computing power information, whether model training is supported, and the type of model supported for training.
其中,在开始训练时,中心节点会向子节点发送中心节点的初始化模型,使得子节点对该初始化模型进行训练。因此,子节点在训练开始前,可以向中心节点上报子节点的内存空间大小,使得中心节点知晓该子节点的内存空间是否足够大,以存储该中心节点的初始化模型和后续的训练数据。其中,子节点的内存空间大小是指子节点可用于存储AI/ML模型的内存空间大小。When training starts, the central node will send the initialization model of the central node to the child node, so that the child node can train the initialization model. Therefore, before the training starts, the child node can report the size of the child node's memory space to the central node, so that the central node knows whether the child node's memory space is large enough to store the initialization model of the central node and subsequent training data. The size of the child node's memory space refers to the size of the memory space that the child node can use to store AI/ML models.
子节点在训练开始前,还可以向中心节点上报子节点的算力信息,使得中心节点知晓该子节点是否可以具有足够强的算力,能够及时反馈训练后的模型信息。其中,子节点的算力信息是指运行AI/ML模型的计算能力。Before training begins, the child node can also report the computing power information of the child node to the central node, so that the central node knows whether the child node has strong enough computing power and can timely feedback the model information after training. Among them, the computing power information of the child node refers to the computing power of running the AI/ML model.
子节点在训练开始前,还可以向中心节点上报是否支持进行模型训练的指示信息,使得中心节点确定该子节点是否可以参与分布式训练,并向其下发全局模型。Before training begins, the child node can also report to the central node whether it supports model training, so that the central node can determine whether the child node can participate in distributed training and send the global model to it.
子节点在训练开始前,还可以向中心节点上报支持训练的模型的类型,以使中心节点确定该子节点是否可以参与分布式训练。示例性地,子节点支持训练的模型的类型包括CNN、RNN、全连接、随机森林模型等。Before training begins, the child node can also report the type of model supported for training to the central node so that the central node can determine whether the child node can participate in distributed training. For example, the types of models supported for training by the child node include CNN, RNN, fully connected, random forest models, etc.
此外,该第三信息还可以包括子节点的硬件信息,包括但不限于子节点的天线配置(天线个数、极化方向等)、射频通道数、传感器类型(位置传感器/全球定位系统(global positioning system,GPS)、运动传感器等)和参数。In addition, the third information may also include hardware information of the sub-node, including but not limited to the antenna configuration of the sub-node (number of antennas, polarization direction, etc.), number of RF channels, sensor type (position sensor/global positioning system (GPS), motion sensor, etc.) and parameters.
可以理解的是,由于是联邦学习,子节点运用本地数据进行训练,所以子节点可以不需要上报比如能够处理的数据量大小等一系列和实际收集到的数据相关的或者涉及隐私的信息。It is understandable that, because it is federated learning, the child nodes use local data for training, so the child nodes do not need to report a series of information related to the actual collected data or involving privacy, such as the amount of data that can be processed.
可以理解的是,中心节点也可以预先获取了子节点的上述信息。因而,该步骤是可选的,图中以虚线表示。It is understandable that the central node may also obtain the above information of the child nodes in advance. Therefore, this step is optional and is indicated by a dotted line in the figure.
S1002.中心节点向子节点发送第一信息。相应地,子节点接收该第一信息。S1002. The central node sends first information to the child node. Correspondingly, the child node receives the first information.
中心节点接收各个子节点上报的上述第三信息后,根据各个子节点的第三信息,选择参与本轮分布式训练的子节点。After receiving the third information reported by each sub-node, the central node selects a sub-node to participate in this round of distributed training according to the third information of each sub-node.
如图9A-图9E所示,全局模型包括N个专家模型(或如图9E所示的网络1~网络N),这N个专家模型可以是基于不同类型的神经网络,例如是CNN、RNN、Transformer、MLP等;或者基于相同类型的神经网络,但是对应不同的参数配置,例如层数、深度、以及包括kernel size在内的具体配置。因而,N个专家模型的输出结果可能不同。As shown in FIG. 9A to FIG. 9E , the global model includes N expert models (or network 1 to network N as shown in FIG. 9E ), and these N expert models can be based on different types of neural networks, such as CNN, RNN, Transformer, MLP, etc.; or based on the same type of neural network, but corresponding to different parameter configurations, such as the number of layers, depth, and specific configurations including kernel size. Therefore, the output results of the N expert models may be different.
根据经验及仿真结果得出,越稀疏地选择专家模型的输出结果,越能让专家网络适配更多数据特征类型,极限情况就是一种网络仅学习一类特征数据。According to experience and simulation results, the more sparsely the output results of the expert model are selected, the more data feature types the expert network can adapt to. The extreme case is that a network only learns one type of feature data.
然而,参与分布式训练的各个子节点需要对齐对N个专家模型的输出结果的选择,否则会导致中心节点聚合得到的全局模型的性能变差。在图9A-图9E中,该全局模型中包括门控网络模型,子节点在对专家模型进行训练时,也对门控网络模型进行训练。门控网络模型中包括不同类型的神经网络,或包括相同类型的神经网络,但是对应不同的参数配置,例如层数、深度、以及包括kernel size在内的具体配置,因而,门控网络模型有N个输出,门控网络模型的N个输出分别与N个专家模型的输出连接,门控网络模型的输出结果可用于选择专家模型的输出结果,门控网络模型的输出结果与其连接的专家模型的输出结果一一对应。因此,为了对齐各个子节点对专家模型的输出结果的选择,中心节点向子节点发送第一信息。其中,该第一信息用于指示选择参数,该选择参数用于选择门控网络模型的输出结果。子节点可以根据该选择参数确定是保留该门控网络模型的输出结果,或丢弃该门控网络模型的输出结果。若根据该选择参数确定保留该门控网络模型的输出结果,则对应的专家模型的输出结果也被保留;若根据该选择参数确定丢弃该门控网络模型的输出结果,则对应的专家模型的输出结果也被丢弃。However, each sub-node participating in the distributed training needs to align the selection of the output results of the N expert models, otherwise the performance of the global model obtained by the central node aggregation will deteriorate. In Figures 9A-9E, the global model includes a gated network model, and the sub-nodes also train the gated network model when training the expert model. The gated network model includes different types of neural networks, or includes the same type of neural network, but corresponds to different parameter configurations, such as the number of layers, depth, and specific configurations including kernel size. Therefore, the gated network model has N outputs, and the N outputs of the gated network model are respectively connected to the outputs of the N expert models. The output results of the gated network model can be used to select the output results of the expert model, and the output results of the gated network model correspond one-to-one to the output results of the expert model connected to it. Therefore, in order to align the selection of the output results of the expert model by each sub-node, the central node sends the first information to the sub-node. Among them, the first information is used to indicate the selection parameter, and the selection parameter is used to select the output result of the gated network model. The sub-node can determine whether to retain the output result of the gated network model or discard the output result of the gated network model according to the selection parameter. If it is determined according to the selection parameters to retain the output result of the gated network model, the output result of the corresponding expert model is also retained; if it is determined according to the selection parameters to discard the output result of the gated network model, the output result of the corresponding expert model is also discarded.
示例性地,门控网络模型的输出结果经设定函数处理后为[0,1]之间的值。Exemplarily, the output result of the gated network model is a value between [0, 1] after being processed by the set function.
示例性地,该选择参数为第一阈值。例如,该第一阈值的取值范围可以为(0,1)。门控网络模型的输出结果大于该第一阈值,则保留该门控网络模型的输出结果;门控网络模型的输出结果小于该第一阈值,则丢弃该门控网络模型的输出结果。特别的,门控网络模型的输出结果等于该第一阈值,可以约定保留或丢弃该门控网络模型的输出结果。Exemplarily, the selection parameter is a first threshold. For example, the value range of the first threshold may be (0, 1). If the output result of the gated network model is greater than the first threshold, the output result of the gated network model is retained; if the output result of the gated network model is less than the first threshold, the output result of the gated network model is discarded. In particular, if the output result of the gated network model is equal to the first threshold, it can be agreed to retain or discard the output result of the gated network model.
进一步地,该第一信息除了用于指示选择参数,该第一信息还可以用于指示以下至少一个信息:N个专家模型的标识,N个专家模型的竟合方式,训练任务,全局模型的输入和输出的类型。Furthermore, in addition to being used to indicate the selection parameters, the first information can also be used to indicate at least one of the following information: the identification of the N expert models, the competition mode of the N expert models, the training task, and the type of input and output of the global model.
其中,该第一信息用于指示N个专家模型的标识,以便于后续子节点上报专家模型的模型参数信息是基于哪些专家模型的输出结果。The first information is used to indicate the identifiers of the N expert models, so that the model parameter information of the expert models reported by the subsequent child nodes is based on the output results of which expert models.
针对N个专家模型,中心节点还可以进一步指示子节点通过何种方式来训练各个专家模型。一般地,中心节点会指示子节点训练网络的目的,比如信道恢复,就是尽量让恢复的信道(作为中心节点的输出)逼近真值(即使归一化均方误差(normalized mean squared error,NMSE)最小化)。但是在混合专家模型的架构下,即使基于同一个任务(例如使恢复出的信道逼近真值)仍然可能存在以下至少两种方式,这两种方式的主要区别是N个专家模型的竞合方式,其中,竞合方式还可以称为使用方式,合作方式,协作方式,是否协作的指示信息等:For N expert models, the central node can further instruct the child nodes on how to train each expert model. Generally, the central node will instruct the child nodes on the purpose of training the network. For example, channel recovery is to make the recovered channel (as the output of the central node) as close to the true value as possible (that is, to minimize the normalized mean squared error (NMSE)). However, under the architecture of the hybrid expert model, even based on the same task (such as making the recovered channel close to the true value), there may still be at least two of the following methods. The main difference between these two methods is the competitive mode of the N expert models, where the competitive mode can also be called the usage mode, cooperation mode, collaboration mode, and whether to collaborate, etc.:
方式1,N个专家模型之间是协作方式:
loss=||A-∑ipioi||2。Mode 1: N expert models collaborate with each other:
loss=||A-∑ i p i o i || 2 .
方式2,N个专家模型之间是竞争方式:
loss=∑ipi||A-oi||2。Mode 2: N expert models compete with each other:
loss=∑ i p i ||Ao i || 2 .
其中,在上述两式中,A代表真值(可以表示单个样本(sample)),pi代表第i个专家模型的权重(基于门控网络模型的输出决定),oi代表第i个专家模型的输出结果。可以看出,协作方式中,多个专家模型合作来逼近真值;而竞争方式中,每个专家模型都有单独的损失(loss),实现了单独的判断而不依赖其他专家模型的结果。即使相同的训练节点基于不同的竞合方式训练该混合专家模型得到的门控权和专家权也会不同。所以希望参与训练的所有子节点至少按照相同的方式使用/训练专家模型。Among them, in the above two formulas, A represents the true value (which can represent a single sample), pi represents the weight of the ith expert model (determined based on the output of the gating network model), and oi represents the output result of the ith expert model. It can be seen that in the collaborative mode, multiple expert models cooperate to approximate the true value; while in the competitive mode, each expert model has a separate loss, which realizes a separate judgment without relying on the results of other expert models. Even if the same training node trains the hybrid expert model based on different competitive methods, the gating weight and expert weight obtained will be different. Therefore, it is hoped that all sub-nodes participating in the training will at least use/train the expert model in the same way.
因此,中心节点可以在第一信息中携带N个专家的竞合方式。该竞合方式指示N个专家模型之间是协作还是竞争。Therefore, the central node may carry the competition mode of the N experts in the first information, where the competition mode indicates whether the N expert models are cooperating or competing.
中心节点还指示该分布式训练的训练任务,训练任务可以理解为全局模型的功能,即该全局模型可以用于训练什么。示例性地,训练任务可以是波束预测,信道恢复,信道预测等。The central node also indicates the training task of the distributed training, and the training task can be understood as the function of the global model, that is, what the global model can be used for training. Exemplarily, the training task can be beam prediction, channel recovery, channel prediction, etc.
在波束管理场景下,全局模型的输入例如可以是信道质量、波束的RSRP等;全局模型的输出例如可以是最优波束索引等。In the beam management scenario, the input of the global model may be, for example, channel quality, RSRP of the beam, etc.; the output of the global model may be, for example, the optimal beam index, etc.
此外,中心节点还可以向子节点发送全局模型的模型参数信息。In addition, the central node can also send model parameter information of the global model to the child nodes.
S1003.子节点输入样本数据到N个专家模型和门控网络模型,得到N个专家模型的N个第一输出结果和门控网络模型的N个第二输出结果,N个第二输出结果分别与N个第一输出结果一一对应。S1003. The subnode inputs sample data to N expert models and the gated network model to obtain N first output results of the N expert models and N second output results of the gated network model, and the N second output results correspond one-to-one to the N first output results.
子节点训练时,需要输入样本数据到N个专家模型和门控网络模型。因此,在执行本步骤之前,子节点在不同的应用场景下收集不同类型的样本数据。When training a child node, it is necessary to input sample data into N expert models and gated network models. Therefore, before executing this step, the child node collects different types of sample data in different application scenarios.
如在波束管理场景中,联邦学习架构中的子节点可以是网络设备,而中心节点可以是独立的联邦学习管理节点;或者子节点可以是终端设备,而中心节点可以是担任中心节点功能的网络设备。假设所要训练的全局模型是实现以估计的信道测量值或者接收信号本身为输入,以最优波束索引作为输出的AI/ML模型,那么子节点在收集数据阶段就负责收集作为模型输入的信道测量值或者接收信号以及训练模型所用的标签,即最优波束索引。可以通过网络设备将全部可能的波束(基于码本的同步信号块(synchronization signal block,SSB)或者信道状态信息-参考信号(channel state information-reference signal,CSI-RS)波束)逐个发送给终端设备,终端设备选择性能最优(最优可以指所有SSB/CSI-RS波束中具有最大的物理层-参考信号接收功率(layer1-reference signal receiving power,L1-RSRP)或者信噪比(signal-to-noise ratio,SNR)测量值的波束)的波束方向索引作为标签。For example, in a beam management scenario, the subnodes in the federated learning architecture can be network devices, while the central node can be an independent federated learning management node; or the subnodes can be terminal devices, while the central node can be a network device that functions as a central node. Assuming that the global model to be trained is an AI/ML model that takes the estimated channel measurement value or the received signal itself as input and the optimal beam index as output, then the subnode is responsible for collecting the channel measurement value or received signal as the model input and the label used for training the model, that is, the optimal beam index, during the data collection stage. All possible beams (codebook-based synchronization signal blocks (SSB) or channel state information-reference signal (CSI-RS) beams) can be sent to the terminal device one by one through the network device, and the terminal device selects the beam direction index with the best performance (the best can refer to the beam with the largest physical layer-reference signal receiving power (layer1-reference signal receiving power, L1-RSRP) or signal-to-noise ratio (SNR) measurement value among all SSB/CSI-RS beams) as a label.
在进行模型训练前,中心节点还可以为子节点配置下行资源用于下发中心节点的初始的全局模型信息。该下行资源可以是控制信道资源,如PDCCH资源;也可以是数据信道资源,如PDSCH资源。示例性的,该下行资源包括频域资源块编号、起始位置、子带号、子带带宽、跳频参数、调制编码方案(modulation and coding scheme,MCS)等参数。Before model training, the central node can also configure downlink resources for the subnodes to send the initial global model information of the central node. The downlink resources can be control channel resources, such as PDCCH resources; or data channel resources, such as PDSCH resources. Exemplarily, the downlink resources include frequency domain resource block number, starting position, subband number, subband bandwidth, frequency hopping parameters, modulation and coding scheme (MCS) and other parameters.
全局模型可以采用广播或者组播的方式由中心节点进行下发。例如,在中心节点为网络设备,子节点为终端设备的单小区联邦学习构架中,可以采用广播的方式下发全局模型,由于广播的特性,未参与到联邦学习中的子节点也可以接收到该广播信息;在以具有联邦学习管理功能的网络设备作为中心节点,其他网络设备作为子节点的多小区联邦学习构架中也可以由中心节点采用广播的方式下发全局模型至各个子节点,同样的,其他未参与到联邦学习中的子节点也可以接收到广播的信息;也可以针对参与到的联邦学习中的子节点采用组播方式,同一个中心节点所关联的子节点作为一组,拥有同样的组号,配置相同的下行资源。组播模式下,不参与该联邦学习的子节点则不会接收到该组播信息。The global model can be sent down by the central node in a broadcast or multicast manner. For example, in a single-cell federated learning architecture where the central node is a network device and the sub-nodes are terminal devices, the global model can be sent down in a broadcast manner. Due to the characteristics of broadcast, sub-nodes that are not involved in the federated learning can also receive the broadcast information; in a multi-cell federated learning architecture where a network device with a federated learning management function is used as the central node and other network devices are used as sub-nodes, the central node can also send the global model to each sub-node in a broadcast manner. Similarly, other sub-nodes that are not involved in the federated learning can also receive the broadcast information; multicast can also be used for sub-nodes participating in the federated learning. Sub-nodes associated with the same central node are grouped together, have the same group number, and are configured with the same downlink resources. In multicast mode, sub-nodes that do not participate in the federated learning will not receive the multicast information.
中心节点还可以为子节点配置子节点上报本地模型的上行资源,用于子节点上报模型/梯度/权值。也可以由另外的联邦学习管理节点为中心节点和子节点配置子节点用于本地模型信息以及必要信令上报的上行资源。与下行资源配置类似的,上行资源可以是控制信道资源,如PUCCH资源,也可以是数据信道资源,如PUSCH资源。The central node can also configure uplink resources for the subnodes to report local models for the subnodes to report models/gradients/weights. Another federated learning management node can also configure uplink resources for the central node and subnodes for the subnodes to report local model information and necessary signaling. Similar to the downlink resource configuration, the uplink resources can be control channel resources, such as PUCCH resources, or data channel resources, such as PUSCH resources.
子节点收集到一定数量的样本数据后,输入样本数据到N个专家模型和门控网络模型,得到N个专家模型的N个第一输出结果和门控网络模型的N个第二输出结果。由于门控网络模型的N个输出分别与N个专家模型的输出连接,因此,N个第二输出结果分别与N个第一输出结果一一对应。After the child node collects a certain amount of sample data, it inputs the sample data into the N expert models and the gated network model to obtain the N first output results of the N expert models and the N second output results of the gated network model. Since the N outputs of the gated network model are connected to the outputs of the N expert models, the N second output results correspond to the N first output results one by one.
S1004.子节点针对N个第一输出结果中的每个第一输出结果和N个第二输出结果中的每个第二输出结果,对于满足选择参数的第二输出结果,选择与第二输出结果对应的第一输出结果。S1004. The child node selects, for each first output result among the N first output results and each second output result among the N second output results, a first output result corresponding to the second output result that satisfies the selection parameter.
对于每个门控网络模型的第二输出结果,子节点根据选择参数,判断是保留或丢弃该第二输出结果。例如,该选择参数为第一阈值,子节点判断该第二输出结果是否大于或等于该第一阈值,如果是,则保留该第二输出结果;否则,丢弃该第二输出结果。For the second output result of each gated network model, the child node determines whether to retain or discard the second output result according to the selection parameter. For example, the selection parameter is a first threshold, and the child node determines whether the second output result is greater than or equal to the first threshold. If so, the second output result is retained; otherwise, the second output result is discarded.
在确定了保留的至少一个第二输出结果后,与保留的至少一个第二输出结果对应的至少一个第一输出结果也被保留或选择。After the at least one second output result to be retained is determined, at least one first output result corresponding to the at least one second output result to be retained is also retained or selected.
S1005.子节点基于被选择的至少一个第一输出结果和真值信息,获取被选择的至少一个专家模型的模型参数信息和门控网络模型的模型参数信息。S1005. The subnode obtains model parameter information of at least one selected expert model and model parameter information of the gated network model based on the selected at least one first output result and the true value information.
子节点选择了至少一个第一输出结果后,基于被选择的至少一个第一输出结果和真值信息,获取被选择的至少一个专家模型的模型参数信息和门控网络模型的模型参数信息。After the subnode selects at least one first output result, the subnode obtains model parameter information of the selected at least one expert model and model parameter information of the gated network model based on the selected at least one first output result and the true value information.
其中,专家模型的模型参数信息包括专家模型的权值、梯度、梯度变化量等。门控网络模型的模型参数信息包括门控网络模型的权值、梯度、梯度变化量等。The model parameter information of the expert model includes the weight, gradient, gradient change, etc. of the expert model. The model parameter information of the gated network model includes the weight, gradient, gradient change, etc. of the gated network model.
示例性地,子节点基于被选择的至少一个第一输出结果和真值信息,获取被选择的至少一个专家模型的模型参数信息和门控网络模型的模型参数信息,可以有以下两种可能的实现方式:Exemplarily, the child node obtains the model parameter information of the selected at least one expert model and the model parameter information of the gated network model based on the selected at least one first output result and the true value information, and there are two possible implementation methods:
一种可能的实现方式为,子节点基于被选择的至少一个第一输出结果的平均值和真值信息,获取被选择的至少一个专家模型的模型参数信息和门控网络模型的模型参数信息。例如,第一阈值为0.5,子节点可以保留其值大于0.5的门控网络模型的第二输出结果,并选择与保留的至少一个第二输出结果对应的专家模型的至少一个第一输出结果。子节点将被选择的至少一个第一输出结果中每个第一输出结果乘以0.5,将它们的乘积进行求和平均,得到被选择的至少一个第一输出结果的平均值。One possible implementation is that the child node obtains the model parameter information of the selected at least one expert model and the model parameter information of the gated network model based on the average value and true value information of the selected at least one first output result. For example, if the first threshold is 0.5, the child node can retain the second output results of the gated network model whose values are greater than 0.5, and select at least one first output result of the expert model corresponding to the retained at least one second output result. The child node multiplies each of the selected at least one first output results by 0.5, sums and averages their products, and obtains the average value of the selected at least one first output result.
另一种可能的实现方式为,子节点基于被选择的至少一个第一输出结果分别对应的至少一个第二输出结果,对被选择的至少一个第一输出结果进行加权和平均,得到被选择的至少一个第一输出结果的加权平均值,以及基于被选择的至少一个第一输出结果的加权平均值和真值信息,获取被选择的至少一个专家模型的模型参数信息和门控网络模型的模型参数信息。例如,第一阈值为0.5,子节点可以保留其值大于0.5的门控网络模型的第二输出结果,并选择与保留的至少一个第二输出结果对应的专家模型的至少一个第一输出结果。子节点将被选择的至少一个第一输出结果中每个第一输出结果乘以其对应的第二输出结果(例如,有的第二输出结果是0.6,有的第二输出结果是0.8),将它们的乘积进行求和平均,得到被选择的至少一个第一输出结果的加权平均值。Another possible implementation is that the child node weights and averages the at least one selected first output result based on the at least one second output result corresponding to the at least one selected first output result, obtains the weighted average value of the at least one selected first output result, and obtains the model parameter information of the at least one selected expert model and the model parameter information of the gated network model based on the weighted average value and true value information of the at least one selected first output result. For example, the first threshold is 0.5, the child node can retain the second output result of the gated network model whose value is greater than 0.5, and select at least one first output result of the expert model corresponding to the at least one retained second output result. The child node multiplies each first output result of the at least one selected first output result by its corresponding second output result (for example, some second output results are 0.6, and some second output results are 0.8), and sums and averages their products to obtain the weighted average value of the at least one selected first output result.
S1006.子节点向中心节点发送第二信息。相应地,中心节点接收该第二信息。S1006. The subnode sends the second information to the central node. Correspondingly, the central node receives the second information.
子节点获取了被选择的至少一个专家模型的模型参数信息和门控网络模型的模型参数信息后,向中心节点发送第二信息,其中,该第二信息用于指示门控网络模型的模型参数信息和N个专家模型中被选择的至少一个专家模型的模型参数信息。After the child node obtains the model parameter information of at least one selected expert model and the model parameter information of the gated network model, it sends second information to the central node, wherein the second information is used to indicate the model parameter information of the gated network model and the model parameter information of at least one selected expert model from the N expert models.
进一步地,第二信息还用于指示被选择的至少一个专家模型的标识。子节点在第二信息中携带被选择的至少一个专家模型的标识的指示信息,以便于中心节点基于该指示信息进行模型聚合。Furthermore, the second information is also used to indicate the identifier of the selected at least one expert model. The subnode carries the indication information of the identifier of the selected at least one expert model in the second information, so that the central node performs model aggregation based on the indication information.
S1007.中心节点基于从多个子节点接收到的多个第二信息更新全局模型并更新选择参数。S1007. The central node updates the global model and the selection parameters based on the plurality of second information received from the plurality of child nodes.
中心节点从多个子节点分别接收到了多个第二信息,可以基于该多个第二信息更新全局模型。The central node receives multiple pieces of second information from multiple child nodes respectively, and can update the global model based on the multiple pieces of second information.
在某些场景下,中心节点基于更新的全局模型,还可以进一步更新选择参数。例如,初始训练时,中心节点配置的第一阈值较大,基于子节点的训练反馈等信息更新该第一阈值。该第一阈值也可以是基于神经网络学习所得。In some scenarios, the central node can further update the selection parameters based on the updated global model. For example, during initial training, the first threshold configured by the central node is relatively large, and the first threshold is updated based on information such as training feedback from the child nodes. The first threshold can also be obtained based on neural network learning.
S1008.中心节点发送第四信息。相应地,子节点接收该第四信息。S1008. The central node sends the fourth information. Correspondingly, the subnode receives the fourth information.
若中心节点更新了选择参数,则可以向各个子节点发送第四信息,其中,该第四信息用于指示更新的选择参数,使得第四信息在下一轮训练时,基于更新的选择参数选择门控网络模型的输出结果。If the central node updates the selection parameters, the fourth information may be sent to each child node, wherein the fourth information is used to indicate the updated selection parameters, so that the fourth information selects the output result of the gated network model based on the updated selection parameters during the next round of training.
该分布式训练系统可以多次执行上述步骤S1002~S1008,直至全局模型收敛。The distributed training system may execute the above steps S1002 to S1008 multiple times until the global model converges.
在一个示例中,全局模型包括一个通用层Net_com,3个专家模型Net1/2/3,以及一个全连接网络Net_FC。中心节点(例如是基站)向各个子节点发送全局模型的模型参数信息,包括Net_com,Net1/2/3,Net_FC。它们对应的神经网络梯度/权值数据表示为W_com,W_1/W_2/W_3以及W_FC。In one example, the global model includes a general layer Net_com, three expert models Net1/2/3, and a fully connected network Net_FC. The central node (e.g., a base station) sends model parameter information of the global model to each child node, including Net_com, Net1/2/3, and Net_FC. Their corresponding neural network gradient/weight data are represented as W_com, W_1/W_2/W_3, and W_FC.
假设只有两个子节点参加本轮的联邦学习:UE1和UE2,它们各自的训练流程是一样的,输入本地的样本数据到模型,进行以下训练:Assume that only two child nodes participate in this round of federated learning: UE1 and UE2. Their respective training processes are the same. Input local sample data into the model and perform the following training:
ⅰ)经过通用层以及Net_1/2/3后的输出分别为W_1_out,W_2_out,W_3_out,门控Net_FC的输出为W_FC(为一个三个[0,1]元素组成的向量),融合后得到最终的输出:
Out=W_FC(1)*W_1_out+W_FC(2)*W_2_out+W_FC(3)*W_3_out;ⅰ) The outputs after the general layer and Net_1/2/3 are W_1_out, W_2_out, W_3_out, and the output of gated Net_FC is W_FC (a vector of three [0,1] elements). After fusion, the final output is:
Out=W_FC(1)*W_1_out+W_FC(2)*W_2_out+W_FC(3)*W_3_out;
ⅱ)基于例如损失函数计算公式loss=||Out–A||^2计算loss,A为真值/标签;ii) Calculate loss based on, for example, the loss function calculation formula loss = ||Out–A||^2, where A is the true value/label;
ⅲ)完成训练后得到以下权值:
UE1:W_com(UE1),W_1(UE1)/W_2(UE1)/W_3(UE1)以及W_FC(UE1);
UE2:W_com(UE2),W_1(UE2)/W_2(UE2)/W_3(UE2)以及W_FC(UE2)。ⅲ) After completing the training, the following weights are obtained:
UE1: W_com(UE1), W_1(UE1)/W_2(UE1)/W_3(UE1) and W_FC(UE1);
UE2: W_com(UE2), W_1(UE2)/W_2(UE2)/W_3(UE2) and W_FC(UE2).
子节点(UE1、UE2)向中心节点(基站)反馈上述权值(还可以是梯度或梯度变化量)。The sub-nodes (UE1, UE2) feed back the above weights (which may also be gradients or gradient changes) to the central node (base station).
中心节点得到来自两个子节点的模型参数信息后,对应同一个全局模型,开始融合这两组参数,这里以平均的方式来融合为例,得到:After the central node obtains the model parameter information from the two child nodes, it corresponds to the same global model and starts to fuse the two sets of parameters. Here, we take the averaging method as an example to obtain:
ⅰ)通用层的融合结果 ⅰ) Fusion results of general layer
ⅱ)3个专家模型分别对应的融合结果 ii) Fusion results of the three expert models
ⅲ)门控网络模型的融合结果 ⅲ) Fusion results of the gated network model
完成本轮联邦学习,中心节点向子节点下发新的模型参数信息进行下一轮训练。After completing this round of federated learning, the central node sends new model parameter information to the child nodes for the next round of training.
如图11所示,为本申请实施例示例的最优波束训练的示意图,以高频波束管理问题为例,网络设备侧在进行基于码本的同步信号块或者CSI-RS的波束扫描时,不同位置的用户(UE1~UEN)(还可以包括与这些UE数据特征类似的训练节点)和网络设备间的信道不同。网络设备向不同位置的用户下发统一的全局模型。不同位置的用户对接收到的SSB或者CSI-RS波束进行测量,例如测量L1-RSRP,并反馈最大RSRP值所对应的波束标识。利用AI/ML可以实现训练一个模型,利用例如接收到的复数个SSB/CSI-RS信号(中的部分或全部),或者接收的复数个SSB/CSI-RS信号的强度(RSRP)(中的部分或全部)或者估计的信道作为输入,来推断出最优波束ID,并反馈给网络设备。As shown in Figure 11, it is a schematic diagram of the optimal beam training of an example embodiment of the present application. Taking the high-frequency beam management problem as an example, when the network device side performs beam scanning of the synchronization signal block or CSI-RS based on the codebook, the channels between users (UE1~UEN) at different locations (may also include training nodes with similar data characteristics to these UEs) and the network device are different. The network device sends a unified global model to users at different locations. Users at different locations measure the received SSB or CSI-RS beam, for example, measure L1-RSRP, and feedback the beam identifier corresponding to the maximum RSRP value. AI/ML can be used to train a model, using, for example, a plurality of received SSB/CSI-RS signals (part or all of them), or the strength (RSRP) of a plurality of received SSB/CSI-RS signals (part or all of them) or the estimated channel as input to infer the optimal beam ID and feed it back to the network device.
该全局模型包括一个门控网络模型和N个专家模型,还可以包括其它层(即上述通用层,可以训练或不训练),其它层与门控网络模型、N个专家模型连接。该门控网络模型有N个输出,每个输出分别连接一个专家模型的输出。在每个用户训练过程中,每个用户将测量得到的RSRP输入到其它层,其它层的训练输出结果作为门控网络模型和N个专家模型的共同的输入。每个用户基于该输入对门控网络模型和N个专家模型进行训练。网络设备在训练前,向各个用户发送了选择参数,每个用户根据该选择参数确定是否保留门控网络模型的N个第二输出结果,并根据保留的至少一个第二输出结果,获取与保留的至少一个第二输出结果对应的至少一个专家模型的第一输出结果。然后,每个用户基于被选择的至少一个第一输出结果和真值信息,获取被选择的至少一个专家模型的模型参数信息和门控网络模型的模型参数信息,并向网络设备发送被选择的至少一个专家模型的模型参数信息和门控网络模型的模型参数信息。并且,每个用户基于被选择的至少一个第一输出结果和真值信息,得到最优波束标识。The global model includes a gated network model and N expert models, and may also include other layers (i.e., the above-mentioned general layers, which may be trained or not), and the other layers are connected to the gated network model and the N expert models. The gated network model has N outputs, and each output is connected to the output of an expert model. During the training process of each user, each user inputs the measured RSRP into other layers, and the training output results of other layers are used as the common input of the gated network model and the N expert models. Each user trains the gated network model and the N expert models based on the input. Before training, the network device sends selection parameters to each user, and each user determines whether to retain the N second output results of the gated network model according to the selection parameters, and obtains the first output result of at least one expert model corresponding to the at least one second output result retained according to the retained at least one second output result. Then, each user obtains the model parameter information of the selected at least one expert model and the model parameter information of the gated network model based on the selected at least one first output result and the true value information, and sends the model parameter information of the selected at least one expert model and the model parameter information of the gated network model to the network device. Furthermore, each user obtains an optimal beam identifier based on the selected at least one first output result and the true value information.
可选的,中心节点为前述执行中心节点相关的动作的第三方设备。比如,上述步骤S1001-S1002,S1006-S1008均由第三方设备执行。Optionally, the central node is a third-party device that performs the aforementioned actions related to the central node. For example, the above steps S1001-S1002, S1006-S1008 are all performed by a third-party device.
可选的,子节点为前述执行子节点相关的动作的第三方设备。比如,上述步骤S1001-S1008均由第三方设备执行。Optionally, the subnode is a third-party device that performs the aforementioned subnode-related actions. For example, the above steps S1001-S1008 are all performed by a third-party device.
可选的,中心节点为网络设备。此时,网络设备可以完成模型的训练。比如,上述步骤S1001-S1002,S1006-S1008均由网络设备执行。Optionally, the central node is a network device. In this case, the network device can complete the training of the model. For example, the above steps S1001-S1002 and S1006-S1008 are all performed by the network device.
可选的,子节点为终端设备。此时,终端设备可以完成模型的训练。比如,上述步骤S1001-S1008均由终端设备执行。Optionally, the child node is a terminal device. In this case, the terminal device can complete the training of the model. For example, the above steps S1001-S1008 are all performed by the terminal device.
可选的,中心节点包括网络设备和第三方设备。在一个示例中,上述步骤S1007可以由第三方设备,如OTT,或,云服务器等执行,上述步骤S1001-S1002、S1006和S1008中的一项或多项可以由网络设备执行。此外,网络设备与第三方设备之间也可以进行通信,进行上述步骤S1001-S1002,S1006-S1008中的一项或多项所传输的内容的传输。Optionally, the central node includes a network device and a third-party device. In one example, the above step S1007 can be performed by a third-party device, such as an OTT, or a cloud server, and one or more of the above steps S1001-S1002, S1006, and S1008 can be performed by the network device. In addition, the network device and the third-party device can also communicate with each other to transmit the content transmitted by one or more of the above steps S1001-S1002, S1006-S1008.
可选的,子节点包括终端设备和第三方设备。在一个示例中,上述步骤S1003-S1005中的一项或多项也可以是第三方设备,如OTT,或,云服务器等,执行,上述步骤S1001-S1002、S1006和S1008中的一项或多项可以由终端设备执行。此外,终端设备与第三方设备之间也可以进行通信,进行上述步骤S1001-S1008中的一项或多项所传输的内容的传输。Optionally, the subnode includes a terminal device and a third-party device. In one example, one or more of the above steps S1003-S1005 may also be performed by a third-party device, such as an OTT, or a cloud server, and one or more of the above steps S1001-S1002, S1006, and S1008 may be performed by the terminal device. In addition, the terminal device and the third-party device may also communicate with each other to transmit the content transmitted by one or more of the above steps S1001-S1008.
根据本申请实施例提供的一种分布式训练方法,中心节点通过指示选择参数,使得所有参与训练的子节点在门控网络模型和专家模型的训练过程中,将该选择参数作用于门控网络模型的输出结果上,从而可以使得各个子节点对齐门控网络模型的输出结果的选择,提高中心节点的聚合模型的性能。According to a distributed training method provided in an embodiment of the present application, the central node indicates a selection parameter so that all sub-nodes participating in the training will apply the selection parameter to the output result of the gated network model during the training process of the gated network model and the expert model, thereby enabling each sub-node to align the selection of the output result of the gated network model and improve the performance of the aggregation model of the central node.
本申请中“向…(例如子节点)发送信息”或者附图中的相关示意可以理解为该信息的目的端是子节点。可以包括直接或间接的向子节点发送信息。“从…(例如子节点)接收信息”或者“接收来自…(例如子节点)的信息”,或者附图中的相关示意可以理解为该信息的源端是子节点,可以包括直接或间接的从子节点接收信息。信息在信息发送的源端和目的端之间可能会被进行必要的处理,例如格式变化等,但目的端可以理解来自源端的有效信息。本申请中类似的表述可以做类似的理解,在此不再赘述。In this application, "sending information to... (e.g., a child node)" or the related illustrations in the accompanying drawings can be understood as the destination end of the information is a child node. It can include sending information to a child node directly or indirectly. "Receiving information from... (e.g., a child node)" or "receiving information from... (e.g., a child node)", or the related illustrations in the accompanying drawings can be understood as the source end of the information is a child node, which can include receiving information from a child node directly or indirectly. The information may be processed as necessary between the source end and the destination end of the information transmission, such as format changes, etc., but the destination end can understand the valid information from the source end. Similar expressions in this application can be understood similarly and will not be repeated here.
上述主要从各个节点之间交互的角度对本申请实施例提供的方案进行了介绍。相应地,本申请实施例还提供了分布式训练装置,该分布式训练装置用于实现上述各种方法。该分布式训练装置可以为上述方法实施例中的中心节点,或者为可用于中心节点的部件;或者,该分布式训练装置可以为上述方法实施例中的子节点,或者为可用于子节点的部件。可以理解的是,该分布式训练装置为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。The above mainly introduces the scheme provided by the embodiment of the present application from the perspective of interaction between various nodes. Accordingly, the embodiment of the present application also provides a distributed training device, which is used to implement the above various methods. The distributed training device can be a central node in the above method embodiment, or a component that can be used for the central node; or, the distributed training device can be a sub-node in the above method embodiment, or a component that can be used for the sub-node. It can be understood that in order to achieve the above functions, the distributed training device includes a hardware structure and/or software module corresponding to each function. Those skilled in the art should easily realize that, in combination with the units and algorithm steps of each example described in the embodiment disclosed in this article, the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is executed in the form of hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Professional and technical personnel can use different methods to implement the described functions for each specific application, but such implementation should not be considered to exceed the scope of this application.
本申请实施例可以根据上述方法实施例中对分布式训练装置进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理单元中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。The embodiment of the present application can divide the functional modules of the distributed training device according to the above method embodiment. For example, each functional module can be divided according to each function, or two or more functions can be integrated into one processing unit. The above integrated modules can be implemented in the form of hardware or in the form of software functional modules. It should be noted that the division of modules in the embodiment of the present application is schematic and is only a logical functional division. There may be other division methods in actual implementation.
基于上述分布式训练方法的同一构思,本申请还提供了如下分布式训练装置:Based on the same concept of the above distributed training method, the present application also provides the following distributed training device:
如图12所示,为本申请实施例提供的一种分布式训练装置的结构示意图,该分布式训练装置1200包括收发单元1201和处理单元1202;其中:As shown in FIG. 12 , it is a schematic diagram of the structure of a distributed training device provided in an embodiment of the present application. The distributed training device 1200 includes a transceiver unit 1201 and a processing unit 1202; wherein:
该分布式训练装置用于实现上述方法实施例中子节点的功能时,收发单元1201用于执行如图10所示实施例的步骤S1001、S1002、S1006和S1008中子节点的操作中的一项或多项,以及处理单元1202用于执行如图10所示实施例的步骤S1003-S1005中的一项或多项。可选的,该分布式训练装置可以为终端设备,或者,第三方设备,如OTT或云服务器,或者,可以为终端设备和第三方设备构成的系统。When the distributed training device is used to implement the functions of the subnode in the above method embodiment, the transceiver unit 1201 is used to perform one or more of the operations of the subnode in steps S1001, S1002, S1006 and S1008 of the embodiment shown in Figure 10, and the processing unit 1202 is used to perform one or more of steps S1003-S1005 of the embodiment shown in Figure 10. Optionally, the distributed training device can be a terminal device, or a third-party device such as an OTT or cloud server, or a system composed of a terminal device and a third-party device.
该分布式训练装置用于实现上述方法实施例中中心节点的功能时,收发单元1201用于执行如图10所示实施例的步骤S1001、S1002、S1006和S1008中中心节点的操作中的一项或多项,以及处理单元1202用于执行如图10所示实施例的步骤S1007。可选的,该分布式训练装置可以为网络设备,或者,第三方设备,如OTT或云服务器,或者,可以为网络设备和第三方设备构成的系统。When the distributed training device is used to implement the function of the central node in the above method embodiment, the transceiver unit 1201 is used to perform one or more of the operations of the central node in steps S1001, S1002, S1006 and S1008 of the embodiment shown in Figure 10, and the processing unit 1202 is used to perform step S1007 of the embodiment shown in Figure 10. Optionally, the distributed training device can be a network device, or a third-party device such as an OTT or cloud server, or a system consisting of a network device and a third-party device.
有关上述收发单元1201和处理单元1202的具体实现可参考上述方法实施例中的描述。For the specific implementation of the above-mentioned transceiver unit 1201 and the processing unit 1202, reference may be made to the description in the above-mentioned method embodiment.
此外,需要说明的是,前述收发单元和/或处理单元可通过虚拟模块实现,例如处理单元可通过软件功能单元或虚拟装置实现,收发单元可以通过软件功能或虚拟装置实现。或者,处理单元或收发单元也可以通过实体电路实现,例如若该装置采用芯片/芯片电路实现,所述收发单元可以是输入输出电路和/或通信接口,执行输入操作(对应前述接收操作)、输出操作(对应前述发送操作);处理单元为处理电路,如集成的处理器或者微处理器或者集成电路。In addition, it should be noted that the aforementioned transceiver unit and/or processing unit can be implemented through a virtual module, for example, the processing unit can be implemented through a software function unit or a virtual device, and the transceiver unit can be implemented through a software function or a virtual device. Alternatively, the processing unit or the transceiver unit can also be implemented through a physical circuit, for example, if the device is implemented using a chip/chip circuit, the transceiver unit can be an input-output circuit and/or a communication interface, performing input operations (corresponding to the aforementioned receiving operations) and output operations (corresponding to the aforementioned sending operations); the processing unit is a processing circuit, such as an integrated processor or microprocessor or integrated circuit.
本申请中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,另外,在本申请各个示例中的各功能模块可以集成在一个处理器中,也可以是单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。The division of modules in this application is schematic and is only a logical function division. There may be other division methods in actual implementation. In addition, each functional module in each example of this application may be integrated into one processor, or may exist physically separately, or two or more modules may be integrated into one module. The above-mentioned integrated modules may be implemented in the form of hardware or in the form of software functional modules.
如图13所示,为本申请实施例提供的另一种分布式训练装置的结构示意图,该分布式训练装置1300包括一个或多个处理电路1301(图中示例了一个处理电路)。可选的,该分布式训练装置1300还可以包括存储器1303(图中以虚线表示)。该存储器1303用于存储处理电路1301执行的指令,或存储处理电路1301运行指令所需要的输入数据,或存储处理电路1301运行指令后产生的数据。可选的,分布式训练装置1300还可以包括接口电路1302(图中以虚线表示),处理电路1301和接口电路1302之间相互耦合。可以理解的是,接口电路1302可以为收发器或输入输出接口。As shown in Figure 13, it is a schematic diagram of the structure of another distributed training device provided in an embodiment of the present application. The distributed training device 1300 includes one or more processing circuits 1301 (one processing circuit is illustrated in the figure). Optionally, the distributed training device 1300 may also include a memory 1303 (indicated by a dotted line in the figure). The memory 1303 is used to store instructions executed by the processing circuit 1301, or to store input data required for the processing circuit 1301 to run instructions, or to store data generated after the processing circuit 1301 runs instructions. Optionally, the distributed training device 1300 may also include an interface circuit 1302 (indicated by a dotted line in the figure), and the processing circuit 1301 and the interface circuit 1302 are coupled to each other. It can be understood that the interface circuit 1302 can be a transceiver or an input-output interface.
其中,处理电路可以为处理器或处理器中用于处理的电路。The processing circuit may be a processor or a circuit in a processor used for processing.
该分布式训练装置用于实现上述方法实施例中子节点的功能时,接口电路1302用于执行如图10所示实施例的步骤S1001、S1002、S1006和S1008中子节点的操作中的一项或多项,以及处理电路1301用于执行如图10所示实施例的步骤S1003-S1005中的一项或多项。When the distributed training device is used to implement the functions of the sub-node in the above method embodiment, the interface circuit 1302 is used to execute one or more of the operations of the sub-node in steps S1001, S1002, S1006 and S1008 of the embodiment shown in Figure 10, and the processing circuit 1301 is used to execute one or more of steps S1003-S1005 of the embodiment shown in Figure 10.
该分布式训练装置用于实现上述方法实施例中中心节点的功能时,接口电路1302用于执行如图10所示实施例的步骤S1001、S1002、S1006和S1008中中心节点的操作中的一项或多项,以及处理电路1301用于执行如图10所示实施例的步骤S1007。When the distributed training device is used to implement the function of the central node in the above method embodiment, the interface circuit 1302 is used to execute one or more of the operations of the central node in steps S1001, S1002, S1006 and S1008 of the embodiment shown in Figure 10, and the processing circuit 1301 is used to execute step S1007 of the embodiment shown in Figure 10.
当上述分布式训练装置为应用于中心节点的芯片时,该芯片实现上述方法实施例中中心节点的功能。该芯片从中心节点中的其它模块接收信息,该信息是子节点发送给中心节点的;或者,该芯片向中心节点中的其它模块发送信息,该信息是中心节点发送给子节点的。中心节点为网络设备时,这里的中心节点的模块可以是中心节点的基带芯片,也可以是CU、DU或其他模块,也可以是开放式无线接入网(open radio access network,O-RAN)架构下的装置,例如开放式CU、开放式DU等装置。中心节点为第三方设备时,这里的子节点的模块可以是第三方设备的处理芯片。其中,处理芯片可以用于实现AI训练。When the above-mentioned distributed training device is a chip applied to the central node, the chip implements the function of the central node in the above-mentioned method embodiment. The chip receives information from other modules in the central node, and the information is sent by the subnode to the central node; or, the chip sends information to other modules in the central node, and the information is sent by the central node to the subnode. When the central node is a network device, the module of the central node here can be the baseband chip of the central node, or it can be a CU, DU or other module, or it can be a device under the open radio access network (open radio access network, O-RAN) architecture, such as an open CU, open DU and other devices. When the central node is a third-party device, the module of the subnode here can be a processing chip of a third-party device. Among them, the processing chip can be used to implement AI training.
当上述分布式训练装置为应用于子节点的芯片时,该芯片实现上述方法实施例中子节点的功能。该芯片从子节点中的其它模块接收信息,该信息是中心节点发送给子节点的;或者,该芯片向子节点中的其它模块发送信息,该信息是子节点发送给中心节点的。子节点为终端设备时,这里的子节点的模块可以是子节点的基带芯片,或,基带芯片和处理芯片。其中,处理芯片可以用于实现AI训练。子节点为第三方设备时,这里的子节点的模块可以是第三方设备的处理芯片。其中,处理芯片可以用于实现AI训练。When the above-mentioned distributed training device is a chip applied to a subnode, the chip implements the functions of the subnode in the above-mentioned method embodiment. The chip receives information from other modules in the subnode, and the information is sent from the central node to the subnode; or, the chip sends information to other modules in the subnode, and the information is sent from the subnode to the central node. When the subnode is a terminal device, the module of the subnode here can be the baseband chip of the subnode, or, the baseband chip and the processing chip. Among them, the processing chip can be used to implement AI training. When the subnode is a third-party device, the module of the subnode here can be the processing chip of the third-party device. Among them, the processing chip can be used to implement AI training.
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序或指令,当计算机程序或指令被执行时,实现上述实施例中的方法。An embodiment of the present application further provides a computer-readable storage medium, in which a computer program or instruction is stored. When the computer program or instruction is executed, the method in the above embodiment is implemented.
本申请实施例还提供了一种包含指令的计算机程序产品,当该指令在计算机上运行时,使得计算机执行上述实施例中的方法。The embodiments of the present application also provide a computer program product including instructions, which, when executed on a computer, enables the computer to execute the method in the above embodiments.
本申请实施例还提供了一种分布式训练系统,包括上述的分布式训练装置。An embodiment of the present application also provides a distributed training system, including the above-mentioned distributed training device.
本申请实施例还提供了一种电路,该电路与存储器耦合,该电路被用于执行上述实施例中所示的方法。该电路可包括芯片电路。The embodiment of the present application also provides a circuit, which is coupled to a memory and is used to execute the method shown in the above embodiment. The circuit may include a chip circuit.
可选的,本申请实施例还提供了一种芯片系统,包括:至少一个处理器和接口,该至少一个处理器通过接口与存储器耦合,当该至少一个处理器运行存储器中的计算机程序或指令时,使得该芯片系统执行上述任一方法实施例中的方法。可选的,该芯片系统可以由芯片构成,也可以包含芯片和其他分立器件,本申请实施例对此不作具体限定。Optionally, the embodiment of the present application further provides a chip system, including: at least one processor and an interface, the at least one processor is coupled to a memory via the interface, and when the at least one processor runs a computer program or instruction in the memory, the chip system executes a method in any of the above method embodiments. Optionally, the chip system may be composed of a chip, or may include a chip and other discrete devices, which is not specifically limited in the embodiment of the present application.
本申请中的存储器还可以是电路或者其它任意能够实现存储功能的装置,用于存储程序指令和/或数据。存储器是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。例如,存储器可以是非易失性存储器,比如数字通用光盘(digital versatile disc,DVD)、硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD)等,还可以是易失性存储器(volatile memory),例如随机存取存储器(random-access memory,RAM)。The memory in the present application may also be a circuit or any other device capable of realizing a storage function, for storing program instructions and/or data. The memory is any other medium that can be used to carry or store the desired program code in the form of an instruction or data structure and can be accessed by a computer, but is not limited thereto. For example, the memory may be a non-volatile memory, such as a digital versatile disc (DVD), a hard disk drive (HDD), or a solid-state drive (SSD), etc., or a volatile memory (volatile memory), such as a random-access memory (RAM).
本申请如下描述中所提到的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备,没有限定于已列出的步骤或单元,而是可选地还包括其他没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "including" and "having" and any variations thereof mentioned in the following description of this application are intended to cover non-exclusive inclusions. For example, a process, method, system, product or device that includes a series of steps or units is not limited to the listed steps or units, but may optionally include other steps or units that are not listed, or may optionally include other steps or units that are inherent to these processes, methods, products or devices.
应理解,在本申请的描述中,除非另有说明,“/”表示前后关联的对象是一种“或”的关系,例如,A/B可以表示A或B;其中A,B可以是单数或者复数。并且,在本申请的描述中,除非另有说明,“多个”是指两个或多于两个。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。另外,为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。同时,在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念,便于理解。It should be understood that in the description of the present application, unless otherwise specified, "/" indicates that the objects associated before and after are in an "or" relationship, for example, A/B can represent A or B; wherein A and B can be singular or plural. Also, in the description of the present application, unless otherwise specified, "multiple" refers to two or more than two. "At least one of the following" or similar expressions refers to any combination of these items, including any combination of single items or plural items. For example, at least one of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c can be single or multiple. In addition, in order to facilitate the clear description of the technical solutions of the embodiments of the present application, in the embodiments of the present application, the words "first", "second", etc. are used to distinguish the same items or similar items with substantially the same functions and effects. Those skilled in the art can understand that the words "first", "second", etc. do not limit the quantity and execution order, and the words "first", "second", etc. do not limit them to be necessarily different. Meanwhile, in the embodiments of the present application, words such as "exemplary" or "for example" are used to indicate examples, illustrations or descriptions. Any embodiment or design described as "exemplary" or "for example" in the embodiments of the present application should not be interpreted as being more preferred or more advantageous than other embodiments or designs. Specifically, the use of words such as "exemplary" or "for example" is intended to present related concepts in a concrete manner for ease of understanding.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件程序实现时,可以全部或部分地以计算机程序产品的形式来实现。该计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或者数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。In the above embodiments, it can be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented using a software program, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the process or function described in the embodiment of the present application is generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) mode to another website site, computer, server or data center.
尽管在此结合各实施例对本申请进行了描述,然而,在实施所要求保护的本申请过程中,本领域技术人员通过查看所述附图、公开内容、以及所附权利要求书,可理解并实现所述公开实施例的其他变化。在权利要求中,“包括”(comprising)一词不排除其他组成部分或步骤,“一”或“一个”不排除多个的情况。单个处理器或其他单元可以实现权利要求中列举的若干项功能。相互不同的从属权利要求中记载了某些措施,但这并不表示这些措施不能组合起来产生良好的效果。Although the present application is described herein in conjunction with various embodiments, in the process of implementing the claimed application, those skilled in the art may understand and implement other variations of the disclosed embodiments by viewing the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other components or steps, and "one" or "an" does not exclude multiple situations. A single processor or other unit may implement several functions listed in a claim. Certain measures are recorded in different dependent claims, but this does not mean that these measures cannot be combined to produce good results.
可以理解的是,在本申请的实施例中涉及的各种数字编号仅为描述方便进行的区分,并不用来限制本申请的实施例的范围。上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定。It is understood that the various numbers involved in the embodiments of the present application are only for the convenience of description and are not used to limit the scope of the embodiments of the present application. The size of the sequence number of the above-mentioned processes does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference can be made to the relevant descriptions of other embodiments.
本申请实施例装置中的部件可以根据实际需要进行合并、划分和删减。本领域的技术人员可以将本说明书中描述的不同实施例以及不同实施例的特征进行结合或组合。The components in the device of the embodiment of the present application can be merged, divided and deleted according to actual needs. Those skilled in the art can combine or combine the different embodiments and features of the different embodiments described in this specification.
在本申请中,在无逻辑矛盾的前提下,各示例之间可以相互引用,例如方法实施例之间的方法和/或术语可以相互引用,例如装置实施例之间的功能和/或术语可以相互引用,例如装置示例和方法示例之间的功能和/或术语可以相互引用。In the present application, under the premise of no logical contradiction, the examples may reference each other, for example, the methods and/or terms between method embodiments may reference each other, for example, the functions and/or terms between device embodiments may reference each other, for example, the functions and/or terms between device examples and method examples may reference each other.
Claims (21)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311819054.6 | 2023-12-26 | ||
| CN202311819054.6A CN120218123A (en) | 2023-12-26 | 2023-12-26 | Distributed training method, device, system, chip module and storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025140055A1 true WO2025140055A1 (en) | 2025-07-03 |
Family
ID=96104869
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2024/141150 Pending WO2025140055A1 (en) | 2023-12-26 | 2024-12-20 | Distributed training method, apparatus and system, and chip module and storage medium |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN120218123A (en) |
| WO (1) | WO2025140055A1 (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120803003B (en) * | 2025-09-10 | 2025-11-28 | 中国人民解放军国防科技大学 | Unmanned aerial vehicle body cognition alignment method based on man-machine cooperation |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20210073677A1 (en) * | 2019-09-06 | 2021-03-11 | Oracle International Corporation | Privacy preserving collaborative learning with domain adaptation |
| CN112560991A (en) * | 2020-12-25 | 2021-03-26 | 中山大学 | Personalized federal learning method based on hybrid expert model |
| CN115906921A (en) * | 2022-11-30 | 2023-04-04 | 北京百度网讯科技有限公司 | Deep learning model training method, target object detection method and device |
-
2023
- 2023-12-26 CN CN202311819054.6A patent/CN120218123A/en active Pending
-
2024
- 2024-12-20 WO PCT/CN2024/141150 patent/WO2025140055A1/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20210073677A1 (en) * | 2019-09-06 | 2021-03-11 | Oracle International Corporation | Privacy preserving collaborative learning with domain adaptation |
| CN112560991A (en) * | 2020-12-25 | 2021-03-26 | 中山大学 | Personalized federal learning method based on hybrid expert model |
| CN115906921A (en) * | 2022-11-30 | 2023-04-04 | 北京百度网讯科技有限公司 | Deep learning model training method, target object detection method and device |
Also Published As
| Publication number | Publication date |
|---|---|
| CN120218123A (en) | 2025-06-27 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Brik et al. | Deep learning for B5G open radio access network: Evolution, survey, case studies, and challenges | |
| Zhu et al. | Pushing AI to wireless network edge: An overview on integrated sensing, communication, and computation towards 6G | |
| WO2022022334A1 (en) | Artificial intelligence-based communication method and communication device | |
| WO2022206513A1 (en) | Model processing method, communication device, and system | |
| US20240346384A1 (en) | Communication method and apparatus | |
| WO2025140055A1 (en) | Distributed training method, apparatus and system, and chip module and storage medium | |
| US20230319662A1 (en) | Method and apparatus for programmable and customized intelligence for traffic steering in 5g networks using open ran architectures | |
| WO2024051789A1 (en) | Beam management method | |
| KR20240113453A (en) | Method and device for performing communication in a wireless communication system | |
| WO2022135288A1 (en) | Information processing method and apparatus | |
| US20240232645A9 (en) | Zone gradient diffusion (zgd) for zone-based federated learning | |
| WO2024169710A1 (en) | Communication method and communication apparatus | |
| EP4258730A1 (en) | Method and apparatus for programmable and customized intelligence for traffic steering in 5g networks using open ran architectures | |
| WO2024093503A1 (en) | Model processing method and apparatus | |
| WO2025131014A1 (en) | Distributed training method and apparatus | |
| WO2025218448A1 (en) | Communication method and apparatus, chip module, storage medium, and program product | |
| US12261792B2 (en) | Group-common reference signal for over-the-air aggregation in federated learning | |
| WO2025161598A1 (en) | Network training method, and communication apparatus | |
| WO2025113198A1 (en) | Information transmission method and communication apparatus | |
| US20230385651A1 (en) | Method of determining zone membership in zone-based federated learning | |
| WO2025016265A1 (en) | Model monitoring method, apparatus and system, and storage medium and program product | |
| WO2025067344A1 (en) | Artificial intelligence model training method and communication apparatus | |
| US20230325654A1 (en) | Scalable deep learning design for missing input features | |
| WO2025237160A1 (en) | Model monitoring method and apparatus | |
| WO2024169600A1 (en) | Communication method and apparatus |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24910983 Country of ref document: EP Kind code of ref document: A1 |