WO2023077597A1

WO2023077597A1 - Cell selection method and device

Info

Publication number: WO2023077597A1
Application number: PCT/CN2021/135532
Authority: WO
Inventors: 尤心
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2021-11-02
Filing date: 2021-12-03
Publication date: 2023-05-11
Anticipated expiration: 2024-05-02
Also published as: CN118044261A

Abstract

A cell selection method and device, the method comprising: determining at least one reward condition for cell selection and a reward value corresponding to the at least one reward condition; and training a reinforcement learning model for cell selection according to the reward value corresponding to the at least one reward condition.

Description

Method and device for cell selection

本申请要求于2021年11月02日提交中国专利局、申请号为PCT/CN2021/128217、发明名称为“小区选择的方法和设备”的PCT专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims the priority of the PCT patent application with the application number PCT/CN2021/128217 and the title of the invention "Method and device for cell selection" filed with the China Patent Office on November 02, 2021, the entire contents of which are incorporated herein by reference In this application.

technical field

本申请实施例涉及通信领域，具体涉及一种小区选择的方法和设备。The embodiments of the present application relate to the communication field, and in particular to a method and device for cell selection.

Background technique

在新无线(New Radio，NR)系统中，当正在使用网络服务的终端设备从一个小区移动到另一个小区，或者，由于无线传输业务负荷量调整、激活操作维护、设备故障等原因，为了保证通信的连续性和服务的质量，系统要将该终端设备与源小区的通信链路转移到新的小区上，即执行切换过程。In the New Radio (NR) system, when a terminal device using network services moves from one cell to another, or due to wireless transmission business load adjustment, activation operation and maintenance, equipment failure, etc., in order to ensure To ensure the continuity of communication and the quality of service, the system needs to transfer the communication link between the terminal equipment and the source cell to a new cell, that is, to perform a handover process.

传统的切换依赖于终端设备的测量上报选择目标小区，可能会触发不必要的切换流程，比如乒乓切换。由测量配置以及测量上报带来的信令交互可能会导致较大的传输时延，这样，切换命令下达时测量结果的时效性无法保证，进而导致切换失败。因此如何进行小区选择以提升切换成功率是一项亟需解决的问题。The traditional handover relies on the measurement report of the terminal equipment to select the target cell, which may trigger unnecessary handover procedures, such as ping-pong handover. The signaling interaction caused by measurement configuration and measurement reporting may cause a large transmission delay. In this way, the timeliness of the measurement results cannot be guaranteed when the handover command is issued, resulting in handover failure. Therefore, how to perform cell selection to improve the handover success rate is an urgent problem to be solved.

发明内容Contents of the invention

本申请提供了一种小区选择的方法和设备，利用强化学习模型进行小区选择，有利于终端设备选择到合适的小区。The present application provides a method and device for cell selection, which uses a reinforcement learning model to select a cell, which is beneficial for a terminal device to select a suitable cell.

第一方面，提供了一种小区选择的方法，包括：确定用于小区选择的至少一个奖励条件以及所述至少一个奖励条件对应的奖励值；根据所述至少一个奖励条件以及所述至少一个奖励条件对应的奖励值对用于小区选择的强化学习模型进行训练。In a first aspect, a method for cell selection is provided, including: determining at least one reward condition for cell selection and a reward value corresponding to the at least one reward condition; according to the at least one reward condition and the at least one reward The reward value corresponding to the condition trains the reinforcement learning model for cell selection.

第二方面，提供了一种小区选择的方法，包括：利用强化学习模型根据终端设备在多个小区的状态信息确定选择的目标小区。In a second aspect, a method for cell selection is provided, including: using a reinforcement learning model to determine a selected target cell according to state information of a terminal device in multiple cells.

第三方面，提供了一种小区选择的设备，用于执行上述第一方面或其各实现方式中的方法。In a third aspect, a device for cell selection is provided, configured to perform the method in the above first aspect or various implementations thereof.

具体地，该终端设备包括用于执行上述第一方面或其各实现方式中的方法的功能模块。Specifically, the terminal device includes a functional module for executing the method in the above first aspect or its various implementation manners.

第四方面，提供了一种小区选择的设备，用于执行上述第二方面或其各实现方式中的方法。In a fourth aspect, a device for cell selection is provided, configured to execute the method in the above second aspect or various implementations thereof.

具体地，该网络设备包括用于执行上述第二方面或其各实现方式中的方法的功能模块。Specifically, the network device includes a functional module for executing the method in the above second aspect or each implementation manner thereof.

第五方面，提供了一种通信设备，包括处理器和存储器。该存储器用于存储计算机程序，该处理器用于调用并运行该存储器中存储的计算机程序，执行上述第一方面或其各实现方式中的方法。In a fifth aspect, a communication device is provided, including a processor and a memory. The memory is used to store a computer program, and the processor is used to call and run the computer program stored in the memory to execute the method in the above first aspect or its various implementations.

第六方面，提供了一种通信设备，包括处理器和存储器。该存储器用于存储计算机程序，该处理器用于调用并运行该存储器中存储的计算机程序，执行上述第二方面或其各实现方式中的方法。In a sixth aspect, a communication device is provided, including a processor and a memory. The memory is used to store a computer program, and the processor is used to call and run the computer program stored in the memory to execute the method in the above second aspect or its various implementations.

第七方面，提供了一种芯片，用于实现上述第一方面至第二方面中的任一方面或其各实现方式中的方法。In a seventh aspect, a chip is provided for implementing any one of the above first aspect to the second aspect or the method in each implementation manner thereof.

具体地，该芯片包括：处理器，用于从存储器中调用并运行计算机程序，使得安装有该装置的设备执行如上述第一方面至第二方面中的任一方面或其各实现方式中的方法。Specifically, the chip includes: a processor, configured to call and run a computer program from the memory, so that the device installed with the device executes any one of the above-mentioned first to second aspects or any of the implementations thereof. method.

第八方面，提供了一种计算机可读存储介质，用于存储计算机程序，该计算机程序使得计算机执行上述第一方面至第二方面中的任一方面或其各实现方式中的方法。In an eighth aspect, there is provided a computer-readable storage medium for storing a computer program, and the computer program causes a computer to execute any one of the above-mentioned first to second aspects or the method in each implementation manner thereof.

第九方面，提供了一种计算机程序产品，包括计算机程序指令，所述计算机程序指令使得计算机执行上述第一方面至第二方面中的任一方面或其各实现方式中的方法。A ninth aspect provides a computer program product, including computer program instructions, the computer program instructions cause a computer to execute any one of the above first to second aspects or the method in each implementation manner.

第十方面，提供了一种计算机程序，当其在计算机上运行时，使得计算机执行上述第一方面至第二方面中的任一方面或其各实现方式中的方法。In a tenth aspect, a computer program is provided, which, when running on a computer, causes the computer to execute any one of the above-mentioned first to second aspects or the method in each implementation manner.

通过上述技术方案，通过设置用于小区选择的至少一个奖励条件以及对应的奖励值，进一步基于该至少一个奖励条件及其对应的奖励值进行用于小区选择的强化学习模型进行训练，有利于选择到合适的小区，避免传统切换中的问题。Through the above technical solution, by setting at least one reward condition and corresponding reward value for cell selection, and further training the reinforcement learning model for cell selection based on the at least one reward condition and its corresponding reward value, it is beneficial to select to a suitable cell, avoiding the problems in traditional handover.

Description of drawings

图1是本申请实施例提供的一种通信系统架构的示意性图。FIG. 1 is a schematic diagram of a communication system architecture provided by an embodiment of the present application.

图2是相关技术中小区切换的决策流程示意图。Fig. 2 is a schematic diagram of a decision flow of cell handover in the related art.

图3是强化学习的示意性流程图。Fig. 3 is a schematic flowchart of reinforcement learning.

图4是根据本申请实施例提供的一种小区选择的方法的示意性流程图。Fig. 4 is a schematic flowchart of a method for cell selection according to an embodiment of the present application.

图5是根据本申请一个实施例的小区部署示意图。Fig. 5 is a schematic diagram of cell deployment according to an embodiment of the present application.

图6是根据本申请一个实施例的最大可选位置撒点示意图。Fig. 6 is a schematic diagram of the maximum optional position of sprinkle points according to an embodiment of the present application.

图7是根据本申请另一实施例的小区部署示意图。Fig. 7 is a schematic diagram of cell deployment according to another embodiment of the present application.

图8是根据本申请一个实施例的重叠扇区示意图。Fig. 8 is a schematic diagram of overlapping sectors according to an embodiment of the present application.

图9是根据本申请实施例提供的另一种小区选择的方法的示意性流程图。Fig. 9 is a schematic flowchart of another method for cell selection according to an embodiment of the present application.

图10是根据本申请实施例提供的一种小区选择的设备的示意性框图。Fig. 10 is a schematic block diagram of a device for cell selection according to an embodiment of the present application.

图11是根据本申请实施例提供的另一种小区选择的设备的示意性框图。Fig. 11 is a schematic block diagram of another device for cell selection according to an embodiment of the present application.

图12是根据本申请实施例提供的一种通信设备的示意性框图。Fig. 12 is a schematic block diagram of a communication device provided according to an embodiment of the present application.

图13是根据本申请实施例提供的一种芯片的示意性框图。Fig. 13 is a schematic block diagram of a chip provided according to an embodiment of the present application.

Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。针对本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. With regard to the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

本申请实施例的技术方案可以应用于各种通信系统，例如：全球移动通讯(Global System of Mobile communication，GSM)系统、码分多址(Code Division Multiple Access，CDMA)系统、宽带码分多址(Wideband Code Division Multiple Access，WCDMA)系统、通用分组无线业务(General Packet Radio Service，GPRS)、长期演进(Long Term Evolution，LTE)系统、先进的长期演进(Advanced long term evolution，LTE-A)系统、新无线(New Radio，NR)系统、NR系统的演进系统、非授权频谱上的LTE(LTE-based access to unlicensed spectrum，LTE-U)系统、非授权频谱上的NR(NR-based access to unlicensed spectrum，NR-U)系统、非地面通信网络(Non-Terrestrial Networks，NTN)系统、通用移动通信系统(Universal Mobile Telecommunication System，UMTS)、无线局域网(Wireless Local Area Networks，WLAN)、无线保真(Wireless Fidelity，WiFi)、第五代通信(5th-Generation，5G)系统或其他通信系统等。The technical solution of the embodiment of the present application can be applied to various communication systems, such as: Global System of Mobile communication (Global System of Mobile communication, GSM) system, code division multiple access (Code Division Multiple Access, CDMA) system, broadband code division multiple access (Wideband Code Division Multiple Access, WCDMA) system, General Packet Radio Service (GPRS), Long Term Evolution (LTE) system, Advanced long term evolution (LTE-A) system , New Radio (NR) system, evolution system of NR system, LTE (LTE-based access to unlicensed spectrum, LTE-U) system on unlicensed spectrum, NR (NR-based access to unlicensed spectrum) on unlicensed spectrum unlicensed spectrum (NR-U) system, Non-Terrestrial Networks (NTN) system, Universal Mobile Telecommunications System (UMTS), Wireless Local Area Networks (WLAN), Wireless Fidelity (Wireless Fidelity, WiFi), fifth-generation communication (5th-Generation, 5G) system or other communication systems, etc.

通常来说，传统的通信系统支持的连接数有限，也易于实现，然而，随着通信技术的发展，移动通信系统将不仅支持传统的通信，还将支持例如，设备到设备(Device to Device，D2D)通信，机器到机器(Machine to Machine，M2M)通信，机器类型通信(Machine Type Communication，MTC)，车辆间(Vehicle to Vehicle，V2V)通信，或车联网(Vehicle to everything，V2X)通信等，本申请实施例也可以应用于这些通信系统。Generally speaking, the number of connections supported by traditional communication systems is limited and easy to implement. However, with the development of communication technology, mobile communication systems will not only support traditional communication, but also support, for example, Device to Device (Device to Device, D2D) communication, Machine to Machine (M2M) communication, Machine Type Communication (MTC), Vehicle to Vehicle (V2V) communication, or Vehicle to everything (V2X) communication, etc. , the embodiments of the present application may also be applied to these communication systems.

可选地，本申请实施例中的通信系统可以应用于载波聚合(Carrier Aggregation，CA)场景，也可以应用于双连接(Dual Connectivity，DC)场景，还可以应用于独立(Standalone，SA)布网场景。Optionally, the communication system in the embodiment of the present application may be applied to a carrier aggregation (Carrier Aggregation, CA) scenario, may also be applied to a dual connectivity (Dual Connectivity, DC) scenario, and may also be applied to an independent (Standalone, SA) deployment Web scene.

可选地，本申请实施例中的通信系统可以应用于非授权频谱，其中，非授权频谱也可以认为是共享频谱；或者，本申请实施例中的通信系统也可以应用于授权频谱，其中，授权频谱也可以认为是非共享频谱。Optionally, the communication system in the embodiment of the present application may be applied to an unlicensed spectrum, where the unlicensed spectrum may also be considered as a shared spectrum; or, the communication system in the embodiment of the present application may also be applied to a licensed spectrum, where, Licensed spectrum can also be considered as non-shared spectrum.

本申请实施例结合网络设备和终端设备描述了各个实施例，其中，终端设备也可以称为用户设备(User Equipment，UE)、接入终端、用户单元、用户站、移动站、移动台、远方站、远程终端、移动设备、用户终端、终端、无线通信设备、用户代理或用户装置等。The embodiments of the present application describe various embodiments in conjunction with network equipment and terminal equipment, wherein the terminal equipment may also be referred to as user equipment (User Equipment, UE), access terminal, user unit, user station, mobile station, mobile station, remote station, remote terminal, mobile device, user terminal, terminal, wireless communication device, user agent or user device, etc.

终端设备可以是WLAN中的站点(STATION，ST)，可以是蜂窝电话、无绳电话、会话启动协议(Session Initiation Protocol，SIP)电话、无线本地环路(Wireless Local Loop，WLL)站、个人数字助理(Personal Digital Assistant，PDA)设备、具有无线通信功能的手持设备、计算设备或连接到无线调制解调器的其它处理设备、车载设备、可穿戴设备、下一代通信系统例如NR网络中的终端设备，或者未来演进的公共陆地移动网络(Public Land Mobile Network，PLMN)网络中的终端设备等。The terminal device can be a station (STATION, ST) in a WLAN, a cellular phone, a cordless phone, a Session Initiation Protocol (Session Initiation Protocol, SIP) phone, a wireless local loop (Wireless Local Loop, WLL) station, a personal digital assistant (Personal Digital Assistant, PDA) devices, handheld devices with wireless communication functions, computing devices or other processing devices connected to wireless modems, vehicle-mounted devices, wearable devices, next-generation communication systems such as terminal devices in NR networks, or future Terminal equipment in the evolved public land mobile network (Public Land Mobile Network, PLMN) network, etc.

在本申请实施例中，终端设备可以部署在陆地上，包括室内或室外、手持、穿戴或车载；也可以部署在水面上(如轮船等)；还可以部署在空中(例如飞机、气球和卫星上等)。In the embodiment of this application, the terminal device can be deployed on land, including indoor or outdoor, handheld, wearable or vehicle-mounted; it can also be deployed on water (such as ships, etc.); it can also be deployed in the air (such as aircraft, balloons and satellites) superior).

在本申请实施例中，终端设备可以是手机(Mobile Phone)、平板电脑(Pad)、带无线收发功能的电脑、虚拟现实(Virtual Reality，VR)终端设备、增强现实(Augmented Reality，AR)终端设备、工业控制(industrial control)中的无线终端设备、无人驾驶(self driving)中的无线终端设备、远程医疗(remote medical)中的无线终端设备、智能电网(smart grid)中的无线终端设备、运输安全(transportation safety)中的无线终端设备、智慧城市(smart city)中的无线终端设备或智慧家庭(smart home)中的无线终端设备等。In this embodiment of the application, the terminal device may be a mobile phone (Mobile Phone), a tablet computer (Pad), a computer with a wireless transceiver function, a virtual reality (Virtual Reality, VR) terminal device, an augmented reality (Augmented Reality, AR) terminal Equipment, wireless terminal equipment in industrial control, wireless terminal equipment in self driving, wireless terminal equipment in remote medical, wireless terminal equipment in smart grid , wireless terminal equipment in transportation safety, wireless terminal equipment in smart city, or wireless terminal equipment in smart home.

作为示例而非限定，在本申请实施例中，该终端设备还可以是可穿戴设备。可穿戴设备也可以称为穿戴式智能设备，是应用穿戴式技术对日常穿戴进行智能化设计、开发出可以穿戴的设备的总称，如眼镜、手套、手表、服饰及鞋等。可穿戴设备即直接穿在身上，或是整合到用户的衣服或配件的一种便携式设备。可穿戴设备不仅仅是一种硬件设备，更是通过软件支持以及数据交互、云端交互来实现强大的功能。广义穿戴式智能设备包括功能全、尺寸大、可不依赖智能手机实现完整或者部分的功能，例如：智能手表或智能眼镜等，以及只专注于某一类应用功能，需要和其它设备如智能手机配合使用，如各类进行体征监测的智能手环、智能首饰等。As an example but not a limitation, in this embodiment of the present application, the terminal device may also be a wearable device. Wearable devices can also be called wearable smart devices, which is a general term for the application of wearable technology to intelligently design daily wear and develop wearable devices, such as glasses, gloves, watches, clothing and shoes. A wearable device is a portable device that is worn directly on the body or integrated into the user's clothing or accessories. Wearable devices are not only a hardware device, but also achieve powerful functions through software support, data interaction, and cloud interaction. Generalized wearable smart devices include full-featured, large-sized, complete or partial functions without relying on smart phones, such as smart watches or smart glasses, etc., and only focus on a certain type of application functions, and need to cooperate with other devices such as smart phones Use, such as various smart bracelets and smart jewelry for physical sign monitoring.

在本申请实施例中，网络设备可以是用于与移动设备通信的设备，网络设备可以是WLAN中的接入点(Access Point，AP)，GSM或CDMA中的基站(Base Transceiver Station，BTS)，也可以是WCDMA中的基站(NodeB，NB)，还可以是LTE中的演进型基站(Evolutional Node B，eNB或eNodeB)，或者中继站或接入点，或者车载设备、可穿戴设备以及NR网络中的网络设备(gNB)或者未来演进的PLMN网络中的网络设备或者NTN网络中的网络设备等。In the embodiment of the present application, the network device may be a device for communicating with the mobile device, and the network device may be an access point (Access Point, AP) in WLAN, a base station (Base Transceiver Station, BTS) in GSM or CDMA , or a base station (NodeB, NB) in WCDMA, or an evolved base station (Evolutional Node B, eNB or eNodeB) in LTE, or a relay station or access point, or a vehicle-mounted device, a wearable device, and an NR network The network equipment (gNB) in the network or the network equipment in the future evolved PLMN network or the network equipment in the NTN network, etc.

作为示例而非限定，在本申请实施例中，网络设备可以具有移动特性，例如网络设备可以为移动的设备。可选地，网络设备可以为卫星、气球站。例如，卫星可以为低地球轨道(low earth orbit，LEO)卫星、中地球轨道(medium earth orbit，MEO)卫星、地球同步轨道(geostationary earth orbit，GEO)卫星、高椭圆轨道(High Elliptical Orbit，HEO)卫星等。可选地，网络设备还可以为设置在陆地、水域等位置的基站。As an example but not a limitation, in this embodiment of the present application, the network device may have a mobile feature, for example, the network device may be a mobile device. Optionally, the network equipment may be a satellite or a balloon station. For example, the satellite can be a low earth orbit (low earth orbit, LEO) satellite, a medium earth orbit (medium earth orbit, MEO) satellite, a geosynchronous earth orbit (geosynchronous earth orbit, GEO) satellite, a high elliptical orbit (High Elliptical Orbit, HEO) satellite. ) Satellite etc. Optionally, the network device may also be a base station installed on land, water, and other locations.

在本申请实施例中，网络设备可以为小区提供服务，终端设备通过该小区使用的传输资源(例如，频域资源，或者说，频谱资源)与网络设备进行通信，该小区可以是网络设备(例如基站)对应的小区，小区可以属于宏基站，也可以属于小小区(Small cell)对应的基站，这里的小小区可以包括：城市小区(Metro cell)、微小区(Micro cell)、微微小区(Pico cell)、毫微微小区(Femto cell)等，这些小小区具有覆盖范围小、发射功率低的特点，适用于提供高速率的数据传输服务。In this embodiment of the present application, the network device may provide services for a cell, and the terminal device communicates with the network device through the transmission resources (for example, frequency domain resources, or spectrum resources) used by the cell, and the cell may be a network device ( For example, a cell corresponding to a base station), the cell may belong to a macro base station, or may belong to a base station corresponding to a small cell (Small cell), and the small cell here may include: a metro cell (Metro cell), a micro cell (Micro cell), a pico cell ( Pico cell), Femto cell, etc. These small cells have the characteristics of small coverage and low transmission power, and are suitable for providing high-speed data transmission services.

示例性的，本申请实施例应用的通信系统100如图1所示。该通信系统100可以包括网络设备110，网络设备110可以是与终端设备120(或称为通信终端、终端)通信的设备。网络设备110可以为特定的地理区域提供通信覆盖，并且可以与位于该覆盖区域内的终端设备进行通信。Exemplarily, a communication system 100 applied in this embodiment of the application is shown in FIG. 1 . The communication system 100 may include a network device 110, and the network device 110 may be a device for communicating with a terminal device 120 (or called a communication terminal, terminal). The network device 110 can provide communication coverage for a specific geographical area, and can communicate with terminal devices located in the coverage area.

图1示例性地示出了一个网络设备和两个终端设备，可选地，该通信系统100可以包括多个网络设备并且每个网络设备的覆盖范围内可以包括其它数量的终端设备，本申请实施例对此不做限定。FIG. 1 exemplarily shows one network device and two terminal devices. Optionally, the communication system 100 may include multiple network devices and each network device may include other numbers of terminal devices within the coverage area. This application The embodiment does not limit this.

可选地，该通信系统100还可以包括网络控制器、移动管理实体等其他网络实体，本申请实施例对此不作限定。Optionally, the communication system 100 may further include other network entities such as a network controller and a mobility management entity, which is not limited in this embodiment of the present application.

应理解，本申请实施例中网络/系统中具有通信功能的设备可称为通信设备。以图1示出的通信系统100为例，通信设备可包括具有通信功能的网络设备110和终端设备120，网络设备110和终端设备120可以为上文所述的具体设备，此处不再赘述；通信设备还可包括通信系统100中的其他设备，例如网络控制器、移动管理实体等其他网络实体，本申请实施例中对此不做限定。It should be understood that a device with a communication function in the network/system in the embodiment of the present application may be referred to as a communication device. Taking the communication system 100 shown in FIG. 1 as an example, the communication equipment may include a network equipment 110 and a terminal equipment 120 with communication functions. The network equipment 110 and the terminal equipment 120 may be the specific equipment described above, and will not be repeated here. The communication device may also include other devices in the communication system 100, such as network controllers, mobility management entities and other network entities, which are not limited in this embodiment of the present application.

应理解，本文中术语“系统”和“网络”在本文中常被可互换使用。本文中术语“和/或”，仅仅是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。另外，本文中字符“/”，一般表示前后关联对象是一种“或”的关系。It should be understood that the terms "system" and "network" are often used interchangeably herein. The term "and/or" in this article is just an association relationship describing associated objects, which means that there can be three relationships, for example, A and/or B can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations. In addition, the character "/" in this article generally indicates that the contextual objects are an "or" relationship.

应理解，在本申请的实施例中提到的“指示”可以是直接指示，也可以是间接指示，还可以是表示具有关联关系。举例说明，A指示B，可以表示A直接指示B，例如B可以通过A获取；也可以表示A间接指示B，例如A指示C，B可以通过C获取；还可以表示A和B之间具有关联关系。It should be understood that the "indication" mentioned in the embodiments of the present application may be a direct indication, may also be an indirect indication, and may also mean that there is an association relationship. For example, A indicates B, which can mean that A directly indicates B, for example, B can be obtained through A; it can also indicate that A indirectly indicates B, for example, A indicates C, and B can be obtained through C; it can also indicate that there is an association between A and B relation.

在本申请实施例的描述中，术语“对应”可表示两者之间具有直接对应或间接对应的关系，也可以表示两者之间具有关联关系，也可以是指示与被指示、配置与被配置等关系。In the description of the embodiments of the present application, the term "corresponding" may indicate that there is a direct or indirect correspondence between the two, or that there is an association between the two, or that it indicates and is indicated, configuration and is configuration etc.

本申请实施例中，"预定义"可以通过在设备(例如，包括终端设备和网络设备)中预先保存相应的代码、表格或其他可用于指示相关信息的方式来实现，本申请对于其具体的实现方式不做限定。比如预定义可以是指协议中定义的。In the embodiment of this application, "predefinition" can be realized by pre-saving corresponding codes, tables or other methods that can be used to indicate related information in devices (for example, including terminal devices and network devices). The implementation method is not limited. For example, pre-defined may refer to defined in the protocol.

本申请实施例中，所述"协议"可以指通信领域的标准协议，例如可以包括LTE协议、NR协议以及应用于未来的通信系统中的相关协议，本申请对此不做限定。In the embodiment of the present application, the "protocol" may refer to a standard protocol in the communication field, for example, it may include the LTE protocol, the NR protocol, and related protocols applied in future communication systems, which is not limited in the present application.

为便于理解本申请实施例的技术方案，以下通过具体实施例详述本申请的技术方案。以下相关技术作为可选方案与本申请实施例的技术方案可以进行任意结合，其均属于本申请实施例的保护范围。本申请实施例包括以下内容中的至少部分内容。In order to facilitate understanding of the technical solutions of the embodiments of the present application, the technical solutions of the present application are described in detail below through specific examples. The following related technologies may be optionally combined with the technical solutions of the embodiments of the present application as optional solutions, and all of them belong to the protection scope of the embodiments of the present application. The embodiment of the present application includes at least part of the following content.

为便于更好的理解本申请实施例，对本申请相关的切换进行说明。To facilitate a better understanding of the embodiments of the present application, the switching related to the present application will be described.

与LTE系统相似，NR系统支持连接态UE的切换过程。当正在使用网络服务的用户从一个小区移动到另一个小区，或者，由于无线传输业务负荷量调整、激活操作维护、设备故障等原因，为了保证通信的连续性和服务的质量，系统要将该用户与源小区的通信链路转移到新的小区上，即执行切换过程。Similar to the LTE system, the NR system supports the handover process of the UE in the connected state. When a user using network services moves from one cell to another, or due to reasons such as wireless transmission business load adjustment, activation operation maintenance, equipment failure, etc., in order to ensure the continuity of communication and the quality of service, the system will The communication link between the user and the source cell is transferred to the new cell, that is, the handover process is performed.

以Xn接口切换过程为例，整个切换过程分为以下三个阶段：Taking the switching process of the Xn interface as an example, the whole switching process is divided into the following three stages:

(1)切换准备：包括测量控制和汇报，切换请求以及确认。在切换确认消息中包含目标小区生成的切换命令，源小区不允许对目标小区生成的切换命令进行任何修改，直接将切换命令转发给UE。(1) Handover preparation: including measurement control and reporting, handover request and confirmation. The handover confirmation message includes the handover command generated by the target cell, and the source cell does not allow any modification to the handover command generated by the target cell, and directly forwards the handover command to the UE.

(2)切换执行：UE在收到切换命令后立即执行切换过程，即UE断开源小区并与目标小区连接(如执行随机接入，发送无线资源控制(Radio Resource Control，RRC)切换完成消息给目标基站等)；序列号(sequence number，SN)状态转移，数据转发。(2) Handover execution: UE immediately executes the handover process after receiving the handover command, that is, the UE disconnects the source cell and connects with the target cell (such as performing random access, sending a radio resource control (Radio Resource Control, RRC) handover completion message to Target base station, etc.); sequence number (sequence number, SN) state transfer, data forwarding.

(3)切换完成：目标小区与接入与移动性管理功能(Access and Mobility Management Function，AMF)实体和用户面功能(User Plane Function，UPF)实体执行路径切换(Path Switch)，释放源基站的UE上下文。(3) Handover is completed: the target cell, the Access and Mobility Management Function (AMF) entity and the User Plane Function (UPF) entity execute the path switch (Path Switch), releasing the source base station UE context.

如图2所示是切换决策过程的示意性交互图。FIG. 2 is a schematic interaction diagram of the handover decision process.

S1.AMF实体提供移动控制信息；S1. The AMF entity provides mobile control information;

S2.测量和上报；S2. Measurement and reporting;

S3.源gNB决定切换；S3. The source gNB decides to switch;

S4.源gNB向目标gNB发送切换请求；S4. The source gNB sends a handover request to the target gNB;

S5.目标gNB进行准入控制；S5. The target gNB performs admission control;

S6.目标gNB向源gNB发送切换请求确认(Acknowledgement，ACK)；S6. The target gNB sends a handover request acknowledgment (Acknowledgment, ACK) to the source gNB;

S7.UE和源gNB执行无线接入网(Radio Access Network，RAN)切换开始。S7. The UE and the source gNB perform radio access network (Radio Access Network, RAN) handover and start.

即UE接收源基站的测量配置，根据该测量配置进行测量和上报。源基站基于UE的测量上报决定是否进行切换。That is, the UE receives the measurement configuration of the source base station, and performs measurement and reporting according to the measurement configuration. The source base station decides whether to perform handover based on the UE's measurement report.

为便于理解本申请实施例的技术方案，对本申请相关的强化学习进行说明。In order to facilitate the understanding of the technical solutions of the embodiments of the present application, the reinforcement learning related to the present application will be described.

强化学习是智能体(Agent)以“试错”的方式进行学习，通过与环境进行交互获得的奖励指导行为，目标是使智能体获得最大的奖励，强化学习不同于连接主义学习中的监督学习，主要表现在强化信号上，强化学习中由环境提供的强化信号是对产生动作的好坏作一种评价(通常为标量信号)，而不是告诉强化学习系统(reinforcement learning system，RLS)如何去产生正确的动作。由于外部环境提供的信息很少，RLS必须靠自身的经历进行学习。通过这种方式，RLS在行动-评价的环境中获得知识，改进行动方案以适应环境。Reinforcement learning is that the agent (Agent) learns in a "trial and error" way, and the reward-guided behavior obtained by interacting with the environment, the goal is to enable the agent to obtain the maximum reward. Reinforcement learning is different from supervised learning in connectionist learning. , mainly manifested in the reinforcement signal. The reinforcement signal provided by the environment in reinforcement learning is an evaluation of the quality of the generated action (usually a scalar signal), rather than telling the reinforcement learning system (reinforcement learning system, RLS) how to do it. produce the correct action. With little information from the external environment, RLS must learn by experience. In this way, RLS acquires knowledge in an action-evaluation environment and improves the action plan to suit the environment.

强化学习多用在需要与环境交互的场景下，给定一个环境的状态(State)，智能体根据某种策略(Policy)选出一个对应的行为(Action)，而执行这个Action后环境又会发生改变，即状态会转换为新的状态S'，且每执行完一个Action后程序会得到一个奖励值(Reward)，而智能体依据得到的奖励值的大小调整其策略，使得在所有步骤执行完后，即状态到达终止状态(Terminal)时，所获得的Reward之和最大。Reinforcement learning is mostly used in scenarios that need to interact with the environment. Given an environment state (State), the agent selects a corresponding action (Action) according to a certain policy (Policy), and the environment will happen again after executing this Action. Change, that is, the state will be converted to a new state S', and the program will get a reward value (Reward) after each Action is executed, and the agent adjusts its strategy according to the size of the reward value obtained, so that after all steps are executed After that, that is, when the state reaches the terminal state (Terminal), the sum of the obtained Rewards is the largest.

图3是强化学习执行流程的示意图。其中，智能体可以理解为程序，该智能体可以观察Environment并获得state，依据Policy对state做出action，得到一个reward，且Environment改变了，因此Agent会得到一个新的state，并继续执行下去。Fig. 3 is a schematic diagram of reinforcement learning execution flow. Among them, the agent can be understood as a program, the agent can observe the environment and get the state, take action on the state according to the policy, and get a reward, and the environment has changed, so the agent will get a new state and continue to execute.

深度强化学习模型(DQN)相对于强化学习(Q学习)的改进在于：一是使用神经网络逼近值函数，二是使用目标Q网络来更新目标，三是使用了经验回放。在深度强化学习模型中，主要包括三部分：状态，行为和奖励。目的是最大化智能体和环境交互过程内观察到的奖励。具体地，在迭代过程中，智能体从状态空间中观察一组状态，并且从动作空间基于学习策略选择一个动作执行，决策策略由DQN决定，策略原则是使模型获得最大的回报。The improvement of the deep reinforcement learning model (DQN) over reinforcement learning (Q-learning) lies in: one is to use the neural network to approximate the value function, the other is to use the target Q-network to update the target, and the third is to use experience playback. In the deep reinforcement learning model, it mainly includes three parts: state, behavior and reward. The goal is to maximize the reward observed during the interaction between the agent and the environment. Specifically, in the iterative process, the agent observes a set of states from the state space, and selects an action to execute based on the learning strategy from the action space. The decision-making strategy is determined by DQN, and the strategy principle is to maximize the reward for the model.

传统的切换依赖于UE的测量上报选择目标小区，可能会触发不必要的切换流程，比如乒乓切换。Traditional handover relies on UE's measurement report to select a target cell, which may trigger unnecessary handover procedures, such as ping-pong handover.

由测量配置以及测量上报带来的信令交互可能会导致较大的传输时延，切换命令下达时测量结果的时效性无法保证，进而导致切换失败。因此如何进行小区切换以提升切换成功率是一项亟需解决的问题。The signaling interaction caused by measurement configuration and measurement reporting may cause a large transmission delay, and the timeliness of the measurement results cannot be guaranteed when the handover command is issued, resulting in handover failure. Therefore, how to perform cell handover to improve the handover success rate is an urgent problem to be solved.

图4是根据本申请实施例的小区选择的方法200的示意性流程图，如图4所示，该方法200包括如下至少部分内容：FIG. 4 is a schematic flowchart of a method 200 for cell selection according to an embodiment of the present application. As shown in FIG. 4, the method 200 includes at least part of the following:

S210，确定用于小区选择的至少一个奖励条件以及至少一个奖励条件对应的奖励值；S210. Determine at least one reward condition for cell selection and a reward value corresponding to the at least one reward condition;

S220，根据至少一个奖励条件对应的奖励值对用于小区选择的强化学习模型进行训练。S220. Train the reinforcement learning model for cell selection according to the reward value corresponding to at least one reward condition.

在本申请一些实施例中，该方法200可以由智能体(agent)执行，该智能体可以设置在终端设备，或者网络设备上。即，终端设备或网络设备可以利用强化学习模型进行小区选择，有利于选择到合适的小区，提升用户体验。以下，以终端设备利用强化学习模型进行小区选择为例进行说明，网络设备侧的实现方式类似，这里不再赘述。In some embodiments of the present application, the method 200 may be executed by an agent (agent), and the agent may be set on a terminal device or a network device. That is, a terminal device or a network device can use a reinforcement learning model to select a cell, which is conducive to selecting a suitable cell and improving user experience. In the following, the cell selection by the terminal device using the reinforcement learning model is taken as an example for illustration, and the implementation manner on the network device side is similar, so details are not repeated here.

应理解，本申请并不限定具体的强化学习模型，例如可以包括但不限于深度Q网络(DQN)模型。It should be understood that the present application does not limit a specific reinforcement learning model, for example, it may include but not limited to a deep Q network (DQN) model.

本申请一些实施例中，强化学习模型对应的奖励条件与第一信息相关，其中，该第一信息包括但不限于以下中的至少一项：In some embodiments of the present application, the reward condition corresponding to the reinforcement learning model is related to the first information, where the first information includes but is not limited to at least one of the following:

小区的信号质量信息，小区的覆盖范围，小区的负载信息，终端设备在小区的驻留时长。The signal quality information of the cell, the coverage area of the cell, the load information of the cell, and the residence time of the terminal equipment in the cell.

通过设计与第一信息相关的奖励条件，进一步设计与该奖励条件对应的奖励值，有利于辅助终端设备选择到合适的小区，降低小区选择中常见的问题，例如乒乓切换，切换过早，切换至错误小区等问题。By designing reward conditions related to the first information, and further designing the reward value corresponding to the reward conditions, it is beneficial to assist the terminal equipment to select a suitable cell and reduce common problems in cell selection, such as ping-pong handover, premature handover, handover To the wrong cell and other issues.

在本申请一些实施例中，小区的信号质量信息可以通过以下指标中的至少一种表征：In some embodiments of the present application, the signal quality information of a cell may be characterized by at least one of the following indicators:

参考信号接收功率(Reference Signal Receiving Power，RSRP)、参考信号接收质量(Reference Signal Receiving Quality，RSRQ)、信号干扰噪声比(Signal to Interference plus Noise Ratio，SINR)。Reference Signal Receiving Power (RSRP), Reference Signal Receiving Quality (Reference Signal Receiving Quality, RSRQ), Signal to Interference plus Noise Ratio (SINR).

以下，以采用RSRP表征小区的信号质量信息为例进行说明，但本申请并不限于此。Hereinafter, the RSRP is used as an example to describe the signal quality information of a cell, but the present application is not limited thereto.

在本申请一些实施例中，所述至少一个奖励条件包括目标小区需要满足的目标奖励条件。In some embodiments of the present application, the at least one reward condition includes a target reward condition that the target cell needs to meet.

即，满足目标奖励条件的小区可以认为是目标小区，不满足目标奖励条件的小区认为不是目标小区。或者，当终端设备选择的小区满足目标奖励条件时，认为切换成功，当终端设备选择的小区不满足目标奖励条件时，认为切换失败。That is, a cell that meets the target reward condition can be considered as a target cell, and a cell that does not meet the target reward condition is not considered a target cell. Alternatively, when the cell selected by the terminal device satisfies the target reward condition, it is considered that the handover is successful, and when the cell selected by the terminal device does not meet the target reward condition, it is considered that the handover fails.

进一步地，对于所述至少一个奖励条件分配配置对应的奖励值。Further, a corresponding reward value is assigned to the at least one reward condition.

例如，在终端设备选择的小区满足目标奖励条件时，配置较大的奖励值，例如，正的奖励值，在终端设备选择的小区不满足目标奖励条件时，配置较小的奖励值，例如负的奖励值。For example, when the cell selected by the terminal device meets the target reward condition, configure a larger reward value, such as a positive reward value, and configure a smaller reward value, such as a negative reward value, when the cell selected by the terminal device does not meet the target reward condition. reward value.

在本申请一些实施例中，所述目标奖励条件包括以下中至少一个：In some embodiments of the present application, the target reward conditions include at least one of the following:

小区的信号质量信息大于或等于信号质量阈值；The signal quality information of the cell is greater than or equal to the signal quality threshold;

小区在多个候选小区中的信号质量信息最大；The cell has the largest signal quality information among multiple candidate cells;

所述终端设备位于小区的覆盖范围内；The terminal device is located within the coverage of the cell;

小区的负载满足负载阈值；The load of the cell meets the load threshold;

所述终端设备在小区的驻留时长大于时长阈值。The dwell time of the terminal device in the cell is greater than the time threshold.

在一些实施例中，小区的信道质量信息大于或等于信号质量阈值可以认为该小区的信号质量良好，能够满足终端设备的传输需求。In some embodiments, if the channel quality information of the cell is greater than or equal to the signal quality threshold, it may be considered that the signal quality of the cell is good and can meet the transmission requirements of the terminal device.

在一些实施例中，小区的负载满足负载阈值可以认为小区的负载能力良好，能够满足终端设备的接入。In some embodiments, if the load of the cell meets the load threshold, it may be considered that the load capacity of the cell is good and can satisfy the access of the terminal equipment.

在一些实施例中，终端设备在小区的驻留时长大于时长阈值，可以认为终端设备的此次切换是一次成功切换。In some embodiments, the dwell time of the terminal device in the cell is longer than the duration threshold, and the handover of the terminal device can be considered as a successful handover.

在一些实施例中，所述小区的负载满足负载阈值，包括：In some embodiments, the load of the cell meets a load threshold, comprising:

小区的可用负载大于或等于第一负载阈值，即，小区还有足够的可用负载；和/或The available load of the cell is greater than or equal to the first load threshold, that is, the cell has sufficient available load; and/or

小区的已用负载小于或等于第二负载阈值，即，小区的已有负载较少，换言之，小区尚有足够的可用负载。The used load of the cell is less than or equal to the second load threshold, that is, the existing load of the cell is less, in other words, the cell still has enough available load.

在一些实施例中，所述信号质量阈值可以是预配置的。例如，所述信号质量阈值可以作为环境数据预配置给终端设备(具体为终端设备中的强化学习模型)。In some embodiments, the signal quality threshold may be preconfigured. For example, the signal quality threshold may be preconfigured to the terminal device as environmental data (specifically, a reinforcement learning model in the terminal device).

在一些实施例中，所述负载阈值，例如第一负载阈值或第二负载阈值可以是预配置的。例如，第一负载阈值或第二负载阈值可以作为环境数据预配置给终端设备(具体为终端设备中的强化学习模型)。In some embodiments, the load threshold, such as the first load threshold or the second load threshold, may be pre-configured. For example, the first load threshold or the second load threshold may be preconfigured to the terminal device (specifically, a reinforcement learning model in the terminal device) as environment data.

在一些实施例中，时长阈值可以是预配置的。例如，时长阈值可以作为环境数据预配置给终端设备(具体为终端设备中的强化学习模型)。In some embodiments, the duration threshold may be preconfigured. For example, the duration threshold may be preconfigured to the terminal device (specifically, a reinforcement learning model in the terminal device) as environment data.

在本申请一些实施例中，可以根据不同的优化目标设置对应的目标奖励条件。In some embodiments of the present application, corresponding target reward conditions may be set according to different optimization targets.

在一些实施例中，目标奖励条件包括第一奖励条件，其中，该第一奖励条件包括：In some embodiments, the target bonus condition includes a first bonus condition, wherein the first bonus condition includes:

小区的信号质量信息大于或等于信号质量阈值，并且，小区在多个候选小区中的信号质量信息最大。The signal quality information of the cell is greater than or equal to the signal quality threshold, and the signal quality information of the cell is the largest among multiple candidate cells.

例如，在信号质量满足信号质量阈值的候选小区中，选择信号质量最优的小区。For example, among the candidate cells whose signal quality satisfies the signal quality threshold, the cell with the best signal quality is selected.

因此，基于第一奖励条件进行模型训练有利于选择到信号质量最优的小区。Therefore, performing model training based on the first reward condition is conducive to selecting a cell with the best signal quality.

在另一些实施例中，目标奖励条件包括第二奖励条件，其中，该第二奖励条件包括：In some other embodiments, the target reward condition includes a second reward condition, wherein the second reward condition includes:

小区的信号质量信息大于或等于信号质量阈值，所述终端设备位于小区的覆盖范围内，并且小区在多个候选小区中的信号质量信息最大。The signal quality information of the cell is greater than or equal to the signal quality threshold, the terminal device is located within the coverage of the cell, and the signal quality information of the cell is the largest among multiple candidate cells.

例如，在信号质量满足信号质量阈值并且终端设备的位置位于小区覆盖范围内的候选小区中，选择信号质量最优的小区。For example, among the candidate cells whose signal quality satisfies the signal quality threshold and where the terminal device is located within the coverage of the cell, the cell with the best signal quality is selected.

因此，基于第二奖励条件进行模型训练有利于选择到信号质量最优并且终端位置满足小区覆盖的小区。Therefore, performing model training based on the second reward condition is conducive to selecting a cell with the best signal quality and the terminal location meeting the cell coverage.

在又一些实施例中，所述目标奖励条件包括第三奖励条件，其中，该第三奖励条件包括：In yet other embodiments, the target reward condition includes a third reward condition, wherein the third reward condition includes:

小区的信号质量信息大于或等于信号质量阈值，小区的负载满足负载阈值，并且小区在多个候选小区中的信号质量信息最大。The signal quality information of the cell is greater than or equal to the signal quality threshold, the load of the cell meets the load threshold, and the signal quality information of the cell is the largest among multiple candidate cells.

例如，在信号质量满足信号质量阈值并且小区的负载满足负载阈值的候选小区中，选择信号质量最优的小区。For example, among the candidate cells whose signal quality satisfies the signal quality threshold and whose load satisfies the load threshold, the cell with the best signal quality is selected.

因此，基于第三奖励条件进行模型训练有利于选择信号质量最优并且小区负载满足负载阈值的小区。Therefore, performing model training based on the third reward condition is conducive to selecting a cell with the best signal quality and the cell load meeting the load threshold.

在又一些实施例中，所述目标奖励条件包括第四奖励条件，其中，所述第四奖励条件包括：In still other embodiments, the target reward condition includes a fourth reward condition, wherein the fourth reward condition includes:

小区的信号质量信息大于或等于信号质量阈值，小区的负载满足负载阈值，所述终端设备位于小区的覆盖范围内，并且小区在多个候选小区中的信号质量信息最大。The signal quality information of the cell is greater than or equal to the signal quality threshold, the load of the cell meets the load threshold, the terminal device is located within the coverage of the cell, and the signal quality information of the cell is the largest among multiple candidate cells.

例如，在信号质量满足信号质量阈值，终端位置位于小区覆盖范围内并且小区的负载满足负载阈值的候选小区中，选择信号质量最优的小区。For example, among the candidate cells whose signal quality meets the signal quality threshold, where the terminal is located within the coverage of the cell, and whose load meets the load threshold, the cell with the best signal quality is selected.

因此，基于第四奖励条件进行模型训练有利于选择信号质量最优，终端位置满足小区覆盖并且小区负载满足负载阈值的小区。Therefore, performing model training based on the fourth reward condition is conducive to selecting a cell with the best signal quality, the terminal location meets the cell coverage, and the cell load meets the load threshold.

在一些实施例中，所述目标奖励条件包括所述第一奖励条件，所述第二奖励条件，所述第三奖励条件和第四奖励条件中的至少一个。In some embodiments, the target bonus condition includes at least one of the first bonus condition, the second bonus condition, the third bonus condition, and the fourth bonus condition.

在一些实施例中，所述多个候选小区可以包括所述终端设备当前位置周围的所有邻区。In some embodiments, the plurality of candidate cells may include all neighboring cells around the current location of the terminal device.

在本申请一些实施例中，终端设备可以基于不同的目标奖励条件依次对强化学习模型进行训练。In some embodiments of the present application, the terminal device may sequentially train the reinforcement learning model based on different target reward conditions.

例如，按照奖励条件对应的约束条件由少到多的顺序依次基于每个目标奖励条件进行训练。For example, the training is performed on the basis of each target reward condition sequentially according to the sequence of constraint conditions corresponding to the reward condition from less to more.

作为示例，首先基于第一奖励条件对强化学习模型进行训练，在强化学习模型收敛的情况下，再基于第二奖励条件对强化学习模型进行训练，在强化学习模型收敛的情况下，再基于第三奖励条件对强化学习模型进行训练，在强化学习模型收敛的情况下，再基于第四奖励条件对强化学习模型进行训练。As an example, firstly, the reinforcement learning model is trained based on the first reward condition, and then the reinforcement learning model is trained based on the second reward condition when the reinforcement learning model converges, and then the reinforcement learning model is trained based on the second reward condition The reinforcement learning model is trained on the three reward conditions, and the reinforcement learning model is trained based on the fourth reward condition when the reinforcement learning model converges.

在本申请一些实施例中，可以定义选择的小区满足不同的奖励条件时的奖励值。In some embodiments of the present application, the reward value when the selected cell satisfies different reward conditions may be defined.

例如，在选择的小区满足所述目标奖励条件时，给予第一奖励值。For example, when the selected cell satisfies the target reward condition, a first reward value is given.

又例如，在选择的小区不满足所述目标奖励条件时，给予第二奖励值。For another example, when the selected cell does not satisfy the target reward condition, a second reward value is given.

其中，第一奖励值大于第二奖励值。Wherein, the first reward value is greater than the second reward value.

在一些实施例中，可以设置终端设备选择满足前述第一奖励条件，第二奖励条件，第三奖励条件和第四奖励条件的小区时，对应相同的奖励值，或者，也可以对应不同的奖励值。In some embodiments, it can be set that when the terminal device selects a cell that meets the aforementioned first reward condition, second reward condition, third reward condition and fourth reward condition, it corresponds to the same reward value, or it can also correspond to different reward values value.

例如，终端设备选择满足第一奖励条件，第二奖励条件，第三奖励条件和第四奖励条件的小区时，均给予第一奖励值。For example, when the terminal device selects a cell that satisfies the first reward condition, the second reward condition, the third reward condition and the fourth reward condition, it will give the first reward value.

又例如，终端设备选择满足第一奖励条件的小区时，给予奖励值X，终端设备选择满足第二奖励条件的小区时，给予奖励值Y，终端设备选择满足第三奖励条件的小区时，给予奖励值Z，终端设备选择满足第四奖励条件的小区时，给予奖励值P，其中，X＜Y＜Z＜P。For another example, when the terminal device selects a cell that satisfies the first reward condition, reward value X is given; when the terminal device selects a cell that meets the second reward condition, reward value Y is given; when the terminal device selects a cell that meets the third reward condition, reward value Y is given. For the reward value Z, when the terminal device selects a cell satisfying the fourth reward condition, the reward value P is given, where X<Y<Z<P.

在本申请一些实施例中，用于小区选择的强化学习模型可以是根据终端设备的状态空间和前述至少一个奖励条件以及对应的奖励值训练得到的。或者，该强化学习模型可以是根据终端设备的状态空间、终端设备的行为空间和前述至少一个奖励条件以及对应的奖励值训练得到的。In some embodiments of the present application, the reinforcement learning model used for cell selection may be trained according to the state space of the terminal device, the aforementioned at least one reward condition, and the corresponding reward value. Alternatively, the reinforcement learning model may be trained according to the state space of the terminal device, the behavior space of the terminal device, the aforementioned at least one reward condition, and the corresponding reward value.

可选地，该终端设备的状态空间可以用于描述终端设备在多个小区的状态信息。Optionally, the state space of the terminal device may be used to describe state information of the terminal device in multiple cells.

在一些实施例中，终端设备的状态空间包括所述终端设备在多个时刻的状态信息，例如包括终端设备在第一时刻的状态信息。可选地，终端设备在第一时刻的状态信息包括但不限于以下中的至少一项：In some embodiments, the state space of the terminal device includes state information of the terminal device at multiple moments, for example, includes state information of the terminal device at a first moment. Optionally, the status information of the terminal device at the first moment includes but is not limited to at least one of the following:

所述终端设备在所述第一时刻所属的小区信息，例如小区标识信息(Cell ID)；Information about the cell to which the terminal device belongs at the first moment, such as cell identification information (Cell ID);

所述终端设备在所述第一时刻所属小区的信号质量信息，例如RSRP大小；Signal quality information of the cell to which the terminal device belongs at the first moment, such as RSRP size;

所述终端设备在所述第一时刻的切换状态信息，用于指示所述终端设备在所述第一时刻是否发生切换；The switching status information of the terminal device at the first moment is used to indicate whether the terminal device is switched at the first moment;

所述终端设备在所述第一时刻的位置信息，例如三维坐标。The location information of the terminal device at the first moment, such as three-dimensional coordinates.

例如，以信号质量信息通过RSRP表征为例，终端设备n在时刻t的状态信息可以表示为

For example, taking signal quality information represented by RSRP as an example, the state information of terminal device n at time t can be expressed as

其中，

表示时刻t终端设备n所属(或者，所选择的，所驻留的)的小区(或者说，扇区)。 in,

Indicates the cell (or sector) to which the terminal device n belongs (or, selected, camped) at time t.

RSRP _t ⁿ表示时刻t终端设备n所属小区的RSRP大小。 RSRP _t ⁿ represents the RSRP size of the cell to which terminal device n belongs at time t.

表示切换的状态，例如，取值为0表示没有切换，取值为1表示切换。是否切换的判断依据为：终端设备在时刻t选择的小区和上一时刻选择的小区是否发生了变化。

Indicates the state of switching, for example, a value of 0 means no switching, and a value of 1 means switching. The basis for judging whether to switch is: whether the cell selected by the terminal device at time t has changed from the cell selected at the previous time.

P _t ⁿ表示终端设备n的位置信息。 P _t ⁿ represents the location information of terminal device n.

应理解，以上终端设备的状态信息仅为示例，在其他实施例中，该终端设备的状态信息也可以包括其他信息，例如小区的负载信息，或者其他用于辅助作小区选择的信息，本申请对此不作限定。It should be understood that the above status information of the terminal device is only an example, and in other embodiments, the status information of the terminal device may also include other information, such as cell load information, or other information used to assist in cell selection. There is no limit to this.

在一些实施例中，终端设备的行为空间可以包括终端设备在多个时刻的行为信息，例如，包括终端设备在第一时刻的行为信息，该终端设备在第一时刻的行为信息用于指示终端设备在第一时刻选择了某个小区。In some embodiments, the behavior space of the terminal device may include behavior information of the terminal device at multiple moments, for example, including behavior information of the terminal device at the first moment, and the behavior information of the terminal device at the first moment is used to indicate that the terminal The device selects a certain cell at the first moment.

例如，终端设备n在时刻t的行为信息可以表述为

表示终端设备n在时刻t选择的小区。 For example, the behavior information of terminal device n at time t can be expressed as

Indicates the cell selected by terminal device n at time t.

其中，若

表示发生了小区切换，否则，表示未发生切换。 Among them, if

Indicates that a cell handover has occurred; otherwise, it indicates that a handover has not occurred.

在本申请一些实施例中，终端设备可以根据多个时刻中的每个时刻的状态信息结合前述至少一个奖励条件，确定对应的奖励值。In some embodiments of the present application, the terminal device may determine the corresponding reward value according to the state information at each time in the multiple moments in combination with the aforementioned at least one reward condition.

例如，根据第一时刻的状态信息，确定终端设备在第一时刻所属的小区是否为目标小区，进一步确定对应的奖励值。For example, according to the status information at the first moment, it is determined whether the cell to which the terminal device belongs at the first moment is the target cell, and the corresponding reward value is further determined.

作为示例，若在第一时刻，终端设备所属小区的RSRP满足RSRP阈值，并且该小区的RSRP在候选小区中是最大的，则确定该小区为目标小区，给予第一奖励值。As an example, if at the first moment, the RSRP of the cell to which the terminal device belongs satisfies the RSRP threshold, and the RSRP of the cell is the largest among the candidate cells, then the cell is determined to be the target cell, and the first reward value is given.

作为示例，若在第一时刻，终端设备所属小区的RSRP满足RSRP阈值，终端设备的位置位于该小区的覆盖范围内，并且该小区的RSRP在候选小区中是最大的，则确定该小区为目标小区，给予第一奖励值。As an example, if at the first moment, the RSRP of the cell to which the terminal device belongs satisfies the RSRP threshold, the location of the terminal device is within the coverage of the cell, and the RSRP of the cell is the largest among the candidate cells, then the cell is determined to be the target cell, give the first reward value.

作为示例，若第一时刻，终端设备所属小区的RSRP不满足RSRP阈值，确定该小区不是目标小区，给予第二奖励值。As an example, if the RSRP of the cell to which the terminal device belongs does not meet the RSRP threshold at the first moment, it is determined that the cell is not the target cell, and a second reward value is given.

作为示例，前述的第一奖励条件以及对应的奖励值可以定义为：As an example, the aforementioned first reward condition and corresponding reward value can be defined as:

其中，

表示终端设备n在时刻t执行动作

得到的奖励值。 in,

Indicates that terminal device n executes an action at time t

The reward value obtained.

作为示例，前述的第二奖励条件以及对应的奖励值可以定义为：As an example, the aforementioned second reward condition and corresponding reward value can be defined as:

其中，

表示终端设备n在时刻t执行动作

得到的奖励值。 in,

Indicates that terminal device n executes an action at time t

The reward value obtained.

作为示例，前述的第三奖励条件以及对应的奖励值可以定义为：As an example, the aforementioned third reward condition and corresponding reward value can be defined as:

其中，

表示终端设备n在时刻t执行动作

得到的奖励值。 in,

Indicates that terminal device n executes an action at time t

The reward value obtained.

作为示例，前述的第四奖励条件以及对应的奖励值可以定义为：As an example, the aforementioned fourth reward condition and corresponding reward value can be defined as:

其中，

表示终端设备n在时刻t执行动作

得到的奖励值。 in,

Indicates that terminal device n executes an action at time t

The reward value obtained.

在本申请一些实施例中，所述终端设备的状态空间和行为空间可以是根据终端设备的模拟行动轨迹获取的。In some embodiments of the present application, the state space and behavior space of the terminal device may be acquired according to a simulated action trajectory of the terminal device.

例如，首先，终端设备在可选小区范围内随机选择一个小区作为轨迹起始点，确定当前时刻，当前坐标下，该终端设备所属的小区，该小区的信号质量信息，切换状态信息等，即当前时刻的状态信息。进一步基于前述的目标奖励条件确定该小区是否为目标小区，进而得到对应的奖励值。For example, first, the terminal device randomly selects a cell within the range of selectable cells as the starting point of the trajectory, and determines the cell to which the terminal device belongs, the signal quality information of the cell, the handover status information, etc. at the current time and at the current coordinates, that is, the current Time status information. It is further determined whether the cell is a target cell based on the aforementioned target reward condition, and then the corresponding reward value is obtained.

进一步地，终端设备开始移动，例如，终端设备利用概率超参数和Q策略选择切换的小区，切换到该小区后，可以得到下一时刻移动后的坐标下的状态信息，即切换后的小区信息，切换后的小区的信号质量信息，切换状态信息等。基于前述的目标奖励条件确定该小区是否为目标小区，进而可以得到对应的奖励值。基于时间上相邻的两个状态信息，可以建立状态切换，得到状态切换样本，包括当前时刻的状态信息，动作(即切换到哪个小区)，奖励值，下一时刻的状态信息。然后将状态切换样本存储在经验池中，用于强化学习模型的训练。Furthermore, the terminal device starts to move. For example, the terminal device uses the probability hyperparameter and the Q strategy to select the cell to switch to. After switching to this cell, it can obtain the state information at the coordinates after the next move, that is, the cell information after switching , signal quality information of the cell after handover, handover state information, and the like. Based on the aforementioned target reward conditions, it is determined whether the cell is a target cell, and then the corresponding reward value can be obtained. Based on two state information adjacent in time, state switching can be established, and state switching samples can be obtained, including state information at the current moment, action (that is, which cell to switch to), reward value, and state information at the next moment. The state switching samples are then stored in the experience pool for training the reinforcement learning model.

在本申请一些实施例中，终端设备选择一个小区可以认为对应一次切换，即，终端设备选择一个小区可以等价于终端设备切换至该小区。In some embodiments of the present application, the selection of a cell by the terminal device may be regarded as corresponding to one handover, that is, the selection of a cell by the terminal device may be equivalent to switching to the cell by the terminal device.

在本申请一些实施例中，终端设备可以根据该终端设备的状态空间确定小区选择(或者，小区切换)对应的即时奖励值，在另一些实施例中，终端设备也可以考虑终端设备切换至某个小区一段时间后的状态，确定小区选择(或者，小区切换)对应的延时奖励值。In some embodiments of the present application, the terminal device may determine the immediate reward value corresponding to cell selection (or cell switching) according to the state space of the terminal device. In other embodiments, the terminal device may also consider that the terminal device switches to a certain The state of each cell after a period of time determines the delay reward value corresponding to cell selection (or cell switching).

例如，终端设备可以根据在小区的驻留时长，切换对应的事件类型，例如，是否为乒乓切换，是否为切换过早事件，是否为切换至错误小区事件等，给予对应的延时奖励值。有利于避免乒乓切换，切换过早，切换至错误小区等切换事件的发生。For example, the terminal device can switch the corresponding event type according to the dwell time in the cell, for example, whether it is a ping-pong handover, whether it is a premature handover event, whether it is a handover to a wrong cell event, etc., and give the corresponding delay reward value. It is beneficial to avoid handover events such as ping-pong handover, premature handover, and handover to a wrong cell.

例如，在所述终端设备选择的小区满足前述目标奖励条件，并且选择的小区与上一个时刻选择的小区不同时，此情况下，可以认为是一次成功的切换，给予第三奖励值。For example, when the cell selected by the terminal device satisfies the aforementioned target reward condition, and the selected cell is different from the cell selected at the previous moment, in this case, it may be considered as a successful handover, and a third reward value is given.

又例如，在所述终端设备选择的小区满足所述目标奖励条件，但是选择的小区与上一个时刻选择的小区相同时，此情况下，可以认为是乒乓切换，给予第四奖励值，其中，该第四奖励值小于第三奖励值。For another example, when the cell selected by the terminal device satisfies the target reward condition, but the selected cell is the same as the cell selected at the previous moment, in this case, it can be considered as a ping-pong handover, and a fourth reward value is given, wherein, The fourth bonus value is smaller than the third bonus value.

再例如，在所述终端设备选择的小区不满足所述目标奖励条件时，但是一定时长内再次切换至该小区成功，或者说，在一定时长后，该小区满足目标奖励条件，此情况可以认为是发生切换过早事件，给予第五奖励值。其中，该第五奖励值小于第三奖励值。For another example, when the cell selected by the terminal device does not meet the target reward condition, but it is successfully switched to the cell again within a certain period of time, or in other words, after a certain period of time, the cell meets the target reward condition, this situation can be regarded as is a premature switching event, the fifth reward value is given. Wherein, the fifth reward value is smaller than the third reward value.

再例如，在所述终端设备选择的小区不满足所述目标奖励条件时，并且切换事件为切换至错误小区时，给予第六奖励值。其中，第六奖励值小于第三奖励中。For another example, when the cell selected by the terminal device does not satisfy the target reward condition and the handover event is handover to a wrong cell, a sixth reward value is given. Wherein, the sixth reward value is smaller than the third reward.

再例如，，在所述终端设备选择的小区满足所述目标奖励条件的情况下，根据终端设备在小区的驻留时长确定此次切换的累计奖励值。比如若上一时刻终端设备所选小区和当前时刻所选小区相同，则奖励值累计，驻留时长越长，累计奖励值越高。For another example, when the cell selected by the terminal device satisfies the target reward condition, the cumulative reward value for this handover is determined according to the dwell time of the terminal device in the cell. For example, if the cell selected by the terminal device at the previous moment is the same as the cell selected at the current moment, the reward value will be accumulated, and the longer the dwell time, the higher the accumulated reward value.

在本申请一些实施例中，终端设备还需要获取用于模型训练的环境数据，例如，终端设备的位置坐标，基站的位置坐标，小区的信号质量信息等。In some embodiments of the present application, the terminal device also needs to acquire environment data for model training, for example, location coordinates of the terminal device, location coordinates of the base station, signal quality information of the cell, and the like.

应理解，该环境数据可以是在任意网络环境下采集的，本申请对此不作限定。以下，对两种典型网络场景下的环境数据采集进行说明。It should be understood that the environment data may be collected in any network environment, which is not limited in this application. The following describes the collection of environmental data in two typical network scenarios.

场景一：小区不重叠Scenario 1: Cells do not overlap

步骤1：设置网络环境。Step 1: Set up the network environment.

例如，网络环境可以为城市微小区(UMI)场景，且基站低于周围建筑物高度。For example, the network environment may be an urban microcell (UMI) scenario, and the base station is lower than surrounding buildings.

作为示例，场景布局为六边形网络，考虑19个微基站，每个微基站3个扇区，则共存在57个小区。图5是一种小区部署示意图。As an example, the scene layout is a hexagonal network. Considering 19 micro base stations, and each micro base station has 3 sectors, there are 57 cells in total. Fig. 5 is a schematic diagram of cell deployment.

步骤2：确定终端设备的轨迹。Step 2: Determine the trajectory of the end device.

首先，确定终端设备移动的起点位置。例如，采用撒点定位方式。例如，可以在撒点范围内随机选择终端设备的起点位置。First, determine the starting point of terminal equipment movement. For example, use the sprinkle point positioning method. For example, the starting position of the terminal device can be randomly selected within the range of the sprinkle point.

例如，在[-200,200]中，以横竖十米间隔撒点，撒点区间覆盖基站1-7，小区1-21。图6是终端设备可选位置的撒点示意图。For example, in [-200,200], sprinkle points at intervals of ten meters horizontally and vertically, covering base stations 1-7 and cells 1-21. Fig. 6 is a schematic diagram of scattered spots of optional positions of terminal equipment.

然后，当终端设备开始移动时，每一步都可以选择上下左右四个方向，选择方向随机。Then, when the terminal device starts to move, each step can choose four directions: up, down, left, and right, and the direction is selected randomly.

步骤3：确定环境数据，例如终端设备移动轨迹中的基站位置，终端设备位置，小区的信号质量等信息。Step 3: Determine the environment data, such as the base station location in the terminal equipment movement track, the terminal equipment location, the signal quality of the cell and other information.

a、基站位置坐标：数据维度为3x19，包含19个基站的三维坐标点。a. Base station location coordinates: The data dimension is 3x19, including the three-dimensional coordinate points of 19 base stations.

B、终端设备的位置坐标：数据维度为3x1681，包含1681个终端可选的轨迹撒点的三维坐标点B. The position coordinates of the terminal equipment: the data dimension is 3x1681, including 1681 three-dimensional coordinate points of the optional trajectory points of the terminal

C、RSRP：数据维度为1681x57，包含1681个终端点下对应的57个小区的所有RSRP值。C. RSRP: The data dimension is 1681x57, including all RSRP values of 57 cells corresponding to 1681 terminal points.

步骤4：阈值设置Step 4: Threshold Setting

例如，设置RSRP阈值：考虑所选小区必须满足RSRP阈值范围，例如，设定值为-114dB。For example, setting the RSRP threshold: consider that the selected cell must meet the RSRP threshold range, for example, the set value is -114dB.

又例如，设置小区负载阈值：针对所有小区随机分配负载，例如，在0-20之间随机分布，考虑所选小区必须满足小区负载阈值范围，设定值为15。For another example, set the cell load threshold: randomly assign loads to all cells, for example, randomly distribute between 0-20, consider that the selected cell must meet the cell load threshold range, and set the value to 15.

场景二：小区不重叠Scenario 2: Cells do not overlap

步骤1：设置网络环境。Step 1: Set up the network environment.

作为示例，场景布局为六边形网络，考虑38个微基站，每个微基站3个扇区，则共存在114个小区。图7是一种小区部署示意图，图8是一种重叠扇区示意图。As an example, the scene layout is a hexagonal network. Considering 38 micro base stations, and each micro base station has 3 sectors, there are 114 cells in total. FIG. 7 is a schematic diagram of cell deployment, and FIG. 8 is a schematic diagram of overlapping sectors.

首先，确定终端设备移动的起点位置。例如，采用撒点定位方式。例如，可以在撒点范围内随机选择终端设备的起点位置，First, determine the starting point of terminal equipment movement. For example, use the sprinkle point positioning method. For example, the starting position of the terminal device can be randomly selected within the scattered point range,

例如，在[-200,200]中，以横竖十米间隔撒点，撒点区间覆盖基站1-7，小区1-21。图5是终端设备可选位置的撒点示意图。For example, in [-200,200], sprinkle points at intervals of ten meters horizontally and vertically, covering base stations 1-7 and cells 1-21. Fig. 5 is a schematic diagram of scattered spots of optional positions of the terminal equipment.

a、基站位置坐标：数据维度为3x38，包含38个基站的三维坐标点。a. Base station location coordinates: The data dimension is 3x38, including the three-dimensional coordinate points of 38 base stations.

C、RSRP：数据维度为1681x114，包含1681个终端点下对应的114个小区的所有RSRP值。C. RSRP: The data dimension is 1681x114, including all RSRP values of 114 cells corresponding to 1681 terminal points.

步骤4：阈值设置Step 4: Threshold Setting

应理解，以上所示例的场景布局以及环境数据仅为示例，其可以根据具体的优化目标进行调整，本申请对此不作限定。It should be understood that the above-mentioned scene layout and environment data are only examples, which may be adjusted according to a specific optimization goal, which is not limited in the present application.

以下，以强化学习模型为DQN模型为例，说明模型的训练过程。In the following, the training process of the model will be described by taking the reinforcement learning model as the DQN model as an example.

步骤一、初始化DQN模型，例如，设置DQN模型的训练回合数，以1000轮为例，以及batch大小，例如设置为64，或128等。Step 1. Initialize the DQN model, for example, set the number of training rounds of the DQN model, take 1000 rounds as an example, and the batch size, for example, set it to 64, or 128, etc.

步骤二、重置终端设备的状态空间。Step 2, reset the state space of the terminal device.

进一步地，随机选择一个位置作为终端设备的轨迹起始点，以图6所示场景为例，选择小区1的某个位置(记为位置1)作为起始点，当前时刻记为第一时刻。Further, a location is randomly selected as the starting point of the trajectory of the terminal device. Taking the scenario shown in FIG. 6 as an example, a certain location in cell 1 (denoted as location 1) is selected as the starting point, and the current moment is recorded as the first moment.

确定终端设备在第一时刻的状态信息，即在当前位置(即位置1)坐标下，终端设备所属的小区例如小区1，对应小区的RSRP大小(例如，小区1的RSRP大小)，以及相应的切换状态(对于初始时刻切换状态为0，表示未切换)。Determine the status information of the terminal device at the first moment, that is, under the coordinates of the current location (namely, location 1), the cell to which the terminal device belongs, such as cell 1, the RSRP size of the corresponding cell (for example, the RSRP size of cell 1), and the corresponding Switching state (for the initial moment, the switching state is 0, indicating no switching).

然后根据该第一时刻的状态信息，确定该终端设备所属小区是否为目标小区，具体的判断条件跟优化目标有关，例如可以根据前述的第一奖励条件，第二奖励条件，第三奖励条件或第四奖励条件确定终端设备所属小区是否为目标小区。进一步，确定终端设备的当前状态对应的奖励值，或者，也可以认为是终端设备选择小区1对应的奖励值，或者，终端设备切换到小区1对应的奖励值。Then, according to the state information at the first moment, it is determined whether the cell to which the terminal equipment belongs is a target cell. The fourth reward condition determines whether the cell to which the terminal device belongs is the target cell. Further, the reward value corresponding to the current state of the terminal device is determined, or it can also be regarded as the reward value corresponding to the terminal device selecting cell 1, or the reward value corresponding to the terminal device switching to cell 1.

步骤三、终端设备开始移动，例如，终端设备利用概率超参数和Q策略选择切换的小区，切换到该小区后，可以得到下一时刻(记为第二时刻)移动后的坐标(记为位置2)下的状态信息。即，切换后的小区信息，切换后的小区的信号质量信息，切换状态信息等。Step 3: The terminal device starts to move. For example, the terminal device uses the probability hyperparameter and the Q strategy to select a cell to switch to. After switching to this cell, the coordinates (recorded as position 2) under the status information. That is, the cell information after the handover, the signal quality information of the cell after the handover, the handover state information, and the like.

以图6所示场景为例，假设终端设备在第二时刻切换至小区6，则第二时刻的状态信息可以包括小区6的标识信息，小区6的信号质量信息，终端设备的位置信息，对应的切换状态，例如取值为1表示发生了切换。Taking the scenario shown in FIG. 6 as an example, assuming that the terminal device switches to cell 6 at the second moment, the status information at the second moment may include identification information of cell 6, signal quality information of cell 6, location information of the terminal device, and corresponding The switching status of , for example, a value of 1 indicates that switching has occurred.

然后根据该第二时刻的状态信息确定该终端设备所属小区(例如小区6)是否为目标小区，具体的判断条件跟优化目标有关，例如可以根据前述的第一奖励条件，第二奖励条件，第三奖励条件或第四奖励条件确定。进一步，确定终端设备的当前状态对应的奖励值，或者，也可以认为是终端设备选择小区6对应的奖励值，或者，终端设备切换到小区6对应的奖励值。Then determine whether the sub-district to which the terminal equipment belongs (such as sub-district 6) is a target sub-district according to the state information at the second moment, and the specific judgment conditions are related to the optimization target, for example, according to the aforementioned first reward condition, the second reward condition, the second reward condition A third bonus condition or a fourth bonus condition is determined. Further, the reward value corresponding to the current state of the terminal device is determined, or it can also be regarded as the reward value corresponding to the terminal device selecting cell 6, or the reward value corresponding to the terminal device switching to cell 6.

进一步地，基于时间上相邻的两个状态信息，可以建立状态切换，得到状态切换样本，包括当前时刻的状态信息，动作(即切换到哪个小区)，奖励值，下一时刻的状态信息。然后将状态切换样本存储在经验池中，用于强化学习模型的训练。Furthermore, based on two temporally adjacent state information, state switching can be established, and state switching samples can be obtained, including state information at the current moment, action (that is, which cell to switch to), reward value, and state information at the next moment. The state switching samples are then stored in the experience pool for training the reinforcement learning model.

例如，对于前述示例，状态切换样本中的当前时刻的状态信息可以为第一时刻的状态信息，动作可以为切换至小区6，下一时刻的状态信息可以为第二时刻的状态信息，奖励值为切换至小区6的奖励值。For example, for the aforementioned example, the state information at the current moment in the state switching sample can be the state information at the first moment, the action can be switching to cell 6, the state information at the next moment can be the state information at the second moment, and the reward value is the reward value for switching to cell 6.

步骤四、当经验池中的样本数量大于batch大小时，从资源池中选择(例如随机选择)batch大小的样本，利用该样本对DQN模型进行训练。Step 4. When the number of samples in the experience pool is greater than the batch size, select (for example, randomly select) a sample of the batch size from the resource pool, and use the samples to train the DQN model.

例如，在步骤四中，可以利用公式Q(s,a)＝E[R _s+γmaxQ(s',a)|s,a]，计算DQN模型的Q值，其中，R _s为在状态S下采取动作A得到的奖励值，γ为折扣因子，γ∈(0,1]，例如，设置为0.9，折扣系数反映了旧动作对Q值影响的大小。 For example, in step four, the Q value of the DQN model can be calculated using the formula Q(s,a)=E[R _s +γmaxQ(s',a)|s,a], where R _s is Next, the reward value obtained by taking action A, γ is the discount factor, γ∈(0,1], for example, set to 0.9, the discount coefficient reflects the impact of the old action on the Q value.

步骤五：返回执行步骤三，直到完成1000步的轨迹探索。Step 5: Return to step 3 until the trajectory exploration of 1000 steps is completed.

当完成1000步轨迹探索时，返回执行步骤二，进入下一轮训练，直至训练完成，输出DQN训练模型。When the 1000-step trajectory exploration is completed, return to step 2 and enter the next round of training until the training is completed, and output the DQN training model.

步骤六：输出奖励训练图。Step 6: Output the reward training map.

在本申请一些实施例中，可以依次根据前述的第一奖励条件，第二奖励条件，第三奖励条件和第四奖励条件对DQN模型进行训练。In some embodiments of the present application, the DQN model may be trained according to the aforementioned first reward condition, second reward condition, third reward condition and fourth reward condition in sequence.

综上，在本申请实施例中，通过设置用于小区选择的至少一个奖励条件以及对应的奖励值，该奖励条件考虑了用于小区选择的多种因素(例如，小区的信号质量，小区的负载，小区的覆盖范围等)，进一步地，根据该至少一个奖励条件以及对应的奖励值，利用终端设备进行小区选择的历史轨迹作为经验对强化学习模型进行训练，有利于选择到合适的小区，避免传统小区选择中的乒乓切换，切换过早等问题。To sum up, in the embodiment of the present application, by setting at least one reward condition for cell selection and the corresponding reward value, the reward condition takes into account various factors for cell selection (for example, the signal quality of the cell, the load, the coverage of the cell, etc.), further, according to the at least one reward condition and the corresponding reward value, using the historical trajectory of the terminal device to select the cell as experience to train the reinforcement learning model, which is conducive to selecting a suitable cell, Avoid problems such as ping-pong switching and premature switching in traditional cell selection.

上文结合图4至图8，从模型训练的角度描述了根据本申请实施例的小区选择的方法，下文结合图9，从模型测量或模型使用的角度详细描述根据本申请另一实施例的小区选择的方法。The method for cell selection according to the embodiment of the present application is described above in conjunction with FIG. 4 to FIG. 8 from the perspective of model training. The method for cell selection according to another embodiment of the present application is described in detail below in conjunction with FIG. Method of cell selection.

图9是根据本申请另一实施例的小区选择的方法300的示意性流程图，如图9所示，该方法300包括如下至少部分内容：FIG. 9 is a schematic flowchart of a method 300 for cell selection according to another embodiment of the present application. As shown in FIG. 9, the method 300 includes at least part of the following content:

S310，利用强化学习模型根据终端设备在多个小区的状态信息确定选择的目标小区(或者说，切换的目标小区)。S310, using a reinforcement learning model to determine a selected target cell (or in other words, a handover target cell) according to state information of the terminal device in multiple cells.

应理解，在申请实施例中，该方法300可以由终端设备执行，或者，也可以由网络设备执行。It should be understood that, in the embodiment of the application, the method 300 may be executed by a terminal device, or may also be executed by a network device.

在本申请一些实施例中，该强化学习模型可以是采用方法200中所述的方法训练得到的。In some embodiments of the present application, the reinforcement learning model may be obtained through training using the method described in method 200 .

在该方法300中，该终端设备在多个小区的状态信息可以对应于方法200中的终端设备的状态空间。In the method 300, the state information of the terminal device in multiple cells may correspond to the state space of the terminal device in the method 200.

在本申请一些实施例中，该方法300可以为强化学习模型训练后的测试方法。In some embodiments of the present application, the method 300 may be a testing method after reinforcement learning model training.

此情况下，该方法300还可以包括：In this case, the method 300 may also include:

根据所述目标小区是否满足预设条件，确定切换是否成功。即确定基于该强化学习模型进行小区切换的切换成功率。Whether the handover is successful is determined according to whether the target cell satisfies a preset condition. That is, determine the handover success rate of cell handover based on the reinforcement learning model.

在本申请一些实施例中，所述预设条件与以下中的至少一项相关：In some embodiments of the present application, the preset condition is related to at least one of the following:

作为示例而非限定，所述预设条件包括以下中至少一个：As an example but not a limitation, the preset conditions include at least one of the following:

目标小区的信号质量信息大于或等于信号质量阈值；The signal quality information of the target cell is greater than or equal to the signal quality threshold;

目标小区在多个候选小区中的信号质量信息最大；The signal quality information of the target cell among multiple candidate cells is the largest;

终端设备位于目标小区的覆盖范围内；The terminal device is located within the coverage of the target cell;

目标小区的负载满足负载阈值；The load of the target cell meets the load threshold;

终端设备在目标小区的驻留时长大于时长阈值。The dwell time of the terminal device in the target cell is longer than the time threshold.

在一些实施例中，所述目标小区的负载满足负载阈值，包括：In some embodiments, the load of the target cell meets a load threshold, including:

目标小区的可用负载大于或等于第一负载阈值；和/或The available load of the target cell is greater than or equal to the first load threshold; and/or

目标小区的已用负载小于或等于第二负载阈值。The used load of the target cell is less than or equal to the second load threshold.

在一些实施例中，所述预设条件可以与前述方法200中的目标奖励条件对应。In some embodiments, the preset condition may correspond to the target reward condition in the aforementioned method 200 .

在一些实施例中，例如，若强化学习模型是基于前述方法200中的第一奖励条件训练得到的，则该预设条件可以为第一奖励条件。In some embodiments, for example, if the reinforcement learning model is trained based on the first reward condition in the foregoing method 200, the preset condition may be the first reward condition.

例如，预设条件为目标小区的信号质量信息大于或等于信号质量阈值，并且，目标小区在多个候选小区中的信号质量信息最大，记为第一预设条件。For example, the preset condition is that the signal quality information of the target cell is greater than or equal to the signal quality threshold, and the signal quality information of the target cell is the largest among multiple candidate cells, which is recorded as the first preset condition.

在一些实施例中，例如，若强化学习模型是基于前述方法200中的第二奖励条件训练得到的，则该预设条件可以为第二奖励条件。In some embodiments, for example, if the reinforcement learning model is trained based on the second reward condition in the aforementioned method 200, the preset condition may be the second reward condition.

例如，预设条件为目标小区的信号质量信息大于或等于信号质量阈值，所述终端设备位于目标小区的覆盖范围内，并且目标小区在多个候选小区中的信号质量信息最大，记为第二预设条件。For example, the preset condition is that the signal quality information of the target cell is greater than or equal to the signal quality threshold, the terminal device is located within the coverage of the target cell, and the signal quality information of the target cell is the largest among multiple candidate cells, which is recorded as the second preset conditions.

在一些实施例中，例如，若强化学习模型是基于前述方法200中的第三奖励条件训练得到的，则该预设条件可以为第三奖励条件。In some embodiments, for example, if the reinforcement learning model is trained based on the third reward condition in the foregoing method 200, the preset condition may be the third reward condition.

例如，预设条件为目标小区的信号质量信息大于或等于信号质量阈值，目标小区的负载满足负载阈值，并且目标小区在多个候选小区中的信号质量信息最大，记为第三预设条件。For example, the preset condition is that the signal quality information of the target cell is greater than or equal to the signal quality threshold, the load of the target cell meets the load threshold, and the signal quality information of the target cell is the largest among multiple candidate cells, which is recorded as the third preset condition.

在一些实施例中，例如，若强化学习模型是基于前述方法200中的第四奖励条件训练得到的，则该预设条件可以为第四奖励条件。In some embodiments, for example, if the reinforcement learning model is trained based on the fourth reward condition in the aforementioned method 200, the preset condition may be the fourth reward condition.

例如，预设条件为目标小区的信号质量信息大于或等于信号质量阈值，目标小区的负载满足负载阈值，所述终端设备位于目标小区的覆盖范围内，并且目标小区在多个候选小区中的信号质量信息最大，记为第四预设条件。For example, the preset condition is that the signal quality information of the target cell is greater than or equal to the signal quality threshold, the load of the target cell meets the load threshold, the terminal device is located within the coverage of the target cell, and the signals of the target cell in multiple candidate cells The maximum quality information is recorded as the fourth preset condition.

在本申请一些实施例中，所述终端设备在多个小区的状态信息包括所述终端设备在第一小区的状态信息，作为示例而非限定，所述终端设备在所述第一小区的状态信息包括以下中的至少一项：In some embodiments of the present application, the state information of the terminal device in multiple cells includes the state information of the terminal device in the first cell. As an example but not limitation, the state of the terminal device in the first cell Information includes at least one of the following:

第一小区的信号质量信息，第一小区的负载信息，终端设备是否在第一小区的覆盖范围，终端设备的位置信息，小区的位置信息(或者，小区的覆盖范围信息)，切换状态。Signal quality information of the first cell, load information of the first cell, whether the terminal device is within the coverage of the first cell, location information of the terminal device, location information of the cell (or coverage information of the cell), and handover status.

其中，该切换状态用于指示第一小区与上一时刻所选择的小区是否相同，例如，若相同，则表示未发生切换，否则，表示发生切换。Wherein, the handover state is used to indicate whether the first cell is the same as the cell selected at the last moment, for example, if they are the same, it indicates that handover has not occurred; otherwise, it indicates that handover has occurred.

在本申请一些实施例中，所述S310包括：In some embodiments of the present application, the S310 includes:

利用强化学习模型根据至少一个奖励条件及其对应的奖励值，以及终端设备在多个小区的状态信息确定切换的目标小区。A reinforcement learning model is used to determine a handover target cell according to at least one reward condition and its corresponding reward value, as well as state information of the terminal equipment in multiple cells.

例如，根据所述终端设备在所述多个小区中的每个小区的状态信息以及所述至少一个奖励条件，确定所述终端设备切换至所述每个小区的奖励值；For example, determining a reward value for the terminal device to switch to each cell according to the state information of the terminal device in each of the multiple cells and the at least one reward condition;

根据所述终端设备切换至所述每个小区的奖励值，确定所述多个小区中的目标小区。A target cell among the plurality of cells is determined according to a reward value for the terminal device to switch to each cell.

作为示例，选择奖励值最大的小区为目标小区。As an example, the cell with the largest reward value is selected as the target cell.

以下结合具体实施例，对强化学习模型的测试方式进行说明。The test method of the reinforcement learning model will be described below in combination with specific embodiments.

步骤一、重置终端设备的状态空间。Step 1. Reset the state space of the terminal device.

随机选择终端设备的轨迹起始点，确定终端设备在当前位置坐标下，所属的小区，该小区的RSRP大小，以及对应的切换状态。Randomly select the starting point of the trajectory of the terminal device, and determine the cell to which the terminal device belongs, the RSRP size of the cell, and the corresponding handover status under the current location coordinates.

步骤二、终端设备开始移动，终端设备利用训练完的强化学习模型判断切换至哪个小区，假设强化学习模型判断切换至小区X。Step 2: The terminal device starts to move, and the terminal device uses the trained reinforcement learning model to determine which cell to switch to, assuming that the reinforcement learning model determines to switch to cell X.

步骤三、根据小区X是否满足预设条件，确定切换是否成功。Step 3: Determine whether the handover is successful according to whether the cell X satisfies a preset condition.

步骤四、输出切换成功率的结果仿真图。Step 4: outputting a simulation diagram of the result of the handover success rate.

在一些实施例中，可以根据不同的预设条件，分别对该强化学习模型进行测试。In some embodiments, the reinforcement learning model can be tested separately according to different preset conditions.

例如，首先根据第一预设条件对强化学习模型进行测试，在切换成功率满足要求的情况下，再根据第二预设条件对强化学习模型进行测试，在切换成功率满足要求的情况下，再根据第三预设条件对强化学习模型进行测试，在切换成功率满足要求的情况下，再根据第四预设条件对强化学习模型进行测试。For example, first test the reinforcement learning model according to the first preset condition, and then test the reinforcement learning model according to the second preset condition when the switching success rate meets the requirements, and if the switching success rate meets the requirements, Then test the reinforcement learning model according to the third preset condition, and then test the reinforcement learning model according to the fourth preset condition when the switching success rate meets the requirements.

在本申请另一些实施例中，该方法300可以为强化学习模型的使用方法。In other embodiments of the present application, the method 300 may be a method for using a reinforcement learning model.

例如，终端设备可以向网络设备上报其在多个小区的状态信息，网络设备可以利用强化学习模型根据终端设备在多个小区的状态信息确定终端设备切换的目标小区。进一步指示该终端设备切换至该目标小区。For example, a terminal device may report its status information in multiple cells to the network device, and the network device may use a reinforcement learning model to determine a target cell for the terminal device to switch to according to the status information of the terminal device in multiple cells. Further instruct the terminal device to switch to the target cell.

又例如，终端设备可以利用强化学习模型根据该终端设备在多个小区的状态信息确定终端设备切换的目标小区。进一步地，终端设备发起向目标小区的切换。For another example, the terminal device may use a reinforcement learning model to determine a target cell for the terminal device to switch to according to status information of the terminal device in multiple cells. Further, the terminal device initiates handover to the target cell.

综上，在本申请实施例中，通过利用根据至少一个奖励条件以及对应的奖励值训练得到的强化学习模型进行小区选择，有利于选择到合适的小区，避免传统小区选择中的乒乓切换，切换过早等问题。To sum up, in the embodiment of this application, by using the reinforcement learning model trained according to at least one reward condition and the corresponding reward value for cell selection, it is beneficial to select a suitable cell and avoid ping-pong switching in traditional cell selection. premature and other issues.

上文结合图4至图9，详细描述了本申请的方法实施例，下文结合图10至图13，详细描述本申请的装置实施例，应理解，装置实施例与方法实施例相互对应，类似的描述可以参照方法实施例。The method embodiment of the present application is described in detail above in conjunction with FIG. 4 to FIG. 9 , and the device embodiment of the present application is described in detail below in conjunction with FIG. 10 to FIG. 13 . It should be understood that the device embodiment and the method embodiment correspond to each other, similar to The description can refer to the method embodiment.

图10示出了根据本申请实施例的小区选择的设备400的示意性框图。如图10所示，该设备400包括：Fig. 10 shows a schematic block diagram of a device 400 for cell selection according to an embodiment of the present application. As shown in Figure 10, the device 400 includes:

处理单元410，用于确定用于小区选择的至少一个奖励条件以及所述至少一个奖励条件对应的奖励值；以及根据所述至少一个奖励条件以及所述至少一个奖励条件对应的奖励值对用于小区选择的强化学习模型进行训练。A processing unit 410, configured to determine at least one reward condition for cell selection and a reward value corresponding to the at least one reward condition; and pair the at least one reward condition and the reward value corresponding to the at least one reward condition for The reinforcement learning model for cell selection is trained.

在本申请一些实施例中，所述至少一个奖励条件包括目标小区需要满足的目标奖励条件，所述目标奖励条件包括以下中至少一个：In some embodiments of the present application, the at least one reward condition includes a target reward condition that the target cell needs to meet, and the target reward condition includes at least one of the following:

终端设备在小区的驻留时长大于时长阈值。The dwell time of the terminal device in the cell is longer than the time threshold.

在本申请一些实施例中，所述小区的负载满足负载阈值，包括：In some embodiments of the present application, the load of the cell meets the load threshold, including:

小区的可用负载大于或等于第一负载阈值；和/或The available load of the cell is greater than or equal to the first load threshold; and/or

小区的已用负载小于或等于第二负载阈值。The used load of the cell is less than or equal to the second load threshold.

在本申请一些实施例中，所述目标奖励条件包括第一奖励条件，其中，所述第一奖励条件包括：In some embodiments of the present application, the target reward condition includes a first reward condition, wherein the first reward condition includes:

在本申请一些实施例中，所述目标奖励条件包括第二奖励条件，其中，所述第二奖励条件包括：In some embodiments of the present application, the target reward condition includes a second reward condition, wherein the second reward condition includes:

在本申请一些实施例中，所述目标奖励条件包括第三奖励条件，其中，所述第三奖励条件包括：In some embodiments of the present application, the target reward condition includes a third reward condition, wherein the third reward condition includes:

在本申请一些实施例中，所述目标奖励条件包括第四奖励条件，其中，所述第四奖励条件包括：In some embodiments of the present application, the target reward condition includes a fourth reward condition, wherein the fourth reward condition includes:

在本申请一些实施例中，所述确定所述至少一个奖励条件对应的奖励值，包括：In some embodiments of the present application, the determining the reward value corresponding to the at least one reward condition includes:

在选择的小区满足所述目标奖励条件时，给予第一奖励值；或者Giving a first reward value when the selected cell satisfies the target reward condition; or

在选择的小区不满足所述目标奖励条件时，给予第二奖励值；When the selected cell does not meet the target reward condition, give a second reward value;

其中，所述第一奖励值大于所述第二奖励值。Wherein, the first reward value is greater than the second reward value.

在本申请一些实施例中，所述处理单元410还用于：In some embodiments of the present application, the processing unit 410 is further configured to:

在选择的小区满足所述目标奖励条件，并且选择的小区与上一个时刻选择的小区不同的情况下，给予第三奖励值；或者If the selected cell satisfies the target reward condition and the selected cell is different from the cell selected at the previous moment, giving a third reward value; or

在选择的小区满足所述目标奖励条件，但是选择的小区与上一个时刻选择的小区相同的情况下，给予第四奖励值；或者When the selected cell satisfies the target reward condition, but the selected cell is the same as the cell selected at the previous moment, giving a fourth reward value; or

在选择的小区不满足所述目标奖励条件，并且切换事件为切换过早的情况下，给予第五奖励值；或者In the case that the selected cell does not meet the target reward condition, and the handover event is too early handover, giving a fifth reward value; or

在选择的小区不满足所述目标奖励条件，并且切换事件为切换至错误小区的情况下，给予第六奖励值；或者If the selected cell does not meet the target reward condition, and the handover event is a handover to a wrong cell, giving a sixth reward value; or

在选择的小区满足所述目标奖励条件的情况下，根据终端设备在所述小区的驻留时长，给予奖励值；In the case that the selected cell satisfies the target reward condition, giving a reward value according to the residence time of the terminal device in the cell;

其中，所述第三奖励值大于所述第四奖励值，所述第三奖励值大于所述第五奖励值，所述第三奖励值大于所述第六奖励值。Wherein, the third reward value is greater than the fourth reward value, the third reward value is greater than the fifth reward value, and the third reward value is greater than the sixth reward value.

获取终端设备的状态空间和行为空间，其中，所述终端设备的状态空间包括所述终端设备在多个时刻的状态信息，所述终端设备的行为空间包括所述终端设备在多个时刻的行为信息；Obtaining the state space and behavior space of the terminal device, wherein the state space of the terminal device includes state information of the terminal device at multiple moments, and the behavior space of the terminal device includes behaviors of the terminal device at multiple moments information;

根据所述终端设备的状态空间和行为空间，以及所述至少一个奖励条件和所述至少一个奖励条件对应的奖励值，对所述强化学习模型进行训练。The reinforcement learning model is trained according to the state space and behavior space of the terminal device, and the at least one reward condition and the reward value corresponding to the at least one reward condition.

在本申请一些实施例中，所述终端设备在多个时刻的状态信息包括第一时刻的状态信息，所述第一时刻的状态信息包括以下中的至少一项：In some embodiments of the present application, the state information of the terminal device at multiple moments includes state information at a first moment, and the state information at the first moment includes at least one of the following:

所述终端设备在所述第一时刻所属的小区信息；Information about the cell to which the terminal device belongs at the first moment;

所述终端设备在所述第一时刻所属小区的信号质量信息；Signal quality information of the cell to which the terminal device belongs at the first moment;

所述终端设备在所述第一时刻的位置信息。The location information of the terminal device at the first moment.

在本申请一些实施例中，所述终端设备在所述第一时刻的切换状态信息根据所述终端设备在所述第一时刻所属的小区信息和所述终端设备在第二时刻所属的小区是否相同确定，其中，所述第二时刻为所述第一时刻的上一时刻。In some embodiments of the present application, the handover state information of the terminal device at the first moment is based on whether the information of the cell to which the terminal device belongs at the first moment and the cell to which the terminal device belongs at the second moment The same determination, wherein, the second moment is the previous moment of the first moment.

在本申请一些实施例中，所述终端设备在多个时刻的行为信息包括所述终端设备在第一时刻的行为信息，所述终端设备在所述第一时刻的行为信息用于指示所述终端设备在所述第一时刻选择了第一小区。In some embodiments of the present application, the behavior information of the terminal device at multiple moments includes behavior information of the terminal device at a first moment, and the behavior information of the terminal device at the first moment is used to indicate the The terminal device selects the first cell at the first moment.

根据第一时刻的状态信息和所述至少一个奖励条件，确定第一时刻的行为信息对应的奖励值；Determine the reward value corresponding to the behavior information at the first moment according to the status information at the first moment and the at least one reward condition;

将第二时刻的状态信息，所述第一时刻的行为信息，所述第一时刻的行为对应的奖励值和所述第一时刻的状态信息存入经验池，其中，所述第二时刻为所述第一时刻的上一时刻；The state information at the second moment, the behavior information at the first moment, the reward value corresponding to the behavior at the first moment and the state information at the first moment are stored in the experience pool, wherein the second moment is a moment immediately preceding said first moment;

利用所述经验池对所述强化学习模型进行训练。The reinforcement learning model is trained using the experience pool.

在本申请一些实施例中，所述强化学习模型包括深度Q网络模型。In some embodiments of the present application, the reinforcement learning model includes a deep Q-network model.

可选地，在一些实施例中，上述通信单元可以是通信接口或收发器，或者是通信芯片或者片上系统的输入输出接口。上述处理单元可以是一个或多个处理器。Optionally, in some embodiments, the above-mentioned communication unit may be a communication interface or a transceiver, or an input-output interface of a communication chip or a system-on-chip. The aforementioned processing unit may be one or more processors.

应理解，根据本申请实施例的设备400可对应于本申请方法实施例中的终端设备或网络设备，并且设备400中的各个单元的上述和其它操作和/或功能分别为了实现图4至图8所示方法200中终端设备或网络设备的相应流程，为了简洁，在此不再赘述。It should be understood that the device 400 according to the embodiment of the present application may correspond to the terminal device or the network device in the method embodiment of the present application, and the above-mentioned and other operations and/or functions of each unit in the device 400 are to realize the The corresponding processes of the terminal device or the network device in the method 200 shown in 8 are not repeated here for the sake of brevity.

图11示出了根据本申请实施例的小区选择的设备500的示意性框图。如图11所示，该设备500包括：Fig. 11 shows a schematic block diagram of a device 500 for cell selection according to an embodiment of the present application. As shown in Figure 11, the device 500 includes:

处理单元510，用于利用强化学习模型根据终端设备在多个小区的状态信息确定选择的目标小区。The processing unit 510 is configured to use a reinforcement learning model to determine a selected target cell according to state information of the terminal device in multiple cells.

在本申请一些实施例中，所述处理单元510还用于：In some embodiments of the present application, the processing unit 510 is further configured to:

根据所述目标小区是否满足预设条件，确定切换是否成功。Whether the handover is successful is determined according to whether the target cell satisfies a preset condition.

在本申请一些实施例中，所述预设条件包括以下中至少一个：In some embodiments of the present application, the preset conditions include at least one of the following:

所述目标小区的信号质量信息大于或等于信号质量阈值；The signal quality information of the target cell is greater than or equal to a signal quality threshold;

所述目标小区在多个候选小区中的信号质量信息最大；The target cell has the largest signal quality information among multiple candidate cells;

终端设备位于所述目标小区的覆盖范围内；The terminal device is located within the coverage of the target cell;

所述目标小区的负载满足负载阈值；The load of the target cell meets a load threshold;

所述终端设备在所述目标小区的驻留时长大于时长阈值。The dwell time of the terminal device in the target cell is greater than a time threshold.

在本申请一些实施例中，所述终端设备在多个小区的状态信息包括所述终端设备在第一小区的状态信息，其中，所述终端设备在所述第一小区的状态信息包括以下中的至少一项：In some embodiments of the present application, the state information of the terminal device in multiple cells includes state information of the terminal device in the first cell, wherein the state information of the terminal device in the first cell includes the following At least one of:

所述第一小区的信号质量信息，所述第一小区的负载信息，所述终端设备是否在第一小区的覆盖范围。Signal quality information of the first cell, load information of the first cell, and whether the terminal device is within the coverage of the first cell.

根据所述终端设备在所述多个小区中的每个小区的状态信息以及至少一个奖励条件，确定所述终端设备切换至所述每个小区的奖励值；determining a reward value for the terminal device to switch to each cell according to the state information of the terminal device in each of the multiple cells and at least one reward condition;

应理解，根据本申请实施例的设备500可对应于本申请方法实施例中的终端设备或网络设备，并且设备500中的各个单元的上述和其它操作和/或功能分别为了实现图9所示方法300中终端设备或网络设备的相应流程，为了简洁，在此不再赘述。It should be understood that the device 500 according to the embodiment of the present application may correspond to the terminal device or the network device in the method embodiment of the present application, and the above-mentioned and other operations and/or functions of each unit in the device 500 are respectively in order to realize the For the sake of brevity, the corresponding processes of the terminal device or the network device in method 300 are not repeated here.

图12是本申请实施例提供的一种通信设备600示意性结构图。图12所示的通信设备600包括处理器610，处理器610可以从存储器中调用并运行计算机程序，以实现本申请实施例中的方法。Fig. 12 is a schematic structural diagram of a communication device 600 provided by an embodiment of the present application. The communication device 600 shown in FIG. 12 includes a processor 610, and the processor 610 can call and run a computer program from a memory, so as to implement the method in the embodiment of the present application.

可选地，如图12所示，通信设备600还可以包括存储器620。其中，处理器610可以从存储器620中调用并运行计算机程序，以实现本申请实施例中的方法。Optionally, as shown in FIG. 12 , the communication device 600 may further include a memory 620 . Wherein, the processor 610 can invoke and run a computer program from the memory 620, so as to implement the method in the embodiment of the present application.

其中，存储器620可以是独立于处理器610的一个单独的器件，也可以集成在处理器610中。Wherein, the memory 620 may be an independent device independent of the processor 610 , or may be integrated in the processor 610 .

可选地，如图12所示，通信设备600还可以包括收发器630，处理器610可以控制该收发器630与其他设备进行通信，具体地，可以向其他设备发送信息或数据，或接收其他设备发送的信息或数据。Optionally, as shown in FIG. 12, the communication device 600 may further include a transceiver 630, and the processor 610 may control the transceiver 630 to communicate with other devices, specifically, to send information or data to other devices, or receive other Information or data sent by the device.

其中，收发器630可以包括发射机和接收机。收发器630还可以进一步包括天线，天线的数量可以为一个或多个。Wherein, the transceiver 630 may include a transmitter and a receiver. The transceiver 630 may further include antennas, and the number of antennas may be one or more.

可选地，该通信设备600具体可为本申请实施例的网络设备，并且该通信设备600可以实现本申请实施例的各个方法中由网络设备实现的相应流程，为了简洁，在此不再赘述。Optionally, the communication device 600 may specifically be the network device of the embodiment of the present application, and the communication device 600 may implement the corresponding processes implemented by the network device in each method of the embodiment of the present application. For the sake of brevity, details are not repeated here. .

可选地，该通信设备600具体可为本申请实施例的移动终端/终端设备，并且该通信设备600可以实现本申请实施例的各个方法中由移动终端/终端设备实现的相应流程，为了简洁，在此不再赘述。Optionally, the communication device 600 may specifically be the mobile terminal/terminal device of the embodiment of the present application, and the communication device 600 may implement the corresponding processes implemented by the mobile terminal/terminal device in each method of the embodiment of the present application, for the sake of brevity , which will not be repeated here.

图13是本申请实施例的芯片的示意性结构图。图13所示的芯片700包括处理器710，处理器710可以从存储器中调用并运行计算机程序，以实现本申请实施例中的方法。FIG. 13 is a schematic structural diagram of a chip according to an embodiment of the present application. The chip 700 shown in FIG. 13 includes a processor 710, and the processor 710 can call and run a computer program from a memory, so as to implement the method in the embodiment of the present application.

可选地，如图13所示，芯片700还可以包括存储器720。其中，处理器710可以从存储器720中调用并运行计算机程序，以实现本申请实施例中的方法。Optionally, as shown in FIG. 13 , the chip 700 may further include a memory 720 . Wherein, the processor 710 can invoke and run a computer program from the memory 720, so as to implement the method in the embodiment of the present application.

其中，存储器720可以是独立于处理器710的一个单独的器件，也可以集成在处理器710中。Wherein, the memory 720 may be an independent device independent of the processor 710 , or may be integrated in the processor 710 .

可选地，该芯片700还可以包括输入接口730。其中，处理器710可以控制该输入接口730与其他设备或芯片进行通信，具体地，可以获取其他设备或芯片发送的信息或数据。Optionally, the chip 700 may also include an input interface 730 . Wherein, the processor 710 can control the input interface 730 to communicate with other devices or chips, specifically, can obtain information or data sent by other devices or chips.

可选地，该芯片700还可以包括输出接口740。其中，处理器710可以控制该输出接口740与其他设备或芯片进行通信，具体地，可以向其他设备或芯片输出信息或数据。Optionally, the chip 700 may also include an output interface 740 . Wherein, the processor 710 can control the output interface 740 to communicate with other devices or chips, specifically, can output information or data to other devices or chips.

可选地，该芯片可应用于本申请实施例中的网络设备，并且该芯片可以实现本申请实施例的各个方法中由网络设备实现的相应流程，为了简洁，在此不再赘述。Optionally, the chip can be applied to the network device in the embodiment of the present application, and the chip can implement the corresponding processes implemented by the network device in the methods of the embodiment of the present application. For the sake of brevity, details are not repeated here.

可选地，该芯片可应用于本申请实施例中的移动终端/终端设备，并且该芯片可以实现本申请实施例的各个方法中由移动终端/终端设备实现的相应流程，为了简洁，在此不再赘述。Optionally, the chip can be applied to the mobile terminal/terminal device in the embodiments of the present application, and the chip can implement the corresponding processes implemented by the mobile terminal/terminal device in the various methods of the embodiments of the present application. For the sake of brevity, here No longer.

应理解，本申请实施例提到的芯片还可以称为系统级芯片，系统芯片，芯片系统或片上系统芯片等。It should be understood that the chip mentioned in the embodiment of the present application may also be called a system-on-chip, a system-on-chip, a system-on-a-chip, or a system-on-a-chip.

应理解，本申请实施例的处理器可能是一种集成电路芯片，具有信号的处理能力。在实现过程中，上述方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器、数字信号处理器(Digital Signal Processor，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现成可编程门阵列(Field Programmable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成，或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器，闪存、只读存储器，可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器，处理器读取存储器中的信息，结合其硬件完成上述方法的步骤。It should be understood that the processor in the embodiment of the present application may be an integrated circuit chip, which has a signal processing capability. In the implementation process, each step of the above-mentioned method embodiments may be completed by an integrated logic circuit of hardware in a processor or instructions in the form of software. The above-mentioned processor can be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other available Program logic devices, discrete gate or transistor logic devices, discrete hardware components. Various methods, steps, and logic block diagrams disclosed in the embodiments of the present application may be implemented or executed. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register. The storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps of the above method in combination with its hardware.

可以理解，本申请实施例中的存储器可以是易失性存储器或非易失性存储器，或可包括易失性和非易失性存储器两者。其中，非易失性存储器可以是只读存储器(Read-Only Memory，ROM)、可编程只读存储器(Programmable ROM，PROM)、可擦除可编程只读存储器(Erasable PROM，EPROM)、电可擦除可编程只读存储器(Electrically EPROM，EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory，RAM)，其用作外部高速缓存。通过示例性但不是限制性说明，许多形式的RAM可用，例如静态随机存取存储器(Static RAM，SRAM)、动态随机存取存储器(Dynamic RAM，DRAM)、同步动态随机存取存储器(Synchronous DRAM，SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM，DDR SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM，ESDRAM)、同步连接动态随机存取存储器(Synchlink DRAM，SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM，DR RAM)。应注意，本文描述的系统和方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。It can be understood that the memory in the embodiments of the present application may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories. Among them, the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electronically programmable Erase Programmable Read-Only Memory (Electrically EPROM, EEPROM) or Flash. The volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (Static RAM, SRAM), Dynamic Random Access Memory (Dynamic RAM, DRAM), Synchronous Dynamic Random Access Memory (Synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (Synchlink DRAM, SLDRAM ) and Direct Memory Bus Random Access Memory (Direct Rambus RAM, DR RAM). It should be noted that the memory of the systems and methods described herein is intended to include, but not be limited to, these and any other suitable types of memory.

应理解，上述存储器为示例性但不是限制性说明，例如，本申请实施例中的存储器还可以是静态随机存取存储器(static RAM，SRAM)、动态随机存取存储器(dynamic RAM，DRAM)、同步动态随机存取存储器(synchronous DRAM，SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM，DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM，ESDRAM)、同步连接动态随机存取存储器(synch link DRAM，SLDRAM)以及直接内存总线随机存取存储器(Direct Rambus RAM，DR RAM)等等。也就是说，本申请实施例中的存储器旨在包括但不限于这些和任意其它适合类型的存储器。It should be understood that the above-mentioned memory is illustrative but not restrictive. For example, the memory in the embodiment of the present application may also be a static random access memory (static RAM, SRAM), a dynamic random access memory (dynamic RAM, DRAM), Synchronous dynamic random access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection Dynamic random access memory (synch link DRAM, SLDRAM) and direct memory bus random access memory (Direct Rambus RAM, DR RAM), etc. That is, the memory in the embodiments of the present application is intended to include, but not be limited to, these and any other suitable types of memory.

本申请实施例还提供了一种计算机可读存储介质，用于存储计算机程序。The embodiment of the present application also provides a computer-readable storage medium for storing computer programs.

可选的，该计算机可读存储介质可应用于本申请实施例中的网络设备，并且该计算机程序使得计算机执行本申请实施例的各个方法中由网络设备实现的相应流程，为了简洁，在此不再赘述。Optionally, the computer-readable storage medium can be applied to the network device in the embodiments of the present application, and the computer program enables the computer to execute the corresponding processes implemented by the network device in the methods of the embodiments of the present application. For brevity, here No longer.

可选地，该计算机可读存储介质可应用于本申请实施例中的移动终端/终端设备，并且该计算机程序使得计算机执行本申请实施例的各个方法中由移动终端/终端设备实现的相应流程，为了简洁，在此不再赘述。Optionally, the computer-readable storage medium can be applied to the mobile terminal/terminal device in the embodiments of the present application, and the computer program enables the computer to execute the corresponding processes implemented by the mobile terminal/terminal device in the various methods of the embodiments of the present application , for the sake of brevity, it is not repeated here.

本申请实施例还提供了一种计算机程序产品，包括计算机程序指令。The embodiment of the present application also provides a computer program product, including computer program instructions.

可选的，该计算机程序产品可应用于本申请实施例中的网络设备，并且该计算机程序指令使得计算机执行本申请实施例的各个方法中由网络设备实现的相应流程，为了简洁，在此不再赘述。Optionally, the computer program product may be applied to the network device in the embodiment of the present application, and the computer program instructions cause the computer to execute the corresponding process implemented by the network device in each method of the embodiment of the present application. For the sake of brevity, the Let me repeat.

可选地，该计算机程序产品可应用于本申请实施例中的移动终端/终端设备，并且该计算机程序指令使得计算机执行本申请实施例的各个方法中由移动终端/终端设备实现的相应流程，为了简洁，在此不再赘述。Optionally, the computer program product can be applied to the mobile terminal/terminal device in the embodiments of the present application, and the computer program instructions cause the computer to execute the corresponding processes implemented by the mobile terminal/terminal device in the methods of the embodiments of the present application, For the sake of brevity, details are not repeated here.

本申请实施例还提供了一种计算机程序。The embodiment of the present application also provides a computer program.

可选的，该计算机程序可应用于本申请实施例中的网络设备，当该计算机程序在计算机上运行时，使得计算机执行本申请实施例的各个方法中由网络设备实现的相应流程，为了简洁，在此不再赘述。Optionally, the computer program can be applied to the network device in the embodiment of the present application. When the computer program is run on the computer, the computer executes the corresponding process implemented by the network device in each method of the embodiment of the present application. For the sake of brevity , which will not be repeated here.

可选地，该计算机程序可应用于本申请实施例中的移动终端/终端设备，当该计算机程序在计算机上运行时，使得计算机执行本申请实施例的各个方法中由移动终端/终端设备实现的相应流程，为了简洁，在此不再赘述。Optionally, the computer program can be applied to the mobile terminal/terminal device in the embodiment of the present application. When the computer program is run on the computer, the computer executes each method in the embodiment of the present application to be implemented by the mobile terminal/terminal device For the sake of brevity, the corresponding process will not be repeated here.

本领域普通技术人员可以意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Those skilled in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统、装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

在本申请所提供的几个实施例中，应该理解到，所揭露的系统、装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.

所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(Read-Only Memory，ROM)、随机存取存储器(Random Access Memory，RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应所述以权利要求的保护范围为准。The above is only a specific implementation of the application, but the scope of protection of the application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the application. Should be covered within the protection scope of this application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

A method for cell selection, characterized by comprising:

determining at least one reward condition for cell selection and a reward value corresponding to the at least one reward condition;

A reinforcement learning model for cell selection is trained according to the at least one reward condition and the reward value corresponding to the at least one reward condition.

The method according to claim 1, wherein the at least one reward condition includes a target reward condition that the target cell needs to meet, and the target reward condition includes at least one of the following:

The signal quality information of the cell is greater than or equal to the signal quality threshold;

The cell has the largest signal quality information among multiple candidate cells;

The terminal equipment is located within the coverage area of the cell;

The load of the cell meets the load threshold;

The dwell time of the terminal device in the cell is greater than the time threshold.

The method according to claim 2, wherein the load of the cell meets a load threshold, comprising:

The available load of the cell is greater than or equal to the first load threshold; and/or

The used load of the cell is less than or equal to the second load threshold.

The method according to claim 2 or 3, wherein the target reward condition comprises a first reward condition, wherein the first reward condition comprises:

The signal quality information of the cell is greater than or equal to the signal quality threshold, and the signal quality information of the cell is the largest among multiple candidate cells.

The method according to any one of claims 2-4, wherein the target reward condition includes a second reward condition, wherein the second reward condition includes:

The signal quality information of the cell is greater than or equal to the signal quality threshold, the terminal device is located within the coverage of the cell, and the signal quality information of the cell is the largest among multiple candidate cells.

The method according to any one of claims 2-5, wherein the target reward condition includes a third reward condition, wherein the third reward condition includes:

The signal quality information of the cell is greater than or equal to the signal quality threshold, the load of the cell meets the load threshold, and the signal quality information of the cell is the largest among multiple candidate cells.

The method according to any one of claims 2-6, wherein the target reward condition includes a fourth reward condition, wherein the fourth reward condition includes:

The signal quality information of the cell is greater than or equal to the signal quality threshold, the load of the cell meets the load threshold, the terminal device is located within the coverage of the cell, and the signal quality information of the cell is the largest among multiple candidate cells.

The method according to any one of claims 2-7, wherein the determining the reward value corresponding to the at least one reward condition comprises:

Giving a first reward value when the selected cell satisfies the target reward condition; or

When the selected cell does not meet the target reward condition, give a second reward value;

Wherein, the first reward value is greater than the second reward value.

The method according to any one of claims 2-8, wherein the determining the reward value corresponding to the at least one reward condition comprises:

If the selected cell satisfies the target reward condition and the selected cell is different from the cell selected at the previous moment, giving a third reward value; or

When the selected cell satisfies the target reward condition, but the selected cell is the same as the cell selected at the previous moment, giving a fourth reward value; or

In the case that the selected cell does not meet the target reward condition, and the handover event is too early handover, giving a fifth reward value; or

If the selected cell does not meet the target reward condition, and the handover event is a handover to a wrong cell, giving a sixth reward value; or

In the case that the selected cell satisfies the target reward condition, giving a reward value according to the residence time of the terminal device in the cell;

Wherein, the third reward value is greater than the fourth reward value, the third reward value is greater than the fifth reward value, and the third reward value is greater than the sixth reward value.

The method according to any one of claims 1-9, wherein the reinforcement learning model for cell selection is trained according to the at least one reward condition and the reward value corresponding to the at least one reward condition ,include:

Obtaining the state space and behavior space of the terminal device, wherein the state space of the terminal device includes state information of the terminal device at multiple moments, and the behavior space of the terminal device includes behaviors of the terminal device at multiple moments information;

The reinforcement learning model is trained according to the state space and behavior space of the terminal device, and the at least one reward condition and the reward value corresponding to the at least one reward condition.

The method according to claim 10, wherein the state information of the terminal device at multiple moments includes state information at a first moment, and the state information at the first moment includes at least one of the following:

Information about the cell to which the terminal device belongs at the first moment;

Signal quality information of the cell to which the terminal device belongs at the first moment;

The switching status information of the terminal device at the first moment is used to indicate whether the terminal device is switched at the first moment;

The location information of the terminal device at the first moment.

The method according to claim 11, wherein the handover state information of the terminal equipment at the first moment is based on the information of the cell to which the terminal equipment belongs at the first moment and the cell information of the terminal equipment at the second It is determined whether the cell to which the time belongs is the same, wherein the second time is a previous time of the first time.

The method according to any one of claims 10-12, wherein the behavior information of the terminal device at multiple moments includes the behavior information of the terminal device at a first moment, and the terminal device at the The behavior information at the first moment is used to indicate that the terminal device has selected the first cell at the first moment.

The method according to any one of claims 10-13, characterized in that, according to the state space and behavior space of the terminal device, and the at least one reward condition and the reward corresponding to the at least one reward condition value, the reinforcement learning model is trained, including:

Determine the reward value corresponding to the behavior information at the first moment according to the status information at the first moment and the at least one reward condition;

The state information at the second moment, the behavior information at the first moment, the reward value corresponding to the behavior at the first moment and the state information at the first moment are stored in the experience pool, wherein the second moment is a moment immediately preceding said first moment;

The reinforcement learning model is trained using the experience pool.

The method according to any one of claims 1-14, wherein the reinforcement learning model comprises a deep Q-network model.

A method for cell selection, characterized by comprising:

A reinforcement learning model is used to determine the selected target cell according to the status information of the terminal equipment in multiple cells.

The method according to claim 16, further comprising:

Whether the handover is successful is determined according to whether the target cell satisfies a preset condition.

The method according to claim 17, wherein the preset conditions include at least one of the following:

The signal quality information of the target cell is greater than or equal to a signal quality threshold;

The target cell has the largest signal quality information among multiple candidate cells;

The terminal device is located within the coverage of the target cell;

The load of the target cell meets a load threshold;

The dwell time of the terminal device in the target cell is greater than a time threshold.

The method according to any one of claims 16-18, wherein the status information of the terminal device in multiple cells includes the status information of the terminal device in the first cell, wherein the terminal device is in The status information of the first cell includes at least one of the following:

Signal quality information of the first cell, load information of the first cell, and whether the terminal device is within the coverage of the first cell.

The method according to any one of claims 16-19, characterized in that, using a reinforcement learning model to determine the selected target cell according to the state information of the terminal device in multiple cells includes:

determining a reward value for the terminal device to switch to each cell according to the state information of the terminal device in each of the multiple cells and at least one reward condition;

A target cell among the plurality of cells is determined according to a reward value for the terminal device to switch to each cell.

A device for cell selection, characterized by comprising:

A processing unit, configured to determine at least one reward condition for cell selection and a reward value corresponding to the at least one reward condition; and

The device according to claim 21, wherein the at least one reward condition includes a target reward condition that the target cell needs to satisfy, and the target reward condition includes at least one of the following:

The terminal equipment is located within the coverage area of the cell;

The load of the cell meets the load threshold;

The device according to claim 22, wherein the load of the cell meets a load threshold, comprising:

The used load of the cell is less than or equal to the second load threshold.

The device according to claim 22 or 23, wherein the target reward condition comprises a first reward condition, wherein the first reward condition comprises:

The device according to any one of claims 22-24, wherein the target reward condition comprises a second reward condition, wherein the second reward condition comprises:

The device according to any one of claims 22-25, wherein the target reward condition comprises a third reward condition, wherein the third reward condition comprises:

The device according to any one of claims 22-26, wherein the target reward condition comprises a fourth reward condition, wherein the fourth reward condition comprises:

The device according to any one of claims 22-27, wherein the determining the reward value corresponding to the at least one reward condition comprises:

Wherein, the first reward value is greater than the second reward value.

The device according to any one of claims 22-28, wherein the processing unit is further configured to:

The device according to any one of claims 21-29, wherein the processing unit is further configured to:

The device according to claim 30, wherein the state information of the terminal device at multiple moments includes state information at a first moment, and the state information at the first moment includes at least one of the following:

The location information of the terminal device at the first moment.

The device according to claim 31, wherein the handover state information of the terminal device at the first moment is based on the information of the cell to which the terminal device belongs at the first moment and the cell information of the terminal device at the second It is determined whether the cell to which the time belongs is the same, wherein the second time is a previous time of the first time.

The device according to any one of claims 30-32, wherein the behavior information of the terminal device at multiple moments includes the behavior information of the terminal device at a first moment, and the terminal device at the The behavior information at the first moment is used to indicate that the terminal device has selected the first cell at the first moment.

The device according to any one of claims 30-33, wherein the processing unit is further configured to:

The reinforcement learning model is trained using the experience pool.

The device according to any one of claims 21-34, wherein the reinforcement learning model comprises a deep Q-network model.

A device for cell selection, characterized by comprising:

The processing unit is configured to use a reinforcement learning model to determine a selected target cell according to state information of the terminal device in multiple cells.

The device according to claim 36, wherein the processing unit is further configured to:

The device according to claim 37, wherein the preset conditions include at least one of the following:

The terminal device is located within the coverage of the target cell;

The load of the target cell meets a load threshold;

The device according to any one of claims 36-38, wherein the status information of the terminal device in multiple cells includes the status information of the terminal device in the first cell, wherein the terminal device is in The status information of the first cell includes at least one of the following:

The device according to any one of claims 36-39, wherein the processing unit is further configured to:

A communication device, characterized by comprising: a processor and a memory, the memory is used to store a computer program, the processor is used to invoke and run the computer program stored in the memory, and execute any of the following claims 1 to 15 one of the methods described.

A chip, characterized by comprising: a processor, configured to call and run a computer program from a memory, so that a device installed with the chip executes the method according to any one of claims 1 to 15.

A computer-readable storage medium, characterized by being used for storing a computer program, the computer program causes a computer to execute the method according to any one of claims 1 to 15.

A computer program product, characterized by comprising computer program instructions, the computer program instructions cause a computer to execute the method according to any one of claims 1 to 15.

A computer program, characterized in that the computer program causes a computer to execute the method according to any one of claims 1 to 15.

A communication device, characterized by comprising: a processor and a memory, the memory is used to store a computer program, the processor is used to invoke and run the computer program stored in the memory, and execute any of the following claims 16-20 one of the methods described.

A chip, characterized by comprising: a processor, configured to invoke and run a computer program from a memory, so that a device installed with the chip executes the method according to any one of claims 16 to 20.

A computer-readable storage medium, characterized in that it is used to store a computer program, the computer program causes a computer to perform the method according to any one of claims 16 to 20.

A computer program product, characterized by comprising computer program instructions, the computer program instructions causing a computer to execute the method as claimed in any one of claims 16 to 20.

A computer program, characterized in that the computer program causes a computer to execute the method according to any one of claims 16-20.