CN111601490B

CN111601490B - Reinforced learning control method for data center active ventilation floor

Info

Publication number: CN111601490B
Application number: CN202010456237.6A
Authority: CN
Inventors: 万剑雄; 周杰; 熊伟
Original assignee: Inner Mongolia University of Technology
Current assignee: Inner Mongolia University of Technology
Priority date: 2020-05-26
Filing date: 2020-05-26
Publication date: 2022-08-02
Anticipated expiration: 2040-05-26
Also published as: CN111601490A

Abstract

A reinforcement learning control method for an active ventilation floor of a data center is characterized in that a Markov decision process model is established for the problem of a rack hotspot of a lifting floor structure data center, a reinforcement learning model solving algorithm is provided as the core of the reinforcement learning control algorithm, the rotating speed of a fan of the active ventilation floor (a floor with the fan attached to the back of a common ventilation floor) is intelligently controlled according to the current rack temperature distribution on the premise of not improving the air conditioning power of a machine room, and the rack inlet temperature distribution is homogenized by the mode of actively conveying sufficient cold air, so that the problem of the rack hotspot ubiquitous in the data center of the lifting floor structure is solved, the refrigeration energy consumption is saved, and the safety and the stability of a server are ensured. Compared with the existing data center rack-level airflow management method, the method is easier to deploy, more cost-effective and stronger in universality.

Description

Reinforcement Learning Control Method for Active Ventilation Floors in Data Centers

技术领域technical field

本发明属于自动控制技术领域，特别涉及数据中心主动通风地板的强化学习控制方法。The invention belongs to the technical field of automatic control, and particularly relates to a reinforcement learning control method for an active ventilation floor of a data center.

背景技术Background technique

机架热点，即数据中心机房机架某一个或几个位置，温度明显高于其他位置温度的高温点。过高的温度会导致数据中心某些服务器工作效率降低，进而降低其整体功率密度，同时也会降低其可靠性，这显然与数据中心的需求相悖。Rack hotspots are high-temperature spots where the temperature of one or several locations on the data center rack is significantly higher than that of other locations. Excessive temperatures can cause some servers in the data center to work less efficiently, thereby reducing their overall power density and reducing their reliability, which is obviously contrary to the needs of the data center.

采用全局调控的方式进行缓解或消除机架热点，例如提升机房空调功率以提供足量冷气，必然会导致大部分机架区域处于过度制冷状态，在造成制冷资源浪费的同时，使得数据中心总能耗中占比近半的制冷能耗更加巨大。因此，机架级制冷方案更适合于缓解机架热点问题。Using global regulation to alleviate or eliminate rack hot spots, such as increasing the power of the air conditioner in the equipment room to provide sufficient cooling air, will inevitably lead to excessive cooling in most of the rack areas, resulting in waste of cooling resources and at the same time making the data center always available. The cooling energy consumption, which accounts for nearly half of the consumption, is even more huge. Therefore, rack-level cooling solutions are more suitable for alleviating rack hotspot issues.

目前已有机架级制冷方案，例如安装自适应通风地板、安装挡板、封闭单个机架并为其设置通风管等。但这些方案皆为“被动式”制冷方案，不能主动为机架提供冷气流，当冷气供应不足时，这些方案都无能为力。Rack-level cooling solutions exist, such as installing adaptive ventilation floors, installing baffles, and enclosing and ducting individual racks. However, these solutions are all "passive" cooling solutions, which cannot actively provide cold airflow to the racks. When the cooling air supply is insufficient, these solutions are powerless.

主动通风地板作为另一种机架级制冷方案，通过主动输送冷气的方式缓解机架热点问题，相较于上述方案更容易部署，更具成本效益，但其控制的难点主要在于其放置环境的多样性与动态性，例如机房空调、机架相对位置以及机架内部服务器分布不同；冷、热通道封闭状态不同，服务器机架标准和密封情况不同；机房空调功率、不同机架服务器的热负载不同，等等。因此，数据中心的热能效与气流模型，一般难以用解析模型进行描述。As another rack-level cooling solution, active ventilation floors can alleviate the problem of rack hot spots by actively transporting cold air. Compared with the above solutions, it is easier to deploy and more cost-effective, but the difficulty of its control mainly lies in the placement environment. Diversity and dynamism, such as different computer room air conditioners, relative positions of racks, and server distribution inside the rack; different closed states of cold and hot aisles, different server rack standards and sealing conditions; computer room air conditioner power, heat load of servers in different racks different, wait. Therefore, the thermal energy efficiency and airflow models of data centers are generally difficult to describe with analytical models.

现有的主动通风地板相关研究大多是基于测量或仿真的性能建模和评估，目前还没有主动通风地板控制问题的研究文献。Most of the existing active ventilation floor related research is based on measurement or simulation performance modeling and evaluation, and there is no research literature on the control problem of active ventilation floor.

发明内容SUMMARY OF THE INVENTION

为了克服上述现有技术的缺点，本发明的目的在于提供一种数据中心主动通风地板的强化学习控制方法，在不提升机房空调功率的前提下，自动学习最优运行策略，规划机架气流，使机架温度分布均匀化，缓解机架热点问题。且不必建立和校准复杂气流和热交换模型，从而提高主动通风地板的普适性。In order to overcome the above-mentioned shortcomings of the prior art, the purpose of the present invention is to provide a reinforcement learning control method for an active ventilation floor of a data center, which can automatically learn the optimal operation strategy and plan the rack airflow without increasing the power of the air conditioner in the computer room. Uniform rack temperature distribution and alleviate rack hot spots. And there is no need to build and calibrate complex airflow and heat exchange models, thereby improving the universality of active ventilation floors.

为了实现上述目的，本发明采用的技术方案是：In order to achieve the above object, the technical scheme adopted in the present invention is:

一种数据中心主动通风地板的强化学习控制方法，对抬升地板结构数据中心的机架热点问题建立马尔可夫决策过程模型，并提供一种强化学习模型求解算法，阵列式算法，作为强化学习控制算法的核心。所述模型由系统状态、行为、奖励和价值函数四部分组成，所述模型的解为，在一系列系统状态下不断选择最优行为，使得系统累计奖励最大化，所述强化学习控制算法，利用机架入风口温度分布是否均匀以及主动通风地板能耗是否较低作为评价标准，通过不断探索和学习PWM信号占空比值与该值升高、降低或者维持不变之间的复杂关系，调节主动通风地板风扇转速，使得机架入风口温度分布均匀化，缓解机架热点问题。A reinforcement learning control method for active ventilation floors of data centers, establishing a Markov decision process model for the rack hotspot problem of a data center with a raised floor structure, and providing a reinforcement learning model solving algorithm, an array algorithm, as reinforcement learning control. The heart of the algorithm. The model consists of four parts: system state, behavior, reward and value function. The solution of the model is to continuously select the optimal behavior under a series of system states to maximize the cumulative reward of the system. The reinforcement learning control algorithm , using whether the temperature distribution of the air inlet of the rack is uniform and whether the energy consumption of the active ventilation floor is low as the evaluation criteria, by continuously exploring and learning the complex relationship between the PWM signal duty cycle value and the increase, decrease or maintenance of this value, Adjust the fan speed of the active ventilation floor to make the temperature distribution of the air inlet of the rack uniform and alleviate the problem of rack hot spots.

与现有技术相比，本发明的有益效果是：Compared with the prior art, the beneficial effects of the present invention are:

本发明不必建立和校准复杂的气流和热交换模型，使用阵列式控制算法，克服主动通风地板放置环境的多样性和动态性，根据机架入风口温度分布是否均匀以及主动通风地板能耗，自动匹配PWM信号占空比值与该值升高、降低或维持不变之间的关系，只需要将原普通通风地板置换为运行本发明的主动通风地板，本发明即可自主运行，找到最优PWM信号占空比值，调节主动通风地板转速，改善机架入风口温度分布，缓解机架热点问题，相比其他方案，本发明普适性更高，更易部署，更具成本效益。The present invention does not need to establish and calibrate complex air flow and heat exchange models, uses an array control algorithm, overcomes the diversity and dynamics of the placement environment of the active ventilation floor, automatically To match the relationship between the duty cycle value of the PWM signal and the value of increasing, decreasing or maintaining the same value, it is only necessary to replace the original ordinary ventilated floor with the active ventilated floor running the present invention, and the present invention can operate autonomously and find the optimal PWM Compared with other solutions, the present invention is more universal, easier to deploy, and more cost-effective than other solutions.

相较于使用三种智能算法的智能控制方法，使用阵列式算法的强化学习控制方法更加简单，所需计算资源开销较小。Compared with the intelligent control method using three intelligent algorithms, the reinforcement learning control method using the array algorithm is simpler and requires less computational resources.

相较于使用阵列式算法的强化学习控制方法，使用所述三种智能算法的智能控制方法关于状态和行为的定义对解决热点问题更加直接有效，且非离散化的状态定义以及对Q函数的近似，强化了智能控制方法的普适性。Compared with the reinforcement learning control method using the array algorithm, the intelligent control method using the three intelligent algorithms has a more direct and effective definition of states and behaviors for solving hotspot problems, and the non-discrete state definition and the Q-function definition are more direct and effective. The approximation strengthens the universality of the intelligent control method.

附图说明Description of drawings

图1为主动通风地板设计及部署图。图中标号1为温度传感器，2为机架，3为微控制器，4为驱动板，5为开关电源，6为PC，7为主动通风地板。Figure 1 is an active ventilation floor design and deployment diagram. Reference numeral 1 in the figure is a temperature sensor, 2 is a rack, 3 is a microcontroller, 4 is a drive board, 5 is a switching power supply, 6 is a PC, and 7 is an active ventilation floor.

具体实施方式Detailed ways

下面结合附图和实施例详细说明本发明的实施方式。The embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples.

图1为本发明的详细部署实施示意图，一定数量的温度传感器一1均匀分布在机架2入风口处，监测机架2入风口温度分布，同时在主动通风地板下另设一个温度传感器二，监测主动通风地板下送风温度。1 is a schematic diagram of the detailed deployment implementation of the present invention. A certain number of temperature sensors 1 are evenly distributed at the air inlets of rack 2 to monitor the temperature distribution of the air inlets of rack 2. At the same time, another temperature sensor 2 is installed under the active ventilation floor. Monitors actively ventilated underfloor supply air temperature.

本领域中，机架2是一个长方体铁盒子，里面放一定数量的服务器，许多机架一排一排摆放。在某一排机架中，一般某一机架左右面板与其他机架紧贴，机架前面板即为入风口，用来吸冷气制冷服务器，机架后面板为出风口，用来排出制冷后的热气，监测机架入风口温度分布即监测机架前面板某些位置的温度，这些位置的温度组成了机架入风口温度分布，因此温度传感器一1的个数取决于这些位置的数量。In the art, the rack 2 is a rectangular iron box in which a certain number of servers are placed, and many racks are placed in a row. In a row of racks, the left and right panels of a rack are generally close to other racks. The front panel of the rack is the air inlet, which is used to suck in cold air to cool the server, and the rear panel of the rack is the air outlet, which is used to discharge cooling. After the hot air, monitoring the temperature distribution of the air inlet of the rack is to monitor the temperature of certain positions on the front panel of the rack. The temperature of these positions constitutes the temperature distribution of the air inlet of the rack. Therefore, the number of temperature sensors 1 depends on the number of these positions. .

本发明主动通风地板强化学习控制方法运行于PC端，PC6与微控制器3连接，微控制器3连接驱动板4，驱动板4在连接开关电源5(12V，20A)后与主动通风地板风扇7连接。根据温度传感器一1传回的温度分布，产生PWM信号的占空比值，并传给微控制器3，微控制器3据此占空比值，产生相应PWM信号，传输给驱动板4，驱动板4根据PWM信号控制开关电源5提供给主动通风地板风扇7的电压，通过控制风扇供电电压，达到调节风扇转速的目的。The active ventilation floor reinforcement learning control method of the present invention runs on the PC side, the PC6 is connected to the microcontroller 3, the microcontroller 3 is connected to the driving board 4, and the driving board 4 is connected to the active ventilation floor fan after connecting the switching power supply 5 (12V, 20A). 7 Connections. According to the temperature distribution returned by the temperature sensor 1, the duty cycle value of the PWM signal is generated and transmitted to the microcontroller 3. The microcontroller 3 generates the corresponding PWM signal according to the duty cycle value and transmits it to the driver board 4. The driver board 4. Control the voltage provided by the switching power supply 5 to the active ventilation floor fan 7 according to the PWM signal, and achieve the purpose of adjusting the fan speed by controlling the fan power supply voltage.

控制方法包括以下部分：The control method includes the following parts:

1、对抬升地板结构(数据中心的送风结构，数据中心机房地板被架高，留出60-100cm高的地板下空间用于机房空调输送冷气，这种结构即为抬升地板结构，目前国内大部分数据中心均采用这种构造)数据中心的机架热点问题建立马尔可夫决策过程模型，由以下ABCD四部分组成：1. For the raised floor structure (the air supply structure of the data center, the floor of the data center computer room is raised, leaving a 60-100cm high under-floor space for the air conditioner of the computer room to deliver cold air. This structure is the raised floor structure. At present, domestic Most data centers use this structure) The rack hotspot problem of the data center establishes a Markov decision process model, which consists of the following four parts: ABCD:

A系统状态s_t，定义为离散化的PWM信号方波占空比，公式如下：A system state s _t is defined as the duty cycle of the discretized PWM signal square wave, the formula is as follows:

s_t为t时刻系统状态，

为状态空间，s为

中的某一系统状态，DC为PWM信号方波占空比数值，max(DC)为DC最大值，D_TQ为DC离散化等分比，k表示某个状态中D_TQ的个数。s _t is the system state at time t,

is the state space, and s is

For a certain system state in , DC is the duty cycle value of the square wave of the PWM signal, max(DC) is the maximum value of DC, D _TQ is the equal division ratio of DC discretization, and k represents the number of D _TQ in a certain state.

B系统行为空间

定义为主动通风地板风扇转速的变化，即

B system behavior space

Defined as the change in the fan speed of the active ventilation floor, i.e.

C奖励R_t+1，由机架入风口温度分布均匀程度的量化指标及主动通风地板风扇能耗两部分构成，其公式为：The C reward R _t+1 is composed of the quantitative index of the uniformity of the temperature distribution of the air inlet of the rack and the energy consumption of the active ventilation floor fan. The formula is:

其中R_t+1为t时刻系统采取某行为后所得的奖励，

表示机架入风口温度分布均匀程度，该式值全为负，越接近0，表明机架入风口温度分布越均匀，T_t,i为t时刻编号为i的温度传感器一的温度读数，

为t时刻机架参考温度，

T_t,under为t时刻温度传感器二的读数，Δ_T为根据主动通风地板上下冷热气流混合程度设置的固定温度差，为正数，

为温度传感器一的集合，

为温度传感器一的总数；-(A_ref×DC_t)³表示主动通风地板风扇能耗，该式的值全为负，越接近0，表明风扇能耗越低，其中A_ref为保持与机架入风口温度分布均匀程度同一量级的参考行为值，DC_t为t时刻PWM信号方波占空比。where R _t+1 is the reward obtained by the system after taking a certain behavior at time t,

Indicates the uniformity of the temperature distribution of the air inlet of the rack. The value of this formula is all negative. The closer to 0, the more uniform the temperature distribution of the air inlet of the rack is. T _t,i is the temperature reading of the temperature sensor number i at time t.

is the rack reference temperature at time t,

T _t,under is the reading of temperature sensor 2 at time t, Δ _T is the fixed temperature difference set according to the mixing degree of the hot and cold air above and below the active ventilation floor, which is a positive number,

is a set of temperature sensors,

is the total number of temperature sensors 1; -(A _ref ×DC _t ) ³ represents the energy consumption of the active ventilation floor fan, the values of this formula are all negative, the closer to 0, the lower the fan energy consumption, where A _ref is the maintenance and the machine The reference behavior value of the same magnitude of the uniformity of the air inlet temperature distribution, DC _t is the duty cycle of the square wave of the PWM signal at time t.

D价值函数Q(s_t,a_t)，为行为价值函数，其公式为：D value function Q(s _t , at _t ) is a behavioral value function, and its formula is:

其中价值函数Q(s,a)称为Q函数，

为t时刻系统采取的行为，

为期望函数，y为相对于t时刻的未来时刻，R_t+y+1表示系统在t+y时刻采取行为后获得的奖励，γ表示衰减因子，表示模型对未来奖励(环境影响)的重视程度，0≤γ＜1，γ^y为γ的y次方，是t+y时刻R_t+y+1的衰减因子。where the value function Q(s, a) is called the Q function,

is the action taken by the system at time t,

is the expectation function, y is the future time relative to time t, R _t+y+1 represents the reward obtained by the system after taking action at time t+y, γ represents the decay factor, which represents the importance of the model to the future reward (environmental impact) degree, 0≤γ<1, γy is the ^y power of γ, which is the attenuation factor of R _t+y+1 at time t+y.

E马尔可夫决策过程模型可以被总结为，在任意t时刻系统状态下，通过选择最优行为，使得累计奖励最大化，其模型公式为：The E-Markov decision process model can be summarized as, in the system state at any time t, by selecting the optimal behavior to maximize the cumulative reward, the model formula is:

约束于bound to

γ^t是t时刻系R_t+1的衰减因子。γ ^t is the decay factor of the system R _t+1 at time t.

2、模型的解及求解算法2. Model solution and solution algorithm

a模型的解，计算得到最优Q函数，即可根据最优Q函数在任意t时刻系统状态下选择最优行为，使累计奖励最大化，最优Q函数计算公式为：The solution of model a can be calculated to obtain the optimal Q function, and then the optimal behavior can be selected according to the optimal Q function in the system state at any time t to maximize the cumulative reward. The calculation formula of the optimal Q function is:

在任意t时刻，最优行为选择公式为：At any time t, the optimal behavior selection formula is:

其中Q^*(s_t,a_t)表示最优Q函数，s_t+1表示t+1时刻系统状态，a表示在t+1时刻系统可能采取的所有行为中的任一行为，亦即行为空间

中的某一行为，

表示在s_t+1状态下，系统采取任意一个

中的行为，能得到的最大的最优Q函数值。where Q ^* (s _t , at ) represents the optimal Q function, s _t ₊₁ represents the state of the system at time t+1, and a represents any of all actions that the system may take at time t+1, that is, behavior space

an act in

Indicates that in the state of s _t+1 , the system takes any one

The behavior in , the maximum optimal Q-function value that can be obtained.

b求解算法即为，计算得到最优Q函数并在决策中选择选择最优行为，使得累计奖励最大化。强化学习模型求解算法为阵列式算法，采用二维阵列(行索引为状态，列索引为行为)存储所述Q函数，通过计算Q样本值Q_t+1,target与Q查询值Q_t(s_t,a_t)之差δ_t+1，迭代更新阵列中的Q值，计算最优Q函数，进而通过查询阵列选择最优行为，使得所述模型的累计奖励最大化。其中Q样本值根据最优Q函数计算公式，以及实时系统所得R_t+1和s_t+1计算得到，Q查询值为根据系统实时所得s_t和a_t，到二维阵列中对应行列查询所得值。The b solution algorithm is to calculate the optimal Q function and select the optimal behavior in the decision-making, so as to maximize the cumulative reward. The reinforcement learning model solving algorithm is an array algorithm, using a two-dimensional array (row index is the state, column index is the behavior) to store the Q function, by calculating the Q sample value Q _{t+1, target} and Q query value Q _t (s _t , at _t ) difference δ _t+1 , iteratively update the Q value in the array, calculate the optimal Q function, and then select the optimal behavior by querying the array to maximize the cumulative reward of the model. The Q sample value is calculated according to the optimal Q function calculation formula and R _t+1 and s _t+1 obtained by the real-time system, and the Q query value is obtained according to the real-time _s _t and at of the system, and the corresponding row and column query in the two-dimensional array obtained value.

Q样本值计算公式如下：The formula for calculating the sample value of Q is as follows:

其中

为t时刻所述二维阵列s_t+1对应行中最大的Q查询值，阵列更新方式为：in

is the largest Q query value in the row corresponding to the two-dimensional array s _t+1 at time t, and the array update method is:

其中Q_t(s_t,a_t)为t时刻二维阵列中s_t和a_t对应的Q查询值，Q_t+1(s_t,a_t)为t+1时刻二维阵列中s_t和a_t对应的Q查询值，β(s_t,a_t)∈[0,1]为阵列中每个状态-行为对对应的学习步长。where Q _t (s _t , at _t ) is the Q query value corresponding to s _t and at _t in the two-dimensional array at time t, and Q _t+1 (s _t , at _t ) is the s _t in the two-dimensional array at time t+1 Q query value corresponding to a _t , β(s _t , at _t )∈[0,1] is the learning step size corresponding to each state-action pair in the array.

3，采用强化学习模型求解算法对所述模型求解，利用机架入风口温度分布是否均匀以及主动通风地板能耗是否较低作为评价标准，通过不断探索和学习PWM信号占空比值与该值升高、降低或者维持不变之间的复杂关系，调节主动通风地板风扇转速，使得机架入风口温度分布均匀化，缓解机架热点问题。其在PC端的运行逻辑如下：3. Use the reinforcement learning model solving algorithm to solve the model, using whether the temperature distribution of the air inlet of the rack is uniform and whether the energy consumption of the active ventilation floor is low as the evaluation criteria, through continuous exploration and learning of the PWM signal duty cycle value and this value increase. The complex relationship between high, low or unchanged, adjust the active ventilation floor fan speed, make the temperature distribution of the rack air inlet uniform, and alleviate the problem of rack hot spots. Its operation logic on the PC side is as follows:

1：设置参考温度

初始化β(s_t,a_t)；初始化所述阵列；1: Set the reference temperature

initialize β( _s _t , at ); initialize the array;

2：设置初始时刻t＝0；探索概率变化区间random_slots；初始行为探索概率ε，探索率随t减少量Δ_ε，最小探索概率ε_min；2: Set the initial time t=0; explore the probability change interval random_slots; initial behavior exploration probability ε, the exploration rate decreases with t Δ _ε , the minimum exploration probability ε _min ;

3：选取初始状态s₀＝max(DC)；3: Select the initial state s ₀ =max(DC);

4：循环体开始4: The loop body starts

5：若t小于random_slots，行为从行为空间随机选择并转7，否则转6；5: If t is less than random_slots, the behavior is randomly selected from the behavior space and turned to 7, otherwise it is turned to 6;

6：探索概率ε取ε-Δ_ε和ε_min中的最小值，并根据以下公式选择行为：6: The exploration probability ε takes the minimum of ε- _Δε and ε _min and chooses the behavior according to the following formula:

7：执行a_t(PC发送占空比指令到微控制器)，并获得系统下一状态s_t+1(PC发送温度请求指令获得机架温度分布)，根据奖励公式计算R_t+1；7: Execute at (the PC sends the duty cycle command to the microcontroller), and obtain the next state of the system s _t ₊₁ (the PC sends the temperature request command to obtain the rack temperature distribution), and calculate R _t+1 according to the reward formula;

8：根据公式更新阵列中对应值；8: Update the corresponding value in the array according to the formula;

9：时刻t增加1；9: time t increases by 1;

10：循环体结束。10: The loop body ends.

综上，本发明对抬升地板结构数据中心的机架热点问题建立马尔可夫决策过程模型，并提供一种强化学习模型求解算法，作为强化学习控制算法的核心，在不提升机房空调功率的前提下，根据当前机架温度分布，智能控制主动通风地板(在普通通风地板背部附装风扇的地板)风扇转速，通过这种主动输送足量冷气的方式，使得机架入风口温度分布均匀化，缓解抬升地板结构的数据中心普遍存在的机架热点问题，从而节约制冷能耗，保证服务器的安全性和稳定性。与现有的数据中心机架级气流管理方法相比，本发明更容易部署，更具成本效益，普适性更强。To sum up, the present invention establishes a Markov decision process model for the rack hotspot problem of a data center with a raised floor structure, and provides a reinforcement learning model solving algorithm, which is the core of the reinforcement learning control algorithm without increasing the power of the computer room air conditioner. According to the current rack temperature distribution, the fan speed of the active ventilation floor (the floor with fans attached to the back of the ordinary ventilation floor) is intelligently controlled, and through this method of actively delivering sufficient cold air, the temperature distribution of the air inlet of the rack is evenly distributed. Alleviate the rack hotspot problem common in data centers with raised floor structures, thereby saving cooling energy consumption and ensuring the security and stability of servers. Compared with existing data center rack-level airflow management methods, the present invention is easier to deploy, more cost-effective, and more universal.

Claims

1. The reinforcement learning control method of the data center active ventilation floor is characterized by comprising the following steps:

step 1, arranging a certain number of first temperature sensors for monitoring the temperature distribution of an air inlet of a rack at the air inlet of the rack, and arranging a second temperature sensor for monitoring the air supply temperature under an active ventilation floor under the active ventilation floor;

step 2, establishing a Markov decision process model for the rack hot spot problem of the raised floor structure data center, wherein the model is determined by a system state s _t Behavior space

Reward R _t+1 And a cost function Q(s) _t ,a _t ) The four parts are formed;

wherein: the system state s _t For the system state at the time t,

the state space is defined as the duty ratio of a discretized PWM signal square wave, and the formula is as follows:

wherein s is

DC is the value of the square wave duty ratio of the PWM signal, max (DC) is the maximum value of DC, D _TQ For DC discretization of the equivalence ratio, k denotes D in a certain state _TQ The number of (2); the PWM signal square wave is generated by the following method: generating a duty ratio value of a PWM signal according to the temperature distribution returned by the first temperature sensor, and transmitting the duty ratio value to the microcontroller, wherein the microcontroller generates a corresponding PWM signal according to the duty ratio value;

space of action

Defined as the change in the rotational speed of the active ventilation floor fan,

reward R _t+1 The temperature distribution uniformity of the air inlet of the rack is quantified, and the energy consumption of the active ventilation floor fan is calculated according to the following formula:

wherein R is _t+1 The reward obtained after the system takes some action for time t,

the temperature distribution uniformity of the air inlet of the rack is shown, the more the formula value is negative, the closer to 0, the more uniform the temperature distribution of the air inlet of the rack is, the T _t,i The temperature reading of the first temperature sensor numbered i at time t,

for the reference temperature of the rack at time t,

T _t,under is tReading, delta, of the second temperature sensor at the moment _T The fixed temperature difference set according to the mixing degree of the cold air and the hot air on the active ventilation floor is positive,

is a collection of the first temperature sensors,

the total number of the temperature sensors is one; - (A) _ref ×DC _t ) ³ Representing the active ventilation floor fan energy consumption, the values of the formula are all negative, the closer to 0, the lower the fan energy consumption, wherein A _ref To maintain a reference behavior value of the same order of magnitude as the uniformity of the temperature distribution at the inlet of the frame, DC _t The duty ratio of the square wave of the PWM signal at the moment t;

cost function Q(s) _t ,a _t ) The formula of the behavior cost function is as follows:

wherein the merit function Q (s, a) is referred to as the Q function,

for the action taken by the system at time t,

as a function of the expectation, y is the future time relative to time t, R _t+y+1 Represents the reward obtained after the system takes action at the time t + y, gamma represents the attenuation factor, gamma is more than or equal to 0 and less than 1, and gamma is ^y Y power of gamma, is t + y time R _t+y+1 The attenuation factor of (d);

the markov decision process model is summarized as: under the system state at any time t, the accumulated reward is maximized by selecting the optimal behavior, and the model formula is as follows:

is constrained to

Wherein, γ ^t Is time t system R _t+1 The attenuation factor of (d);

and 3, solving the model by adopting a reinforcement learning model solving algorithm, and adjusting the rotating speed of the fan of the active ventilation floor by continuously exploring and learning the complex relation between the duty ratio of the PWM signal and the rise, fall or maintenance of the duty ratio by using whether the temperature distribution of the air inlet of the rack is uniform and whether the energy consumption of the active ventilation floor is low as evaluation standards, so that the temperature distribution of the air inlet of the rack is uniform, and the hot spot problem of the rack is relieved.

2. The reinforcement learning control method for the active ventilation floor of the data center according to claim 1, wherein in the step 2, an optimal Q function is obtained through calculation, that is, an optimal behavior can be selected according to the optimal Q function under a system state at any time t, so that the accumulated reward is maximized, and the optimal Q function has a calculation formula:

at any time t, the optimal behavior selection formula is as follows:

wherein Q ^* (s _t ,a _t ) Representing the optimal Q function, s _t+1 Represents the state of the system at the moment t +1, and a represents any action in all actions that the system may take at the moment t +1, namely, the action space

Is performed in a manner such that a certain behavior in (2),

is shown at s _t+1 In the state, the system adopts any one

The largest optimal Q function value can be obtained.

3. The reinforcement learning control method for the active ventilation floor of the data center according to claim 1, wherein in the step 3, the reinforcement learning model solving algorithm is an array algorithm, the Q function is stored by using a two-dimensional array, wherein a row index is a state and a column index is a behavior, and the Q sample value Q is calculated _t+1,target And Q query value Q _t (s _t ,a _t ) Difference of delta _t+1 Iteratively updating the Q value in the array, calculating an optimal Q function, and further selecting an optimal behavior by inquiring the array so as to maximize the accumulative reward of the model; wherein the Q sample value is calculated according to the optimal Q function, and R obtained by the real-time system _t+1 And s _t+1 Calculated, the Q query value is s obtained in real time according to the system _t And a _t Searching the value obtained by the corresponding row and column query in the two-dimensional array;

the Q sample value calculation formula is as follows:

wherein

For time t said two-dimensional array s _t+1 Corresponding to the maximum Q query value in the row, the array updating mode is as follows:

wherein Q _t (s _t ,a _t ) For s in a two-dimensional array at time t _t And a _t Corresponding Q query value, Q _t+1 (s _t ,a _t ) For s in a two-dimensional array at time t +1 _t And a _t Corresponding Q query value, beta(s) _t ,a _t )∈[0,1]A corresponding learning step size for each state-behavior pair in the array.