CN117034738A

CN117034738A - Event-driven-based deep reinforcement learning building control method

Info

Publication number: CN117034738A
Application number: CN202310796694.3A
Authority: CN
Inventors: 傅启明; 李竹; 陈建平; 马杰; 李国建
Original assignee: Suzhou Sicui Integrated Infrastructure Technology Research Institute Co ltd; Suzhou University of Science and Technology
Current assignee: Suzhou Sicui Integrated Infrastructure Technology Research Institute Co ltd; Suzhou University of Science and Technology
Priority date: 2023-07-03
Filing date: 2023-07-03
Publication date: 2023-11-10

Abstract

The invention discloses an event-driven deep reinforcement learning building control method, which includes: Step S1, obtain weather data, generate a data set, and divide the data set into a training set and a test set; Step S2, select a multi-region The residential building is used as an architectural model, and a Python-based experimental platform is created according to the thermodynamic formula; Step S3: Design an event-driven MDP model, and design two types of events based on expert knowledge and prior knowledge; Step S4 , design an event-driven DQN algorithm to optimize the control strategy when an event is triggered. The present invention provides an event-driven deep reinforcement learning building control method, which can realize model-free optimal control under the needs of multi-region residential buildings, and can maintain better performance while accelerating learning speed and reducing the frequency of HVAC replacement actions. Performance control minimizes building energy consumption while meeting people's needs for indoor thermal comfort.

Description

An event-driven deep reinforcement learning building control method

技术领域Technical field

本发明涉及一种基于事件驱动的深度强化学习建筑控制方法，属于建筑节能领域。The invention relates to an event-driven deep reinforcement learning building control method, belonging to the field of building energy conservation.

背景技术Background technique

目前，随着全球气候变化日益加剧，降低建筑能耗和提高热舒适显得尤为紧迫。根据国际能源署的报告，住宅建筑在全球能耗中占据了相当大的份额，仅在2020年就消耗了建筑能源的35％。而在建筑系统中，暖通空调系统的能耗占比超过50％，因此，降低暖通空调系统的能耗已成为优化建筑控制的重要研究方向之一。然而，在追求建筑节能的同时，我们不能以牺牲热舒适为代价。特别是在疫情期间，人们更多地停留在室内，因此，如何在保持住宅建筑热舒适的前提下最大限度地减少能耗，成为研究人员和相关从业者关注的焦点。Currently, with the increasing global climate change, it is particularly urgent to reduce building energy consumption and improve thermal comfort. According to the International Energy Agency, residential buildings account for a significant share of global energy consumption, consuming 35% of building energy in 2020 alone. In building systems, the energy consumption of HVAC systems accounts for more than 50%. Therefore, reducing the energy consumption of HVAC systems has become one of the important research directions for optimizing building control. However, while pursuing building energy conservation, we cannot sacrifice thermal comfort. Especially during the epidemic, people stay indoors more. Therefore, how to minimize energy consumption while maintaining thermal comfort in residential buildings has become the focus of researchers and related practitioners.

现有技术中，大多数暖通空调系统采用基于规则的控制RBC、比例积分导数PID、拉格拉朗日松弛法和模型预测控制MPC等方法。然而，RBC受到规则设置的限制，控制精度有限，难以适应复杂的实际环境；PID控制器依赖于固定的参数，当环境变化时，可能无法提供最佳的性能；尽管MPC性能更好，但是在实践中创建一个简化且足够准确的建筑模型并不容易，因为室内环境受到许多因素的影响，如建筑结构、建筑布局、建筑内部热量和室外环境等，当模型不足以准确描述建筑热动力学并且存在较大偏差时，控制性能可能会偏离预期。In the existing technology, most HVAC systems adopt methods such as rule-based control RBC, proportional integral derivative PID, Lagrarangian relaxation method, and model predictive control MPC. However, RBC is limited by rule settings and has limited control accuracy, making it difficult to adapt to complex actual environments; PID controllers rely on fixed parameters and may not provide the best performance when the environment changes; although MPC has better performance, it It is not easy to create a simplified and sufficiently accurate building model in practice because the indoor environment is affected by many factors, such as building structure, building layout, building internal heat and outdoor environment, etc. When the model is not enough to accurately describe the building thermal dynamics and When large deviations exist, control performance may deviate from expectations.

强化学习是一种有效的数据驱动控制方法。与传统控制方法相比，强化学习不需要精确的热力学模型，能够更好地适应环境中的变化和不确定性。尽管强化学习在暖通空调系统中表现出了很大的潜力，但它仍然面临着一些挑战。首先，传统的强化学习方法在固定的时间步上学习，而暖通空调控制中连续的时间步之间具有相似性，这可能导致数据冗余和低效利用。其次，时间间隔的选择也影响控制性能。较长的时间间隔会影响控制精度，而较短的时间间隔则会导致过多的动作切换。此外，建筑热动力的缓慢变化也会减慢强化学习的学习速度。并且暖通空调控制问题通常涉及高维状态空间，这进一步增加了强化学习方法的复杂性。深度强化学习方法有潜力通过结合深度学习和强化学习的优势来解决更复杂的暖通空调控制问题，然而深度强化学习方法仍然面临上述挑战。因此，有必要探索新的方法，以提高暖通空调控制的效率和性能，这将有助于推动建筑节能领域的技术创新，为实现可持续建筑发展和全球能源节约做出贡献。Reinforcement learning is an effective data-driven control method. Compared with traditional control methods, reinforcement learning does not require precise thermodynamic models and can better adapt to changes and uncertainties in the environment. Although reinforcement learning has shown great potential in HVAC systems, it still faces several challenges. First, traditional reinforcement learning methods learn at fixed time steps, while there are similarities between consecutive time steps in HVAC control, which may lead to data redundancy and inefficient utilization. Secondly, the choice of time interval also affects the control performance. A longer time interval will affect the control accuracy, while a shorter time interval will cause too many action switches. In addition, slow changes in building thermal dynamics can also slow down reinforcement learning learning. And HVAC control problems often involve high-dimensional state spaces, which further increases the complexity of reinforcement learning methods. Deep reinforcement learning methods have the potential to solve more complex HVAC control problems by combining the advantages of deep learning and reinforcement learning. However, deep reinforcement learning methods still face the above challenges. Therefore, it is necessary to explore new methods to improve the efficiency and performance of HVAC control, which will help promote technological innovation in the field of building energy conservation and contribute to the realization of sustainable building development and global energy conservation.

发明内容Contents of the invention

本发明所要解决的技术问题是，克服现有技术的不足，提供一种基于事件驱动的深度强化学习建筑控制方法，能够在多区域住宅建筑需求下实现无模型优化控制，能够在加快学习速度、减少暖通空调更换动作频率的同时，保持更好的性能控制。通过优化暖通空调系统的控制，实现对建筑能耗的最小化，同时满足人们对室内热舒适的需求。The technical problem to be solved by the present invention is to overcome the shortcomings of the existing technology and provide an event-driven deep reinforcement learning building control method that can achieve model-free optimal control under the needs of multi-regional residential buildings, and can accelerate the learning speed, Reduce the frequency of HVAC replacement actions while maintaining better performance control. By optimizing the control of HVAC systems, we can minimize building energy consumption while meeting people's needs for indoor thermal comfort.

为了解决上述技术问题，本发明的技术方案是：In order to solve the above technical problems, the technical solution of the present invention is:

一种基于事件驱动的深度强化学习建筑控制方法，它包括：An event-driven deep reinforcement learning building control method, which includes:

步骤S1、获取天气数据，生成数据集，将数据集划分为训练集和测试集；Step S1: Obtain weather data, generate a data set, and divide the data set into a training set and a test set;

步骤S2、选择一个具有多区域的住宅建筑作为建筑模型，根据热力学公式，创建基于Python的实验平台；Step S2: Select a residential building with multiple areas as the building model, and create an experimental platform based on Python based on the thermodynamic formula;

步骤S3、设计基于事件驱动的MDP模型，在基于专家知识和先验知识的情况下，设计两种类型的事件；Step S3: Design an event-driven MDP model, and design two types of events based on expert knowledge and prior knowledge;

步骤S4、设计基于事件驱动的DQN算法，用于优化事件触发时的控制策略；Step S4: Design an event-driven DQN algorithm to optimize the control strategy when an event is triggered;

步骤S5、通过消融实验验证事件设置的有效性，并根据结果调整事件参数；Step S5: Verify the effectiveness of the event settings through ablation experiments, and adjust event parameters according to the results;

步骤S6、通过对比实验证明基于事件驱动的DQN算法的控制效果及优越性。Step S6: Prove the control effect and superiority of the event-driven DQN algorithm through comparative experiments.

进一步，所述步骤S3中，设计基于事件驱动的MDP模型，具体包括如下步骤：Further, in step S3, an event-driven MDP model is designed, which specifically includes the following steps:

构建一个由状态、动作、转移、奖励函数、折扣因子以及事件触发函数构成的具有事件驱动的六元组基于事件驱动的MDP模型，表示为：Construct an event-driven six-tuple event-driven MDP model consisting of state, action, transition, reward function, discount factor and event trigger function, expressed as:

其中，S表示状态集合；A表示智能体可采取的动作集合；P表示系统的转移；R表示回报函数；γ表示折扣因子，用于衡量智能体对未来奖励的重视程度；e表示事件触发函数；Among them, S represents the state set; A represents the action set that the agent can take; P represents the transfer of the system; R represents the reward function; γ represents the discount factor, which is used to measure the importance of the agent to future rewards; e represents the event trigger function ;

当事件触发函数的值超过预设的阈值时，智能体被触发并更新策略，同时发生状态转移，转移函数为：When the value of the event triggering function exceeds the preset threshold, the agent is triggered and updates the strategy, and a state transition occurs at the same time. The transition function is:

P{s(t+1)|s(t),a,e}；P{s(t+1)|s(t),a,e};

以多区域住宅建筑为环境，以基于事件驱动的深度强化学习为智能体，设计暖通空调系统的状态、动作以及奖励。Using multi-region residential buildings as the environment and event-driven deep reinforcement learning as the intelligent agent, the status, actions and rewards of the HVAC system are designed.

进一步，所述状态表示为：Further, the status is expressed as:

[T_in,z(t),K_z(t),T_out(t),λ^electricity(t)][T _in,z (t),K _z (t),T _out (t),λ ^electricity (t)]

其中，T_in,z(t)为每个房间室内温度，K_z(t)为人员占用率，T_out(t)为室外温度，λ^electricity为电价，z表示房间号，t表示当前时间步。Among them, T _{in, z} (t) is the indoor temperature of each room, K _z (t) is the occupancy rate, T _out (t) is the outdoor temperature, λ ^electricity is the electricity price, z represents the room number, and t represents the current time step. .

进一步，所述动作为暖通空调系统的温度设定点，所述动作表示为：Further, the action is the temperature set point of the HVAC system, and the action is expressed as:

A(t)＝[Sp_z(t)]A(t)＝[Sp _z (t)]

其中，SP_z(t)表示z房间的温度设定点；其中每个区域的设定点是一个离散变量，并遵循以下逻辑：where, SP _z (t) represents the temperature set point of room z; where the set point of each zone is a discrete variable and follows the following logic:

若室内温度高于温度设定点的温度数值，则暖通空调系统开启；若室内温度低于温度设定点的温度数值，则暖通空调关闭。If the indoor temperature is higher than the temperature value of the temperature set point, the HVAC system is turned on; if the indoor temperature is lower than the temperature value of the temperature set point, the HVAC system is turned off.

进一步，所述奖励函数表示为：Further, the reward function is expressed as:

其中，λ^retail(t')表示零售价格，E_HVAC(t')表示能耗，R^comfort(t')表示温度在舒适范围内的奖励，SW^penalty(t')表示开关暖通空调的惩罚项；Among them, λ ^retail (t') represents the retail price, E _HVAC (t') represents energy consumption, R ^comfort (t') represents the reward for keeping the temperature within the comfort range, and SW ^penalty (t') represents the penalty for switching on and off HVAC. item;

给定舒适度范围当室内温度T_in(t)高于最低舒适温度th^min时，智能体根据奖励大小调整暖通空调系统的设置；当T_in(t)低于最低舒适温度th^mn时，关闭暖通空调；Given comfort range When the indoor temperature T _in (t) is higher than the minimum comfort temperature th ^min , the agent adjusts the settings of the HVAC system according to the reward size; when T _in (t) is lower than the minimum comfort temperature th ^mn , the HVAC is turned off;

当执行动作偏离阈值时，则增加负奖励，具体表示如下：When the execution action deviates from the threshold, a negative reward is added, which is specifically expressed as follows:

当室内温度在热舒适阈值内时，给予正奖赏，定义一个最佳舒适温度th^best，室内温度越靠近th^best，正奖赏越高，正奖赏表示为：When the indoor temperature is within the thermal comfort threshold, a positive reward is given, and an optimal comfort temperature th ^best is defined. The closer the indoor temperature is to th ^best , the higher the positive reward. The positive reward is expressed as:

进一步，所述步骤S3中的两种类型的事件包括状态转换事件和组合事件。Further, the two types of events in step S3 include state transition events and combination events.

进一步，所述状态转换事件包括第一事件和第二事件，所述第一事件定义如下：Further, the state transition event includes a first event and a second event, and the first event is defined as follows:

其中，λ^retail(t)与λ^retail(t')都在价格范围[λ^low,λ^high]内，λ^low表示最低零售价格，λ^high表示最高零售价格。Among them, λ ^retail (t) and λ ^retail (t') are both within the price range [λ ^low , λ ^high ], λ ^low represents the lowest retail price, and λ ^high represents the highest retail price.

所述第二事件定义如下：The second event is defined as follows:

其中，K_z(t')与K_z(t)在[-1，1]内，-1表示此刻没人，1表示此刻有人。Among them, K _z (t') and K _z (t) are within [-1, 1], -1 means there is no one at the moment, and 1 means there is someone at the moment.

进一步，所述组合事件为第三事件，所述第三事件由TH(t')、K_z(t')和λ^retail(t')组合而成；Further, the combined event is a third event, and the third event is composed of TH (t'), K _z (t') and λ ^retail (t');

当房间占用率或价格变化时，如果大于热舒适值χ时，则触发第三事件，所述第三事件定义如下：When room occupancy or price changes, if When it is greater than the thermal comfort value χ, the third event is triggered, and the third event is defined as follows:

其中，表示热舒适值。in, Indicates thermal comfort value.

进一步，所述步骤S4中，设计基于事件驱动的DQN算法，具体包括如下步骤：Further, in step S4, an event-driven DQN algorithm is designed, which specifically includes the following steps:

步骤S41、初始化Q网络参数ω，给目标Q网络赋值ω^*＝ω，基于随机的Q网络参数ω初始化所有的状态和动作及对应q值；Step S41: Initialize the Q network parameter ω, assign the target Q network value ω ^* =ω, and initialize all states and actions and corresponding q values based on the random Q network parameter ω;

步骤S42、初始化minibatch的大小B，初始化经验缓冲区M；Step S42: Initialize the minibatch size B and initialize the experience buffer M;

步骤S43、进行迭代。Step S43: Iterate.

进一步，所述步骤S43的迭代过程如下：Further, the iterative process of step S43 is as follows:

步骤S431、获取环境初始化的当前状态量，并进行初步处理获得特征状态参数s(T_out(0),T_in,z(0),λ^electricity(0),K_z(0))；Step S431: Obtain the current state quantity of the environment initialization, and perform preliminary processing to obtain the characteristic state parameters s (T _out (0), T _{in, z} (0), λ ^electricity (0), K _z (0));

步骤S432、在Q网络中以s作为输入，如果事件都未触发，则执行之前的动作；如果其中一个事件被触发，得到Q网络所有对应q值的动作输出，通过ε贪心算法在当前q值中挑选对应动作a，动作a表示为：Step S432. Use s as input in the Q network. If none of the events are triggered, the previous action is executed; if one of the events is triggered, all action outputs corresponding to the q value of the Q network are obtained, and the current q value is calculated through the ε greedy algorithm. Select the corresponding action a from , and action a is expressed as:

其中，s表示某一个状态，A为动作集；Among them, s represents a certain state, and A is the action set;

步骤S433、当前状态s(t)执行动作a，得到新的环境状态经处理后的特征状态向量s(t+1)和该次动作的奖励R；Step S433: The current state s(t) executes action a, and obtains the processed characteristic state vector s(t+1) of the new environmental state and the reward R of the action;

步骤S434、将元组[s(t),Setpt_z(t),r(t),s(t+1)]放入经验缓冲区M中；Step S434: Put the tuple [s(t), Setpt _z (t), r(t), s(t+1)] into the experience buffer M;

步骤S435、从经验缓冲区M中随机抽取B个样本[s⁽ⁱ⁾(t),Setpt_z ⁽ⁱ⁾(t),r⁽ⁱ⁾(t),s⁽ⁱ⁾(t+1)]，其中i为样本号；Step S435: Randomly select B samples from the experience buffer M [s ⁽ⁱ⁾ (t), Setpt _z ⁽ⁱ⁾ (t), r ⁽ⁱ⁾ (t), s ⁽ⁱ⁾ (t+1)] , where i is the sample number;

步骤S436、当收集到足够数量的样本时，随机挑选一批样本，计算这些样本的q值，用于更新Q网络，计算公式如下：Step S436: When a sufficient number of samples are collected, a batch of samples are randomly selected, and the q values of these samples are calculated to update the Q network. The calculation formula is as follows:

步骤S437、通过均方误差公式作为损失函数来修正权值ω，计算公式如下：Step S437: Use the mean square error formula as the loss function to correct the weight ω. The calculation formula is as follows:

步骤S438、根据延迟策略，经过U步后再更新目标Q网络，复制网络的权重参数ω^*＝ω；Step S438: According to the delay strategy, update the target Q network after U steps and copy the weight parameter ω ^* = ω of the network;

步骤S439、环境模拟停止，算法停止。Step S439: The environment simulation stops and the algorithm stops.

采用了上述技术方案，本发明具有以下的有益效果：Adopting the above technical solution, the present invention has the following beneficial effects:

1、本发明首先引用了事件驱动方法，设计了一种新型的基于事件驱动的MDP模型，使智能体能够更加高效地学习。然后，在设计触发规则时，根据先验知识选择了重要的状态变化作为事件，并设计了合理的触发条件，使智能体只在必要时进行优化控制。最后，提出了基于DQN算法改进的基于事件驱动的深度强化学习算法。与传统的强化学习算法相比，基于事件驱动的深度强化学习算法可以更好地利用事件，加速学习过程并提高学习效果。1. The present invention first refers to the event-driven method and designs a new event-driven MDP model to enable the agent to learn more efficiently. Then, when designing the trigger rules, important state changes are selected as events based on prior knowledge, and reasonable trigger conditions are designed so that the agent can only perform optimal control when necessary. Finally, an event-driven deep reinforcement learning algorithm improved based on the DQN algorithm is proposed. Compared with traditional reinforcement learning algorithms, event-driven deep reinforcement learning algorithms can make better use of events, accelerate the learning process and improve learning effects.

2、传统的强化学习方法在固定的时间步长上进行控制，可能导致资源浪费和学习效率低。相比之下，本发明的基于事件驱动的深度强化学习方法基于“间歇性”概念，仅在重要事件发生后才更新决策，提高了数据的利用率。2. Traditional reinforcement learning methods control at fixed time steps, which may lead to waste of resources and low learning efficiency. In contrast, the event-driven deep reinforcement learning method of the present invention is based on the concept of "intermittent" and only updates decisions after important events occur, improving data utilization.

3、本发明的基于事件驱动的深度强化学习方法通过学习动态非线性特征，例如不同区域的室内温度，并通过预定义的事件，可以捕捉和利用一些不经常出现的状态。3. The event-driven deep reinforcement learning method of the present invention can capture and utilize some infrequently occurring states by learning dynamic nonlinear characteristics, such as indoor temperatures in different areas, and through predefined events.

4、本发明的基于事件驱动的深度强化学习方法能够结合先验知识，在事件定义阶段分配变量权重，能够灵活地适应未知的环境并提高学习速度。4. The event-driven deep reinforcement learning method of the present invention can combine prior knowledge and assign variable weights in the event definition stage, and can flexibly adapt to unknown environments and improve learning speed.

5、本发明的基于事件驱动的深度强化学习方法只需要建筑系统的历史数据作为输入，而不需要建立完整的建筑系统模型。5. The event-driven deep reinforcement learning method of the present invention only requires historical data of the building system as input, without establishing a complete building system model.

附图说明Description of the drawings

图1为本发明的一种基于事件驱动的深度强化学习建筑控制方法的流程图；Figure 1 is a flow chart of an event-driven deep reinforcement learning building control method of the present invention;

图2为本发明的基于事件驱动的深度强化学习方法框架图；Figure 2 is a framework diagram of the event-driven deep reinforcement learning method of the present invention;

图3为本发明的事件触发流程图；Figure 3 is an event triggering flow chart of the present invention;

图4为传统强化学习的回溯图；Figure 4 shows the backtracking diagram of traditional reinforcement learning;

图5为本发明的基于事件驱动的深度强化学习方法的回溯图。Figure 5 is a traceback diagram of the event-driven deep reinforcement learning method of the present invention.

具体实施方式Detailed ways

为了使本发明的内容更容易被清楚地理解，下面根据具体实施例并结合附图，对本发明作进一步详细的说明。In order to make the content of the present invention easier to understand clearly, the present invention will be described in further detail below based on specific embodiments and in conjunction with the accompanying drawings.

如图1所示，本实施例提供一种基于事件驱动的深度强化学习建筑控制方法，它包括：As shown in Figure 1, this embodiment provides an event-driven deep reinforcement learning building control method, which includes:

步骤S1、获取天气数据，生成数据集，将数据集划分为训练集和测试集。在本发明中，天气数据来自气象局，其中一部分天气被用于训练，一部分用于测试。考虑到研究重点是制冷，因此选择了天气较热的季节进行实验。此外，还创建了一个模拟的电价序列，其中电价每四小时在高低值之间交替变化。这个模拟定价序列的目的是测试智能体是否能够检测到价格信号对奖励函数的影响，并相应地调整控制策略。另外，根据人们的工作时间表，住宅内的占用率会根据一周中的不同时间段而变化，其中周末的占用率更高。Step S1: Obtain weather data, generate a data set, and divide the data set into a training set and a test set. In the present invention, weather data comes from the Meteorological Bureau, part of which is used for training and part of which is used for testing. Considering that the research focus is on refrigeration, the hotter season was chosen for the experiment. Additionally, a simulated electricity price sequence was created in which electricity prices alternated between high and low values every four hours. The purpose of this simulated pricing sequence is to test whether the agent is able to detect the impact of price signals on the reward function and adjust the control strategy accordingly. Additionally, occupancy levels within residences can vary based on time of the week, with higher occupancy levels on weekends, depending on people's work schedules.

步骤S2、选择一个具有多区域的住宅建筑作为建筑模型，根据热力学公式，创建基于Python的实验平台。在本发明中，选择了一个多区域的住宅建筑作为建筑模型，并利用基于Python的实验平台来实现，基于热力学公式构建了建筑模型，以模拟暖通空调系统对室内温度的影响。通过该实验平台，可以调整建筑的温度以及其他相关参数，为基于事件驱动的深度强化学习方法提供一个真实的环境。Step S2: Select a residential building with multiple areas as the building model, and create an experimental platform based on Python based on the thermodynamic formula. In this invention, a multi-region residential building is selected as the building model and implemented using a Python-based experimental platform. The building model is constructed based on thermodynamic formulas to simulate the impact of the HVAC system on indoor temperature. Through this experimental platform, the temperature of the building and other related parameters can be adjusted, providing a real environment for event-driven deep reinforcement learning methods.

步骤S3、设计基于事件驱动的MDP模型，在基于专家知识和先验知识的情况下，设计两种类型的事件。Step S3: Design an event-driven MDP model, and design two types of events based on expert knowledge and prior knowledge.

步骤S4、设计基于事件驱动的DQN算法，用于优化事件触发时的控制策略。Step S4: Design an event-driven DQN algorithm to optimize the control strategy when an event is triggered.

步骤S5、通过消融实验验证事件设置的有效性，并根据结果合理调整事件参数。本发明通过消融实验验证事件设置的有效性，并观察哪种因素对于结果影响最大，最后为了平衡能耗和热舒适调整相应的参数。Step S5: Verify the effectiveness of the event settings through ablation experiments, and reasonably adjust event parameters based on the results. The present invention verifies the effectiveness of event settings through ablation experiments, observes which factors have the greatest impact on the results, and finally adjusts corresponding parameters in order to balance energy consumption and thermal comfort.

步骤S6、通过对比实验证明基于事件驱动的DQN算法的控制效果及优越性。步骤五：结果分析和优化。与传统方法以及一般的强化学习方法对比，基于事件驱动的DQN能够更快的学习速度、更少的工作频率以及能够更好的平衡热舒适与能耗。经过测试验证基于事件驱动的DQN对于不同建筑热环境的有很好的适应能力和泛化能力。Step S6: Prove the control effect and superiority of the event-driven DQN algorithm through comparative experiments. Step 5: Result analysis and optimization. Compared with traditional methods and general reinforcement learning methods, event-driven DQN can learn faster, work less frequently, and can better balance thermal comfort and energy consumption. After testing, it has been verified that the event-driven DQN has good adaptability and generalization capabilities for different building thermal environments.

本实施例的步骤S3中，设计基于事件驱动的MDP模型，具体包括如下步骤：In step S3 of this embodiment, an event-driven MDP model is designed, which specifically includes the following steps:

P{s(t+1)|s(t),a,e}；P{s(t+1)|s(t),a,e};

在这个设定下，如图2所示，以多区域住宅建筑为环境，以基于事件驱动的深度强化学习为智能体，设计暖通空调系统的状态、动作以及奖励。Under this setting, as shown in Figure 2, multi-region residential buildings are used as the environment and event-driven deep reinforcement learning is used as the intelligent agent to design the status, actions and rewards of the HVAC system.

具体地，本实施例的状态由环境所决定。在本发明中，考虑了室内环境状态(每个房间室内温度和人员占用率)，室外环境状态(室外温度)，影响能耗状态(电价)，状态表示为：Specifically, the status of this embodiment is determined by the environment. In the present invention, the indoor environmental state (indoor temperature and occupancy rate of each room) and the outdoor environmental state (outdoor temperature) are considered, which affect the energy consumption state (electricity price). The state is expressed as:

具体地，本实施例的动作可以定义为暖通空调系统中的控制变量。在本发明中，动作定义为暖通空调系统的温度设定点，动作表示为：Specifically, the actions of this embodiment can be defined as control variables in the HVAC system. In the present invention, the action is defined as the temperature set point of the HVAC system, and the action is expressed as:

A(t)＝[Sp_z(t)]A(t)＝[Sp _z (t)]

可以看出，若室内温度高于温度设定点的温度数值，则暖通空调系统开启；若室内温度低于温度设定点的温度数值，则暖通空调关闭；其他情况则状态不变。It can be seen that if the indoor temperature is higher than the temperature value of the temperature set point, the HVAC system is turned on; if the indoor temperature is lower than the temperature value of the temperature set point, the HVAC system is turned off; in other cases, the status remains unchanged.

具体地，本实施例的奖励，本发明优化暖通空调系统的首要目标是：在尽可能能耗低的情况下保持居住者舒适度。对于平衡能耗和热舒适的多目标问题，可以将两个权重因子α与β添加至其中，奖励函数表示为：Specifically, as a bonus to this embodiment, the primary goal of optimizing an HVAC system is to maintain occupant comfort while consuming as little energy as possible. For the multi-objective problem of balancing energy consumption and thermal comfort, two weight factors α and β can be added to it, and the reward function is expressed as:

给定舒适度范围当室内温度T_in(t)高于最低舒适温度th^min时，智能体根据奖励大小调整暖通空调系统的设置；当T_in(t)低于最低舒适温度th^min时，为了避免能耗的浪费，可以考虑关闭暖通空调；Given comfort range When the indoor temperature T _in (t) is higher than the lowest comfort temperature th ^min , the agent adjusts the settings of the HVAC system according to the reward size; when T _in (t) is lower than the lowest comfort temperature th ^min , in order to avoid the loss of energy consumption waste, consider turning off the HVAC;

如图3所示，本实施例的触发流程预先设定了一系列事件，假设当前的暖通空调系统环境相对稳定，这意味着触发条件不满足，智能体无需进行策略搜索，可以继续执行当前的动作。然而，如果环境发生变化，触发条件得到满足，智能体就需要更新其策略。在基于事件驱动的MDP模型中，关键在于设计触发规则。当智能体完成观测后，可以根据上一刻观测和当前观测的变化率来判断是否需要触发事件。As shown in Figure 3, the trigger process of this embodiment presets a series of events. It is assumed that the current HVAC system environment is relatively stable, which means that the trigger conditions are not met, and the agent does not need to conduct strategy search and can continue to execute the current Actions. However, if the environment changes and trigger conditions are met, the agent needs to update its strategy. In the event-driven MDP model, the key is to design triggering rules. After the agent completes the observation, it can determine whether an event needs to be triggered based on the change rate of the last moment's observation and the current observation.

例如，当室内温度超过某个阈值时，触发事件，系统会自动调整温度以保持舒适。通过事先确定这些事件，系统能够更加容易地捕捉到影响环境响应的先验因素，从而提高系统的效率和适应性。For example, when the indoor temperature exceeds a certain threshold, an event is triggered and the system automatically adjusts the temperature to maintain comfort. By determining these events in advance, the system can more easily capture a priori factors that affect environmental response, thereby improving the efficiency and adaptability of the system.

如图1所示，本实施例的步骤S3中的两种类型的事件包括本发明中设计了两类事件，状态转换事件和组合事件。另外，如果需要可以很容易地将其他类型地事件添加到基于事件驱动的深度强化学习框架内。As shown in Figure 1, the two types of events in step S3 of this embodiment include two types of events designed in the present invention, state transition events and combination events. In addition, other types of events can be easily added to the event-driven deep reinforcement learning framework if needed.

具体地，状态转换事件：某些状态变化对系统的运行有很大的影响，考虑到零售价格λ^retail(t)与房间占用率K_z(t)对能耗与热舒适影响很大，因此将这两个状态的变化列为第一事件和第二事件。状态转换事件包括第一事件和第二事件；具体来说，当前零售价格为λ^retail(t)，与上一时刻λ^retail(t')不同时，则触发第一事件。第一事件定义如下：Specifically, state transition events: certain state changes have a great impact on the operation of the system. Considering that the retail price λ ^retail (t) and the room occupancy rate K _z (t) have a great impact on energy consumption and thermal comfort, therefore List these two state changes as the first event and the second event. The state transition event includes a first event and a second event; specifically, when the current retail price is λ ^retail (t), and it is different from the previous moment λ ^retail (t'), the first event is triggered. The first event is defined as follows:

类似的，第二事件定义如下：Similarly, the second event is defined as follows:

具体地，组合事件：当不同状态同时变化时，可以列为组合事件。考虑热舒适与能耗是优化目标，且根据房间占用率与价格的变化而变化，组合事件为第三事件，第三事件由TH(t')、K_z(t')和λ^retail(t')组合而成；Specifically, combined events: When different states change at the same time, they can be listed as combined events. Considering that thermal comfort and energy consumption are optimization goals and change according to changes in room occupancy and price, the combined event is the third event. The third event consists of TH(t'), K _z (t') and λ ^retail (t ') combined;

当房间占用率或价格变化时，如果大于先前定义的热舒适值χ时，则触发第三事件，第三事件定义如下：When room occupancy or price changes, if When it is greater than the previously defined thermal comfort value χ, the third event is triggered. The third event is defined as follows:

其中，表示热舒适值；/>的大小等于r^comfort(t')，这样可以使事件能够更好地响应奖励的变化，并确保室内温度保持在舒适的范围内。in, Indicates thermal comfort value;/> The size of is equal to r ^comfort (t'), which allows the event to better respond to changes in reward and ensures that the indoor temperature remains within a comfortable range.

如图4、5所示，为强化学习与基于事件驱动的深度强化学习的回溯图。每个实心点表示状态-动作对，空心点表示状态，在传统的强化学习过程中，智能体从状态-动作对依次的完成每一个学习步。而基于事件驱动的深度强化学习状态与奖励仍然是周期性的，但是行动转换成了非周期性。如果触发事件，在s'下得到最优策略(s',a'),接下来如果不触发事件则直接延续最优策略(s”,a')。值得一提的是，非周期的行动不是指有时不执行行动，而是不进行策略搜索，直接沿用上一个动作。As shown in Figures 4 and 5, they are backtracking diagrams of reinforcement learning and event-driven deep reinforcement learning. Each solid point represents a state-action pair, and the hollow dot represents a state. In the traditional reinforcement learning process, the agent completes each learning step from the state-action pair in sequence. In event-driven deep reinforcement learning, the status and rewards are still periodic, but the actions are converted to non-periodic. If an event is triggered, the optimal strategy (s', a') is obtained under s'. If the event is not triggered, the optimal strategy (s", a') will be continued directly. It is worth mentioning that non-periodic actions It does not mean that the action is not performed sometimes, but that the previous action is directly used without conducting a strategy search.

如图1所示，本实施例的步骤S4中，设计基于事件驱动的DQN算法，具体包括如下步骤：As shown in Figure 1, in step S4 of this embodiment, an event-driven DQN algorithm is designed, which specifically includes the following steps:

步骤S43、进行迭代(迭代次数T由环境决定，直到仿真环境结束)。Step S43: Iterate (the number of iterations T is determined by the environment until the end of the simulation environment).

具体地，步骤S43的迭代过程如下：Specifically, the iterative process of step S43 is as follows:

步骤S432、在Q网络中以s作为输入，如果事件都未触发，则执行之前的动作；如果其中一个事件被触发，得到Q网络所有对应q值的动作输出，通过ε贪心算法在当前q值中挑选对应动作a，随机行动是以概率ε选择的，而具有最大价值函数的行动则以概率1-ε选择。值得注意的是，ε在训练过程中逐渐减少，直到达到最小值，动作a表示为：Step S432. Use s as input in the Q network. If none of the events are triggered, the previous action is executed; if one of the events is triggered, all action outputs corresponding to the q value of the Q network are obtained, and the current q value is calculated through the ε greedy algorithm. The corresponding action a is selected from , the random action is selected with probability ε, and the action with the largest value function is selected with probability 1-ε. It is worth noting that ε gradually decreases during the training process until it reaches the minimum value, and the action a is expressed as:

其中，s和A是ED-MDP算法模型中的元素，s表示某一个状态，A为动作集；Among them, s and A are elements in the ED-MDP algorithm model, s represents a certain state, and A is an action set;

步骤S436、当收集到足够数量的样本时，随机挑选一批小型的样本，计算这些样本的q值，用于更新Q网络，计算公式如下：Step S436: When a sufficient number of samples are collected, a batch of small samples are randomly selected, and the q values of these samples are calculated for updating the Q network. The calculation formula is as follows:

以上所述的具体实施例，对本发明解决的技术问题、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施例而已，并不用于限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific embodiments described above further describe in detail the technical problems, technical solutions and beneficial effects solved by the present invention. It should be understood that the above are only specific embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the invention shall be included in the protection scope of the invention.

Claims

1. The method for controlling the deep reinforcement learning building based on the event driving is characterized by comprising the following steps of:

step S1, acquiring weather data, generating a data set, and dividing the data set into a training set and a testing set;

s2, selecting a residential building with multiple areas as a building model, and creating an experimental platform based on Python according to a thermodynamic formula;

s3, designing an MDP model based on event driving, and designing two types of events under the condition of being based on expert knowledge and priori knowledge;

s4, designing an DQN algorithm based on event driving, and optimizing a control strategy when the event is triggered;

s5, verifying the effectiveness of event setting through an ablation experiment, and adjusting event parameters according to the result;

and S6, comparing experiments prove the control effect and superiority of the DQN algorithm based on event driving.

2. The event-driven deep reinforcement learning building control method according to claim 1, wherein in the step S3, an event-driven MDP model is designed, and the method specifically comprises the following steps:

constructing an event-driven six-tuple based event-driven MDP model consisting of states, actions, transitions, rewards functions, discount factors, and event-triggered functions, expressed as:

wherein S represents a state set; a represents a set of actions that an agent can take; p represents a transfer of the system; r represents a return function; gamma represents a discount factor for measuring the degree of importance of an agent to future rewards; e represents an event trigger function;

when the value of the event triggering function exceeds a preset threshold, the agent is triggered and updates the strategy, and state transition occurs at the same time, wherein the transition function is as follows:

P{s(t+1)|s(t),a,e}；

the state, action and rewards of the heating, ventilation and air conditioning system are designed by taking a multi-region residential building as an environment and event-driven deep reinforcement learning as an intelligent agent.

3. The event-driven deep reinforcement learning building control method of claim 2, wherein the states are expressed as:

[T _in,z (t),K _z (t),T _out (t),λ ^electricity (t)]

wherein T is _in,z (t) is the indoor temperature of each room, K _z (T) is the occupancy of personnel, T _out (t) is the outdoor temperature, lambda ^electricity For electricity price, z represents a roomThe number t indicates the current time step.

4. The event-driven deep reinforcement learning building control method of claim 2, wherein the action is a temperature set point of a hvac system, the action being expressed as:

A(t)＝[Sp _z (t)]

wherein SP is _z (t) represents a temperature set point for the z-room; wherein the set point for each zone is a discrete variable and follows the logic:

if the indoor temperature is higher than the temperature value of the temperature set point, the heating ventilation air conditioning system is started; if the indoor temperature is lower than the temperature value of the temperature set point, the heating ventilation air conditioner is turned off.

5. The event-driven deep reinforcement learning building control method of claim 2, wherein the reward function is expressed as:

wherein lambda is ^retail (t') represents retail price, E _HVAC (t') represents energy consumption, R ^comfort (t') represents a prize having a temperature in the comfort range, SW ^penalty (t') represents a punishment item of the switch heating ventilation air conditioner;

given comfort rangeWhen the indoor temperature T _in (t) is higher than the minimum comfort temperature th ^min When the intelligent agent adjusts the setting of the heating, ventilation and air conditioning system according to the rewarding size; when T is _in (t) below the minimum comfort temperature th ^min When the air conditioner is turned off;

when the action is performed away from the threshold, then a negative prize is increased, specifically as follows:

when the indoor temperature is within the thermal comfort threshold, a positive prize is awarded, defining an optimal comfort temperature th ^best The closer the indoor temperature is to th ^best The higher the positive prize, the positive prize is expressed as:

6. the event-driven deep reinforcement learning building control method of claim 1, wherein: the two types of events in step S3 include a state transition event and a combination event.

7. The event-driven deep reinforcement learning building control method of claim 6, wherein: the state transition event includes a first event and a second event, the first event being defined as follows:

wherein lambda is ^retail (t) and lambda ^retail (t') are all in the price range [ lambda ] ^low ,λ ^high ]In, lambda ^low Represents the lowest retail price, lambda ^high Representing the highest retail price.

The second event is defined as follows:

wherein K is _z (t') and K _z (t) is at [ -1,1]In, -1 indicates that no person is present at this time, and 1 indicates that a person is present at this time.

8. The event-driven deep reinforcement learning building control method of claim 6, wherein: the combined event is a third event, and the third event is composed of TH (t'), K _z (t') and lambda ^retail (t') are combined;

when the occupancy or price of the room changes, ifAbove the thermal comfort value χ, a third event is triggered, said third event being defined as follows:

wherein,indicating a thermal comfort value.

9. The event-driven deep reinforcement learning building control method of claim 1, wherein: in the step S4, designing an DQN algorithm based on event driving, which specifically includes the following steps:

step S41, initializing Q network parameters omega and assigning omega to the target Q network ^* =ω, initializing all states and actions based on the random Q network parameter ω and the corresponding Q value;

step S42, initializing the size B of the miniband and initializing an experience buffer area M;

step S43, iteration is carried out.

10. The event-driven deep reinforcement learning building control method according to claim 9, wherein the iterative process of step S43 is as follows:

step S431, obtaining the current state quantity of the environment initialization, and performing preliminary processing to obtain the characteristic state parameter S (T _out (0),T _in,z (0),λ ^electricity (0),K _z (0))；

Step S432, taking S as input in the Q network, and executing the previous action if the event is not triggered; if one of the events is triggered, obtaining action outputs of all corresponding Q values of the Q network, selecting a corresponding action a from the current Q values through an epsilon greedy algorithm, wherein the action a is expressed as:

wherein s represents a certain state, and A is an action set;

step S433, executing action a in the current state S (t) to obtain a feature state vector S (t+1) processed by the new environmental state and a reward R of the action;

step S434, tuple [ S (t), setpt _z (t),r(t),s(t+1)]Put into experience buffer M;

step S435, randomly extracting B samples [ S ] from the experience buffer M ⁽ⁱ⁾ (t),Setpt _z ⁽ⁱ⁾ (t),r ⁽ⁱ⁾ (t),s ⁽ⁱ⁾ (t+1)]Where i is the sample number;

step 436, when a sufficient number of samples are collected, randomly selecting a batch of samples, and calculating Q values of the samples for updating the Q network, wherein the calculation formula is as follows:

step S437, the weight omega is corrected by using a mean square error formula as a loss function, wherein the calculation formula is as follows:

step S438, updating the target Q network after the step U according to the delay strategy, and copying the weight parameter omega of the network ^* ＝ω；

And step S439, stopping the environment simulation and stopping the algorithm.