WO2022083029A1

WO2022083029A1 - Decision-making method based on deep reinforcement learning

Info

Publication number: WO2022083029A1
Application number: PCT/CN2021/074974
Authority: WO
Inventors: 张昊迪; 伍楷舜; 陈振浩; 高子航; 李启凡
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2020-10-19
Filing date: 2021-02-03
Publication date: 2022-04-28
Anticipated expiration: 2023-04-19
Also published as: CN112295237A

Abstract

A decision-making method based on deep reinforcement learning, the method comprising: an intelligent agent makes a decision according to environment information and selects an action after the decision; the intelligent agent compares said action with a knowledge base, and decides, on the basis of a set ruleset in the knowledge base, to execute said action or to replace said action; the intelligent agent executes said action or the replaced action in the environment, obtains a reward and new environment information from the environment, combines the old environment information, the action, the reward and the new environment information into experience information, and stores the experience information into an experience playback pool; and a set amount of experience information is randomly selected from the experience playback pool, so as to update a deep reinforcement learning model to thereby guide the next iteration. By utilizing the present application, training time may be shortened, catastrophic decision-making may be avoided, and the method may be widely applied to the field of dynamic decision-making.

Description

A decision-making method based on deep reinforcement learning

technical field

本发明涉及人工智能领域，更具体地，涉及一种基于深度强化学习的决策方法。The present invention relates to the field of artificial intelligence, and more particularly, to a decision-making method based on deep reinforcement learning.

Background technique

强化学习是机器学习中的一个领域，用于描述和解决智能体在与环境的交互过程中通过学习策略以达成回报最大化或实现特定目标的问题。Reinforcement learning is a field in machine learning used to describe and solve problems in which agents learn strategies to maximize rewards or achieve specific goals in the process of interacting with the environment.

目前，深度强化学习已成功应用于多种动态决策领域，尤其是那些具有很大状态空间的领域。然而，深度强化学习也面临着一些问题，首先，它的训练过程可能非常缓慢并且需要大量资源，最终的系统通常很脆弱，结果难以解释，并且在训练开始很长一段时间表现很差。此外，对于机器人技术和关键决策支持系统中的应用，利用深度强化学习甚至可能作出灾难性的决策，从而导致成本巨大的后果。Currently, deep reinforcement learning has been successfully applied to a variety of dynamic decision-making domains, especially those with large state spaces. However, deep reinforcement learning also faces some problems. First, its training process can be very slow and resource-intensive. The final system is often fragile, the results are difficult to interpret, and it performs poorly for long periods of time at the beginning of training. Furthermore, for applications in robotics and critical decision support systems, the use of deep reinforcement learning can even make catastrophic decisions with costly consequences.

因此，需要对现有技术进行改进，以获得效率更高、更安全的决策方法。Therefore, improvements to existing technologies are required to obtain more efficient and safer decision-making methods.

发明内容SUMMARY OF THE INVENTION

本发明的目的是克服上述现有技术的缺陷，提供一种基于深度强化学习的决策方法，是将高抽象层级规则与深度强化学习相结合进行动态决策的新技术方案。The purpose of the present invention is to overcome the above-mentioned defects of the prior art, and to provide a decision-making method based on deep reinforcement learning, which is a new technical solution for dynamic decision-making by combining high abstraction level rules and deep reinforcement learning.

本发明提供一种基于深度强化学习的决策方法。该方法包括以下步骤：The present invention provides a decision-making method based on deep reinforcement learning. The method includes the following steps:

智能体根据环境信息进行决策，选择决策后的动作；The agent makes decisions according to the environmental information and selects the actions after the decision;

智能体将决策后的动作与知识库对比，并基于知识库中的设定规则集决定是否用规则集中的随机动作替换决策后的动作；The agent compares the action after the decision with the knowledge base, and decides whether to replace the action after the decision with the random action in the rule set based on the set rule set in the knowledge base;

在判断为替换决策后的动作的情况下，在环境中执行替换后的动作，从环境中获得奖励和新的环境信息，并将旧环境信息、动作、奖励和新环境信息组合成经验信息，存入经验回放池；In the case that it is judged to replace the action after the decision, execute the replaced action in the environment, obtain the reward and new environment information from the environment, and combine the old environment information, action, reward and new environment information into experience information, Stored in the experience playback pool;

在经验回放池中随机选取设定数量的经验信息，以更新深度强化学习模型，进而指导下一次的迭代。A set amount of experience information is randomly selected from the experience replay pool to update the deep reinforcement learning model to guide the next iteration.

在一个实施例中，根据知识库中的设定规则集决定是否用规则集中的随机动作替换决策后的动作包括：In one embodiment, determining whether to replace the post-decision action with a random action in the rule set according to the set rule set in the knowledge base includes:

判断知识库中的规则集是否满足预定条件；Determine whether the rule set in the knowledge base satisfies the predetermined condition;

在满足设定条件的情况下，以设定的概率用规则集中的一个随机动作替换决策后的动作。If the set conditions are met, replace the decided action with a random action in the rule set with a set probability.

在一个实施例中，在满足设定条件的情况下，以P _t＝p ₀·γ ^t的概率用合规动作集α(R,t)中的一个随机动作替换决策后的动作，其中p ₀是初始规则干预概率，t是运行时间，γ是衰减率，R表示规则集，α表示符合规则集R和在时间t下的所有动作。 In one embodiment, the post-decision action is replaced with a random action from the set of compliant actions α(R,t) with probability P _t = p ₀ ·γ ^t , where p ₀ is the initial rule intervention probability, t is the running time, γ is the decay rate, R represents the rule set, and α represents all actions that conform to the rule set R and at time t.

在一个实施例中，所述规则集根据决策应用场景以避免灾难性决策或以提升学习效率为目标进行设定，用于引导智能体在该应用场景下的动作。In one embodiment, the rule set is set according to a decision application scenario to avoid catastrophic decision-making or to improve learning efficiency, and is used to guide the actions of the agent in the application scenario.

在一个实施例中，将旧环境信息、动作、奖励和新环境信息组合成一个经验信息，存入经验回放池包括：In one embodiment, combining old environment information, actions, rewards and new environment information into one experience information, and storing it into the experience playback pool includes:

在获得新环境信息后，将一个单位的经验信息(φ(s _t)，a _t，r _t，φ(s _t+1))存入经验回放池D； After obtaining the new environment information, store one unit of experience information (φ(s _t ), at , r _t , φ(s _t ₊₁ )) into the experience playback pool D;

如果存入新的经验信息时，经验池容量超过设定的阈值N，则以存入时间为参考删除早期的经验信息。If the capacity of the experience pool exceeds the set threshold N when new experience information is stored, the earlier experience information will be deleted with reference to the storage time.

在一个实施例中，在经验回放池中随机选取设定数量的经验信息，以更新深度强化学习模型包括：In one embodiment, randomly selecting a set amount of experience information from the experience replay pool to update the deep reinforcement learning model includes:

在智能体与环境的每轮交互的每一步t中，在经验回放池中D随机选取一定数量的经验信息(φ(s _j)，a _j，r _j，φ(s _j+1))，并计算各个经验信息的当前时刻j的价值： In each step t of each round of interaction between the agent and the environment, D randomly selects a certain amount of experience information (φ(s _j ), a _j , r _j , φ(s _j+1 )) in the experience playback pool, And calculate the value of the current moment j of each experience information:

以(y _j-Q(φ(s _j),a _j；θ)) ²为目标函数做梯度下降来优化神经网络参数θ； Take (y _j -Q(φ(s _j ), a _j ; θ)) ² as the objective function to do gradient descent to optimize the neural network parameter θ;

最后每隔固定的步数C，将目标动作-价值函数Q*同步为动作-价值函数Q；Finally, every fixed number of steps C, the target action-value function Q* is synchronized to the action-value function Q;

其中，a’表示j+1时刻的可选动作，a _j表示j时刻的动作，s _j和s _j+1分别表示j时刻和j+1时刻的环境信息，φ表示预处理过程。 Among them, a' represents the optional action at time j+1, a _j represents the action at time j, s _j and s _j+1 represent the environmental information at time j and time j+1 respectively, and φ represents the preprocessing process.

与现有技术相比，本发明的优点在于，在深度强化学习中，除了考虑所有可能采取动作的Q值之外，还考虑了适用的规则，通过将高抽象层级规则与深度强化学习相结合，提高了训练效果并且能够避免作出灾难性决策。Compared with the prior art, the present invention has the advantage that, in deep reinforcement learning, in addition to considering all possible Q-values for actions, applicable rules are also considered, by combining high abstraction level rules with deep reinforcement learning , which improves training performance and avoids catastrophic decisions.

通过以下参照附图对本发明的示例性实施例的详细描述，本发明的其它特征及其优点将会变得清楚。Other features and advantages of the present invention will become apparent from the following detailed description of exemplary embodiments of the present invention with reference to the accompanying drawings.

Description of drawings

被结合在说明书中并构成说明书的一部分的附图示出了本发明的实施例，并且连同其说明一起用于解释本发明的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention.

图1是根据本发明一个实施例的基于深度强化学习的决策方法的流程图；1 is a flowchart of a decision-making method based on deep reinforcement learning according to an embodiment of the present invention;

图2是根据本发明一个实施例的规则干预学习框架的示意图；2 is a schematic diagram of a rule intervention learning framework according to an embodiment of the present invention;

图3是根据本发明一个实施例的Flappybird游戏中的画面图；FIG. 3 is a screen diagram in the Flappybird game according to an embodiment of the present invention;

图4是根据本发明一个实施例的Flappybird游戏中规则生效范围示意图；FIG. 4 is a schematic diagram of the effective range of rules in the Flappybird game according to an embodiment of the present invention;

图5是根据本发明一个实施例的Flappybird游戏中平均奖励和平均Q值的实验结果图；5 is a graph of experimental results of average reward and average Q value in the Flappybird game according to an embodiment of the present invention;

图6是根据本发明一个实施例的Spacewar游戏中的画面图；FIG. 6 is a screen diagram in the Spacewar game according to an embodiment of the present invention;

图7是根据本发明一个实施例的Spacewar游戏中平均奖励和平均Q值实验结果图；7 is a graph of experimental results of average reward and average Q value in the Spacewar game according to an embodiment of the present invention;

图8是根据本发明一个实施例的Breakout游戏中的画面图；8 is a screen diagram in a Breakout game according to an embodiment of the present invention;

图9是根据本发明一个实施例的Breakout游戏中平均奖励和平均Q值实验结果图；9 is a graph of experimental results of average reward and average Q value in the Breakout game according to an embodiment of the present invention;

图10是根据本发明一个实施例的GirdWorld游戏中的画面图；FIG. 10 is a screen diagram in the GirdWorld game according to an embodiment of the present invention;

图11是本发明一个实施例的GirdWorld游戏中平均奖励实验结果图；Fig. 11 is the average reward experiment result graph in the GirdWorld game of one embodiment of the present invention;

附图中，Average Reward-平均奖励；Average Q value-平均Q值；Training Epochs-训练时期。In the attached figure, Average Reward - average reward; Average Q value - average Q value; Training Epochs - training period.

Detailed ways

现在将参照附图来详细描述本发明的各种示例性实施例。应注意到：除非另外具体说明，否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本发明的范围。Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the invention unless specifically stated otherwise.

以下对至少一个示例性实施例的描述实际上仅仅是说明性的，决不作为对本发明及其应用或使用的任何限制。The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论，但在适当情况下，所述技术、方法和设备应当被视为说明书的一部分。Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods, and apparatus should be considered part of the specification.

在这里示出和讨论的所有例子中，任何具体值应被解释为仅仅是示例性的，而不是作为限制。因此，示例性实施例的其它例子可以具有不同的值。In all examples shown and discussed herein, any specific values should be construed as illustrative only and not limiting. Accordingly, other instances of the exemplary embodiment may have different values.

应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步讨论。It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further discussion in subsequent figures.

参见图1所示，该实施例提供的基于深度强化学习的决策方法包括以下步骤：Referring to FIG. 1 , the decision-making method based on deep reinforcement learning provided by this embodiment includes the following steps:

步骤S110，预设知识库中的规则集、相关超参数，初始化深度强化学习模型。Step S110, preset the rule set and related hyperparameters in the knowledge base, and initialize the deep reinforcement learning model.

为便于理解本发明，首先参见图2所示，所提出的将高抽象层级规则与深度强化学习相结合的决策方法包含了规则干预学习(RIL)框架，具体包括深度Q网络(即DQN，一种将Q学习和深度学习结合的深度强化学习算法)、知识库和环境，其中DQN和知识库具有交互，用于将知识库中的规则和深度Q学习结合，在深度Q学习中，除了要考虑所有可能采取动作的Q值之外，还要考虑是否有任何规则适用。In order to facilitate the understanding of the present invention, referring to Fig. 2 first, the proposed decision-making method combining high-level abstraction rules and deep reinforcement learning includes a rule intervention learning (RIL) framework, specifically including a deep Q network (ie DQN, a A deep reinforcement learning algorithm that combines Q-learning and deep learning), knowledge base and environment, in which DQN and knowledge base have interactions for combining the rules in the knowledge base with deep Q-learning. In deep Q-learning, in addition to the In addition to considering all possible Q-values for action, consider whether any rules apply.

例如，深度强化学习模型利用的神经网络包括3个卷积层、1个隐藏层和1个输出层。例如，第一层卷积层为32个8*8*4的卷积核，步幅为4，输出结果通过2*2的最大池化层；第二层卷积层为64个4*4*32的卷积核，步幅为2；第三层卷积层为64个3*3*64的卷积核，步幅为1；隐藏层由256个完全连接的层组成ReLU函数节点；由于不同实施例的动作数量可能不一样，因此输出层可能不同。For example, a deep reinforcement learning model utilizes a neural network consisting of 3 convolutional layers, 1 hidden layer, and 1 output layer. For example, the first convolutional layer is 32 8*8*4 convolution kernels, the stride is 4, and the output result passes through the 2*2 max pooling layer; the second convolutional layer is 64 4*4 *32 convolution kernel with stride 2; the third convolutional layer is 64 3*3*64 convolution kernels with stride 1; the hidden layer consists of 256 fully connected layers to form ReLU function nodes; Since the number of actions may be different for different embodiments, the output layer may be different.

其中，ReLU函数为：Among them, the ReLU function is:

需说明的是，神经网络的具体结构、层数、卷积核大小、激活函数等可根据对训练精度和训练时间的要求进行设置，本发明对此不作限制。It should be noted that the specific structure, number of layers, convolution kernel size, activation function, etc. of the neural network can be set according to the requirements for training accuracy and training time, which are not limited in the present invention.

步骤S110的初始化过程包括设置知识库中的规则集、规则干预概率、预处理过程、损失函数、经验回放池、训练轮数等。The initialization process of step S110 includes setting the rule set in the knowledge base, the rule intervention probability, the preprocessing process, the loss function, the experience playback pool, the number of training rounds, and the like.

设置知识库中的规则集R，该规则集可包含多种不同类型的规则，并且可根据不同的应用场景进行设定。需说明的是，每种类型的规则可单独起作用或与其他类型的规则相结合同时起作用。规则集中每条规则用于限定规则生效范围和该范围对应的动作。Set the rule set R in the knowledge base, the rule set can contain a variety of different types of rules, and can be set according to different application scenarios. It should be noted that each type of rule can work alone or in combination with other types of rules. Each rule in the rule set is used to limit the effective scope of the rule and the actions corresponding to the scope.

设定规则干预条件或干预概率。例如，设置当前时刻下的规则干预概率为P _t＝p ₀·γ ^t，其中，p ₀是初始规则干预概率，γ是衰减率，t是运行时间。应理解的是，该干预概率仅起示例性作用，也可采用其他形式的规则干预概率，如P _t＝γ ^t等。 Set rule intervention conditions or intervention probabilities. For example, the rule intervention probability at the current moment is set as P _t =p ₀ ·γ ^t , where p ₀ is the initial rule intervention probability, γ is the decay rate, and t is the running time. It should be understood that the intervention probability is only exemplary, and other forms of regular intervention probability, such as P _t =γ ^t and the like, may also be used.

预处理设置φ，用于处理输入的初始信息。例如，彩色游戏画面经过预处理设置后变为灰色画面。具体地，智能体与环境进行M轮的交互，但在每一步决策之前，都要将环境信息进行预处理，如每一轮的环境的初始信息为s ₁，则经过预处理后的初始信息为φ(s ₁)。 Preprocessing sets φ, which is used to process the initial information of the input. For example, a colored game screen becomes a gray screen after preprocessing settings. Specifically, the agent interacts with the environment in M rounds, but before each step of decision-making, the environmental information must be preprocessed. If the initial information of the environment in each round is s ₁ , the preprocessed initial information is φ(s ₁ ).

探索概率ε，智能体除了选择最大Q值的动作之外，有ε的概率随机选出一个动作。具体地，智能体在与环境的每轮交互的每一步t中，智能体以大小为ε的概率随机选出一个动作在环境中执行。Exploration probability ε, in addition to selecting the action with the largest Q value, the agent randomly selects an action with probability ε. Specifically, in each step t of each round of interaction with the environment, the agent randomly selects an action to perform in the environment with a probability of size ε.

设置经验回放池D且其容量最大值为N，用于存放经验信息以进行深度Q学习的训练。Set the experience playback pool D and its maximum capacity is N, which is used to store experience information for deep Q-learning training.

初始化动作-价值函数Q且其参数随机设置为θ，以及初始化目标动作-价值函数Q ^*且其参数θ ^*数值大小设置成为θ，目标动作-价值函数Q*用于选出最大Q*值的动作，表示为： Initialize the action-value function Q and its parameters are randomly set to θ, and initialize the target action-value function Q ^* and its parameter θ ^* value size is set to θ, the target action-value function Q* is used to select the one with the largest Q* value action, expressed as:

a _t＝argmax _aQ ^*(φ(s _t),a；θ) (2) a _t = argmax _a Q ^* (φ(s _t ), a; θ) (2)

然后，基于动作-价值函数Q和当前时刻j的价值计算损失函数，以更新网络参数θ，例如，损失函数表示为：Then, the loss function is calculated based on the action-value function Q and the value of the current moment j to update the network parameter θ, for example, the loss function is expressed as:

(y _j-Q(φ(s _j),a _j,θ)) ² (3) (y _j -Q(φ(s _j ),a _j ,θ)) ² (3)

其中，当前时刻j的价值表示为：Among them, the value of the current moment j is expressed as:

其中，a’表示j+1时刻的可选动作。Among them, a' represents the optional action at time j+1.

此外，还需设置训练轮数上限值M，当训练轮数达到上限值M时，结束训练。In addition, the upper limit value M of the number of training rounds needs to be set, and when the number of training rounds reaches the upper limit value M, the training ends.

步骤S120，智能体观察环境，获得从环境中得到的信息。In step S120, the agent observes the environment and obtains information obtained from the environment.

在智能体与环境的每轮交互的每一步t中，智能体观察环境，获得环境信息s _t，智能体将环境s _t经过预处理成φ(s _t)，例如彩色游戏画面s _t经过预处理设置后变为灰色画面φ(s _t)。需要注意的是，智能体每次从环境中接收到环境信息时，都要经过该预处理操作。 In each step t of each round of interaction between the agent and the environment, the agent observes the environment and obtains environmental information s _t , and the agent preprocesses the environment _st into φ(s _t ), for example, the color game screen s _t is preprocessed The gray screen φ(s _t ) is displayed after processing the setting. It should be noted that each time the agent receives environmental information from the environment, it must go through this preprocessing operation.

步骤S130，智能体根据环境信息进行决策，选择出决策后的动作。Step S130, the agent makes a decision according to the environmental information, and selects an action after the decision.

在智能体与环境的每轮交互的每一步t中，智能体观察预处理后的环境信息φ(s _t)后，以ε的概率大小，随机选择一个动作，否则，根据公式a _t＝argmax _aQ ^*(φ(s _t),a；θ)选择出决策后的动作。 In each step t of each round of interaction between the agent and the environment, after observing the preprocessed environmental information φ(s _t ), the agent randomly selects an action with the probability of ε, otherwise, according to the formula a _t = argmax _a Q ^* (φ(s _t ), a; θ) selects the post-decision action.

步骤S140，智能体将决策后的动作与知识库对比，根据知识库中的规则，判断出是否有更合适的动作，若有更适合的动作，则按一定条件替换决策后的动作，若没有替换则执行决策后的动作。In step S140, the agent compares the action after the decision with the knowledge base, and judges whether there is a more suitable action according to the rules in the knowledge base. If there is a more suitable action, the action after the decision is replaced according to certain conditions. Substitution executes the post-decision action.

具体地，在智能体与环境的每轮交互的每一步t中，智能体将决策后的动作与知识库对比，根据知识库中设定的规则集，判断出是否有更合适的动作，若有更适合的动作，则按一定条件替换决策后的动作，若没有替换则执行决策后的动作。例如，设定规则集

如果合规动作集α(R,t)连续非空且

则以P _t＝p ₀·γ ^t的概率用规则集中的一个随机动作替换决策后的动作a _t，其中R表示规则集，t表示运行时间，α表示符合规则集R和在时间t下的所有动作。 Specifically, in each step t of each round of interaction between the agent and the environment, the agent compares the action after the decision with the knowledge base, and judges whether there is a more appropriate action according to the set of rules set in the knowledge base. If there is a more suitable action, the action after the decision is replaced according to certain conditions, and if there is no replacement, the action after the decision is executed. For example, set the ruleset

If the compliance action set α(R,t) is consecutively non-empty and

Then, with the probability of P _t = p ₀ ·γ ^t , the decision-making action at is replaced by a random action in the rule set, where R represents the rule set, _t represents the running time, and α represents the compliance with the rule set R and at time t. All actions.

步骤S150，智能体将决策后的动作或替换后的动作在环境中执行，从环境中获得奖励和其他新的信息，将旧环境信息、动作、奖励和新环境信息组合成一个经验信息，存入经验回放池。In step S150, the agent executes the action after the decision or the action after the replacement in the environment, obtains rewards and other new information from the environment, combines the old environment information, action, reward and new environment information into one experience information, and saves it. into the experience playback pool.

在智能体与环境的每轮交互的每一步t中，将最终选定的动作a _t执行到环境并获取奖励值r _t和新的环境观察值s _t+1，然后预处理环境观察值得到φ(s _t+1)，并将φ(s _t)、a _t、r _t和φ(s _t+1)组合成一个单位的经验信息，表示为(φ(s _t)，a _t，r _t，φ(s _t+1))，存入经验回放池D中。 In each step t of each round of interaction between the agent and the environment, execute the final selected action a _t to the environment and obtain the reward value r _t and the new environment observation value s _t+1 , and then preprocess the environment observation value to get φ(s _t+1 ), and combine φ(s _t ), at , r _t and φ(s _t ₊₁ ) into a unit of empirical information, denoted as (φ(s _t ), at , _r _t , φ(s _t+1 )), stored in the experience playback pool D.

需要说明的是，经验回放池D的最大容量为N，如果存入新的经验信息时，经验池容量已达最大值N，则需要删除早期的经验信息以腾出空间。It should be noted that the maximum capacity of the experience playback pool D is N. If the capacity of the experience pool has reached the maximum value N when new experience information is stored, the earlier experience information needs to be deleted to make room.

步骤S160，在经验回放池中随机选取一定数量的经验信息，用于更新模型以指导下一次的迭代。In step S160, a certain amount of experience information is randomly selected from the experience playback pool to update the model to guide the next iteration.

在智能体与环境的每轮交互的每一步t中，在经验回放池中D随机选取一定数量的经验信息(φ(s _j)，a _j，r _j，φ(s _j+1))，计算各个经验信息的当前时刻j的价值 In each step t of each round of interaction between the agent and the environment, D randomly selects a certain amount of experience information (φ(s _j ), a _j , r _j , φ(s _j+1 )) in the experience playback pool, Calculate the value of the current time j of each experience information

然后，以(y _j-Q(φ(s _j),a _j；θ)) ²为目标函数做梯度下降从而优化神经网络参数θ，最后，每隔固定的步数C，将目标动作-价值函数Q*同步为动作-价值函数Q，更新后的模型将会指导下一步的迭代。 Then, take (y _j -Q(φ(s _j ), a _j ; θ)) ² as the objective function to do gradient descent to optimize the neural network parameter θ, and finally, every fixed number of steps C, the target action-value The function Q* is synchronized to the action-value function Q, and the updated model will guide the next iteration.

智能体与环境的每轮交互结束时自动进入下一轮交互，重复步骤S120到S160，直至达到预设的训练上限轮数M、预设的迭代次数或者损失函数达到预设的收敛条件。At the end of each round of interaction between the agent and the environment, the agent automatically enters the next round of interaction, and steps S120 to S160 are repeated until the preset upper limit number of training rounds M, the preset number of iterations or the loss function reaches the preset convergence condition.

为进一步说明本发明的效果，下文将以具体应用场景为例进行说明。In order to further illustrate the effect of the present invention, a specific application scenario will be taken as an example for description below.

参见图3所示，在Flappybird游戏中，智能体操纵的一只鸟试图飞过成对的管道同时避免撞到任何管道。在该场景下，智能体有两种操作可用，即控制鸟拍打翅膀或不执行任何操作。通过拍打翅膀，鸟会获得暂时的向上加速，因此，鸟可以上升一定距离。如果不采取任何措施，鸟将因重力而下降一定距离。鸟飞过成对的管道将会获得奖励，但如果鸟撞到管子或跌落在地上，此轮游戏将结束，并失去一定奖励。对于该游戏，本发明使用规则集来告诉鸟在一对管道上飞行时不要飞得太高或太低，该规则仅在图4的框内(即相对的上下管之间的区域)飞行时影响鸟的训练。Referring to Figure 3, in the Flappybird game, a bird manipulated by the agent tries to fly through pairs of pipes while avoiding hitting any pipes. In this scenario, the agent has two actions available, that is, control the bird to flap its wings or do nothing. By flapping its wings, the bird gets a temporary upward acceleration, so the bird can ascend a certain distance. If nothing is done, the bird will descend a certain distance due to gravity. Birds flying through pairs of pipes will be rewarded, but if the bird hits the pipes or falls to the ground, the round will end and a certain amount of the reward will be lost. For this game, the present invention uses a set of rules to tell the bird not to fly too high or too low when flying over a pair of pipes, the rules are only when flying within the box of Figure 4 (ie the area between opposing upper and lower pipes) Affects bird training.

该实施例中，使用了加速规则，生效概率设置为：In this embodiment, the acceleration rule is used, and the effective probability is set as:

P _t＝p ₀·γ ^t P _t =p ₀ ·γ ^t

其中0＜γ＜1，p ₀是常数，例如设置p ₀＝1，γ＝0.8。 Where 0<γ<1, p ₀ is a constant, for example, set p ₀ =1, γ=0.8.

形式上，Flappybird的知识库为R _fb＝{r ₁，r ₂}，η(r ₁)(其中，r ₁表示规则1，r ₂表示规则2，η(r ₁)表示符合规则r ₁的一阶逻辑命题)为： Formally, Flappybird's knowledge base is R _fb = {r ₁ , r ₂ }, η(r ₁ ) (where r ₁ represents rule ₁ , r ₂ represents rule 2, and η(r ₁ ) represents the first-order logic propositions) are:

crossing(p _u,p _l)∧less(disstance(bird,p _u),size(bird))， crossing(p _u ,p _l )∧less(distance(bird,p _u ),size(bird)),

且δ(r ₁)(表示规则r ₁的推荐动作)为： And δ(r ₁ ) (representing the recommended action of rule r ₁ ) is:

{flap}，{flap},

其中，δ(r ₁)表示规则r ₁的推荐动作。上述规则表示当小鸟在上下管道之间，且小鸟与上方管道的距离大于一个小鸟的垂直身高，则上飞。 Among them, δ(r ₁ ) represents the recommended action of rule r ₁ . The above rule means that when the bird is between the upper and lower pipes, and the distance between the bird and the upper pipe is greater than the vertical height of a bird, it will fly up.

η(r ₂₎为： η(r ₂₎ is:

crossing(p _u,p _l)∧less(disstance(bird,p _l),size(bird))， crossing(p _u ,p _l )∧less(distance(bird,pl ₎ ,size(bird)),

且δ(r ₁₎为 And δ(r ₁₎ is

{null}。{null}.

上述规则表示当小鸟在上下管道之间，且小鸟与下方管道的距离大于一个小鸟的垂直身高，则不做任何动作。The above rule means that when the bird is between the upper and lower pipes, and the distance between the bird and the lower pipe is greater than the vertical height of a bird, no action will be taken.

其中，(p _u,p _l)表示鸟正在飞过的一对管道，null表示不做任何措施，flap表示拍打翅膀。 where (p _u ,p _l ) represents a pair of pipes the bird is flying over, null means do nothing, and flap means flapping its wings.

使用了加速规则的FLappybird的平均奖励和平均Q值如图5所示，其中，RIL对应本发明(用曲线S11标识)，DQN对应传统的强化学习(用曲线S12标识)。在实验中，设置了训练阶段的时间限制，随着时间的不断增加，每局游戏的奖励不断上升。由图5示意的平均奖励可知，在相同的训练时间内，相对于传统的DQN，本发明的RIL会以更少的训练集获得更好的表现。并且，平均Q值也表明本发明引入的规则可以加速学习的进度。The average reward and average Q value of FLappybird using the acceleration rule are shown in Figure 5, where RIL corresponds to the present invention (marked by curve S11), and DQN corresponds to traditional reinforcement learning (marked by curve S12). In the experiment, a time limit for the training phase was set, and the reward for each game increased as the time continued to increase. It can be seen from the average reward shown in Fig. 5 that, in the same training time, compared with the traditional DQN, the RIL of the present invention can obtain better performance with fewer training sets. Moreover, the average Q value also shows that the rules introduced by the present invention can speed up the learning progress.

参见图6所示，在Spacewar游戏中，敌方飞机从屏幕顶部随机出现，并垂直飞向屏幕底部。智能体即我方飞机以一定频率连续射击敌方飞机，每次击中敌方飞机都会获得一定的奖励，如果我方飞机与敌方飞机相撞，则游戏结束并失去奖励。在该场景中，我方飞机只能水平移动，可用的动作是左移和右移。Referring to Figure 6, in the Spacewar game, enemy aircraft randomly appear from the top of the screen and fly vertically to the bottom of the screen. The agent means that our plane continuously shoots the enemy plane at a certain frequency, and each time it hits the enemy plane, it will get a certain reward. If our plane collides with the enemy plane, the game ends and the reward is lost. In this scene, our plane can only move horizontally, and the available actions are left and right.

例如，该游戏使用的规则集是一种贪婪策略，即始终移至水平最近的敌机。For example, the rule set used in this game is a greedy strategy that always moves to the closest bandit in the level.

在该实施例中，使用了加速规则，生效概率设置为：In this embodiment, the acceleration rule is used, and the effective probability is set as:

P _t＝p ₀·γ ^t P _t =p ₀ ·γ ^t

其中0＜γ＜1，设置p ₀＝1，γ＝0.8。 Where 0<γ<1, set p ₀ =1 and γ=0.8.

形式上，Spacewar的知识库为R _aw＝{r ₃，r ₄}，η(r ₃)为 Formally, Spacewar's knowledge base is _Raw = {r ₃ , r ₄ }, and η(r ₃ ) is

on_left(nearest_jet,agent)，on_left(nearest_jet,agent),

且δ(r3)为：And δ(r3) is:

{move_left}，{move_left},

上述规则表示当距离最近的敌机在左方，则左移。The above rule means that when the closest bandit is on the left, move to the left.

η(r4)为：η(r4) is:

on_right(nearest_jet,agent)，on_right(nearest_jet,agent),

且δ(r4)为And δ(r4) is

{move_right}。{move_right}.

上述规则表示当距离最近的敌机在右方，则右移。The above rule means that when the closest bandit is on the right, move to the right.

其中，move_left表示左移，move_right表示右移。Among them, move_left means move left, move_right means move right.

使用了加速规则的Spacewar的效果如图7所示，可见，本发明基于规则干预驱动的DQN的学习速度相对于传统的DQN快很多。The effect of Spacewar using accelerated rules is shown in Figure 7. It can be seen that the learning speed of the DQN driven by the rule intervention of the present invention is much faster than that of the traditional DQN.

参见图8所示，对于Breakout游戏，采用了如下规则：如果球在球拍的左侧，则球拍向左移动，如果球在球拍的右侧，则球拍向右移动。Referring to Figure 8, for the Breakout game, the following rules are used: if the ball is on the left side of the racket, the racket moves to the left, and if the ball is on the right side of the racket, the racket moves to the right.

该实施例中使用了加速规则，生效概率为The acceleration rule is used in this embodiment, and the effective probability is

P _t＝p ₀·γ ^t， P _t =p ₀ ·γ ^t ,

形式上，Breakout的知识库为R _bo＝{r ₅，r ₆}，η(r ₅)为 Formally, Breakout's knowledge base is R _bo = {r ₅ , r ₆ }, and η(r ₅ ) is

on_left(ball,paddle)，on_left(ball, paddle),

且δ(r ₅)为 and δ(r ₅ ) is

{move_left}，{move_left},

上述规则表示当球在球拍左方，左移。The above rule means that when the ball is on the left side of the racket, move left.

η(r ₆)为 η(r ₆ ) is

on_right(ball,paddle)，on_right(ball, paddle),

且δ(r ₆)为 and δ(r ₆ ) is

{move_right}。{move_right}.

使用了加速规则的Breakout的效果如图9所示，本发明的基于规则干预的学习效果相对于传统DQN有明显的提高。The effect of Breakout using accelerated rules is shown in Figure 9, and the learning effect of the rule-based intervention of the present invention is significantly improved compared to the traditional DQN.

以上三个实施例的实验结果如下表1，表明采用加速规则集可以有效提高学习效率。The experimental results of the above three embodiments are shown in Table 1, which shows that the use of accelerated rule sets can effectively improve the learning efficiency.

表1：实验结果对比Table 1: Comparison of experimental results

参见图10所示，对于GridWorld游戏，其中包含目的地区域、无法到达的墙壁(如黑色网格)和陷阱等，一旦落入陷阱，游戏就会结束，并且智能体将受到重罚。游戏的最终目标是在不陷入陷阱的情况下找出到达目的地的最短路径。例如，智能体每移动一次，得到的负奖励为-1。如果它落入陷阱，获得负奖励为-600，到达目的地的奖励是100。Referring to Figure 10, for the GridWorld game, which contains destination areas, unreachable walls (such as black grids), and traps, etc., once a trap is caught, the game will end and the agent will be heavily penalized. The ultimate goal of the game is to find the shortest path to your destination without falling into a trap. For example, the agent gets a negative reward of -1 every time it moves. If it falls into a trap, the negative reward is -600, and the reward for reaching the destination is 100.

在该实施例中，使用了安全规则，以确保训练过程中智能体的安全。安全规则用于防止智能体因作出灾难性的决策而导致不可逆的后果。与上述加速规则不同的是，所设定的安全规则在训练过程中始终生效。In this embodiment, safety rules are used to ensure the safety of the agent during training. Safety rules are used to prevent the agent from making catastrophic decisions with irreversible consequences. Unlike the above acceleration rules, the set safety rules are always in effect during the training process.

在该实施例中，在知识库中采用单一的安全规则，表示为R _gw＝{r ₇}，η(r ₇)为， In this embodiment, a single security rule is adopted in the knowledge base, which is expressed as R _gw ={r ₇ }, and η(r ₇ ) is,

near_trap∧trap_in(directions)，near_trap∧trap_in(directions),

η(r ₇)为： η(r ₇ ) is:

Α-{move(dir):dir∈directions}Α-{move(dir):dir∈directions}

上述规则表示当智能体在陷阱旁时，选择除去陷阱方向的一个动作。The above rule says that when the agent is next to a trap, it chooses an action that removes the direction of the trap.

例如：可选动作有“上”、“下”、“左”和“右”，陷阱在智能体“右方”，则选择其他3个动作中的1个。For example: the optional actions are "up", "down", "left" and "right", and the trap is on the "right" of the agent, then choose 1 of the other 3 actions.

其中A是所有动作的集合，该规则只是防止智能体进入陷阱，这个规则是强制性的。where A is the set of all actions, this rule just prevents the agent from entering the trap, this rule is mandatory.

GridWorld实验结果如图11，所示，表明即使在训练的初始阶段，本发明的基于规则干预的学习(RIL)的性能也比传统的DQN好很多。由于安全规则的作用，智能体永远不会做出灾难性的动作，从而保证了智能体的安全。当遇到冷驱动的问题时，使用安全规则来避免灾难性的决策特别有效。The GridWorld experimental results, shown in Fig. 11, show that even in the initial stage of training, the performance of the rule-based intervention based learning (RIL) of the present invention is much better than that of the traditional DQN. Due to the safety rules, the agent will never make catastrophic actions, thus guaranteeing the safety of the agent. Using safety rules to avoid catastrophic decisions is especially effective when it comes to cold-driven problems.

综上所述，本发明通过将高抽象层级规则与深度强化学习相结合，能够引导智能体在特定的场景下向更“正确”的方向前进。并且，所设定的规则集可以根据具体应用场景进行个性化设计，从而在提高模型通用型的前提下，更能从安全性、学习时间、学习效率等多方面满足个性化的需求。因此，利用本发明，能够缩短训练时间并避免作出灾难性决策，能够广泛应用于动态决策领域。To sum up, the present invention can guide the agent to move forward in a more "correct" direction in a specific scenario by combining high-level abstraction rules with deep reinforcement learning. In addition, the set rule set can be individually designed according to specific application scenarios, so that under the premise of improving the generality of the model, it can better meet the individual needs in terms of security, learning time, and learning efficiency. Therefore, by using the present invention, the training time can be shortened and catastrophic decision-making can be avoided, and it can be widely used in the field of dynamic decision-making.

本发明可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质，其上载有用于使处理器实现本发明的各个方面的计算机可读程序指令。The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of the present invention.

计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括：便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身，诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如，通过光纤电缆的光脉冲)、或者通过电线传输的电信号。A computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device. The computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (non-exhaustive list) of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above. Computer-readable storage media, as used herein, are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (eg, light pulses through fiber optic cables), or through electrical wires transmitted electrical signals.

这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备，或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令，并转发该计算机可读程序指令，以供存储在各个计算/处理设备中的计算机可读存储介质中。The computer readable program instructions described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

用于执行本发明操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码，所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等，以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中，远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机，或者，可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中，通过利用计算机可读程序指令的状态信息来个性化定制电子电路，例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA)，该电子电路可以执行计算机可读程序指令，从而实现本发明的各个方面。The computer program instructions for carrying out the operations of the present invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages. Source or object code, written in any combination, including object-oriented programming languages, such as Smalltalk, C++, etc., and conventional procedural programming languages, such as the "C" language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through the Internet connect). In some embodiments, custom electronic circuits, such as programmable logic circuits, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), can be personalized by utilizing state information of computer readable program instructions. Computer readable program instructions are executed to implement various aspects of the present invention.

这里参照根据本发明实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本发明的各个方面。应当理解，流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合，都可以由计算机可读程序指令实现。Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器，从而生产出一种机器，使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时，产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中，这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作，从而，存储有指令的计算机可读介质则包括一个制造品，其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams. These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.

也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上，使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤，以产生计算机实现的过程，从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.

附图中的流程图和框图显示了根据本发明的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分，所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个连续的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或动作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。对于本领域技术人员来说公知的是，通过硬件方式实现、通过软件方式实现以及通过软件和硬件结合的方式实现都是等价的。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more functions for implementing the specified logical function(s) executable instructions. In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions. It is well known to those skilled in the art that implementation in hardware, implementation in software, and implementation in a combination of software and hardware are all equivalent.

以上已经描述了本发明的各实施例，上述说明是示例性的，并非穷尽性的，并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下，对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择，旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进，或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。本发明的范围由所附权利要求来限定。Various embodiments of the present invention have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the various embodiments, the practical application or technical improvement in the marketplace, or to enable others of ordinary skill in the art to understand the various embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

A decision-making method based on deep reinforcement learning, including the following steps:

The agent makes decisions according to the environmental information and selects the actions after the decision;

The agent compares the action after the decision with the knowledge base, and decides whether to replace the action after the decision with the random action in the rule set based on the set rule set in the knowledge base;

In the case that it is judged to replace the action after the decision, perform the replaced action in the environment, obtain the reward and new environment information from the environment, and combine the old environment information, action, reward and new environment information into experience information, Stored in the experience playback pool;

A set amount of experience information is randomly selected from the experience replay pool to update the deep reinforcement learning model to guide the next iteration.

The method according to claim 1, wherein determining whether to replace the decided action with a random action in the rule set according to the set rule set in the knowledge base comprises:

Determine whether the rule set in the knowledge base satisfies the predetermined condition;

If the set conditions are met, replace the decided action with a random action in the rule set with a set probability.

The method according to claim 2, wherein, under the condition that the set condition is satisfied, with the probability of P _t = p ₀ ·γ ^t , a random action in the set of compliant actions α(R, t) is used to replace the post-decision decision where _p0 is the initial rule intervention probability, t is the running time, γ is the decay rate, R is the rule set, and α is all actions that conform to the rule set R and at time t.

The method according to claim 1, wherein the rule set is set according to a decision application scenario to avoid catastrophic decision-making or to improve learning efficiency, and is used to guide the action of the agent in the application scenario.

The method according to claim 1, wherein combining the old environment information, the action, the reward and the new environment information into one experience information, and storing it into the experience playback pool comprises:

After obtaining the new environment information, store one unit of experience information (φ(s _t ), at , r _t , φ(s _t ₊₁ )) into the experience playback pool D;

If the capacity of the experience pool exceeds the set threshold N when new experience information is stored, the earlier experience information will be deleted with reference to the storage time.

The method of claim 5, wherein randomly selecting a set number of experience information in the experience replay pool to update the deep reinforcement learning model comprises:

In each step t of each round of interaction between the agent and the environment, D randomly selects a certain amount of experience information (φ(s _j ), a _j , r _j , φ(s _j+1 )) in the experience playback pool, And calculate the value of the current moment j of each experience information:

Take (y _j -Q(φ(s _j ), a _j ; θ)) ² as the objective function to do gradient descent to optimize the neural network parameter θ;

Finally, every fixed number of steps C, the target action-value function Q* is synchronized to the action-value function Q;

Among them, a' represents the optional action at time j+1, a _j represents the action at time j, s _j and s _j+1 represent the environmental information at time j and time j+1 respectively, and φ represents the preprocessing process.

A computer-readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the steps of the method according to any one of claims 1 to 6.

A computer device, comprising a memory and a processor, a computer program that can be run on the processor is stored in the memory, and characterized in that, when the processor executes the program, any one of claims 1 to 6 is implemented The steps of the method described in item.