[go: up one dir, main page]

WO2022083029A1 - Decision-making method based on deep reinforcement learning - Google Patents

Decision-making method based on deep reinforcement learning Download PDF

Info

Publication number
WO2022083029A1
WO2022083029A1 PCT/CN2021/074974 CN2021074974W WO2022083029A1 WO 2022083029 A1 WO2022083029 A1 WO 2022083029A1 CN 2021074974 W CN2021074974 W CN 2021074974W WO 2022083029 A1 WO2022083029 A1 WO 2022083029A1
Authority
WO
WIPO (PCT)
Prior art keywords
action
decision
experience
information
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2021/074974
Other languages
French (fr)
Chinese (zh)
Inventor
张昊迪
伍楷舜
陈振浩
高子航
李启凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Publication of WO2022083029A1 publication Critical patent/WO2022083029A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/80Special adaptations for executing a specific game genre or game mode
    • A63F13/822Strategy games; Role-playing games
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/80Special adaptations for executing a specific game genre or game mode
    • A63F13/837Shooting of targets
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present invention relates to the field of artificial intelligence, and more particularly, to a decision-making method based on deep reinforcement learning.
  • Reinforcement learning is a field in machine learning used to describe and solve problems in which agents learn strategies to maximize rewards or achieve specific goals in the process of interacting with the environment.
  • the purpose of the present invention is to overcome the above-mentioned defects of the prior art, and to provide a decision-making method based on deep reinforcement learning, which is a new technical solution for dynamic decision-making by combining high abstraction level rules and deep reinforcement learning.
  • the present invention provides a decision-making method based on deep reinforcement learning.
  • the method includes the following steps:
  • the agent makes decisions according to the environmental information and selects the actions after the decision;
  • the agent compares the action after the decision with the knowledge base, and decides whether to replace the action after the decision with the random action in the rule set based on the set rule set in the knowledge base;
  • a set amount of experience information is randomly selected from the experience replay pool to update the deep reinforcement learning model to guide the next iteration.
  • determining whether to replace the post-decision action with a random action in the rule set according to the set rule set in the knowledge base includes:
  • the rule set is set according to a decision application scenario to avoid catastrophic decision-making or to improve learning efficiency, and is used to guide the actions of the agent in the application scenario.
  • combining old environment information, actions, rewards and new environment information into one experience information, and storing it into the experience playback pool includes:
  • randomly selecting a set amount of experience information from the experience replay pool to update the deep reinforcement learning model includes:
  • D randomly selects a certain amount of experience information ( ⁇ (s j ), a j , r j , ⁇ (s j+1 )) in the experience playback pool, And calculate the value of the current moment j of each experience information:
  • a' represents the optional action at time j+1
  • a j represents the action at time j
  • s j and s j+1 represent the environmental information at time j and time j+1 respectively
  • represents the preprocessing process.
  • the present invention has the advantage that, in deep reinforcement learning, in addition to considering all possible Q-values for actions, applicable rules are also considered, by combining high abstraction level rules with deep reinforcement learning , which improves training performance and avoids catastrophic decisions.
  • FIG. 1 is a flowchart of a decision-making method based on deep reinforcement learning according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a rule intervention learning framework according to an embodiment of the present invention.
  • FIG. 3 is a screen diagram in the Flappybird game according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of the effective range of rules in the Flappybird game according to an embodiment of the present invention.
  • FIG. 6 is a screen diagram in the Spacewar game according to an embodiment of the present invention.
  • FIG. 8 is a screen diagram in a Breakout game according to an embodiment of the present invention.
  • FIG. 10 is a screen diagram in the GirdWorld game according to an embodiment of the present invention.
  • Fig. 11 is the average reward experiment result graph in the GirdWorld game of one embodiment of the present invention.
  • the decision-making method based on deep reinforcement learning includes the following steps:
  • Step S110 preset the rule set and related hyperparameters in the knowledge base, and initialize the deep reinforcement learning model.
  • the proposed decision-making method combining high-level abstraction rules and deep reinforcement learning includes a rule intervention learning (RIL) framework, specifically including a deep Q network (ie DQN, a A deep reinforcement learning algorithm that combines Q-learning and deep learning), knowledge base and environment, in which DQN and knowledge base have interactions for combining the rules in the knowledge base with deep Q-learning.
  • RIL rule intervention learning
  • DQN deep Q network
  • knowledge base and environment in which DQN and knowledge base have interactions for combining the rules in the knowledge base with deep Q-learning.
  • a deep reinforcement learning model utilizes a neural network consisting of 3 convolutional layers, 1 hidden layer, and 1 output layer.
  • the first convolutional layer is 32 8*8*4 convolution kernels, the stride is 4, and the output result passes through the 2*2 max pooling layer;
  • the second convolutional layer is 64 4*4 *32 convolution kernel with stride 2;
  • the third convolutional layer is 64 3*3*64 convolution kernels with stride 1;
  • the hidden layer consists of 256 fully connected layers to form ReLU function nodes; Since the number of actions may be different for different embodiments, the output layer may be different.
  • the ReLU function is:
  • the specific structure, number of layers, convolution kernel size, activation function, etc. of the neural network can be set according to the requirements for training accuracy and training time, which are not limited in the present invention.
  • the initialization process of step S110 includes setting the rule set in the knowledge base, the rule intervention probability, the preprocessing process, the loss function, the experience playback pool, the number of training rounds, and the like.
  • the rule set can contain a variety of different types of rules, and can be set according to different application scenarios. It should be noted that each type of rule can work alone or in combination with other types of rules. Each rule in the rule set is used to limit the effective scope of the rule and the actions corresponding to the scope.
  • Preprocessing sets ⁇ which is used to process the initial information of the input. For example, a colored game screen becomes a gray screen after preprocessing settings. Specifically, the agent interacts with the environment in M rounds, but before each step of decision-making, the environmental information must be preprocessed. If the initial information of the environment in each round is s 1 , the preprocessed initial information is ⁇ (s 1 ).
  • Exploration probability ⁇ in addition to selecting the action with the largest Q value, the agent randomly selects an action with probability ⁇ . Specifically, in each step t of each round of interaction with the environment, the agent randomly selects an action to perform in the environment with a probability of size ⁇ .
  • N Set the experience playback pool D and its maximum capacity is N, which is used to store experience information for deep Q-learning training.
  • Initialize the action-value function Q and its parameters are randomly set to ⁇ , and initialize the target action-value function Q * and its parameter ⁇ * value size is set to ⁇ , the target action-value function Q* is used to select the one with the largest Q* value action, expressed as:
  • the loss function is calculated based on the action-value function Q and the value of the current moment j to update the network parameter ⁇ , for example, the loss function is expressed as:
  • a' represents the optional action at time j+1.
  • the upper limit value M of the number of training rounds needs to be set, and when the number of training rounds reaches the upper limit value M, the training ends.
  • step S120 the agent observes the environment and obtains information obtained from the environment.
  • the agent observes the environment and obtains environmental information s t , and the agent preprocesses the environment st into ⁇ (s t ), for example, the color game screen s t is preprocessed
  • the gray screen ⁇ (s t ) is displayed after processing the setting. It should be noted that each time the agent receives environmental information from the environment, it must go through this preprocessing operation.
  • Step S130 the agent makes a decision according to the environmental information, and selects an action after the decision.
  • step S140 the agent compares the action after the decision with the knowledge base, and judges whether there is a more suitable action according to the rules in the knowledge base. If there is a more suitable action, the action after the decision is replaced according to certain conditions. Substitution executes the post-decision action.
  • step S150 the agent executes the action after the decision or the action after the replacement in the environment, obtains rewards and other new information from the environment, combines the old environment information, action, reward and new environment information into one experience information, and saves it. into the experience playback pool.
  • each step t of each round of interaction between the agent and the environment execute the final selected action a t to the environment and obtain the reward value r t and the new environment observation value s t+1 , and then preprocess the environment observation value to get ⁇ (s t+1 ), and combine ⁇ (s t ), at , r t and ⁇ (s t +1 ) into a unit of empirical information, denoted as ( ⁇ (s t ), at , r t , ⁇ (s t+1 )), stored in the experience playback pool D.
  • the maximum capacity of the experience playback pool D is N. If the capacity of the experience pool has reached the maximum value N when new experience information is stored, the earlier experience information needs to be deleted to make room.
  • step S160 a certain amount of experience information is randomly selected from the experience playback pool to update the model to guide the next iteration.
  • D randomly selects a certain amount of experience information ( ⁇ (s j ), a j , r j , ⁇ (s j+1 )) in the experience playback pool, Calculate the value of the current time j of each experience information
  • the agent automatically enters the next round of interaction, and steps S120 to S160 are repeated until the preset upper limit number of training rounds M, the preset number of iterations or the loss function reaches the preset convergence condition.
  • a bird manipulated by the agent tries to fly through pairs of pipes while avoiding hitting any pipes.
  • the agent has two actions available, that is, control the bird to flap its wings or do nothing.
  • the bird gets a temporary upward acceleration, so the bird can ascend a certain distance. If nothing is done, the bird will descend a certain distance due to gravity.
  • Birds flying through pairs of pipes will be rewarded, but if the bird hits the pipes or falls to the ground, the round will end and a certain amount of the reward will be lost.
  • the present invention uses a set of rules to tell the bird not to fly too high or too low when flying over a pair of pipes, the rules are only when flying within the box of Figure 4 (ie the area between opposing upper and lower pipes) Affects bird training.
  • the acceleration rule is used, and the effective probability is set as:
  • ⁇ (r 1 ) represents the recommended action of rule r 1 .
  • the above rule means that when the bird is between the upper and lower pipes, and the distance between the bird and the upper pipe is greater than the vertical height of a bird, it will fly up.
  • ⁇ (r 2) is:
  • the above rule means that when the bird is between the upper and lower pipes, and the distance between the bird and the lower pipe is greater than the vertical height of a bird, no action will be taken.
  • the average reward and average Q value of FLappybird using the acceleration rule are shown in Figure 5, where RIL corresponds to the present invention (marked by curve S11), and DQN corresponds to traditional reinforcement learning (marked by curve S12).
  • RIL corresponds to the present invention
  • DQN corresponds to traditional reinforcement learning
  • a time limit for the training phase was set, and the reward for each game increased as the time continued to increase. It can be seen from the average reward shown in Fig. 5 that, in the same training time, compared with the traditional DQN, the RIL of the present invention can obtain better performance with fewer training sets.
  • the average Q value also shows that the rules introduced by the present invention can speed up the learning progress.
  • the rule set used in this game is a greedy strategy that always moves to the closest bandit in the level.
  • the acceleration rule is used, and the effective probability is set as:
  • ⁇ (r4) is:
  • move_left means move left
  • move_right means move right
  • the acceleration rule is used in this embodiment, and the effective probability is
  • the above rule means that when the ball is on the left side of the racket, move left.
  • the Game which contains destination areas, unreachable walls (such as black grids), and traps, etc.
  • the game will end and the agent will be heavily penalized.
  • the ultimate goal of the game is to find the shortest path to your destination without falling into a trap. For example, the agent gets a negative reward of -1 every time it moves. If it falls into a trap, the negative reward is -600, and the reward for reaching the destination is 100.
  • safety rules are used to ensure the safety of the agent during training.
  • Safety rules are used to prevent the agent from making catastrophic decisions with irreversible consequences.
  • the set safety rules are always in effect during the training process.
  • the optional actions are “up”, “down”, “left” and “right”, and the trap is on the “right” of the agent, then choose 1 of the other 3 actions.
  • this rule just prevents the agent from entering the trap, this rule is mandatory.
  • the GridWorld experimental results, shown in Fig. 11, show that even in the initial stage of training, the performance of the rule-based intervention based learning (RIL) of the present invention is much better than that of the traditional DQN. Due to the safety rules, the agent will never make catastrophic actions, thus guaranteeing the safety of the agent. Using safety rules to avoid catastrophic decisions is especially effective when it comes to cold-driven problems.
  • RIL rule-based intervention based learning
  • the present invention can guide the agent to move forward in a more "correct" direction in a specific scenario by combining high-level abstraction rules with deep reinforcement learning.
  • the set rule set can be individually designed according to specific application scenarios, so that under the premise of improving the generality of the model, it can better meet the individual needs in terms of security, learning time, and learning efficiency. Therefore, by using the present invention, the training time can be shortened and catastrophic decision-making can be avoided, and it can be widely used in the field of dynamic decision-making.
  • the present invention may be a system, method and/or computer program product.
  • the computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of the present invention.
  • a computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Non-exhaustive list of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read only memory
  • flash memory static random access memory
  • SRAM static random access memory
  • CD-ROM compact disk read only memory
  • DVD digital versatile disk
  • memory sticks floppy disks
  • mechanically coded devices such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.
  • Computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (eg, light pulses through fiber optic cables), or through electrical wires transmitted electrical signals.
  • the computer readable program instructions described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • the computer program instructions for carrying out the operations of the present invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages.
  • Source or object code written in any combination, including object-oriented programming languages, such as Smalltalk, C++, etc., and conventional procedural programming languages, such as the "C" language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through the Internet connect).
  • LAN local area network
  • WAN wide area network
  • custom electronic circuits such as programmable logic circuits, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs)
  • FPGAs field programmable gate arrays
  • PDAs programmable logic arrays
  • Computer readable program instructions are executed to implement various aspects of the present invention.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
  • These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
  • Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more functions for implementing the specified logical function(s) executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions. It is well known to those skilled in the art that implementation in hardware, implementation in software, and implementation in a combination of software and hardware are all equivalent.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A decision-making method based on deep reinforcement learning, the method comprising: an intelligent agent makes a decision according to environment information and selects an action after the decision; the intelligent agent compares said action with a knowledge base, and decides, on the basis of a set ruleset in the knowledge base, to execute said action or to replace said action; the intelligent agent executes said action or the replaced action in the environment, obtains a reward and new environment information from the environment, combines the old environment information, the action, the reward and the new environment information into experience information, and stores the experience information into an experience playback pool; and a set amount of experience information is randomly selected from the experience playback pool, so as to update a deep reinforcement learning model to thereby guide the next iteration. By utilizing the present application, training time may be shortened, catastrophic decision-making may be avoided, and the method may be widely applied to the field of dynamic decision-making.

Description

一种基于深度强化学习的决策方法A decision-making method based on deep reinforcement learning 技术领域technical field

本发明涉及人工智能领域,更具体地,涉及一种基于深度强化学习的决策方法。The present invention relates to the field of artificial intelligence, and more particularly, to a decision-making method based on deep reinforcement learning.

背景技术Background technique

强化学习是机器学习中的一个领域,用于描述和解决智能体在与环境的交互过程中通过学习策略以达成回报最大化或实现特定目标的问题。Reinforcement learning is a field in machine learning used to describe and solve problems in which agents learn strategies to maximize rewards or achieve specific goals in the process of interacting with the environment.

目前,深度强化学习已成功应用于多种动态决策领域,尤其是那些具有很大状态空间的领域。然而,深度强化学习也面临着一些问题,首先,它的训练过程可能非常缓慢并且需要大量资源,最终的系统通常很脆弱,结果难以解释,并且在训练开始很长一段时间表现很差。此外,对于机器人技术和关键决策支持系统中的应用,利用深度强化学习甚至可能作出灾难性的决策,从而导致成本巨大的后果。Currently, deep reinforcement learning has been successfully applied to a variety of dynamic decision-making domains, especially those with large state spaces. However, deep reinforcement learning also faces some problems. First, its training process can be very slow and resource-intensive. The final system is often fragile, the results are difficult to interpret, and it performs poorly for long periods of time at the beginning of training. Furthermore, for applications in robotics and critical decision support systems, the use of deep reinforcement learning can even make catastrophic decisions with costly consequences.

因此,需要对现有技术进行改进,以获得效率更高、更安全的决策方法。Therefore, improvements to existing technologies are required to obtain more efficient and safer decision-making methods.

发明内容SUMMARY OF THE INVENTION

本发明的目的是克服上述现有技术的缺陷,提供一种基于深度强化学习的决策方法,是将高抽象层级规则与深度强化学习相结合进行动态决策的新技术方案。The purpose of the present invention is to overcome the above-mentioned defects of the prior art, and to provide a decision-making method based on deep reinforcement learning, which is a new technical solution for dynamic decision-making by combining high abstraction level rules and deep reinforcement learning.

本发明提供一种基于深度强化学习的决策方法。该方法包括以下步骤:The present invention provides a decision-making method based on deep reinforcement learning. The method includes the following steps:

智能体根据环境信息进行决策,选择决策后的动作;The agent makes decisions according to the environmental information and selects the actions after the decision;

智能体将决策后的动作与知识库对比,并基于知识库中的设定规则集决定是否用规则集中的随机动作替换决策后的动作;The agent compares the action after the decision with the knowledge base, and decides whether to replace the action after the decision with the random action in the rule set based on the set rule set in the knowledge base;

在判断为替换决策后的动作的情况下,在环境中执行替换后的动作, 从环境中获得奖励和新的环境信息,并将旧环境信息、动作、奖励和新环境信息组合成经验信息,存入经验回放池;In the case that it is judged to replace the action after the decision, execute the replaced action in the environment, obtain the reward and new environment information from the environment, and combine the old environment information, action, reward and new environment information into experience information, Stored in the experience playback pool;

在经验回放池中随机选取设定数量的经验信息,以更新深度强化学习模型,进而指导下一次的迭代。A set amount of experience information is randomly selected from the experience replay pool to update the deep reinforcement learning model to guide the next iteration.

在一个实施例中,根据知识库中的设定规则集决定是否用规则集中的随机动作替换决策后的动作包括:In one embodiment, determining whether to replace the post-decision action with a random action in the rule set according to the set rule set in the knowledge base includes:

判断知识库中的规则集是否满足预定条件;Determine whether the rule set in the knowledge base satisfies the predetermined condition;

在满足设定条件的情况下,以设定的概率用规则集中的一个随机动作替换决策后的动作。If the set conditions are met, replace the decided action with a random action in the rule set with a set probability.

在一个实施例中,在满足设定条件的情况下,以P t=p 0·γ t的概率用合规动作集α(R,t)中的一个随机动作替换决策后的动作,其中p 0是初始规则干预概率,t是运行时间,γ是衰减率,R表示规则集,α表示符合规则集R和在时间t下的所有动作。 In one embodiment, the post-decision action is replaced with a random action from the set of compliant actions α(R,t) with probability P t = p 0 ·γ t , where p 0 is the initial rule intervention probability, t is the running time, γ is the decay rate, R represents the rule set, and α represents all actions that conform to the rule set R and at time t.

在一个实施例中,所述规则集根据决策应用场景以避免灾难性决策或以提升学习效率为目标进行设定,用于引导智能体在该应用场景下的动作。In one embodiment, the rule set is set according to a decision application scenario to avoid catastrophic decision-making or to improve learning efficiency, and is used to guide the actions of the agent in the application scenario.

在一个实施例中,将旧环境信息、动作、奖励和新环境信息组合成一个经验信息,存入经验回放池包括:In one embodiment, combining old environment information, actions, rewards and new environment information into one experience information, and storing it into the experience playback pool includes:

在获得新环境信息后,将一个单位的经验信息(φ(s t),a t,r t,φ(s t+1))存入经验回放池D; After obtaining the new environment information, store one unit of experience information (φ(s t ), at , r t , φ(s t +1 )) into the experience playback pool D;

如果存入新的经验信息时,经验池容量超过设定的阈值N,则以存入时间为参考删除早期的经验信息。If the capacity of the experience pool exceeds the set threshold N when new experience information is stored, the earlier experience information will be deleted with reference to the storage time.

在一个实施例中,在经验回放池中随机选取设定数量的经验信息,以更新深度强化学习模型包括:In one embodiment, randomly selecting a set amount of experience information from the experience replay pool to update the deep reinforcement learning model includes:

在智能体与环境的每轮交互的每一步t中,在经验回放池中D随机选取一定数量的经验信息(φ(s j),a j,r j,φ(s j+1)),并计算各个经验信息的当前时刻j的价值: In each step t of each round of interaction between the agent and the environment, D randomly selects a certain amount of experience information (φ(s j ), a j , r j , φ(s j+1 )) in the experience playback pool, And calculate the value of the current moment j of each experience information:

Figure PCTCN2021074974-appb-000001
Figure PCTCN2021074974-appb-000001

以(y j-Q(φ(s j),a j;θ)) 2为目标函数做梯度下降来优化神经网络参数θ; Take (y j -Q(φ(s j ), a j ; θ)) 2 as the objective function to do gradient descent to optimize the neural network parameter θ;

最后每隔固定的步数C,将目标动作-价值函数Q*同步为动作-价值函数Q;Finally, every fixed number of steps C, the target action-value function Q* is synchronized to the action-value function Q;

其中,a’表示j+1时刻的可选动作,a j表示j时刻的动作,s j和s j+1分别表示j时刻和j+1时刻的环境信息,φ表示预处理过程。 Among them, a' represents the optional action at time j+1, a j represents the action at time j, s j and s j+1 represent the environmental information at time j and time j+1 respectively, and φ represents the preprocessing process.

与现有技术相比,本发明的优点在于,在深度强化学习中,除了考虑所有可能采取动作的Q值之外,还考虑了适用的规则,通过将高抽象层级规则与深度强化学习相结合,提高了训练效果并且能够避免作出灾难性决策。Compared with the prior art, the present invention has the advantage that, in deep reinforcement learning, in addition to considering all possible Q-values for actions, applicable rules are also considered, by combining high abstraction level rules with deep reinforcement learning , which improves training performance and avoids catastrophic decisions.

通过以下参照附图对本发明的示例性实施例的详细描述,本发明的其它特征及其优点将会变得清楚。Other features and advantages of the present invention will become apparent from the following detailed description of exemplary embodiments of the present invention with reference to the accompanying drawings.

附图说明Description of drawings

被结合在说明书中并构成说明书的一部分的附图示出了本发明的实施例,并且连同其说明一起用于解释本发明的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention.

图1是根据本发明一个实施例的基于深度强化学习的决策方法的流程图;1 is a flowchart of a decision-making method based on deep reinforcement learning according to an embodiment of the present invention;

图2是根据本发明一个实施例的规则干预学习框架的示意图;2 is a schematic diagram of a rule intervention learning framework according to an embodiment of the present invention;

图3是根据本发明一个实施例的Flappybird游戏中的画面图;FIG. 3 is a screen diagram in the Flappybird game according to an embodiment of the present invention;

图4是根据本发明一个实施例的Flappybird游戏中规则生效范围示意图;FIG. 4 is a schematic diagram of the effective range of rules in the Flappybird game according to an embodiment of the present invention;

图5是根据本发明一个实施例的Flappybird游戏中平均奖励和平均Q值的实验结果图;5 is a graph of experimental results of average reward and average Q value in the Flappybird game according to an embodiment of the present invention;

图6是根据本发明一个实施例的Spacewar游戏中的画面图;FIG. 6 is a screen diagram in the Spacewar game according to an embodiment of the present invention;

图7是根据本发明一个实施例的Spacewar游戏中平均奖励和平均Q值实验结果图;7 is a graph of experimental results of average reward and average Q value in the Spacewar game according to an embodiment of the present invention;

图8是根据本发明一个实施例的Breakout游戏中的画面图;8 is a screen diagram in a Breakout game according to an embodiment of the present invention;

图9是根据本发明一个实施例的Breakout游戏中平均奖励和平均Q值实验结果图;9 is a graph of experimental results of average reward and average Q value in the Breakout game according to an embodiment of the present invention;

图10是根据本发明一个实施例的GirdWorld游戏中的画面图;FIG. 10 is a screen diagram in the GirdWorld game according to an embodiment of the present invention;

图11是本发明一个实施例的GirdWorld游戏中平均奖励实验结果图;Fig. 11 is the average reward experiment result graph in the GirdWorld game of one embodiment of the present invention;

附图中,Average Reward-平均奖励;Average Q value-平均Q值;Training Epochs-训练时期。In the attached figure, Average Reward - average reward; Average Q value - average Q value; Training Epochs - training period.

具体实施方式Detailed ways

现在将参照附图来详细描述本发明的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本发明的范围。Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the invention unless specifically stated otherwise.

以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本发明及其应用或使用的任何限制。The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods, and apparatus should be considered part of the specification.

在这里示出和讨论的所有例子中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它例子可以具有不同的值。In all examples shown and discussed herein, any specific values should be construed as illustrative only and not limiting. Accordingly, other instances of the exemplary embodiment may have different values.

应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further discussion in subsequent figures.

参见图1所示,该实施例提供的基于深度强化学习的决策方法包括以下步骤:Referring to FIG. 1 , the decision-making method based on deep reinforcement learning provided by this embodiment includes the following steps:

步骤S110,预设知识库中的规则集、相关超参数,初始化深度强化学习模型。Step S110, preset the rule set and related hyperparameters in the knowledge base, and initialize the deep reinforcement learning model.

为便于理解本发明,首先参见图2所示,所提出的将高抽象层级规则与深度强化学习相结合的决策方法包含了规则干预学习(RIL)框架,具体包括深度Q网络(即DQN,一种将Q学习和深度学习结合的深度强化学习算法)、知识库和环境,其中DQN和知识库具有交互,用于将知识库中的规则和深度Q学习结合,在深度Q学习中,除了要考虑所有可能采取动作的Q值之外,还要考虑是否有任何规则适用。In order to facilitate the understanding of the present invention, referring to Fig. 2 first, the proposed decision-making method combining high-level abstraction rules and deep reinforcement learning includes a rule intervention learning (RIL) framework, specifically including a deep Q network (ie DQN, a A deep reinforcement learning algorithm that combines Q-learning and deep learning), knowledge base and environment, in which DQN and knowledge base have interactions for combining the rules in the knowledge base with deep Q-learning. In deep Q-learning, in addition to the In addition to considering all possible Q-values for action, consider whether any rules apply.

例如,深度强化学习模型利用的神经网络包括3个卷积层、1个隐藏 层和1个输出层。例如,第一层卷积层为32个8*8*4的卷积核,步幅为4,输出结果通过2*2的最大池化层;第二层卷积层为64个4*4*32的卷积核,步幅为2;第三层卷积层为64个3*3*64的卷积核,步幅为1;隐藏层由256个完全连接的层组成ReLU函数节点;由于不同实施例的动作数量可能不一样,因此输出层可能不同。For example, a deep reinforcement learning model utilizes a neural network consisting of 3 convolutional layers, 1 hidden layer, and 1 output layer. For example, the first convolutional layer is 32 8*8*4 convolution kernels, the stride is 4, and the output result passes through the 2*2 max pooling layer; the second convolutional layer is 64 4*4 *32 convolution kernel with stride 2; the third convolutional layer is 64 3*3*64 convolution kernels with stride 1; the hidden layer consists of 256 fully connected layers to form ReLU function nodes; Since the number of actions may be different for different embodiments, the output layer may be different.

其中,ReLU函数为:Among them, the ReLU function is:

Figure PCTCN2021074974-appb-000002
Figure PCTCN2021074974-appb-000002

需说明的是,神经网络的具体结构、层数、卷积核大小、激活函数等可根据对训练精度和训练时间的要求进行设置,本发明对此不作限制。It should be noted that the specific structure, number of layers, convolution kernel size, activation function, etc. of the neural network can be set according to the requirements for training accuracy and training time, which are not limited in the present invention.

步骤S110的初始化过程包括设置知识库中的规则集、规则干预概率、预处理过程、损失函数、经验回放池、训练轮数等。The initialization process of step S110 includes setting the rule set in the knowledge base, the rule intervention probability, the preprocessing process, the loss function, the experience playback pool, the number of training rounds, and the like.

设置知识库中的规则集R,该规则集可包含多种不同类型的规则,并且可根据不同的应用场景进行设定。需说明的是,每种类型的规则可单独起作用或与其他类型的规则相结合同时起作用。规则集中每条规则用于限定规则生效范围和该范围对应的动作。Set the rule set R in the knowledge base, the rule set can contain a variety of different types of rules, and can be set according to different application scenarios. It should be noted that each type of rule can work alone or in combination with other types of rules. Each rule in the rule set is used to limit the effective scope of the rule and the actions corresponding to the scope.

设定规则干预条件或干预概率。例如,设置当前时刻下的规则干预概率为P t=p 0·γ t,其中,p 0是初始规则干预概率,γ是衰减率,t是运行时间。应理解的是,该干预概率仅起示例性作用,也可采用其他形式的规则干预概率,如P t=γ t等。 Set rule intervention conditions or intervention probabilities. For example, the rule intervention probability at the current moment is set as P t =p 0 ·γ t , where p 0 is the initial rule intervention probability, γ is the decay rate, and t is the running time. It should be understood that the intervention probability is only exemplary, and other forms of regular intervention probability, such as P tt and the like, may also be used.

预处理设置φ,用于处理输入的初始信息。例如,彩色游戏画面经过预处理设置后变为灰色画面。具体地,智能体与环境进行M轮的交互,但在每一步决策之前,都要将环境信息进行预处理,如每一轮的环境的初始信息为s 1,则经过预处理后的初始信息为φ(s 1)。 Preprocessing sets φ, which is used to process the initial information of the input. For example, a colored game screen becomes a gray screen after preprocessing settings. Specifically, the agent interacts with the environment in M rounds, but before each step of decision-making, the environmental information must be preprocessed. If the initial information of the environment in each round is s 1 , the preprocessed initial information is φ(s 1 ).

探索概率ε,智能体除了选择最大Q值的动作之外,有ε的概率随机选出一个动作。具体地,智能体在与环境的每轮交互的每一步t中,智能体以大小为ε的概率随机选出一个动作在环境中执行。Exploration probability ε, in addition to selecting the action with the largest Q value, the agent randomly selects an action with probability ε. Specifically, in each step t of each round of interaction with the environment, the agent randomly selects an action to perform in the environment with a probability of size ε.

设置经验回放池D且其容量最大值为N,用于存放经验信息以进行深度Q学习的训练。Set the experience playback pool D and its maximum capacity is N, which is used to store experience information for deep Q-learning training.

初始化动作-价值函数Q且其参数随机设置为θ,以及初始化目标动 作-价值函数Q *且其参数θ *数值大小设置成为θ,目标动作-价值函数Q*用于选出最大Q*值的动作,表示为: Initialize the action-value function Q and its parameters are randomly set to θ, and initialize the target action-value function Q * and its parameter θ * value size is set to θ, the target action-value function Q* is used to select the one with the largest Q* value action, expressed as:

a t=argmax aQ *(φ(s t),a;θ)   (2) a t = argmax a Q * (φ(s t ), a; θ) (2)

然后,基于动作-价值函数Q和当前时刻j的价值计算损失函数,以更新网络参数θ,例如,损失函数表示为:Then, the loss function is calculated based on the action-value function Q and the value of the current moment j to update the network parameter θ, for example, the loss function is expressed as:

(y j-Q(φ(s j),a j,θ)) 2   (3) (y j -Q(φ(s j ),a j ,θ)) 2 (3)

其中,当前时刻j的价值表示为:Among them, the value of the current moment j is expressed as:

Figure PCTCN2021074974-appb-000003
Figure PCTCN2021074974-appb-000003

其中,a’表示j+1时刻的可选动作。Among them, a' represents the optional action at time j+1.

此外,还需设置训练轮数上限值M,当训练轮数达到上限值M时,结束训练。In addition, the upper limit value M of the number of training rounds needs to be set, and when the number of training rounds reaches the upper limit value M, the training ends.

步骤S120,智能体观察环境,获得从环境中得到的信息。In step S120, the agent observes the environment and obtains information obtained from the environment.

在智能体与环境的每轮交互的每一步t中,智能体观察环境,获得环境信息s t,智能体将环境s t经过预处理成φ(s t),例如彩色游戏画面s t经过预处理设置后变为灰色画面φ(s t)。需要注意的是,智能体每次从环境中接收到环境信息时,都要经过该预处理操作。 In each step t of each round of interaction between the agent and the environment, the agent observes the environment and obtains environmental information s t , and the agent preprocesses the environment st into φ(s t ), for example, the color game screen s t is preprocessed The gray screen φ(s t ) is displayed after processing the setting. It should be noted that each time the agent receives environmental information from the environment, it must go through this preprocessing operation.

步骤S130,智能体根据环境信息进行决策,选择出决策后的动作。Step S130, the agent makes a decision according to the environmental information, and selects an action after the decision.

在智能体与环境的每轮交互的每一步t中,智能体观察预处理后的环境信息φ(s t)后,以ε的概率大小,随机选择一个动作,否则,根据公式a t=argmax aQ *(φ(s t),a;θ)选择出决策后的动作。 In each step t of each round of interaction between the agent and the environment, after observing the preprocessed environmental information φ(s t ), the agent randomly selects an action with the probability of ε, otherwise, according to the formula a t = argmax a Q * (φ(s t ), a; θ) selects the post-decision action.

步骤S140,智能体将决策后的动作与知识库对比,根据知识库中的规则,判断出是否有更合适的动作,若有更适合的动作,则按一定条件替换决策后的动作,若没有替换则执行决策后的动作。In step S140, the agent compares the action after the decision with the knowledge base, and judges whether there is a more suitable action according to the rules in the knowledge base. If there is a more suitable action, the action after the decision is replaced according to certain conditions. Substitution executes the post-decision action.

具体地,在智能体与环境的每轮交互的每一步t中,智能体将决策后的动作与知识库对比,根据知识库中设定的规则集,判断出是否有更合适的动作,若有更适合的动作,则按一定条件替换决策后的动作,若没有替换则执行决策后的动作。例如,设定规则集

Figure PCTCN2021074974-appb-000004
如果合规动作集α(R,t)连续非空且
Figure PCTCN2021074974-appb-000005
则以P t=p 0·γ t的概率用规则 集中的一个随机动作替换决策后的动作a t,其中R表示规则集,t表示运行时间,α表示符合规则集R和在时间t下的所有动作。 Specifically, in each step t of each round of interaction between the agent and the environment, the agent compares the action after the decision with the knowledge base, and judges whether there is a more appropriate action according to the set of rules set in the knowledge base. If there is a more suitable action, the action after the decision is replaced according to certain conditions, and if there is no replacement, the action after the decision is executed. For example, set the ruleset
Figure PCTCN2021074974-appb-000004
If the compliance action set α(R,t) is consecutively non-empty and
Figure PCTCN2021074974-appb-000005
Then, with the probability of P t = p 0 ·γ t , the decision-making action at is replaced by a random action in the rule set, where R represents the rule set, t represents the running time, and α represents the compliance with the rule set R and at time t. All actions.

步骤S150,智能体将决策后的动作或替换后的动作在环境中执行,从环境中获得奖励和其他新的信息,将旧环境信息、动作、奖励和新环境信息组合成一个经验信息,存入经验回放池。In step S150, the agent executes the action after the decision or the action after the replacement in the environment, obtains rewards and other new information from the environment, combines the old environment information, action, reward and new environment information into one experience information, and saves it. into the experience playback pool.

在智能体与环境的每轮交互的每一步t中,将最终选定的动作a t执行到环境并获取奖励值r t和新的环境观察值s t+1,然后预处理环境观察值得到φ(s t+1),并将φ(s t)、a t、r t和φ(s t+1)组合成一个单位的经验信息,表示为(φ(s t),a t,r t,φ(s t+1)),存入经验回放池D中。 In each step t of each round of interaction between the agent and the environment, execute the final selected action a t to the environment and obtain the reward value r t and the new environment observation value s t+1 , and then preprocess the environment observation value to get φ(s t+1 ), and combine φ(s t ), at , r t and φ(s t +1 ) into a unit of empirical information, denoted as (φ(s t ), at , r t , φ(s t+1 )), stored in the experience playback pool D.

需要说明的是,经验回放池D的最大容量为N,如果存入新的经验信息时,经验池容量已达最大值N,则需要删除早期的经验信息以腾出空间。It should be noted that the maximum capacity of the experience playback pool D is N. If the capacity of the experience pool has reached the maximum value N when new experience information is stored, the earlier experience information needs to be deleted to make room.

步骤S160,在经验回放池中随机选取一定数量的经验信息,用于更新模型以指导下一次的迭代。In step S160, a certain amount of experience information is randomly selected from the experience playback pool to update the model to guide the next iteration.

在智能体与环境的每轮交互的每一步t中,在经验回放池中D随机选取一定数量的经验信息(φ(s j),a j,r j,φ(s j+1)),计算各个经验信息的当前时刻j的价值 In each step t of each round of interaction between the agent and the environment, D randomly selects a certain amount of experience information (φ(s j ), a j , r j , φ(s j+1 )) in the experience playback pool, Calculate the value of the current time j of each experience information

Figure PCTCN2021074974-appb-000006
Figure PCTCN2021074974-appb-000006

然后,以(y j-Q(φ(s j),a j;θ)) 2为目标函数做梯度下降从而优化神经网络参数θ,最后,每隔固定的步数C,将目标动作-价值函数Q*同步为动作-价值函数Q,更新后的模型将会指导下一步的迭代。 Then, take (y j -Q(φ(s j ), a j ; θ)) 2 as the objective function to do gradient descent to optimize the neural network parameter θ, and finally, every fixed number of steps C, the target action-value The function Q* is synchronized to the action-value function Q, and the updated model will guide the next iteration.

智能体与环境的每轮交互结束时自动进入下一轮交互,重复步骤S120到S160,直至达到预设的训练上限轮数M、预设的迭代次数或者损失函数达到预设的收敛条件。At the end of each round of interaction between the agent and the environment, the agent automatically enters the next round of interaction, and steps S120 to S160 are repeated until the preset upper limit number of training rounds M, the preset number of iterations or the loss function reaches the preset convergence condition.

为进一步说明本发明的效果,下文将以具体应用场景为例进行说明。In order to further illustrate the effect of the present invention, a specific application scenario will be taken as an example for description below.

参见图3所示,在Flappybird游戏中,智能体操纵的一只鸟试图飞过成对的管道同时避免撞到任何管道。在该场景下,智能体有两种操作可用,即控制鸟拍打翅膀或不执行任何操作。通过拍打翅膀,鸟会获得暂时的向上加速,因此,鸟可以上升一定距离。如果不采取任何措施,鸟将因重力 而下降一定距离。鸟飞过成对的管道将会获得奖励,但如果鸟撞到管子或跌落在地上,此轮游戏将结束,并失去一定奖励。对于该游戏,本发明使用规则集来告诉鸟在一对管道上飞行时不要飞得太高或太低,该规则仅在图4的框内(即相对的上下管之间的区域)飞行时影响鸟的训练。Referring to Figure 3, in the Flappybird game, a bird manipulated by the agent tries to fly through pairs of pipes while avoiding hitting any pipes. In this scenario, the agent has two actions available, that is, control the bird to flap its wings or do nothing. By flapping its wings, the bird gets a temporary upward acceleration, so the bird can ascend a certain distance. If nothing is done, the bird will descend a certain distance due to gravity. Birds flying through pairs of pipes will be rewarded, but if the bird hits the pipes or falls to the ground, the round will end and a certain amount of the reward will be lost. For this game, the present invention uses a set of rules to tell the bird not to fly too high or too low when flying over a pair of pipes, the rules are only when flying within the box of Figure 4 (ie the area between opposing upper and lower pipes) Affects bird training.

该实施例中,使用了加速规则,生效概率设置为:In this embodiment, the acceleration rule is used, and the effective probability is set as:

P t=p 0·γ t P t =p 0 ·γ t

其中0<γ<1,p 0是常数,例如设置p 0=1,γ=0.8。 Where 0<γ<1, p 0 is a constant, for example, set p 0 =1, γ=0.8.

形式上,Flappybird的知识库为R fb={r 1,r 2},η(r 1)(其中,r 1表示规则1,r 2表示规则2,η(r 1)表示符合规则r 1的一阶逻辑命题)为: Formally, Flappybird's knowledge base is R fb = {r 1 , r 2 }, η(r 1 ) (where r 1 represents rule 1 , r 2 represents rule 2, and η(r 1 ) represents the first-order logic propositions) are:

crossing(p u,p l)∧less(disstance(bird,p u),size(bird)), crossing(p u ,p l )∧less(distance(bird,p u ),size(bird)),

且δ(r 1)(表示规则r 1的推荐动作)为: And δ(r 1 ) (representing the recommended action of rule r 1 ) is:

{flap},{flap},

其中,δ(r 1)表示规则r 1的推荐动作。上述规则表示当小鸟在上下管道之间,且小鸟与上方管道的距离大于一个小鸟的垂直身高,则上飞。 Among them, δ(r 1 ) represents the recommended action of rule r 1 . The above rule means that when the bird is between the upper and lower pipes, and the distance between the bird and the upper pipe is greater than the vertical height of a bird, it will fly up.

η(r 2)为: η(r 2) is:

crossing(p u,p l)∧less(disstance(bird,p l),size(bird)), crossing(p u ,p l )∧less(distance(bird,pl ) ,size(bird)),

且δ(r 1)And δ(r 1) is

{null}。{null}.

上述规则表示当小鸟在上下管道之间,且小鸟与下方管道的距离大于一个小鸟的垂直身高,则不做任何动作。The above rule means that when the bird is between the upper and lower pipes, and the distance between the bird and the lower pipe is greater than the vertical height of a bird, no action will be taken.

其中,(p u,p l)表示鸟正在飞过的一对管道,null表示不做任何措施,flap表示拍打翅膀。 where (p u ,p l ) represents a pair of pipes the bird is flying over, null means do nothing, and flap means flapping its wings.

使用了加速规则的FLappybird的平均奖励和平均Q值如图5所示,其中,RIL对应本发明(用曲线S11标识),DQN对应传统的强化学习(用曲线S12标识)。在实验中,设置了训练阶段的时间限制,随着时间的不断增加,每局游戏的奖励不断上升。由图5示意的平均奖励可知,在相同的训练时间内,相对于传统的DQN,本发明的RIL会以更少的训练集获得更好的表现。并且,平均Q值也表明本发明引入的规则可以加速学习的进度。The average reward and average Q value of FLappybird using the acceleration rule are shown in Figure 5, where RIL corresponds to the present invention (marked by curve S11), and DQN corresponds to traditional reinforcement learning (marked by curve S12). In the experiment, a time limit for the training phase was set, and the reward for each game increased as the time continued to increase. It can be seen from the average reward shown in Fig. 5 that, in the same training time, compared with the traditional DQN, the RIL of the present invention can obtain better performance with fewer training sets. Moreover, the average Q value also shows that the rules introduced by the present invention can speed up the learning progress.

参见图6所示,在Spacewar游戏中,敌方飞机从屏幕顶部随机出现,并垂直飞向屏幕底部。智能体即我方飞机以一定频率连续射击敌方飞机,每次击中敌方飞机都会获得一定的奖励,如果我方飞机与敌方飞机相撞,则游戏结束并失去奖励。在该场景中,我方飞机只能水平移动,可用的动作是左移和右移。Referring to Figure 6, in the Spacewar game, enemy aircraft randomly appear from the top of the screen and fly vertically to the bottom of the screen. The agent means that our plane continuously shoots the enemy plane at a certain frequency, and each time it hits the enemy plane, it will get a certain reward. If our plane collides with the enemy plane, the game ends and the reward is lost. In this scene, our plane can only move horizontally, and the available actions are left and right.

例如,该游戏使用的规则集是一种贪婪策略,即始终移至水平最近的敌机。For example, the rule set used in this game is a greedy strategy that always moves to the closest bandit in the level.

在该实施例中,使用了加速规则,生效概率设置为:In this embodiment, the acceleration rule is used, and the effective probability is set as:

P t=p 0·γ t P t =p 0 ·γ t

其中0<γ<1,设置p 0=1,γ=0.8。 Where 0<γ<1, set p 0 =1 and γ=0.8.

形式上,Spacewar的知识库为R aw={r 3,r 4},η(r 3)为 Formally, Spacewar's knowledge base is Raw = {r 3 , r 4 }, and η(r 3 ) is

on_left(nearest_jet,agent),on_left(nearest_jet,agent),

且δ(r3)为:And δ(r3) is:

{move_left},{move_left},

上述规则表示当距离最近的敌机在左方,则左移。The above rule means that when the closest bandit is on the left, move to the left.

η(r4)为:η(r4) is:

on_right(nearest_jet,agent),on_right(nearest_jet,agent),

且δ(r4)为And δ(r4) is

{move_right}。{move_right}.

上述规则表示当距离最近的敌机在右方,则右移。The above rule means that when the closest bandit is on the right, move to the right.

其中,move_left表示左移,move_right表示右移。Among them, move_left means move left, move_right means move right.

使用了加速规则的Spacewar的效果如图7所示,可见,本发明基于规则干预驱动的DQN的学习速度相对于传统的DQN快很多。The effect of Spacewar using accelerated rules is shown in Figure 7. It can be seen that the learning speed of the DQN driven by the rule intervention of the present invention is much faster than that of the traditional DQN.

参见图8所示,对于Breakout游戏,采用了如下规则:如果球在球拍的左侧,则球拍向左移动,如果球在球拍的右侧,则球拍向右移动。Referring to Figure 8, for the Breakout game, the following rules are used: if the ball is on the left side of the racket, the racket moves to the left, and if the ball is on the right side of the racket, the racket moves to the right.

该实施例中使用了加速规则,生效概率为The acceleration rule is used in this embodiment, and the effective probability is

P t=p 0·γ tP t =p 0 ·γ t ,

其中0<γ<1,设置p 0=1,γ=0.8。 Where 0<γ<1, set p 0 =1 and γ=0.8.

形式上,Breakout的知识库为R bo={r 5,r 6},η(r 5)为 Formally, Breakout's knowledge base is R bo = {r 5 , r 6 }, and η(r 5 ) is

on_left(ball,paddle),on_left(ball, paddle),

且δ(r 5)为 and δ(r 5 ) is

{move_left},{move_left},

上述规则表示当球在球拍左方,左移。The above rule means that when the ball is on the left side of the racket, move left.

η(r 6)为 η(r 6 ) is

on_right(ball,paddle),on_right(ball, paddle),

且δ(r 6)为 and δ(r 6 ) is

{move_right}。{move_right}.

使用了加速规则的Breakout的效果如图9所示,本发明的基于规则干预的学习效果相对于传统DQN有明显的提高。The effect of Breakout using accelerated rules is shown in Figure 9, and the learning effect of the rule-based intervention of the present invention is significantly improved compared to the traditional DQN.

以上三个实施例的实验结果如下表1,表明采用加速规则集可以有效提高学习效率。The experimental results of the above three embodiments are shown in Table 1, which shows that the use of accelerated rule sets can effectively improve the learning efficiency.

表1:实验结果对比Table 1: Comparison of experimental results

Figure PCTCN2021074974-appb-000007
Figure PCTCN2021074974-appb-000007

参见图10所示,对于GridWorld游戏,其中包含目的地区域、无法到达的墙壁(如黑色网格)和陷阱等,一旦落入陷阱,游戏就会结束,并且智能体将受到重罚。游戏的最终目标是在不陷入陷阱的情况下找出到达目的地的最短路径。例如,智能体每移动一次,得到的负奖励为-1。如果它落入陷阱,获得负奖励为-600,到达目的地的奖励是100。Referring to Figure 10, for the GridWorld game, which contains destination areas, unreachable walls (such as black grids), and traps, etc., once a trap is caught, the game will end and the agent will be heavily penalized. The ultimate goal of the game is to find the shortest path to your destination without falling into a trap. For example, the agent gets a negative reward of -1 every time it moves. If it falls into a trap, the negative reward is -600, and the reward for reaching the destination is 100.

在该实施例中,使用了安全规则,以确保训练过程中智能体的安全。安全规则用于防止智能体因作出灾难性的决策而导致不可逆的后果。与上述加速规则不同的是,所设定的安全规则在训练过程中始终生效。In this embodiment, safety rules are used to ensure the safety of the agent during training. Safety rules are used to prevent the agent from making catastrophic decisions with irreversible consequences. Unlike the above acceleration rules, the set safety rules are always in effect during the training process.

在该实施例中,在知识库中采用单一的安全规则,表示为R gw={r 7},η(r 7)为, In this embodiment, a single security rule is adopted in the knowledge base, which is expressed as R gw ={r 7 }, and η(r 7 ) is,

near_trap∧trap_in(directions),near_trap∧trap_in(directions),

η(r 7)为: η(r 7 ) is:

Α-{move(dir):dir∈directions}Α-{move(dir):dir∈directions}

上述规则表示当智能体在陷阱旁时,选择除去陷阱方向的一个动作。The above rule says that when the agent is next to a trap, it chooses an action that removes the direction of the trap.

例如:可选动作有“上”、“下”、“左”和“右”,陷阱在智能体“右方”,则选择其他3个动作中的1个。For example: the optional actions are "up", "down", "left" and "right", and the trap is on the "right" of the agent, then choose 1 of the other 3 actions.

其中A是所有动作的集合,该规则只是防止智能体进入陷阱,这个规则是强制性的。where A is the set of all actions, this rule just prevents the agent from entering the trap, this rule is mandatory.

GridWorld实验结果如图11,所示,表明即使在训练的初始阶段,本发明的基于规则干预的学习(RIL)的性能也比传统的DQN好很多。由于安全规则的作用,智能体永远不会做出灾难性的动作,从而保证了智能体的安全。当遇到冷驱动的问题时,使用安全规则来避免灾难性的决策特别有效。The GridWorld experimental results, shown in Fig. 11, show that even in the initial stage of training, the performance of the rule-based intervention based learning (RIL) of the present invention is much better than that of the traditional DQN. Due to the safety rules, the agent will never make catastrophic actions, thus guaranteeing the safety of the agent. Using safety rules to avoid catastrophic decisions is especially effective when it comes to cold-driven problems.

综上所述,本发明通过将高抽象层级规则与深度强化学习相结合,能够引导智能体在特定的场景下向更“正确”的方向前进。并且,所设定的规则集可以根据具体应用场景进行个性化设计,从而在提高模型通用型的前提下,更能从安全性、学习时间、学习效率等多方面满足个性化的需求。因此,利用本发明,能够缩短训练时间并避免作出灾难性决策,能够广泛应用于动态决策领域。To sum up, the present invention can guide the agent to move forward in a more "correct" direction in a specific scenario by combining high-level abstraction rules with deep reinforcement learning. In addition, the set rule set can be individually designed according to specific application scenarios, so that under the premise of improving the generality of the model, it can better meet the individual needs in terms of security, learning time, and learning efficiency. Therefore, by using the present invention, the training time can be shortened and catastrophic decision-making can be avoided, and it can be widely used in the field of dynamic decision-making.

本发明可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本发明的各个方面的计算机可读程序指令。The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of the present invention.

计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计 算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。A computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device. The computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (non-exhaustive list) of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above. Computer-readable storage media, as used herein, are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (eg, light pulses through fiber optic cables), or through electrical wires transmitted electrical signals.

这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。The computer readable program instructions described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

用于执行本发明操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本发明的各个方面。The computer program instructions for carrying out the operations of the present invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages. Source or object code, written in any combination, including object-oriented programming languages, such as Smalltalk, C++, etc., and conventional procedural programming languages, such as the "C" language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through the Internet connect). In some embodiments, custom electronic circuits, such as programmable logic circuits, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), can be personalized by utilizing state information of computer readable program instructions. Computer readable program instructions are executed to implement various aspects of the present invention.

这里参照根据本发明实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本发明的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它 可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams. These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.

也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.

附图中的流程图和框图显示了根据本发明的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。对于本领域技术人员来说公知的是,通过硬件方式实现、通过软件方式实现以及通过软件和硬件结合的方式实现都是等价的。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more functions for implementing the specified logical function(s) executable instructions. In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions. It is well known to those skilled in the art that implementation in hardware, implementation in software, and implementation in a combination of software and hardware are all equivalent.

以上已经描述了本发明的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进,或者使本技术领域的其它普通技术 人员能理解本文披露的各实施例。本发明的范围由所附权利要求来限定。Various embodiments of the present invention have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the various embodiments, the practical application or technical improvement in the marketplace, or to enable others of ordinary skill in the art to understand the various embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims (8)

一种基于深度强化学习的决策方法,包括以下步骤:A decision-making method based on deep reinforcement learning, including the following steps: 智能体根据环境信息进行决策,选择决策后的动作;The agent makes decisions according to the environmental information and selects the actions after the decision; 智能体将决策后的动作与知识库对比,并基于知识库中的设定规则集决定是否用规则集中的随机动作替换决策后的动作;The agent compares the action after the decision with the knowledge base, and decides whether to replace the action after the decision with the random action in the rule set based on the set rule set in the knowledge base; 在判断为替换决策后的动作的情况下,在环境中执行替换后的动作,从环境中获得奖励和新的环境信息,并将旧环境信息、动作、奖励和新环境信息组合成经验信息,存入经验回放池;In the case that it is judged to replace the action after the decision, perform the replaced action in the environment, obtain the reward and new environment information from the environment, and combine the old environment information, action, reward and new environment information into experience information, Stored in the experience playback pool; 在经验回放池中随机选取设定数量的经验信息,以更新深度强化学习模型,进而指导下一次的迭代。A set amount of experience information is randomly selected from the experience replay pool to update the deep reinforcement learning model to guide the next iteration. 根据权利要求1所述的方法,其中,根据知识库中的设定规则集决定是否用规则集中的随机动作替换决策后的动作包括:The method according to claim 1, wherein determining whether to replace the decided action with a random action in the rule set according to the set rule set in the knowledge base comprises: 判断知识库中的规则集是否满足预定条件;Determine whether the rule set in the knowledge base satisfies the predetermined condition; 在满足设定条件的情况下,以设定的概率用规则集中的一个随机动作替换决策后的动作。If the set conditions are met, replace the decided action with a random action in the rule set with a set probability. 根据权利要求2所述的方法,其中,在满足设定条件的情况下,以P t=p 0·γ t的概率用合规动作集α(R,t)中的一个随机动作替换决策后的动作,其中p 0是初始规则干预概率,t是运行时间,γ是衰减率,R表示规则集,α表示符合规则集R和在时间t下的所有动作。 The method according to claim 2, wherein, under the condition that the set condition is satisfied, with the probability of P t = p 0 ·γ t , a random action in the set of compliant actions α(R, t) is used to replace the post-decision decision where p0 is the initial rule intervention probability, t is the running time, γ is the decay rate, R is the rule set, and α is all actions that conform to the rule set R and at time t. 根据权利要求1所述的方法,其中,所述规则集根据决策应用场景以避免灾难性决策或以提升学习效率为目标进行设定,用于引导智能体在该应用场景下的动作。The method according to claim 1, wherein the rule set is set according to a decision application scenario to avoid catastrophic decision-making or to improve learning efficiency, and is used to guide the action of the agent in the application scenario. 根据权利要求1所述的方法,其中,将旧环境信息、动作、奖励和新环境信息组合成一个经验信息,存入经验回放池包括:The method according to claim 1, wherein combining the old environment information, the action, the reward and the new environment information into one experience information, and storing it into the experience playback pool comprises: 在获得新环境信息后,将一个单位的经验信息(φ(s t),a t,r t,φ(s t+1))存入经验回放池D; After obtaining the new environment information, store one unit of experience information (φ(s t ), at , r t , φ(s t +1 )) into the experience playback pool D; 如果存入新的经验信息时,经验池容量超过设定的阈值N,则以存入时间为参考删除早期的经验信息。If the capacity of the experience pool exceeds the set threshold N when new experience information is stored, the earlier experience information will be deleted with reference to the storage time. 根据权利要求5所述的方法,其中,在经验回放池中随机选取设定 数量的经验信息,以更新深度强化学习模型包括:The method of claim 5, wherein randomly selecting a set number of experience information in the experience replay pool to update the deep reinforcement learning model comprises: 在智能体与环境的每轮交互的每一步t中,在经验回放池中D随机选取一定数量的经验信息(φ(s j),a j,r j,φ(s j+1)),并计算各个经验信息的当前时刻j的价值: In each step t of each round of interaction between the agent and the environment, D randomly selects a certain amount of experience information (φ(s j ), a j , r j , φ(s j+1 )) in the experience playback pool, And calculate the value of the current moment j of each experience information:
Figure PCTCN2021074974-appb-100001
Figure PCTCN2021074974-appb-100001
以(y j-Q(φ(s j),a j;θ)) 2为目标函数做梯度下降来优化神经网络参数θ; Take (y j -Q(φ(s j ), a j ; θ)) 2 as the objective function to do gradient descent to optimize the neural network parameter θ; 最后每隔固定的步数C,将目标动作-价值函数Q*同步为动作-价值函数Q;Finally, every fixed number of steps C, the target action-value function Q* is synchronized to the action-value function Q; 其中,a’表示j+1时刻的可选动作,a j表示j时刻的动作,s j和s j+1分别表示j时刻和j+1时刻的环境信息,φ表示预处理过程。 Among them, a' represents the optional action at time j+1, a j represents the action at time j, s j and s j+1 represent the environmental information at time j and time j+1 respectively, and φ represents the preprocessing process.
一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现根据权利要求1至6中任一项所述方法的步骤。A computer-readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the steps of the method according to any one of claims 1 to 6. 一种计算机设备,包括存储器和处理器,在所述存储器上存储有能够在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现权利要求1至6中任一项所述的方法的步骤。A computer device, comprising a memory and a processor, a computer program that can be run on the processor is stored in the memory, and characterized in that, when the processor executes the program, any one of claims 1 to 6 is implemented The steps of the method described in item.
PCT/CN2021/074974 2020-10-19 2021-02-03 Decision-making method based on deep reinforcement learning Ceased WO2022083029A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011120754.2 2020-10-19
CN202011120754.2A CN112295237A (en) 2020-10-19 2020-10-19 A decision-making method based on deep reinforcement learning

Publications (1)

Publication Number Publication Date
WO2022083029A1 true WO2022083029A1 (en) 2022-04-28

Family

ID=74328357

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/074974 Ceased WO2022083029A1 (en) 2020-10-19 2021-02-03 Decision-making method based on deep reinforcement learning

Country Status (2)

Country Link
CN (1) CN112295237A (en)
WO (1) WO2022083029A1 (en)

Cited By (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114840024A (en) * 2022-05-25 2022-08-02 西安电子科技大学 Unmanned aerial vehicle control decision method based on context memory
CN114866494A (en) * 2022-07-05 2022-08-05 之江实验室 Reinforced learning intelligent agent training method, modal bandwidth resource scheduling method and device
CN114880946A (en) * 2022-05-31 2022-08-09 苏州大学 A Random Exploration Method for Agents Based on Flight Strategy
CN114897098A (en) * 2022-06-06 2022-08-12 网络通信与安全紫金山实验室 Automatic mixing precision quantification method and device
CN114895710A (en) * 2022-05-31 2022-08-12 中国人民解放军陆军工程大学 A control method and system for autonomous behavior of a swarm of unmanned aerial vehicles
CN114952828A (en) * 2022-05-09 2022-08-30 华中科技大学 Mechanical arm motion planning method and system based on deep reinforcement learning
CN114996278A (en) * 2022-06-27 2022-09-02 华中科技大学 Road network shortest path distance calculation method based on reinforcement learning
CN115047912A (en) * 2022-07-14 2022-09-13 北京航空航天大学 Unmanned aerial vehicle cluster self-adaptive self-reconstruction method and system based on reinforcement learning
CN115048690A (en) * 2022-05-09 2022-09-13 中存大数据科技有限公司 Cement sintering model optimization method based on pattern search
CN115049292A (en) * 2022-06-28 2022-09-13 中国水利水电科学研究院 Intelligent single reservoir flood control scheduling method based on DQN deep reinforcement learning algorithm
CN115242782A (en) * 2022-09-21 2022-10-25 之江实验室 A large file fragmentation transmission method and transmission architecture between supercomputing centers
CN115293033A (en) * 2022-07-26 2022-11-04 哈尔滨工业大学 Spacecraft artificial intelligence model training method and system
CN115366099A (en) * 2022-08-18 2022-11-22 江苏科技大学 Mechanical arm depth certainty strategy gradient training method based on forward kinematics
CN115455505A (en) * 2022-08-26 2022-12-09 天津大学 Active construction method of full-freedom permanent magnet
CN115509233A (en) * 2022-09-29 2022-12-23 山东交通学院 Robot path planning method and system based on prior experience playback mechanism
CN115564058A (en) * 2022-09-30 2023-01-03 西北核技术研究所 Real scene-oriented intelligent confrontation virtual environment construction method and system
CN115577647A (en) * 2022-12-09 2023-01-06 南方电网数字电网研究院有限公司 Power grid fault type identification method and intelligent agent construction method
CN115797394A (en) * 2022-11-15 2023-03-14 北京科技大学 A multi-agent overlay method based on reinforcement learning
CN115829717A (en) * 2022-09-27 2023-03-21 厦门国际银行股份有限公司 Wind control decision rule optimization method, system, terminal and storage medium
CN116088495A (en) * 2022-11-14 2023-05-09 中北大学 A Reinforcement Learning Navigation Method for Indoor Mobile Robots
CN116155819A (en) * 2023-04-20 2023-05-23 北京邮电大学 Method and device for balancing load in intelligent network based on programmable data plane
CN116382337A (en) * 2023-03-30 2023-07-04 西安交通大学 A scale-independent distributed collaborative task assignment method and system for unmanned swarms
CN116680563A (en) * 2023-05-18 2023-09-01 西安交通大学 Radio Frequency Fingerprint Recognition Method and Related Devices Based on Deep Reinforcement Learning and RawI/Q
CN116796281A (en) * 2023-02-23 2023-09-22 白杨时代(北京)科技有限公司 Data fusion methods, devices, equipment and storage media
US20230342425A1 (en) * 2022-04-20 2023-10-26 Adobe Inc. Optimal sequential decision making with changing action space
CN117235742A (en) * 2023-11-13 2023-12-15 中国人民解放军国防科技大学 An intelligent penetration testing method and system based on deep reinforcement learning
CN117409517A (en) * 2023-10-19 2024-01-16 光谷技术有限公司 Voice alarm system and method based on video AI behavior analysis
CN117763127A (en) * 2024-01-10 2024-03-26 南京理工大学 An industrial question answering model training method based on reinforcement learning and knowledge base matching
US11979295B2 (en) 2022-07-05 2024-05-07 Zhejiang Lab Reinforcement learning agent training method, modal bandwidth resource scheduling method and apparatus
CN118095401A (en) * 2024-04-29 2024-05-28 南京邮电大学 Reinforcement learning training acceleration method for post-state derailment strategy of warehouse storage
CN118113005A (en) * 2024-03-04 2024-05-31 江西协成锂业有限公司 Model-driven autonomous control method and system for lithium hydroxide production
CN118438433A (en) * 2024-03-14 2024-08-06 中国科学院深圳先进技术研究院 Reinforced learning robot control method based on dynamic rewarding value cutting
CN118567842A (en) * 2024-05-22 2024-08-30 中国地质大学(北京) A method and device for optimizing and scheduling flow system resources based on DQN algorithm
WO2024178984A1 (en) * 2023-03-02 2024-09-06 常州微亿智造科技有限公司 Flying trigger control method and system
CN118741801A (en) * 2024-07-24 2024-10-01 联通(山西)产业互联网有限公司 Urban lighting energy-saving control method and system based on deep reinforcement learning
CN118761016A (en) * 2024-09-06 2024-10-11 国网天津市电力公司培训中心 Cable status monitoring method, system and storage medium based on deep reinforcement learning
CN119115953A (en) * 2024-10-21 2024-12-13 清华大学深圳国际研究生院 A trajectory planning method for teleoperated space manipulator based on deep reinforcement learning
CN119127651A (en) * 2024-11-11 2024-12-13 杭州市北京航空航天大学国际创新研究院(北京航空航天大学国际创新学院) A multi-dimensional situation awareness method and device for intelligent operation and maintenance of cluster systems
CN119254646A (en) * 2024-09-25 2025-01-03 南京信息工程大学 An intelligent management system for communication networks based on multi-agent reinforcement learning
CN119427356A (en) * 2024-11-18 2025-02-14 东莞理工学院 A robot tracking control learning method based on post-screening experience replay
CN119472309A (en) * 2025-01-14 2025-02-18 中国人民解放军军事科学院战争研究院 Decision-making method and system for mobile devices based on knowledge acceleration
CN120254777A (en) * 2025-06-05 2025-07-04 中南大学 A method and system for switching working state of equipment using adjustable electromagnetic metamaterial
CN120386218A (en) * 2025-06-27 2025-07-29 西安羚控电子科技有限公司 Method and system for building flight control computer simulation test system
CN120745473A (en) * 2025-09-08 2025-10-03 中国科学院西北生态环境资源研究院 Method for constructing frozen ring intelligent body based on artificial intelligence
CN120822811A (en) * 2025-09-18 2025-10-21 中南大学 A method for site selection of urban public service facilities

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112295237A (en) * 2020-10-19 2021-02-02 深圳大学 A decision-making method based on deep reinforcement learning
CN113093727B (en) * 2021-03-08 2023-03-28 哈尔滨工业大学(深圳) Robot map-free navigation method based on deep security reinforcement learning
CN113660159B (en) * 2021-07-23 2023-04-18 成都壹唯视信息技术有限公司 Multipath dynamic multiplexing and grouping transmission method based on reinforcement learning
CN114362773B (en) * 2021-12-29 2022-12-06 西南交通大学 Real-time adaptive tracking decision method oriented to optical radio frequency cancellation
CN114404976B (en) * 2022-01-20 2024-08-20 腾讯科技(深圳)有限公司 Training method and device for decision model, computer equipment and storage medium
CN114579579B (en) * 2022-03-15 2025-04-15 重庆邮电大学 An index selection method based on deep reinforcement learning
CN114996856B (en) * 2022-06-27 2023-01-24 北京鼎成智造科技有限公司 Data processing method and device for airplane intelligent agent maneuver decision
CN115796364A (en) * 2022-11-30 2023-03-14 南京邮电大学 Intelligent interactive decision-making method for discrete manufacturing system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5548512A (en) * 1994-10-04 1996-08-20 The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration Autonomous navigation apparatus with neural network for a mobile vehicle
CN109703568A (en) * 2019-02-19 2019-05-03 百度在线网络技术(北京)有限公司 Automatic driving vehicle travels the method, apparatus and server of tactful real-time learning
US20200039520A1 (en) * 2018-08-06 2020-02-06 Honda Motor Co., Ltd. System and method for learning naturalistic driving behavior based on vehicle dynamic data
CN111605565A (en) * 2020-05-08 2020-09-01 昆山小眼探索信息科技有限公司 Automatic driving behavior decision method based on deep reinforcement learning
CN112295237A (en) * 2020-10-19 2021-02-02 深圳大学 A decision-making method based on deep reinforcement learning

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009587B (en) * 2017-12-01 2021-04-16 驭势科技(北京)有限公司 Method and equipment for determining driving strategy based on reinforcement learning and rules
US11914350B2 (en) * 2018-08-09 2024-02-27 Siemens Aktiengesellschaft Manufacturing process control using constrained reinforcement machine learning
US20200160191A1 (en) * 2018-11-19 2020-05-21 International Business Machines Corporation Semi-automated correction of policy rules
CN110991545B (en) * 2019-12-10 2021-02-02 中国人民解放军军事科学院国防科技创新研究院 Multi-agent confrontation oriented reinforcement learning training optimization method and device
CN111783944A (en) * 2020-06-19 2020-10-16 中国人民解放军军事科学院战争研究院 Rule embedded multi-agent reinforcement learning method and device based on combination training

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5548512A (en) * 1994-10-04 1996-08-20 The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration Autonomous navigation apparatus with neural network for a mobile vehicle
US20200039520A1 (en) * 2018-08-06 2020-02-06 Honda Motor Co., Ltd. System and method for learning naturalistic driving behavior based on vehicle dynamic data
CN109703568A (en) * 2019-02-19 2019-05-03 百度在线网络技术(北京)有限公司 Automatic driving vehicle travels the method, apparatus and server of tactful real-time learning
CN111605565A (en) * 2020-05-08 2020-09-01 昆山小眼探索信息科技有限公司 Automatic driving behavior decision method based on deep reinforcement learning
CN112295237A (en) * 2020-10-19 2021-02-02 深圳大学 A decision-making method based on deep reinforcement learning

Cited By (58)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12111884B2 (en) * 2022-04-20 2024-10-08 Adobe Inc. Optimal sequential decision making with changing action space
US20230342425A1 (en) * 2022-04-20 2023-10-26 Adobe Inc. Optimal sequential decision making with changing action space
CN114952828A (en) * 2022-05-09 2022-08-30 华中科技大学 Mechanical arm motion planning method and system based on deep reinforcement learning
CN115048690A (en) * 2022-05-09 2022-09-13 中存大数据科技有限公司 Cement sintering model optimization method based on pattern search
CN114840024A (en) * 2022-05-25 2022-08-02 西安电子科技大学 Unmanned aerial vehicle control decision method based on context memory
CN114880946A (en) * 2022-05-31 2022-08-09 苏州大学 A Random Exploration Method for Agents Based on Flight Strategy
CN114895710A (en) * 2022-05-31 2022-08-12 中国人民解放军陆军工程大学 A control method and system for autonomous behavior of a swarm of unmanned aerial vehicles
CN114897098A (en) * 2022-06-06 2022-08-12 网络通信与安全紫金山实验室 Automatic mixing precision quantification method and device
CN114996278A (en) * 2022-06-27 2022-09-02 华中科技大学 Road network shortest path distance calculation method based on reinforcement learning
CN115049292A (en) * 2022-06-28 2022-09-13 中国水利水电科学研究院 Intelligent single reservoir flood control scheduling method based on DQN deep reinforcement learning algorithm
CN114866494B (en) * 2022-07-05 2022-09-20 之江实验室 Reinforced learning intelligent agent training method, modal bandwidth resource scheduling method and device
CN114866494A (en) * 2022-07-05 2022-08-05 之江实验室 Reinforced learning intelligent agent training method, modal bandwidth resource scheduling method and device
US11979295B2 (en) 2022-07-05 2024-05-07 Zhejiang Lab Reinforcement learning agent training method, modal bandwidth resource scheduling method and apparatus
CN115047912A (en) * 2022-07-14 2022-09-13 北京航空航天大学 Unmanned aerial vehicle cluster self-adaptive self-reconstruction method and system based on reinforcement learning
CN115293033A (en) * 2022-07-26 2022-11-04 哈尔滨工业大学 Spacecraft artificial intelligence model training method and system
CN115366099A (en) * 2022-08-18 2022-11-22 江苏科技大学 Mechanical arm depth certainty strategy gradient training method based on forward kinematics
CN115366099B (en) * 2022-08-18 2024-05-28 江苏科技大学 Deep deterministic policy gradient training method for robotic arms based on forward kinematics
CN115455505A (en) * 2022-08-26 2022-12-09 天津大学 Active construction method of full-freedom permanent magnet
CN115455505B (en) * 2022-08-26 2023-11-24 天津大学 An active construction method for full-degree-of-freedom permanent magnets
CN115242782B (en) * 2022-09-21 2023-01-03 之江实验室 Large file fragment transmission method and transmission architecture between super-computing centers
CN115242782A (en) * 2022-09-21 2022-10-25 之江实验室 A large file fragmentation transmission method and transmission architecture between supercomputing centers
CN115829717B (en) * 2022-09-27 2023-09-19 厦门国际银行股份有限公司 Wind control decision rule optimization method, system, terminal and storage medium
CN115829717A (en) * 2022-09-27 2023-03-21 厦门国际银行股份有限公司 Wind control decision rule optimization method, system, terminal and storage medium
CN115509233A (en) * 2022-09-29 2022-12-23 山东交通学院 Robot path planning method and system based on prior experience playback mechanism
CN115564058B (en) * 2022-09-30 2025-08-15 西北核技术研究所 Real scene-oriented intelligent countermeasure virtual environment construction method and system
CN115564058A (en) * 2022-09-30 2023-01-03 西北核技术研究所 Real scene-oriented intelligent confrontation virtual environment construction method and system
CN116088495A (en) * 2022-11-14 2023-05-09 中北大学 A Reinforcement Learning Navigation Method for Indoor Mobile Robots
CN115797394B (en) * 2022-11-15 2023-09-05 北京科技大学 A multi-agent overlay method based on reinforcement learning
CN115797394A (en) * 2022-11-15 2023-03-14 北京科技大学 A multi-agent overlay method based on reinforcement learning
CN115577647B (en) * 2022-12-09 2023-04-07 南方电网数字电网研究院有限公司 Power grid fault type identification method and intelligent agent construction method
CN115577647A (en) * 2022-12-09 2023-01-06 南方电网数字电网研究院有限公司 Power grid fault type identification method and intelligent agent construction method
CN116796281A (en) * 2023-02-23 2023-09-22 白杨时代(北京)科技有限公司 Data fusion methods, devices, equipment and storage media
WO2024178984A1 (en) * 2023-03-02 2024-09-06 常州微亿智造科技有限公司 Flying trigger control method and system
CN116382337A (en) * 2023-03-30 2023-07-04 西安交通大学 A scale-independent distributed collaborative task assignment method and system for unmanned swarms
CN116155819A (en) * 2023-04-20 2023-05-23 北京邮电大学 Method and device for balancing load in intelligent network based on programmable data plane
CN116680563B (en) * 2023-05-18 2025-11-21 西安交通大学 Radio frequency fingerprint identification method and related device based on deep reinforcement learning and RawI/Q
CN116680563A (en) * 2023-05-18 2023-09-01 西安交通大学 Radio Frequency Fingerprint Recognition Method and Related Devices Based on Deep Reinforcement Learning and RawI/Q
CN117409517B (en) * 2023-10-19 2024-05-07 光谷技术有限公司 Voice alarm system and method based on video AI behavior analysis
CN117409517A (en) * 2023-10-19 2024-01-16 光谷技术有限公司 Voice alarm system and method based on video AI behavior analysis
CN117235742B (en) * 2023-11-13 2024-05-14 中国人民解放军国防科技大学 An intelligent penetration testing method and system based on deep reinforcement learning
CN117235742A (en) * 2023-11-13 2023-12-15 中国人民解放军国防科技大学 An intelligent penetration testing method and system based on deep reinforcement learning
CN117763127A (en) * 2024-01-10 2024-03-26 南京理工大学 An industrial question answering model training method based on reinforcement learning and knowledge base matching
CN118113005A (en) * 2024-03-04 2024-05-31 江西协成锂业有限公司 Model-driven autonomous control method and system for lithium hydroxide production
CN118438433A (en) * 2024-03-14 2024-08-06 中国科学院深圳先进技术研究院 Reinforced learning robot control method based on dynamic rewarding value cutting
CN118095401A (en) * 2024-04-29 2024-05-28 南京邮电大学 Reinforcement learning training acceleration method for post-state derailment strategy of warehouse storage
CN118567842A (en) * 2024-05-22 2024-08-30 中国地质大学(北京) A method and device for optimizing and scheduling flow system resources based on DQN algorithm
CN118567842B (en) * 2024-05-22 2024-12-27 中国地质大学(北京) Method and device for optimizing and scheduling stream system resources based on DQN algorithm
CN118741801A (en) * 2024-07-24 2024-10-01 联通(山西)产业互联网有限公司 Urban lighting energy-saving control method and system based on deep reinforcement learning
CN118761016A (en) * 2024-09-06 2024-10-11 国网天津市电力公司培训中心 Cable status monitoring method, system and storage medium based on deep reinforcement learning
CN119254646A (en) * 2024-09-25 2025-01-03 南京信息工程大学 An intelligent management system for communication networks based on multi-agent reinforcement learning
CN119115953A (en) * 2024-10-21 2024-12-13 清华大学深圳国际研究生院 A trajectory planning method for teleoperated space manipulator based on deep reinforcement learning
CN119127651A (en) * 2024-11-11 2024-12-13 杭州市北京航空航天大学国际创新研究院(北京航空航天大学国际创新学院) A multi-dimensional situation awareness method and device for intelligent operation and maintenance of cluster systems
CN119427356A (en) * 2024-11-18 2025-02-14 东莞理工学院 A robot tracking control learning method based on post-screening experience replay
CN119472309A (en) * 2025-01-14 2025-02-18 中国人民解放军军事科学院战争研究院 Decision-making method and system for mobile devices based on knowledge acceleration
CN120254777A (en) * 2025-06-05 2025-07-04 中南大学 A method and system for switching working state of equipment using adjustable electromagnetic metamaterial
CN120386218A (en) * 2025-06-27 2025-07-29 西安羚控电子科技有限公司 Method and system for building flight control computer simulation test system
CN120745473A (en) * 2025-09-08 2025-10-03 中国科学院西北生态环境资源研究院 Method for constructing frozen ring intelligent body based on artificial intelligence
CN120822811A (en) * 2025-09-18 2025-10-21 中南大学 A method for site selection of urban public service facilities

Also Published As

Publication number Publication date
CN112295237A (en) 2021-02-02

Similar Documents

Publication Publication Date Title
WO2022083029A1 (en) Decision-making method based on deep reinforcement learning
CN116776964B (en) Methods, program products, and storage media for distributed reinforcement learning
US20190272465A1 (en) Reward estimation via state prediction using expert demonstrations
CN114162146B (en) Driving strategy model training method and automatic driving control method
Kong et al. Hierarchical multi‐agent reinforcement learning for multi‐aircraft close‐range air combat
CN114089776A (en) Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning
CN110703766A (en) Unmanned aerial vehicle path planning method based on transfer learning strategy deep Q network
CN112884130A (en) SeqGAN-based deep reinforcement learning data enhanced defense method and device
CN111260026B (en) Navigation migration method based on meta reinforcement learning
CN106687871B (en) System and method for controller adaptation
CN115826621B (en) A UAV motion planning method and system based on deep reinforcement learning
CN116225046A (en) Autonomous Path Planning Method for Unmanned Aerial Vehicle in Unknown Environment Based on Deep Reinforcement Learning
CN111882047B (en) Rapid empty pipe anti-collision method based on reinforcement learning and linear programming
CN113947022B (en) Near-end strategy optimization method based on model
CN112052947A (en) Hierarchical reinforcement learning method and device based on strategy options
CN112231967A (en) Crowd evacuation simulation method and system based on deep reinforcement learning
CN107179077A (en) A kind of self-adaptive visual air navigation aid based on ELM LRF
CN109903251B (en) Method for carrying out image enhancement optimization through serial fusion of drosophila algorithm and rhododendron search algorithm
CN118896610B (en) UAV route planning method and system based on deep reinforcement learning
CN116362289A (en) An improved MATD3 multi-robot cooperative round-up method based on BiGRU structure
CN118938948A (en) A UAV task allocation method based on improved black kite optimization algorithm
CN114239834A (en) Adversary relationship reasoning method and device based on multi-round confrontation attribute sharing
CN114492677B (en) Unmanned aerial vehicle countermeasure method and device
US12124537B2 (en) Training an environment generator of a generative adversarial network (GAN) to generate realistic environments that incorporate reinforcement learning (RL) algorithm feedback
CN119479385A (en) A multi-machine real-time three-dimensional conflict resolution method based on graph reinforcement learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21881455

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19/07/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21881455

Country of ref document: EP

Kind code of ref document: A1