KR102709703B1

KR102709703B1 - Apparatus and method for designing reinforcement learning based on game behavior replication

Info

Publication number: KR102709703B1
Application number: KR1020230134969A
Authority: KR
Inventors: 조경은; 최태혁; 강경원
Original assignee: 동국대학교 산학협력단
Priority date: 2023-10-11
Filing date: 2023-10-11
Publication date: 2024-09-26
Anticipated expiration: 2043-10-11

Abstract

본 발명은 행동 복제 기술에 관한 것으로, 더욱 상세하게는 게임 행동 복제 기반 강화학습을 통한 학습 성능을 높이는 설계 장치에 관한 것이다. 본 발명의 실시 예에 따르면, 전처리 기술과 행동 복제 알고리즘을 이용하여 정책 기반 강화 학습을 적용할 경우 학습 시 발생되는 학습시간 및 상태 공간 문제를 최적화하여 공수 전환이 있는 구기 스포츠게임과 같이 복잡한 게임환경에서 실시간으로 강화 학습을 적용할 수 있다. 또한 본 발명의 전처리 기술과 다른 복잡한 게임 환경에 맞추어 에피소드와 보상 함수를 설계할 경우에는 다양한 게임에도 응용할 수 있다. The present invention relates to a behavior replication technology, and more particularly, to a design device for improving learning performance through game behavior replication-based reinforcement learning. According to an embodiment of the present invention, when policy-based reinforcement learning is applied using a preprocessing technology and a behavior replication algorithm, the learning time and state space problems occurring during learning are optimized, so that reinforcement learning can be applied in real time in a complex game environment such as a ball game with attack-defense transitions. In addition, when the episode and reward function are designed according to the preprocessing technology of the present invention and other complex game environments, it can be applied to various games.

Description

{APPARATUS AND METHOD FOR DESIGNING REINFORCEMENT LEARNING BASED ON GAME BEHAVIOR REPLICATION}

본 발명은 행동 복제 기술에 관한 것으로, 더욱 상세하게는 게임 행동 복제 기반 강화학습을 통한 학습 성능을 높이는 설계 장치에 관한 것이다. The present invention relates to a behavior replication technology, and more specifically, to a design device for improving learning performance through game behavior replication-based reinforcement learning.

최근 게임 인공지능(Artificial Intelligence, AI)에 대한 연구가 활발히 진행되고 있다. 다양한 상용 게임에서 유한 상태 기계(Finite State Machine, FSM) 기반 AI를 적용하고 있다. 하지만, FSM 기반 AI는 동일한 상황에서 획일적인 행동만 수행하기 때문에 사용자의 만족도가 낮기 때문에Non-Player Character(NPC)나 플레이어(Player)를 대체하는 AI에 강화 학습을 적용한 연구가 다양하게 진행되고 있다. Recently, research on game artificial intelligence (AI) has been actively conducted. Finite State Machine (FSM)-based AI is being applied to various commercial games. However, FSM-based AI only performs uniform actions in the same situation, so user satisfaction is low. Therefore, various studies are being conducted on applying reinforcement learning to AI that replaces non-player characters (NPCs) or players.

강화 학습 알고리즘은 일반적으로 Q-Learning과 DQN(Deep Q-Network)과 같은 가치함수 기반 강화 학습과 A2C와 A3C 같은 정책 기반 강화 학습으로 분류할 수 있다. Q-Learning은 환경에서 제공하는 상태와 이를 통해 에이전트가 선택한 행동에 대해서 Q-table를 통해 가치함수를 평가하기 때문에 에이전트의 상태 및 행동 공간이 증가하는 경우 많은 메모리가 필요하며, 이전의 관측 데이터가 다음 관측 데이터에 영향을 줄 경우 학습이 잘 안되는 문제가 있다. DQN은 Q-Learning에 심층 신경망(Deep Neural Network)을 결합하여 가치함수를 예측하는 방법으로, 리플레이 버퍼(Replay buff)와 타겟 네트워크(Target network)를 사용해 Q-Learning의 데이터 간 상관관계 문제를 해결하였으나, 많은 양의 리플레이 버프와 많은 학습 시간이 필요하다. A2C와 A3C는 정책 기반의 강화 학습 알고리즘으로 에이전트가 환경과 실시간으로 상호작용하면서 학습을 진행할 수 있기 때문에 보다 빠른 학습 속도와 수렴이 가능하며, 연속적인 행동 공간에서의 원활한 학습이 가능하기 때문에 다양한 액션이 존재하는 환경에서는 가치함수 기반 강화 학습보다 정확한 학습이 가능하다.Reinforcement learning algorithms can generally be classified into value function-based reinforcement learning such as Q-Learning and DQN (Deep Q-Network) and policy-based reinforcement learning such as A2C and A3C. Since Q-Learning evaluates the value function through the Q-table for the state provided by the environment and the action selected by the agent through it, it requires a lot of memory when the state and action space of the agent increase, and there is a problem that learning does not work well when the previous observation data affects the next observation data. DQN is a method to predict the value function by combining a deep neural network with Q-Learning, and solves the data correlation problem of Q-Learning by using a replay buff and a target network, but requires a large amount of replay buff and a lot of learning time. A2C and A3C are policy-based reinforcement learning algorithms that allow agents to learn while interacting with the environment in real time, enabling faster learning speeds and convergence. They also enable smooth learning in a continuous action space, enabling more accurate learning than value function-based reinforcement learning in environments with various actions.

행동 복제 알고리즘은 대표적인 모방 학습 알고리즘이다. 행동 복제 알고리즘은 수집한 전문가의 게임 데이터에서 상태와 행동을 추출해서 학습 데이터로 사용한다. 이를 기반으로 전문가의 정책과 유사하게 행동하는 AI 학습이 가능하다. The behavioral replication algorithm is a representative imitation learning algorithm. The behavioral replication algorithm extracts states and actions from the collected expert game data and uses them as learning data. Based on this, AI learning that behaves similarly to the expert's policy is possible.

강화 학습은 에이전트가 주변 환경과 상호작용을 통하여 얻는 보상을 통해 문제를 해결하는 인공지능 알고리즘으로, 환경 모델을 정의하지 않고 학습하기 때문에 복잡한 게임 환경에 적용하기에 적합하나 학습을 위해서는 많은 시간과 비용이 요구된다. 따라서, 강화 학습을 위해 필요한 시간과 비용을 감소시켜 실시간으로 복잡한 게임환경에서 적용할 수 있는 강화 학습을 위한 설계 방법이 필요로 되고 있다.Reinforcement learning is an artificial intelligence algorithm that solves problems through rewards obtained by agents through interaction with the surrounding environment. Since it learns without defining an environment model, it is suitable for application to complex game environments, but it requires a lot of time and cost for learning. Therefore, a design method for reinforcement learning that can be applied in complex game environments in real time by reducing the time and cost required for reinforcement learning is needed.

본 발명의 배경기술은 대한민국 공개특허 제2020-0108718 호에 게시되어 있다.The background technology of the present invention is published in Republic of Korea Patent Publication No. 2020-0108718.

본 발명의 목적은 구기 스포츠 게임과 같이 공수 전환이 있는 복잡한 게임 환경에서 실시간적으로 강화 학습을 적용하기위해 학습의 복잡도를 감소시키기 위한 전처리 기술과 학습 초반에 시행착오를 최소화하기 위한 행동 복제 알고리즘을 이용한 정책 기반 강화 학습을 위한 설계 방법 및 그 장치를 제공하는데 있다.The purpose of the present invention is to provide a design method and device for policy-based reinforcement learning using a preprocessing technique for reducing the complexity of learning and a behavior replication algorithm for minimizing trial and error in the early stage of learning in order to apply reinforcement learning in real time in a complex game environment with offense and defense transitions, such as a ball sports game.

상기 목적을 달성하기 위한 본 발명은 세가지 전처리 기술과 행동 복제 알고리즘을 이용해 복잡한 공수 전환이 있는 구기 스포츠 게임에서 강화 학습을 적용할 때 발생되는 학습시간 및 상태 공간 문제를 해결하고자 한다. In order to achieve the above purpose, the present invention seeks to solve learning time and state space problems that occur when applying reinforcement learning to a ball sports game with complex offense-defense transitions by using three preprocessing techniques and a behavior replication algorithm.

첫 번째, 공수 전환이 있는 구기 스포츠 게임 환경에서 발생되는 모든 원시 데이터에 의해 증가하는 학습량을 감소시키기 위해서 원시 데이터를 정규화된 상대적인 정보로 표현함으로써 강화 학습의 상태 공간의 축소와 불안정한 게임 환경의 원시 데이터를 학습하기에 적합한 상태로 표현함으로써 안정적인 학습을 수행할 수 있게 한다.First, in order to reduce the amount of learning increased by all raw data generated in a ball sports game environment with air-to-ground transitions, the raw data is expressed as normalized relative information, thereby reducing the state space of reinforcement learning and enabling stable learning by expressing the raw data in an unstable game environment as a state suitable for learning.

두 번째, 공수 전환이 있는 구기 스포츠 게임에서 강화 학습에 적용될 보상 함수를 정의하는 데에는 다양한 어려움이 발생하기 때문에, 복잡한 환경에서 공격과 수비에 관련된 주요 행동과 주요 행동의 빈도수를 고려해 보상 설계를 수행한다.Second, because there are various difficulties in defining reward functions to be applied to reinforcement learning in ball sports games with offense and defense transitions, reward design is performed by considering the main actions related to offense and defense and the frequency of the main actions in a complex environment.

세 번째, 다양한 에피소드를 한 번에 학습하는 것은 학습 복잡도를 증가시키기 때문에 상황 별로 에피소드를 나누기 위해서 공수 전환이 있는 구기 스포츠 게임에서 발생하는 대표적인 상황인 자기팀이 공을 소유하고 있는 공격 상황, 적팀이 공을 소유하고 있는 수비 상황, 공을 아무도 소유하고 있지 않은 기타 상황으로 나누어 학습함으로써 학습 복잡도를 감소시킬 수 있다. Third, because learning multiple episodes at once increases the learning complexity, we can reduce the learning complexity by dividing the episodes by situation, such as an offensive situation where the own team has the ball, a defensive situation where the enemy team has the ball, and other situations where no one has the ball, which are representative situations that occur in ball sports games with offense-defense transitions.

또한, 강화 학습은 임의의 정책으로 학습을 수행할 경우 학습 초반에 시행착오가 많이 발생한다. 시행착오는 학습 성능을 낮추기 때문에 이 부분에 대한 개선하기 위하여 강화 학습 알고리즘에 전문가의 게임 데이터에서 상태와 행동을 추출해서 AI 학습을 수행하는 행동 복제 알고리즘을 적용하여 학습 성능을 높여 구기 스포츠 게임과 같이 공수 전환이 있는 복잡한 게임 환경에서 실시간적으로 적용가능한 게임 행동 복제 기반 강화학습 설계 장치 및 방법을 제공하고자 한다.In addition, reinforcement learning often causes trial and error in the early stage of learning when learning is performed with an arbitrary policy. Since trial and error lowers learning performance, in order to improve this part, a behavior replication algorithm that extracts states and actions from expert game data and performs AI learning is applied to the reinforcement learning algorithm to increase learning performance, and a game behavior replication-based reinforcement learning design device and method that can be applied in real time in complex game environments with offense and defense transitions such as ball sports games are provided.

본 발명이 이루고자 하는 기술적 과제는 이상에서 언급한 기술적 과제로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problems to be achieved by the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned can be clearly understood by a person having ordinary skill in the technical field to which the present invention belongs from the description below.

본 발명의 일 측면에 따르면, 게임 행동 복제 기반 강화학습 설계 장치가 제공된다.According to one aspect of the present invention, a device for designing reinforcement learning based on game action replication is provided.

본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치는 게임 환경에서 정규화된 상태 표현을 하고, 행동을 평가하고 에피소드를 분류하여 학습하기 위한 전처리를 수행하는 전처리부, 전처리를 기반으로 강화 학습을 실시하는 강화학습부 및 상태를 받아 행동을 결정하고 게임에 적용하는 행동 결정부를 포함할 수 있다. A game action replication-based reinforcement learning design device according to one embodiment of the present invention may include a preprocessing unit that performs preprocessing for learning by normalizing state representation in a game environment, evaluating actions, and classifying episodes, a reinforcement learning unit that performs reinforcement learning based on the preprocessing, and an action determination unit that receives a state, determines an action, and applies it to a game.

본 발명의 다른 일 측면에 따르면, 게임 행동 복제 기반 강화학습 설계 방법 및 이를 실행하는 컴퓨터 프로그램을 제공한다. 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 방법은 게임 환경에서 정규화된 상태 표현을 하고, 행동을 평가하고 에피소드를 분류하여 학습하기 위한 전처리를 수행하는 단계, 전처리를 기반으로 강화 학습을 실시하는 단계 및 상태를 받아 행동을 결정하고 게임에 적용하는 단계를 포함할 수 있다. According to another aspect of the present invention, a game action replication-based reinforcement learning design method and a computer program for executing the same are provided. A game action replication-based reinforcement learning design method according to an embodiment of the present invention may include a step of performing preprocessing for normalizing a state representation in a game environment, evaluating an action, and classifying an episode to learn, a step of performing reinforcement learning based on the preprocessing, and a step of receiving a state, determining an action, and applying the same to a game.

본 발명의 실시 예에 따르면, 전처리 기술과 행동 복제 알고리즘을 이용하여 정책 기반 강화 학습을 적용할 경우 학습 시 발생되는 학습시간 및 상태 공간 문제를 최적화하여 공수 전환이 있는 구기 스포츠게임과 같이 복잡한 게임환경에서 실시간으로 강화 학습을 적용할 수 있다. 또한 본 발명의 전처리 기술과 다른 복잡한 게임 환경에 맞추어 에피소드와 보상 함수를 설계할 경우에는 다양한 게임에도 응용할 수 있다. According to an embodiment of the present invention, when policy-based reinforcement learning is applied using a preprocessing technique and a behavior replication algorithm, the learning time and state space problems that occur during learning are optimized, so that reinforcement learning can be applied in real time in a complex game environment such as a ball game with attack-defense transitions. In addition, when the episode and reward function are designed according to the preprocessing technique of the present invention and other complex game environments, it can be applied to various games.

또한, 본 발명의 효과는 상기한 효과로 한정되는 것은 아니며, 본 발명의 설명 또는 청구범위에 기재된 발명의 구성으로부터 추론 가능한 모든 효과를 포함하는 것으로 이해되어야 한다.In addition, the effects of the present invention are not limited to the effects described above, and should be understood to include all effects that can be inferred from the composition of the invention described in the description or claims of the present invention.

도 1 은 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치를 설명하기 위한 도면.
도 2는 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치의 전처리부를 설명하기 위한 도면.
도 3은 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치의 구조도를 설명하는 도면.
도 4는 본 발명의 일 실시 예인 농구게임에 따른 게임 행동 복제 기반 강화학습 설계 장치의 에이전트 위치 상태 값에 대한 정규화 과정을 설명하기 위한 도면.
도 5 및 도 6은 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치와 기존 강화학습 장치를 비교하기 위한 예시.
도 7은 전통적인 FSM 기반 AI와 게임 행동 복제 기반 강화학습 설계 장치의 실험 결과를 설명하기 위한 예시.
도 8은 전통적인 FSM 기반 AI와 게임 행동 복제 기반 강화학습 설계 장치의 경기당 성공 행동 평균을 설명하기 위한 도면.
도 9는 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 방법을 설명하기 위한 도면.FIG. 1 is a drawing for explaining a game action replication-based reinforcement learning design device according to one embodiment of the present invention.
FIG. 2 is a diagram for explaining a preprocessing unit of a game action replication-based reinforcement learning design device according to one embodiment of the present invention.
FIG. 3 is a diagram explaining the structure of a game action replication-based reinforcement learning design device according to one embodiment of the present invention.
FIG. 4 is a diagram for explaining a normalization process for agent position state values of a device for designing reinforcement learning based on game action replication according to a basketball game, which is an embodiment of the present invention.
FIGS. 5 and 6 are examples for comparing a game action replication-based reinforcement learning design device according to one embodiment of the present invention with a conventional reinforcement learning device.
Figure 7 is an example to explain the experimental results of a traditional FSM-based AI and a game action replication-based reinforcement learning design device.
Figure 8 is a diagram illustrating the average successful actions per game of a traditional FSM-based AI and a game action replication-based reinforcement learning design device.
FIG. 9 is a diagram for explaining a game action replication-based reinforcement learning design method according to one embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시 예를 가질 수 있는 바, 특정 실시 예인 농구게임들을 도면에 예시하고 이를 상세한 설명을 통해 상세히 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 본 발명을 설명함에 있어서, 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 본 명세서 및 청구항에서 사용되는 단수 표현은, 달리 언급하지 않는 한 일반적으로 "하나 이상"을 의미하는 것으로 해석되어야 한다.The present invention can have various modifications and various embodiments, and specific embodiments of basketball games are illustrated in the drawings and described in detail through detailed descriptions. However, this is not intended to limit the present invention to specific embodiments, and it should be understood that it includes all modifications, equivalents, and substitutes included in the spirit and technical scope of the present invention. In describing the present invention, if it is determined that a specific description of a related known technology may unnecessarily obscure the gist of the present invention, a detailed description thereof will be omitted. In addition, the expressions "a," "an," and "the" used in this specification and claims should generally be interpreted to mean "one or more" unless otherwise stated.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결(접속, 접촉, 결합)"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 부재를 사이에 두고 "간접적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 구비할 수 있다는 것을 의미한다.Throughout the specification, when a part is said to be "connected (connected, contacted, joined)" to another part, this includes not only cases where it is "directly connected" but also cases where it is "indirectly connected" with another member in between. Also, when a part is said to "include" a certain component, this does not mean that other components are excluded, unless otherwise specifically stated, but that other components can be included.

본 명세서에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the present invention. The singular expression includes the plural expression unless the context clearly indicates otherwise. As used herein, the terms "comprises" or "has" and the like are intended to specify the presence of a feature, number, step, operation, component, part or combination thereof described in the specification, but should be understood to not exclude in advance the possibility of the presence or addition of one or more other features, numbers, steps, operations, components, parts or combinations thereof.

이하에서는 첨부한 도면을 참조하여 본 발명을 설명하기로 한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 따라서 여기에서 설명하는 실시예로 한정되는 것은 아니다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, the present invention will be described with reference to the attached drawings. However, the present invention can be implemented in various different forms, and therefore is not limited to the embodiments described herein. In addition, in order to clearly describe the present invention in the drawings, parts that are not related to the description are omitted, and similar parts are assigned similar drawing reference numerals throughout the specification.

도 1 은 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치를 설명하기 위한 도면이다.FIG. 1 is a drawing for explaining a game action replication-based reinforcement learning design device according to one embodiment of the present invention.

도 1을 참조하면, 게임 행동 복제 기반 강화학습 설계 장치는 전처리부(110), 강화학습부(120) 및 행동 결정부(130)를 포함한다. Referring to Fig. 1, a game action replication-based reinforcement learning design device includes a preprocessing unit (110), a reinforcement learning unit (120), and an action decision unit (130).

전처리부(110)는 게임 환경에서 상태 표현을 정규화하고, 행동을 평가하고 에피소드를 분류하여 학습하기 위한 전처리를 수행한다. 전처리부(110)는 상태 표현을 하기 위한 정규화 모듈, 보상 함수를 계산하기 위한 보상 모듈 및 에피소드를 분류하기 위한 에피소드 분류 모듈을 포함할 수 있다. 전처리부(110)는 이하 도 2에서 보다 상세히 설명하도록 한다.The preprocessing unit (110) performs preprocessing for normalizing state representation in the game environment, evaluating actions, and classifying episodes for learning. The preprocessing unit (110) may include a normalization module for state representation, a reward module for calculating a reward function, and an episode classification module for classifying episodes. The preprocessing unit (110) will be described in more detail in FIG. 2 below.

강화학습부(120)는 상기 전처리부(110)에서 수행한 전처리를 기반으로 강화 학습을 실시할 수 있다. 강화학습부(120)는 에피소드 분류 모듈에서 전체 에피소드의 상태, 행동 및 보상을 공격, 루스볼, 수비 상황으로 분류하여 각각에 대응되는 A2C(Advantage Actor-Critic) 모듈로 전달받아, 전처리와 전문가 행동을 모방하여 분류된 에피소드 별로 정책 기반의 강화 학습을 수행할 수 있다. 전문가 행동을 모방할 수 있는 정책은 공격, 수비 및 기타상황에서도 적용이 가능하다. 즉, 강화학습부(120)는 구기 스포츠 게임과 같이, 공수 전환이 있는 복잡한 게임환경에서 강화 학습을 적용할 때, 발생되는 학습시간 및 상태 공간 문제를 해결하기 위해 정규화 모듈, 보상 모듈, 에피소드 분류모델과 전문가 행동을 모방할 수 있는 행동 복제 알고리즘을 이용하여 해당 문제를 해결할 수 있다. 이때, 행동 복제 기반 강화 학습의 의사코드(pseudo-code)는 학습 파라미터 와 파라미터

그리고 업데이트 버퍼

를 초기화를 수행한 후, 학습은 리턴값이 수렴할 때까지 다음과 같이 반복 수행한다. 상태 를 이용해 에서 행동 을 구하고, 상태 와 행동 에 따라 보상 이 결정된다. 전문가 정책 에서 전문가 행동 을 구하고 업데이트 버퍼

에 상태 , 행동 , 전문가 행동 , 보상 을 기록한다. 만약, 에피소드가 끝나면 학습 파라미터 는 정책 손실 함수를 통해 업데이트하며, 파라미터 는 가치 손실 함수로 업데이트한다. 한 에피소드에 대한 업데이트가 끝나면 업데이트 버퍼를 초기화하고, 리턴 값이 수렴할 때까지 반복 수행할 수 있다.The reinforcement learning unit (120) can perform reinforcement learning based on the preprocessing performed in the preprocessing unit (110). The reinforcement learning unit (120) classifies the state, action, and reward of the entire episode into attack, loose ball, and defense situations in the episode classification module and transmits them to the A2C (Advantage Actor-Critic) module corresponding to each, and performs policy-based reinforcement learning for each classified episode by imitating the preprocessing and expert actions. The policy that can imitate expert actions can also be applied to attack, defense, and other situations. That is, the reinforcement learning unit (120) can solve the learning time and state space problems that occur when applying reinforcement learning in a complex game environment with offense and defense transitions, such as a ball game, by using a normalization module, a reward module, an episode classification model, and a behavior replication algorithm that can imitate expert actions. At this time, the pseudo-code of the behavior replication-based reinforcement learning is learning parameter and parameter

And update buffer

After initializing, learning is repeated as follows until the return value converges. State By using Action in Find and state Wow action Compensation according to This is decided by expert policy Professional behavior in Find and update buffer

In the state , action , professional behavior , compensation . If the episode ends, the learning parameters is updated through the policy loss function, and the parameters is updated with a loss function. After the update for one episode is finished, the update buffer can be initialized and repeated until the return value converges.

행동 결정부(130)는 상태를 받아 행동을 결정하고, 이를 게임에 적용할 수 있다. 즉, 강화학습을 수행한 행동을 예를 들어, 수비, 공격 및 루스볼 등의 농구 게임에 행동을 적용하여 반영할 수 있다. The action decision unit (130) can receive a state, determine an action, and apply it to the game. That is, the action that performed reinforcement learning can be reflected by applying the action to a basketball game, such as defense, offense, and loose ball.

도 2는 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치의 전처리부를 설명하기 위한 도면이다.FIG. 2 is a diagram for explaining a preprocessing unit of a game action replication-based reinforcement learning design device according to one embodiment of the present invention.

도 2를 참조하면, 전처리부(110)는 상태 표현부(111), 보상부(113) 및 에피소드 분류부(115)를 포함할 수 있다. Referring to FIG. 2, the preprocessing unit (110) may include a state expression unit (111), a reward unit (113), and an episode classification unit (115).

상태 표현부(111)는 특정 시간에 에이전트에 환경 정보를 제공한다, 또한, 상태 표현부(111)는 상태 안정화를 위해 상대적인 상태 값 기반의 정규화 기법을 이용할 수 있다. 상태 표현부(111)는 상태를 표현하기 위해 정규화 모듈을 이용할 수 있는데, 이때, 정규화 모듈은 게임에서 상태만 입력 받아 정규화를 수행한다. 정규화 모듈은 절대값으로 표현된 상태 값 를 기준으로 상대값을 가지면서 최대값이 1이 되도록 정규를 수행한다, 상태 값 는 구기 스포츠 게임에서 에이전트 위치, 자기팀 및 적팀의 위치와 공의 위치, 공과의 거리, 공 클리어 조건, 경기 잔여 시간, 팀 점수 등을 사용할 수 있고, 이때, 위치는 2차원 또는 3차원으로 표현될 수 있다. 이는 예를 들어, [표 1]와 같이 표현될 수 있다.The state expression unit (111) provides environmental information to the agent at a specific time. In addition, the state expression unit (111) can use a normalization technique based on relative state values to stabilize the state. The state expression unit (111) can use a normalization module to express the state. In this case, the normalization module receives only the state from the game and performs normalization. The normalization module performs normalization by inputting the state value expressed as an absolute value. Normalization is performed so that the maximum value becomes 1 while having a relative value based on the state value. In a ball sports game, the agent location, the location of the own team and the enemy team, the location of the ball, the distance to the ball, the ball clear condition, the remaining time of the game, the team score, etc. can be used, and at this time, the location can be expressed in two dimensions or three dimensions. This can be expressed, for example, as in [Table 1].

상태situation 수식formula AI 위치AI location 팀 상대 위치Team Relative Position 적 상대 위치Enemy relative position 공 상대 위치Ball relative position 공 높이(절대값)Ball height (absolute value) 림 상대 위치Rim relative position 적 상대 각도Enemy relative angle 팀 상대 각도Team Relative Angle 마크 상대 각도Mark relative angle 공 거리Ball distance 림 거리Rim distance 볼 클리어 여부Whether the ball is clear or not 피벗 여부Whether to pivot 경기 잔여 시간Remaining time in the game 반칙 잔여 시간Foul remaining time 팀 점수Team Score

이때, team(AI, n)은 AI 팀의 n 번째 팀원을 의미하고 enemy(AI, m)은 AI 상대 팀의 m번째 팀원을 의미하고 mark(AI)는 마크 상태를 의미한다. 또한, 위치는 캐릭터, 농구공 및 림의 위치를 나타낸다. 각각 캐릭터마다 자신의 위치를 원점으로 표현할 수 있고, 다른 캐릭터, 농구공 및 림을 상대 위치로 표현할 수 있다. 각도는 환경에서 제공되는 상태는 아니지만, 계산하여 상태 값으로 산출하여 사용할 수 있다. 캐릭터의 오픈 여부를 판단하기 위한 기준 값으로 사용되며, 거리는 예를 들어, 미터 단위로 기준이 되는 객체와 상대 객체까지의 거리를 최대 허용 거리인 10m로 나눈 값을 상태로 사용할 수 있다. 이때, 조건은 AI 팀의 볼 클리어 여부 ball_clear(AI), AI의 피벗 여부 pivot(AI)일 수 있다. 또한, 조건은 0 또는 1의 값을 가질 수 있다. 시간은 초 단위로 농구 게임에서 경기 종료까지 남은 시간과 바이올레이션(Violation) 시간이 있다. 예를 들어, 시간은 각각 최대 시간 240초와 20초로 나누어 0부터 1사이 값으로 정규화를 진행한다. 점수는 AI 팀의 현재 점수를 의미하고, 최대 허용 점수인 30점으로 나눈 값을 상태로 사용할 수 있다. At this time, team(AI, n) means the nth team member of the AI team, enemy(AI, m) means the mth team member of the AI opponent team, and mark(AI) means the mark status. In addition, the location indicates the location of the character, basketball, and rim. Each character can express his or her own location as the origin, and other characters, basketballs, and rims can be expressed as relative locations. The angle is not a status provided in the environment, but can be calculated and used as a status value. It is used as a reference value to determine whether the character is open, and the distance can be used as a status, for example, the distance between the reference object and the opponent object in meters divided by the maximum allowable distance of 10m. At this time, the condition can be whether the AI team clears the ball or not ball_clear(AI), and whether the AI pivots or not pivot(AI). In addition, the condition can have a value of 0 or 1. The time is the time remaining until the end of the game in a basketball game in seconds and the violation time. For example, the time is divided into the maximum time of 240 seconds and 20 seconds, respectively, and normalized to a value between 0 and 1. The score represents the current score of the AI team, and the value divided by the maximum allowable score of 30 can be used as the status.

보상부(113)는 특정 시점에서의 에이전트의 행동에 가치를 반영한다. 보상부(113)는 슛, 패스, 돌파, 마크, 회피, 볼 클리어, 공 줍기 및 골 밑 상황에 보상 가치를 적용할 수 있고, 돌파, 마크, 회피 및 공 줍기 상황에서는 발생 빈도에 기초하여 가중치를 적용할 수 있다. 보상부(113)는 행동 평가를 위해 미리 정의된 상세 보상 기반 보상 함수 설계를 하는 보상 모듈을 이용할 수 있다. 보상 모듈은 강화 학습 알고리즘을 적용하기 위해 구기 스포츠 게임과 같이 공수 전환이 있는 복잡한 게임 환경을 공격과 수비 시 요구되는 주요 행동과 주요 행동의 발생 빈도를 고려하여 보상 의 설계를 수행할 수 있다. 보상 는 슛, 패스, 돌파, 마크, 회피 등으로 정의할 수 있고, 발생한 모든 보상을 더하여 계산할 수 있다. 예를 들어, 슛은 오픈 여부(0.0/1.0)슛 유효 범위 여부(0.0/1.0)슛 여부(0.0/1.0), 패스는 피벗 여부(0.0/1.0)패스 여부(0.0/1.0), 돌파는 돌파 유효 범위 여부(0.0/1.0)돌파 여부(0.0/1.0)가중치(0.5), 마크는 마크 각도30도 여부(0.0/1.0)마크 거리2m 여부(0.0/1.0)가중치(0.002), 회피는 오픈 여부(0.0/1.0)슛 유효 범위 여부(0.0/1.0)가중치(0.01), 볼 클리어는 볼 클리어 성공 여부(0.0/1.0), 공 줍기는 공 줍기 성공 여부(0.0/1.0)-공과 다른 방향 이동 여부(0.0/1.0)가중치(0.002) 및 골 밑은 림 거리2m 여부(0.0/1.0)가중치(0.001)로 산출할 수 있다. The reward unit (113) reflects the value of the agent's actions at a specific point in time. The reward unit (113) can apply reward values to situations such as shooting, passing, breakthrough, marking, evading, ball clearing, ball picking up, and goal bottom situations, and can apply weights based on the frequency of occurrence in situations such as breakthrough, marking, evading, and ball picking up. The reward unit (113) can use a reward module that designs a detailed reward-based reward function predefined for action evaluation. The reward module considers the main actions required for offense and defense and the frequency of occurrence of the main actions in a complex game environment with offense and defense transitions, such as a ball game, in order to apply a reinforcement learning algorithm. can perform the design of compensation can be defined as a shot, pass, breakthrough, mark, dodge, etc., and can be calculated by adding up all the rewards that occurred. For example, a shot is open (0.0/1.0). Whether the shot is within range (0.0/1.0) Whether it's a shot (0.0/1.0), whether it's a pass or a pivot (0.0/1.0) Whether to pass (0.0/1.0), whether breakthrough is within the breakthrough effective range (0.0/1.0) Breakthrough (0.0/1.0) Weight (0.5), Mark is the mark angle 30 degrees or not (0.0/1.0) Mark Distance 2m or not (0.0/1.0) Weight (0.002), evasion is open or not (0.0/1.0) Whether the shot is within range (0.0/1.0) Weight (0.01), Ball Clear is whether the ball is successfully cleared (0.0/1.0), Ball Pickup is whether the ball is successfully picked up (0.0/1.0) - whether the ball moves in a different direction (0.0/1.0) Weight (0.002) and the bottom of the goal is the rim distance 2m or not (0.0/1.0) It can be calculated with a weight (0.001).

에피소드 분류부(115)는 적어도 하나 이상의 에피소드를 분류하여 학습할 수 있다. 또한, 에피소드 분류부(115)는 학습 복잡도를 감소시키기 위해 전체 에피소드의 상태 , 행동 , 보상 을 공격, 수비, 기타 상황으로 분류하여 강화 학습을 수행할 수 있도록 전처리를 수행한다. 에피소드 분류의 기준은 공의 소유 여부를 위해 자기팀이 공을 소유하고 있으면 공격, 적팀이 공을 소유하고 있으면 수비, 공을 아무도 소유하고 있지 않은 상황이면 기타가 되며, 분류된 에피소드에 따라 학습에서 사용하는 상태, 보상 및 행동을 다르게 적용하여 강화 학습을 수행할 수 있다. 에피소드 분류 결과에 따라서 공격, 루스볼, 수비 상황 별로 학습에서 사용하는 상태, 보상 및 행동이 상이하다. 예를 들어, 공격 에피소드의 경우, 8방향 이동, 슛, 돌파, 패스로 구성되는 11개의 행동으로 공격 에피소드의 행동을 표현할 수 있다. 이는 행동의 수를 줄이기 때문에, 결과적으로 학습 공간이 줄어들어 동일한 에피소드로 학습 성능을 높일 수 있는 장점을 지니고 있다. The episode classification unit (115) can learn by classifying at least one episode. In addition, the episode classification unit (115) can learn the state of the entire episode to reduce learning complexity. , action , compensation Preprocessing is performed so that reinforcement learning can be performed by classifying the situations into attack, defense, and others. The criteria for episode classification are as follows: if the team owns the ball, it is an attack; if the enemy team owns the ball, it is a defense; and if no one has the ball, it is others. Depending on the classified episode, the states, rewards, and actions used in learning can be applied differently to perform reinforcement learning. Depending on the results of the episode classification, the states, rewards, and actions used in learning are different for each attack, loose ball, and defense situation. For example, in the case of an attack episode, the actions of the attack episode can be expressed with 11 actions consisting of 8-way movement, shooting, breakthrough, and passing. This has the advantage of reducing the number of actions, which results in a smaller learning space, which can improve learning performance with the same episode.

도 3은 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치의 구조도를 설명하는 도면이다.FIG. 3 is a diagram explaining the structure of a game action replication-based reinforcement learning design device according to one embodiment of the present invention.

도 3을 참조하면, 게임 행동 복제 기반 강화학습 설계 장치의 전처리 과정에는 상태 표현하기 위한 정규화 모듈, 보상 함수를 계산하기 위한 보상 모듈, 에피소드 분류하기 위한 에피소드 분류 모듈을 포함한다. 이후, 게임 행동 복제 기반 강화학습 설계 장치는 전처리와 전문가 행동을 모방하여 분류된 에피소드 별 정책 기반의 강화 학습을 수행한다. 도 4에는 전문가 행동을 모방할 수 있는 정책이 수비에 적용되어 있으나, 공격 및 기타 상황에도 적용할 수 있다. 정규화 모듈은 절대값으로 표현된 상태 값 를 기준으로 상대값을 가지면서 최대값이 1이 되도록 정규를 수행한다. 상태 값 는 구기 스포츠 게임에서 에이전트 위치, 자기팀 및 적팀의 위치와 공의 위치, 공과의 거리, 공 클리어 조건, 경기 잔여 시간, 팀 점수 등을 사용하며, 위치는 2차원 또는 3차원으로 표현할 수 있다. 이때, 조건은 자기팀의 볼 클리어 여부를 위해 0 또는 1의 값을 가지며, 시간은 각각 최대 시간을 기준으로 정규화하며, 점수는 최대 허용 점수를 기준으로 정규화할 수 있다. 전처리 과정에서의 보상 모듈은 강화 학습 알고리즘을 적용하기 위해 구기 스포츠 게임과 같이 공수 전환이 있는 복잡한 게임 환경을 공격과 수비의 경우, 요구되는 주요 행동과 주요 행동의 발생 빈도를 고려하여 보상 의 설계를 수행할 수 있다. 전처리 과정에서의 에피소드 분류 모듈은 학습 복잡도를 감소시키기 위해 전체 에피소드의 상태 , 행동 , 보상 을 공격, 수비, 기타 상황으로 분류하여 강화 학습을 수행할 수 있다. 에피소드 분류의 기준은 공의 소유 여부에 위해, 자기팀이 공을 소유하고 있으면 공격, 적팀이 공을 소유하고 있으면 수비, 공을 아무도 소유하고 있지 않은 상황이면 기타가 되며, 분류된 에피소드에 따라 학습에서 사용하는 상태, 보상 그리고 행동을 다르게 적용하여 강화 학습을 수행할 수 있다. 전처리 기반 강화 학습 과정은 구기 스포츠 게임과 같이 공수 전환이 있는 복잡한 게임환경에서 강화 학습을 적용할 때 발생되는 학습시간 및 상태 공간 문제를 해결하기 위해 정규화 모듈, 보상 모듈, 에피소드 분류모델과 전문가 행동을 모방할 수 있는 행동 복제 알고리즘을 이용하여 학습 시간 및 상태 공간 문제를 해결할 수 있다. 전처리 기반 강화 학습을 수행한 이후, 이를 게임에 적용할 수 있다. Referring to Fig. 3, the preprocessing process of the game behavior replication-based reinforcement learning design device includes a normalization module for expressing a state, a reward module for calculating a reward function, and an episode classification module for classifying episodes. After that, the game behavior replication-based reinforcement learning design device performs reinforcement learning based on a policy for each classified episode by imitating the preprocessing and expert behavior. In Fig. 4, a policy that can imitate expert behavior is applied to defense, but it can also be applied to offense and other situations. The normalization module is a state value expressed as an absolute value. Normalization is performed so that the maximum value becomes 1 while having a relative value based on the state value. In a ball sports game, the agent location, the location of the own team and the enemy team, the location of the ball, the distance to the ball, the ball clear condition, the remaining time of the game, the team score, etc. are used, and the location can be expressed in two or three dimensions. At this time, the condition has a value of 0 or 1 for whether the own team clears the ball, and the time can be normalized based on the maximum time, and the score can be normalized based on the maximum allowed score. The reward module in the preprocessing step is used to apply the reinforcement learning algorithm to a complex game environment with offense and defense transitions such as a ball sports game, and in the case of offense and defense, the reward is considered by considering the main actions required and the frequency of occurrence of the main actions. The design of the episode classification module can be performed in the preprocessing stage to reduce the learning complexity by classifying the state of the entire episode. , action , compensation It is possible to classify the situations into attack, defense, and other situations and perform reinforcement learning. The criteria for classifying the episodes are based on whether the team owns the ball or not: if the team owns the ball, it is attack; if the enemy team owns the ball, it is defense; and if no one has the ball, it is other. Depending on the classified episode, the states, rewards, and actions used in learning can be applied differently to perform reinforcement learning. The preprocessing-based reinforcement learning process can solve the learning time and state space problems that occur when applying reinforcement learning in a complex game environment with offense and defense transitions, such as a ball game, by using a normalization module, a reward module, an episode classification model, and a behavior replication algorithm that can imitate expert actions. After performing preprocessing-based reinforcement learning, it can be applied to the game.

도 4는 본 발명의 일 실시 예인 농구게임에 따른 게임 행동 복제 기반 강화학습 설계 장치의 에이전트 위치 상태 값에 대한 정규화 과정을 설명하기 위한 도면이다.FIG. 4 is a diagram for explaining a normalization process for agent position state values of a device for designing reinforcement learning based on game action replication according to a basketball game, which is an embodiment of the present invention.

도 4를 참조하면, 에이전트의 위치 상태 값 에 대한 정규화 과정을 나타낸 것을 확인할 수 있다. 도 4에 도시된 바와 같이, 파란색 원은 기준이 되는 에이전트를 의미하고, 빨간색 원은 상대팀의 위치를 나타내 것이다. 파란색 원을 기준으로 상대 좌표로 표현한 후 경기장의 크기를 기준으로 정규화를 수행할 수 있다. AI 위치는 경기장 중앙을 기준으로 상대 값을 사용하며, 위치는 2차원으로 표현하나, 공만 높이를 고려할 수 있다. 예를 들어, 공의 높이는 0을 기준으로 최대 허용 높이는 5m일 수 있다. Referring to Figure 4, the agent's location status value You can see that the normalization process for . As shown in Fig. 4, the blue circle means the reference agent, and the red circle indicates the location of the opposing team. After expressing the relative coordinates based on the blue circle, normalization can be performed based on the size of the stadium. The AI location uses a relative value based on the center of the stadium, and the location is expressed in two dimensions, but only the height of the ball can be considered. For example, the maximum allowable height of the ball can be 5m based on 0.

도 5 및 도 6은 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치와 기존 강화학습 장치를 비교하기 위한 예시이다.FIGS. 5 and 6 are examples for comparing a game action replication-based reinforcement learning design device according to one embodiment of the present invention with an existing reinforcement learning device.

도 5의 (a)는 행동 복제를 적용하지 않은 전통적인 A2C의 실험 결과이다. 도 5의 (a)는 회피 보상 발생 횟수와 마크 보상 발생 횟수를 나타내는 그래프이다. 도 5의 (b)는 행동 복제를 적용한 A2C의 학습 실험 결과이다. 도 5의 (b)는 회피 보상 발생 횟수와 마크 보상 발생 횟수를 나타내는 그래프이다. 전문가 정책이 없는 공격 에피소드의 회피 보상은 전통적인 A2C 실험 결과와 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치는 비슷한 수치를 나타낼 수 있다. 반면, 전문가 정책이 존재하여 행동 복제 적용을 수행한 수비 에피소드에서는 예를 들어, 학습 종료 시 마지막 100개 데이터 평균이 전통적인 A2C의 보상 발생 횟수가 224회이고, 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치의 보상 발생 횟수가 428회로 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치가 전통적인 A2C보다 약 1.9배의 마크 성공을 보인 것을 확인할 수 있다. 회피와 마크의 발생 횟수가 불안정하기 때문에 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치와 전통적인 A2C의 실험 결과를 0을 기준으로 선형 추세선으로 비교할 수 있다. 도 5의 (b)에서 빨간 선은 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치의 마크 보상 발생 횟수의 추세선이고, 도 5의 (a)에서의 빨간 선은 전통적인 A2C의 마크 보상 발생 횟수의 추세선이다. 게임 행동 복제 기반 강화학습 설계 장치가 약 1.47의 기울기이고, 전통적인 A2C가 약 0.68의 기울기를 가진다. 게임 행동 복제 기반 강화학습 설계 장치가 약 2.1배의 기울기를 가진다. 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치가 전통적인 A2C에 비해서 보상이 상대적으로 많이 발생되는 행동을 선택하고 수행하는 것을 알 수 있다.Fig. 5 (a) is the experimental result of a traditional A2C without applying behavior replication. Fig. 5 (a) is a graph showing the number of occurrences of evasion rewards and the number of occurrences of mark rewards. Fig. 5 (b) is the experimental result of a learning of an A2C with applying behavior replication. Fig. 5 (b) is a graph showing the number of occurrences of evasion rewards and the number of occurrences of mark rewards. The evasion rewards of an attack episode without an expert policy can show similar values in the experimental results of a traditional A2C and the reinforcement learning design device based on game behavior replication according to an embodiment of the present invention. On the other hand, in a defense episode where an expert policy exists and behavior replication is applied, for example, the average of the last 100 data at the end of learning is 224 times for the number of occurrences of rewards of a traditional A2C and 428 times for the number of occurrences of rewards of a game behavior replication based reinforcement learning design device based on game behavior replication according to an embodiment of the present invention, and it can be confirmed that the mark success of a game behavior replication based reinforcement learning design device based on game behavior replication according to an embodiment of the present invention is about 1.9 times higher than that of a traditional A2C. Since the occurrence numbers of evasion and mark are unstable, the experimental results of the game behavior replication-based reinforcement learning design device according to an embodiment of the present invention and the traditional A2C can be compared with a linear trend line based on 0. The red line in Fig. 5 (b) is a trend line of the number of mark reward occurrences of the game behavior replication-based reinforcement learning design device according to an embodiment of the present invention, and the red line in Fig. 5 (a) is a trend line of the number of mark reward occurrences of the traditional A2C. The game behavior replication-based reinforcement learning design device has a slope of about 1.47, and the traditional A2C has a slope of about 0.68. The game behavior replication-based reinforcement learning design device has a slope of about 2.1 times higher. It can be seen that the game behavior replication-based reinforcement learning design device according to an embodiment of the present invention selects and performs an action that generates a relatively large reward compared to the traditional A2C.

도 6은 전통적인 FSM 기반 AI 상대 실험 결과 예시이다. 도 6의 (a)는 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치의 득점 그래프이고, 도 6의 (b)는 전통적인 FSM 기반 AI의 득점 그래프이다. 도 6의 (c)는 승/무/패 누적이다. 경기 결과에 따라 행동 복제 기반 A2C AI를 기준으로 승리(+1), 무승부(+0), 패배(-1)를 누적한 결과를 의미한다. 도 6의 (d)는 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치의 AI가 학습을 수행하면서 얻은 경기 별 보상 합계이다. 예를 들어, 득점 차이는 전체 평균 약 0.43점으로 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치의 득점이 상대적으로 높다. FSM 기반 AI가 더 높은 승률을 가지는 처음부터 1,000번째까지 경기의 평균 득점 차는 약 0.4점으로 FSM 기반 AI가 더 높다. 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치의 AI 승률이 상대적으로 높은 마지막 1,000 경기의 평균 득점 차는 제안한 방법이 약 0.83점 높은 득점을 기록한 것을 확인할 수 있다. 승/무/패 누적은 학습 초반에 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치가 상대적으로 패배를 많이 한다. 그러나, 학습이 진행됨에 따라 점점 패배가 줄어들고 승리 비율이 높아진다. 보상 합계도 초반에 낮은 보상을 기록하지만, 점차 증가하여 일정한 값으로 수렴하는 그래프를 확인할 수 있다. 또한, 예를 들어, 마지막 1,000개 경기 결과는 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 장치가 512승 146무 342패로 60%의 승률을 기록한 것을 확인할 수 있다. Fig. 6 is an example of the results of a traditional FSM-based AI opponent experiment. Fig. 6 (a) is a score graph of a game behavior replication-based reinforcement learning design device according to an embodiment of the present invention, and Fig. 6 (b) is a score graph of a traditional FSM-based AI. Fig. 6 (c) is an accumulation of wins/draws/losses. It means the accumulated results of wins (+1), draws (+0), and losses (-1) based on the behavior replication-based A2C AI according to the game results. Fig. 6 (d) is a sum of rewards for each game obtained while the AI of the game behavior replication-based reinforcement learning design device according to an embodiment of the present invention performs learning. For example, the score difference is about 0.43 points on average, and the score of the game behavior replication-based reinforcement learning design device according to an embodiment of the present invention is relatively high. The average score difference of the games from the first to the 1,000th, where the FSM-based AI has a higher win rate, is about 0.4 points, which is higher for the FSM-based AI. According to an embodiment of the present invention, the average score difference of the last 1,000 games in which the AI winning rate of the game behavior replication-based reinforcement learning design device has a relatively high score shows that the proposed method recorded a score that is about 0.83 points higher. The accumulation of wins/draws/losses shows that the game behavior replication-based reinforcement learning design device according to an embodiment of the present invention has relatively many losses in the early stage of learning. However, as learning progresses, the losses gradually decrease and the winning rate increases. The graph also shows that the total reward records a low reward in the early stage, but gradually increases and converges to a constant value. In addition, for example, the results of the last 1,000 games show that the game behavior replication-based reinforcement learning design device according to an embodiment of the present invention recorded 512 wins, 146 draws, and 342 losses, for a winning rate of 60%.

도 7은 전통적인 FSM 기반 AI와 게임 행동 복제 기반 강화학습 설계 장치의 실험 결과를 설명하기 위한 예시이다.Figure 7 is an example to explain the experimental results of a traditional FSM-based AI and a game action replication-based reinforcement learning design device.

도 7은 게임 행동 복제 기반 강화학습 설계 장치의 AI를 농구 게임 전문가가 설계한 FSM 기반 AI와 대전하는 실험을 진행한 예시 결과이다. 예를 들어, 총 103 경기를 진행한 경우, 68승 35패 승률로 성능을 보이는 것을 확인할 수 있다. Figure 7 is an example result of an experiment in which the AI of the game action replication-based reinforcement learning design device competes against the FSM-based AI designed by a basketball game expert. For example, when a total of 103 games were played, it can be confirmed that the performance was 68 wins and 35 losses.

도 8은 전통적인 FSM 기반 AI와 게임 행동 복제 기반 강화학습 설계 장치의 경기당 성공 행동 평균을 설명하기 위한 도면이다.Figure 8 is a diagram illustrating the average successful actions per game of a traditional FSM-based AI and a game action replication-based reinforcement learning design device.

도 8은 게임 행동 복제 기반 강화학습 설계 장치의 AI를 농구 게임 전문가가 설계한 FSM 기반 AI 상대 경기당 성공 행동 평균을 나타낸 것이다. 도 8을 참조하면, 각 AI의 경기마다 행동 성공 횟수 평균을 확인할 수 있다. 전문가가 설계한 FSM 기반 AI 대비 게임 행동 복제 기반 강화학습 설계 장치의 AI가 공격(2점 슛)과 수비(블록, 스틸) 에피소드에서 상대적으로 높은 지표가 나온 것을 확인할 수 있다. Figure 8 shows the average successful actions per game of the AI of the game action replication-based reinforcement learning design device compared to the FSM-based AI designed by basketball game experts. Referring to Figure 8, the average number of successful actions per game of each AI can be confirmed. It can be confirmed that the AI of the game action replication-based reinforcement learning design device has relatively high indicators in the attack (2-point shot) and defense (block, steal) episodes compared to the FSM-based AI designed by experts.

도 9는 본 발명의 일 실시 예에 따른 게임 행동 복제 기반 강화학습 설계 방법을 설명하기 위한 도면이다.FIG. 9 is a diagram for explaining a game action replication-based reinforcement learning design method according to one embodiment of the present invention.

도 9를 참조하면, S910 단계에서 게임 행동 복제 기반 강화학습 설계 장치는 상태를 표현하기 위해 입력한다.Referring to FIG. 9, at step S910, a game action replication-based reinforcement learning design device inputs to express a state.

S930 단계에서 게임 행동 복제 기반 강화학습 설계 장치는 게임 환경에서 상태 표현을 정규화하고, 행동을 평가하고 에피소드를 분류하여 학습하기 위한 전처리를 수행한다. 상태 표현을 하기 위한 정규화 모듈, 보상 함수를 계산하기 위한 보상 모듈 및 에피소드를 분류하기 위한 에피소드 분류 모듈을 포함할 수 있다. 상태 표현은 특정 시간에 에이전트에 환경 정보를 제공하고, 안정화를 위해 상대적인 상태 값 기반의 정규화 기법을 이용할 수 있다. 상태를 표현하기 위해 정규화 모듈을 이용할 수 있는데, 이때, 정규화 모듈은 S910단계에서 게임으로부터 상태만 입력 받아 정규화를 수행한다. 정규화 모듈은 절대값으로 표현된 상태 값 를 기준으로 상대값을 가지면서 최대값이 1이 되도록 정규를 수행한다, 상태 값 는 구기 스포츠 게임에서 에이전트 위치, 자기팀 및 적팀의 위치와 공의 위치, 공과의 거리, 공 클리어 조건, 경기 잔여 시간, 팀 점수 등을 사용할 수 있고, 이때, 위치는 2차원 또는 3차원으로 표현될 수 있다. 보상은 특정 시점에서의 에이전트의 행동에 가치를 반영한다. 보상은 슛, 패스, 돌파, 마크, 회피, 볼 클리어, 공 줍기 및 골 밑 상황에 보상 가치를 적용할 수 있고, 돌파, 마크, 회피 및 공 줍기 상황에서는 발생 빈도에 기초하여 가중치를 적용할 수 있다. 게임 행동 복제 기반 강화학습 설계 장치는 행동 평가를 위해 미리 정의된 상세 보상 기반 보상 함수 설계를 하는 보상 모듈을 이용할 수 있다. 보상 모듈은 강화 학습 알고리즘을 적용하기 위해 구기 스포츠 게임과 같이 공수 전환이 있는 복잡한 게임 환경을 공격과 수비 시 요구되는 주요 행동과 주요 행동의 발생 빈도를 고려하여 보상 의 설계를 수행할 수 있다. 보상 는 슛, 패스, 돌파, 마크, 회피 등으로 정의할 수 있고, 발생한 모든 보상을 더하여 계산할 수 있다. 예를 들어, 슛은 오픈 여부(0.0/1.0)슛 유효 범위 여부(0.0/1.0)슛 여부(0.0/1.0), 패스는 피벗 여부(0.0/1.0)패스 여부(0.0/1.0), 돌파는 돌파 유효 범위 여부(0.0/1.0)돌파 여부(0.0/1.0)가중치(0.5), 마크는 마크 각도30도 여부(0.0/1.0)마크 거리2m 여부(0.0/1.0)가중치(0.002), 회피는 오픈 여부(0.0/1.0)슛 유효 범위 여부(0.0/1.0)가중치(0.01), 볼 클리어는 볼 클리어 성공 여부(0.0/1.0), 공 줍기는 공 줍기 성공 여부(0.0/1.0)-공과 다른 방향 이동 여부(0.0/1.0)가중치(0.002) 및 골 밑은 림 거리2m 여부(0.0/1.0)가중치(0.001)로 산출할 수 있다. 에 게임 행동 복제 기반 강화학습 설계 장치는 적어도 하나 이상의 에피소드를 분류하여 학습할 수 있다. 또한, 에피소드 분류하는 단계는 학습 복잡도를 감소시키기 위해 전체 에피소드의 상태 , 행동 , 보상 을 공격, 수비, 기타 상황으로 분류하여 강화 학습을 수행할 수 있도록 전처리를 수행한다. 에피소드 분류의 기준은 공의 소유 여부를 위해 자기팀이 공을 소유하고 있으면 공격, 적팀이 공을 소유하고 있으면 수비, 공을 아무도 소유하고 있지 않은 상황이면 기타가 되며, 분류된 에피소드에 따라 학습에서 사용하는 상태, 보상 및 행동을 다르게 적용하여 강화 학습을 수행할 수 있다. 에피소드 분류 결과에 따라서 공격, 루스볼, 수비 상황 별로 학습에서 사용하는 상태, 보상 및 행동이 상이하다. 예를 들어, 공격 에피소드의 경우, 8방향 이동, 슛, 돌파, 패스로 구성되는 11개의 행동으로 공격 에피소드의 행동을 표현할 수 있다. 이는 행동의 수를 줄이기 때문에, 결과적으로 학습 공간이 줄어들어 동일한 에피소드로 학습 성능을 높일 수 있는 장점을 지니고 있다. In step S930, the game action replication-based reinforcement learning design device performs preprocessing for learning by normalizing the state representation in the game environment, evaluating the action, and classifying the episode. It may include a normalization module for representing the state, a reward module for calculating a reward function, and an episode classification module for classifying the episode. The state representation may provide environment information to the agent at a specific time, and may use a normalization technique based on relative state values for stabilization. A normalization module may be used to represent the state, and at this time, the normalization module receives only the state from the game in step S910 and performs normalization. The normalization module may be configured to normalize the state value expressed as an absolute value. Normalization is performed so that the maximum value becomes 1 while having a relative value based on the state value. In a ball sports game, the agent position, the position of the own team and the enemy team, the position of the ball, the distance to the ball, the ball clear condition, the remaining time of the game, the team score, etc. can be used, and at this time, the position can be expressed in two dimensions or three dimensions. The reward reflects the value of the agent's action at a specific point in time. The reward can apply the reward value to the situations of shooting, passing, breaking through, marking, avoiding, ball clearing, ball picking up, and goal, and can apply a weight based on the frequency of occurrence in the situations of breaking through, marking, avoiding, and ball picking up. The game action replication-based reinforcement learning design device can use a reward module that designs a detailed reward-based reward function that is predefined for action evaluation. The reward module considers the main actions required for offense and defense and the frequency of occurrence of the main actions in a complex game environment with offense and defense transitions such as a ball sports game in order to apply a reinforcement learning algorithm. can perform the design of compensation can be defined as a shot, pass, breakthrough, mark, dodge, etc., and can be calculated by adding up all the rewards that occurred. For example, a shot is open (0.0/1.0). Whether the shot is within range (0.0/1.0) Whether it's a shot (0.0/1.0), whether it's a pass or a pivot (0.0/1.0) Whether to pass (0.0/1.0), whether breakthrough is within the breakthrough effective range (0.0/1.0) Breakthrough (0.0/1.0) Weight (0.5), Mark is the mark angle 30 degrees or not (0.0/1.0) Mark Distance 2m or not (0.0/1.0) Weight (0.002), evasion is open or not (0.0/1.0) Whether the shot is within range (0.0/1.0) Weight (0.01), Ball Clear is whether the ball is successfully cleared (0.0/1.0), Ball Pickup is whether the ball is successfully picked up (0.0/1.0) - whether the ball moves in a different direction (0.0/1.0) Weight (0.002) and the bottom of the goal is the rim distance 2m or not (0.0/1.0) can be calculated as a weight (0.001). The game action replication-based reinforcement learning design device can learn by classifying at least one episode. In addition, the episode classification step reduces the learning complexity by classifying the state of the entire episode. , action , compensation Preprocessing is performed so that reinforcement learning can be performed by classifying the situations into attack, defense, and others. The criteria for episode classification are as follows: if the team owns the ball, it is an attack; if the enemy team owns the ball, it is a defense; and if no one has the ball, it is others. Depending on the classified episode, the states, rewards, and actions used in learning can be applied differently to perform reinforcement learning. Depending on the results of the episode classification, the states, rewards, and actions used in learning are different for each attack, loose ball, and defense situation. For example, in the case of an attack episode, the actions of the attack episode can be expressed with 11 actions consisting of 8-way movement, shooting, breakthrough, and passing. This has the advantage of reducing the number of actions, which results in a smaller learning space, which can improve learning performance with the same episode.

S950 단계에서 게임 행동 복제 기반 강화학습 설계 장치는 S930단계에서 수행한 전처리를 기반으로 강화 학습을 실시할 수 있다. 에피소드 분류 모듈에서 전체 에피소드의 상태, 행동 및 보상을 공격, 루스볼, 수비 상황으로 분류하여 각각에 대응되는 A2C(Advantage Actor-Critic) 모듈로 전달받아, 전처리와 전문가 행동을 모방하여 분류된 에피소드 별로 정책 기반의 강화 학습을 수행할 수 있다. 전문가 행동을 모방할 수 있는 정책은 공격, 수비 및 기타상황에서도 적용이 가능하다. 즉, 구기 스포츠 게임과 같이, 공수 전환이 있는 복잡한 게임환경에서 강화 학습을 적용할 때, 발생되는 학습시간 및 상태 공간 문제를 해결하기 위해 정규화 모듈, 보상 모듈, 에피소드 분류모델과 전문가 행동을 모방할 수 있는 행동 복제 알고리즘을 이용하여 해당 문제를 해결할 수 있다. 이때, 행동 복제 기반 강화 학습의 의사코드(pseudo-code)는 학습 파라미터 와 파라미터 그리고 업데이트 버퍼

에 상태 , 행동 , 전문가 행동 , 보상 을 기록한다. 만약, 에피소드가 끝나면 학습 파라미터 는 정책 손실 함수를 통해 업데이트하며, 파라미터 는 가치 손실 함수로 업데이트한다. 한 에피소드에 대한 업데이트가 끝나면 업데이트 버퍼를 초기화하고, 리턴 값이 수렴할 때까지 반복 수행할 수 있다.In step S950, the game behavior replication-based reinforcement learning design device can perform reinforcement learning based on the preprocessing performed in step S930. In the episode classification module, the state, action, and reward of the entire episode are classified into attack, loose ball, and defense situations, and are transmitted to the A2C (Advantage Actor-Critic) module corresponding to each, so that policy-based reinforcement learning can be performed for each classified episode by imitating the preprocessing and expert behavior. The policy that can imitate expert behavior can be applied to attack, defense, and other situations. In other words, when applying reinforcement learning in a complex game environment with offense and defense transitions, such as a ball game, the problem of learning time and state space that occurs can be solved by using a normalization module, a reward module, an episode classification model, and a behavior replication algorithm that can imitate expert behavior. At this time, the pseudo-code of the behavior replication-based reinforcement learning is learning parameter and parameter And update buffer

S970단계에서 게임 행동 복제 기반 강화학습 설계 장치는 행동을 결정하고, 이를 게임에 적용할 수 있다. 즉, 강화학습을 수행한 행동을 예를 들어, 수비, 공격 및 루스볼 등의 농구 게임에 행동을 적용하여 반영할 수 있다. In step S970, the game action replication-based reinforcement learning design device can determine an action and apply it to the game. That is, the action that performed reinforcement learning can be reflected by applying the action to, for example, a basketball game such as defense, offense, and loose ball.

이상에서, 본 발명의 실시 예를 구성하는 모든 구성 요소들이 하나로 결합되거나 결합되어 동작하는 것으로 설명되었다고 해서, 본 발명이 반드시 이러한 실시 예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위안에서라면, 그 모든 구성요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다.In the above, although all the components constituting the embodiments of the present invention have been described as being combined as one or operating in combination, the present invention is not necessarily limited to such embodiments. That is, within the scope of the purpose of the present invention, all of the components may be selectively combined as one or more and operate.

도면에서 동작들이 특정한 순서로 도시되어 있지만, 반드시 동작들이 도시된 특정한 순서로 또는 순차적 순서로 실행되어야만 하거나 또는 모든 도시 된 동작들이 실행되어야만 원하는 결과를 얻을 수 있는 것으로 이해되어서는 안 된다. 특정 상황에서는, 멀티태스킹 및 병렬 처리가 유리할 수도 있다. 더욱이, 위에 설명한 실시 예 들에서 다양한 구성들의 분리는 그러한 분리가 반드시 필요한 것으로 이해되어서는 안 되고, 설명된 프로그램 컴포넌트들 및 시스템들은 일반적으로 단일 소프트웨어 제품으로 함께 통합되거나 다수의 소프트웨어 제품으로 패키지 될 수 있음을 이해하여야 한다.Although the operations are depicted in the drawings in a particular order, it should not be understood that the operations must be performed in the particular order depicted or in any sequential order, or that all depicted operations must be performed to achieve a desired result. In certain situations, multitasking and parallel processing may be advantageous. Furthermore, the separation of the various components in the embodiments described above should not be understood to imply that such separation is necessary, and that the program components and systems described may generally be integrated together in a single software product or packaged into multiple software products.

이제까지 본 발명에 대하여 그 실시 예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.The present invention has been described with reference to embodiments thereof. Those skilled in the art will understand that the present invention can be implemented in modified forms without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered from an illustrative rather than a restrictive perspective. The scope of the present invention is indicated by the claims, not the foregoing description, and all differences within the scope equivalent thereto should be interpreted as being included in the present invention.

100: 게임 행동 복제 기반 강화학습 설계 장치
110: 전처리부
120: 강화학습부
130: 행동 결정부100: Game Behavior Replication-Based Reinforcement Learning Design Device
110: Preprocessing unit
120: Reinforcement Learning Department
130: Action Decision Section

Claims

In a device for designing reinforcement learning based on game action replication,
A preprocessing unit that performs preprocessing to normalize state representation in the game environment, evaluate actions, and classify episodes for learning;
A reinforcement learning unit that performs reinforcement learning based on the above preprocessing; and
Action decision unit that receives status, decides action, and applies it to the game;
Including, but not limited to,
The above preprocessing unit
Provides environmental information to an agent at a specific time, and uses a normalization module to express a state, wherein the normalization module performs normalization by receiving only a state from a game and, when performing normalization, performing normalization so that the state value expressed as an absolute value has a relative value and the maximum value becomes 1, and the state value uses the agent position, the position of the own team and the enemy team, the position of the ball, the distance to the ball, the ball clear condition, the remaining time of the game, and the team score in a ball sports game, and a state expression unit that expresses the position in two or three dimensions;
A reward unit that reflects the value of an agent's actions at a certain point in time, applying reward values to situations such as shooting, passing, breaking through, marking, avoiding, ball clearing, ball picking up, and goal-side situations, and applying weights based on the frequency of occurrence in the situations of breaking through, marking, avoiding, and ball picking up; and
Includes an episode classification unit that classifies the state, actions, and rewards of the entire episode into attack, defense, and loose ball situations and passes them to the corresponding modules.
A device for designing reinforcement learning based on game action replication.

delete

In a method for designing a reinforcement learning design device based on game action replication for reinforcement learning,
A step of performing preprocessing to normalize state representation in the game environment, evaluate actions, and classify episodes for learning;
A step of performing reinforcement learning based on the above preprocessing; and
The step of receiving the status, deciding on an action, and applying it to the game.
Including,
The steps for performing preprocessing to learn by normalizing state representation, evaluating actions, and classifying episodes in the above game environment are as follows.
A method of providing environmental information to an agent at a specific time and using a normalization module to express a state, wherein the normalization module performs normalization by receiving only a state from a game and, when performing normalization, performing normalization so that the state value expressed as an absolute value has a relative value and the maximum value becomes 1, and the state value uses the agent position, the position of the own team and the enemy team, the position of the ball, the distance to the ball, the ball clear condition, the remaining time of the game, and the team score in a ball sports game, and expressing the position in two or three dimensions;
A step of reflecting value to the agent's actions at a certain point in time, applying reward values to situations such as shooting, passing, breaking through, marking, avoiding, ball clearing, ball picking up and goal bottom situations, and applying weights based on the frequency of occurrence in the situations of breaking through, marking, avoiding and ball picking up; and
Including a step of learning by classifying the episode into attack, defense, and loose ball situations and passing the state, action, and reward of the entire episode to the corresponding module.
A method for designing reinforcement learning based on game action replication.

delete

In Article 6,
A computer program that executes a game action replication-based reinforcement learning design method and is recorded on a computer-readable recording medium.