KR20030071939A

KR20030071939A - Apparatus and method for construction model of data mining using ensemble machines

Info

Publication number: KR20030071939A
Application number: KR1020020011208A
Authority: KR
Inventors: 김용대
Original assignee: 김용대; 전종우; (주)비엘시스템스
Priority date: 2002-03-02
Filing date: 2002-03-02
Publication date: 2003-09-13
Anticipated expiration: 2022-03-02
Also published as: KR100640264B1; WO2003075187A1; AU2003212671A1

Abstract

본 발명은 다차원 거대 자료의 분석 및 예측에 있어서 기존의 방법에 비하여 보다 빠른 계산 시간 및 안정적인 모델의 구축이 용이하고 뛰어난 해석력과 예측력을 가지는 앙상블 구축 장치 및 그 방법에 관한 것이다.The present invention relates to an ensemble construction apparatus and method having faster analysis time and stable model construction, superior analysis and predictive power, and a method for analyzing and predicting multidimensional large data.

본 발명에 따르면, 앙상블(Ensemble) 모형을 이용한 데이터 마이닝(Data Mining) 모형 구축 장치에 있어서, M 개의 앙상블 모형 구축 수단을 포함하고, 첫 번째 앙상블 모형 구축 수단은 입력되는 다차원 학습 자료로부터 첫 번째 학습기를 구축하여, 상기 구축된 첫 번째 학습기 자체를 첫 번째 앙상블 모형으로 구축하고; k 번째 앙상블 모형 구축 수단은 k-1 번째 앙상블 모형 구축 수단의 결과물인 k-1 번째 앙상블 모형을 입력받아 다차원 공간 상에서 직교인 k 번째 학습기를 구축하고, 상기 구축된 k 번째 학습기와 k-1 번째 앙상블 모형을 결합함으로써, k 번째 앙상블 모형을 구축하는 것을 특징으로 하는 앙상블 모형을 이용한 데이터 마이닝 모형 구축 장치를 제공한다.According to the present invention, in a data mining model construction apparatus using an ensemble model, the apparatus includes M ensemble model construction means, and the first ensemble model construction means includes a first learner from the input multidimensional training material. Constructing the first learner itself as a first ensemble model; The k th ensemble model construction means receives a k-1 ensemble model that is the result of the k-1 ensemble model construction means, constructs an orthogonal k th learner in a multidimensional space, and constructs the k th learner and the k-1 th Provided is a data mining model building apparatus using the ensemble model, wherein the ensemble model is constructed by combining the ensemble model.

Description

Apparatus and method for construction model of data mining using ensemble machines}

본 발명은 앙상블 모형을 이용한 데이터 마이닝 모형 구축 장치 및 그 방법에 관한 것으로서, 보다 상세하게는, 다차원 거대 자료의 분석 및 예측에 있어서 기존의 방법에 비하여 보다 빠른 계산 시간 및 안정적인 모델의 구축이 용이하고 뛰어난 해석력과 예측력을 가지는 데이터 마이닝 모형 구축 장치 및 그 방법에 관한 것이다.The present invention relates to an apparatus and method for constructing a data mining model using an ensemble model, and more specifically, to a faster calculation time and a more stable construction of a model than a conventional method for analyzing and predicting multidimensional large data. The present invention relates to an apparatus and method for constructing a data mining model having excellent analysis and predictive power.

1. 서설1. Introduction

데이터 마이닝(DataMining)에서 앙상블 알고리즘(Ensemble Algorithm)은 'Breiman'의 배깅(Bagging) 기법을 효시로 하여 최근까지 많은 연구가 진행되고 있다.Ensemble Algorithm in DataMining has been studied in recent years with 'Breiman''s bagging technique as the first.

즉, 상술한 'Breiman'의 배깅 기법을 보다 개량하는 수많은 연구가 진행되고있는 바, 이러한 종래의 연구 성과로는, 'Freund and Schapire'가 제안하고 있는 부스팅(Boosting) 알고리즘, 'Breiman'이 제안하고 있는 아킹(Arcing) 알고리즘 및 'Breiman'이 제안하고 있는 랜덤 포레스트(Random Forest) 알고리즘 등이 있다.In other words, a number of studies are being conducted to further improve the above-described bagging technique of Breiman. As a result of the conventional research, the boosting algorithm proposed by Freund and Schapire, Breiman, is proposed. The Arching algorithm and the Random Forest algorithm proposed by Breiman.

이러한 앙상블 기법 중 'Freund and Schapire'가 제안한 부스팅 알고리즘은 그 예측력의 뛰어남으로 인하여 최근에 이를 기반으로 한 다양한 개선된 알고리즘이 등장하고 있다. 이러한 개선된 알고리즘으로는 'Schapire and Singer'가 제안한 리얼 부스팅(Real Boosting) 알고리즘 및 'Friedman'이 제안한 그레디언트 부스팅(Gradient Boosting) 알고리즘이 있다.Among these ensemble techniques, the boosting algorithm proposed by 'Freund and Schapire' has recently appeared in various improved algorithms based on its superior predictive power. These improved algorithms include Real Boosting algorithm proposed by 'Schapire and Singer' and Gradient Boosting algorithm proposed by 'Friedman'.

즉, 종래의 데이터 마이닝에 사용하는 앙상블 알고리즘은 주로 부스팅 알고리즘에 기반을 두고 있다.That is, the ensemble algorithm used in the conventional data mining is mainly based on the boosting algorithm.

이하에서는 분류(Classification) 문제에 적용되는 종래의 여러 가지 부스팅 알고리즘에 대하여 간략히 소개한다. 특히, 현재 가장 널리 사용되고 있는 리얼 부스팅 알고리즘, 로지트(Logit) 부스팅 알고리즘 및 그레디언트 부스팅 알고리즘에 대하여 설명하도록 하겠다.The following briefly introduces a number of conventional boosting algorithms applied to classification problems. In particular, the most widely used real boosting algorithm, logit boosting algorithm, and gradient boosting algorithm will be described.

2. 2-클래스에 관한 종래의 부스팅 알고리즘2. Conventional Boosting Algorithms for Two-Classes

부스팅 알고리즘은 분류(Classfication) 문제에 주로 사용되는 방법이다. 분류 문제의 기본 모형은 다음과 같다.The boosting algorithm is the main method used for classification problems. The basic model of the classification problem is as follows.

n 개의 학습 자료 (x₁, y₁), ..., (x_n, y_n)이 주어졌다고 가정하자. 여기서,x_i는 p 차원의 설명 변수, 즉, x_i= (x_1i, ..., x_pi)이고, 반응 변수 y_i는 자료가 속하는 그룹을 나타낸다. 즉, J 개의 그룹이 있을 때, y_i는 1부터 J 중의 하나의 정수값을 가진다.Suppose that _n training materials (x ₁ , y ₁ ), ..., (x _n , y _n ) are given. Here, x _i is an explanatory variable of p dimension, that is, x _i = (x _1i , ..., x _pi ), and the response variable y _i represents a group to which the data belongs. That is, when there are J groups, y _i has an integer value from 1 to J.

분류 문제의 목적은 n 개의 학습 자료를 이용하여 설명 변수로 반응 변수를 가장 잘 설명하는 관계를 찾는 것이다. 다시 말하면, 학습 자료를 이용하여 최적의 함수 H : R^p--> {1, 2, ..., J}를 만드는 것이다. 그리고, 새로운 설명 변수 x가 주어지면, 이 자료를 그룹 H(x)로 할당한다.The purpose of the classification problem is to find relationships that best describe the response variables with explanatory variables using n training data. In other words, we use the training material to create the optimal function H: R ^p- > {1, 2, ..., J}. And given a new explanatory variable x, we assign this data to group H (x).

먼저, 그룹이 두 개인 경우, 즉, J = 2 인 경우를 고려한다. 그룹이 여러 개인 경우는 후술하도록 한다. 다음은 여러 가지 부스팅 알고리즘에 관한 것이다.First, consider the case of two groups, that is, J = 2. If there are multiple groups, they will be described later. The following are the various boosting algorithms.

2-1. 리얼 부스팅 알고리즘(Schapire and Singer, 1999)2-1. Real Boosting Algorithm (Schapire and Singer, 1999)

리얼 부스팅 알고리즘은 데이터 마이닝을 위한 앙상블 구축 알고리즘에 있어서 가장 대표적인 알고리즘이다.Real boosting algorithm is the most representative algorithm for ensemble construction algorithm for data mining.

본 알고리즘은 미국 특허(US 5,819,247) 'Apparatus and methods for machine learning hypotheses'에 상세히 기재되어 있는 바, 이를 상세히 설명하면, 다음과 같다.The algorithm is described in detail in US Pat. No. 5,819,247, Apparatus and methods for machine learning hypotheses, which will be described in detail as follows.

도 1은 분류 문제에서 그룹이 두 개인 경우에 적용되는 종래의 리얼 부스팅 방법을 나타낸 흐름도이다. 이를 상세히 설명하면, 다음과 같다.1 is a flowchart illustrating a conventional real boosting method applied to two groups in a classification problem. This will be described in detail as follows.

(1) 스텝 S101 : 반응 변수 y_i를 자료가 그룹 2에 속하면, 1로, 그룹 1에 속하면, -1로 놓는다.(1) Step S101: The response variable y _{i is set} to 1 if the data belongs to group 2, and to -1 if it belongs to group 1.

(2) 스텝 S102 : n 개의 가중치 w₁, ..., w_n을 w_i= 1/n으로 놓음으로써, 초기화한다.(2) Step S102: Initialize by setting n weights w ₁ , ..., w _n to w _i = 1 / n.

(3) 스텝 S103 : n 개의 학습 자료 (x₁, y₁), ..., (x_n, y_n)과 가중치 w₁, ..., w_n을 이용하여 주어진 설명 변수 x에 대하여 반응 변수가 1일 확률을 기본 학습기를 이용하여 추정한다. 이때, 반응 변수가 1일 확률은 아래의 [수학식 1]에 의하여 결정된다.(3) Step S103: Respond to the explanatory variable x given using _n training materials (x ₁ , y ₁ ), ..., (x _n , y _n ) and weights w ₁ , ..., w _n The probability that the variable is 1 is estimated using the basic learner. At this time, the probability that the response variable is 1 is determined by Equation 1 below.

(4) 스텝 S104 : 상기 P_m을 변환하여 f_m을 구한다. 이는 아래의 [수학식 2]에 의하여 결정된다.(4) Step S104: The above-mentioned P _m is converted to find f _m . This is determined by Equation 2 below.

(5) 스텝 S105 : 새로운 가중치를 아래의 [수학식 3]에 의하여 구한 후, 이를이 되도록 정규화(Normalization)한다.(5) Step S105: After the new weight is obtained by the following Equation 3, it is obtained. Normalize to

(6) 스텝 S106 : 상기 스텝 S103 내지 스텝 S105를 m = 1, ..., M 번까지 반복함으로써, M 개의 기본 학습기를 생성한다.(6) Step S106: By repeating steps S103 to S105 up to m = 1, ..., M times, M basic learners are generated.

(7) 스텝 S107 : 최종 앙상블 모형을 아래의 [수학식 4]에 의하여 결정한다. 이때, 새로운 반응 변수 x에 대하여 H(x) > 0 이면, 그룹 2에, H(x) < 0 이면, 그룹 1에 할당한다.(7) Step S107: The final ensemble model is determined by the following [Equation 4]. At this time, if H (x)> 0 for the new response variable x, it is assigned to Group 2, if H (x) <0, it is assigned to Group 1.

위와 같은 과정을 통하여 최종적으로 앙상블 모형을 구축하게 된다.Through the above process, the ensemble model is finally constructed.

2-2. 로지트 부스팅 알고리즘(Friedman et al., 2000)2-2. Logit Boosting Algorithm (Friedman et al., 2000)

도 2는 분류 문제에서 그룹이 두 개인 경우에 적용되는 종래의 로지트 부스팅 방법을 나타낸 흐름도이다. 이를 상세히 설명하면, 다음과 같다.2 is a flowchart illustrating a conventional logit boosting method applied to two groups in a classification problem. This will be described in detail as follows.

(1) 스텝 S201 : 반응 변수 y_i를 자료가 그룹 2에 속하면, 1로, 그룹 1에 속하면, 0으로 놓는다.(1) Step S201: The reaction variable y _{i is set} to 1 if the data belongs to group 2 and to 0 if the data belongs to group 1.

(2) 스텝 S202 : n 개의 가중치 w₁, ..., w_n을 w_i= 1/n, F(x) = 0, p(x) = 1/2로 초기화한다.(2) Step S202: n weights w ₁ , ..., w _n are initialized to w _i = 1 / n, F (x) = 0, p (x) = 1/2.

(3) 스텝 S203 : 주어진 설명 변수에 대하여 새로운 반응 변수 z_i및 가중치 w_i를 아래의 [수학식 5]에 의하여 구한다.(3) Step S203: The new response variable z _i and the weight w _i are obtained for the given explanatory variable by the following [Equation 5].

(4) 스텝 S204 : 반응 변수 z_i, 설명 변수 x_i및 가중치 w_i를 이용하고, 기본 학습기를 참조하여 회귀 모형 f_m(x)를 구축한다.(4) Step S204: Using the response variable z _i , the explanatory variable x _i, and the weight w _i , a regression model f _m (x) is constructed with reference to the basic learner.

(5) 스텝 S205 : 상기 스텝 S204에서 구한 f_m(x)를 이용하여 확률을 갱신한다. 이를 나타낸 것이 아래의 [수학식 6]이다.(5) Step S205: The probability is updated using f _m (x) obtained in the above step S204. This is shown in Equation 6 below.

(6) 스텝 S206 : 상기 스텝 S203 내지 스텝 S205를 M 번 반복함으로써, 확률 갱신을 M 번 수행한다.(6) Step S206: By repeating Step S203 to Step S205 M times, probability update is performed M times.

(7) 스텝 S207 : 최종 앙상블 모형을 H(x) = F(x)로 하여, 구축한다. 이때, 새로운 반응 변수 x에 대하여 H(x) > 0 이면, 그룹 2에, H(x) < 0 이면, 그룹 1에 할당한다.(7) Step S207: The final ensemble model is constructed with H (x) = F (x). At this time, if H (x)> 0 for the new response variable x, it is assigned to Group 2, if H (x) <0, it is assigned to Group 1.

2-3. 그레디언트 부스팅 알고리즘(Friedman, 2001)2-3. Gradient Boosting Algorithm (Friedman, 2001)

도 3은 분류 문제에서 그룹이 두 개인 경우에 적용되는 종래의 그레디언트 부스팅 방법을 나타낸 흐름도이다. 이를 상세히 설명하면, 다음과 같다.3 is a flowchart illustrating a conventional gradient boosting method applied to two groups in a classification problem. This will be described in detail as follows.

(1) 스텝 S301 : 반응 변수 y_i를 자료가 그룹 2에 속하면, 1로, 그룹 1에 속하면, -1로 놓는다.(1) Step S301: The response variable y _{i is set} to 1 if the data belongs to group 2, and to -1 if it belongs to group 1.

(2) 스텝 S302 : 아래의 [수학식 7]과 같이 각종 변수를 초기화한다.(2) Step S302: Initialize various variables as shown in Equation 7 below.

(3) 스텝 S303 : 주어진 설명 변수에 대하여 새로운 반응 변수 z_i를 아래의 [수학식 8]에 의하여 계산한다.(3) Step S303: The new response variable z _i is calculated for the given explanatory variable by the following [Equation 8].

(4) 스텝 S304 : 반응 변수 z_i, 설명 변수 x_i및 기본 학습기(회귀 의사 결정 나무 모형 : Regression Decision Tree)를 이용하여 f_m(x)를 추정한다.(4) Step S304: f _m (x) is estimated using the response variable z _i , the explanatory variable x _i, and the basic learner (Regression Decision Tree Model).

(5) 스텝 S305 : 상기 f_m(x)의 l 번째 최종 노드(Terminal Node)의 예측값을 아래의 [수학식 9]를 이용하여 추정한다.(5) Step S305: Predicted value of the l th last node of the f _m (x) Is estimated using Equation 9 below.

여기서, R_i는 i 번째 최종 노드에 속하는 자료의 집합이다.Where R _i is a set of data belonging to the i th last node.

(6) 스텝 S306 : F(x) = F(x) + f_m(x)로 갱신한다.(6) Step S306: Update to F (x) = F (x) + f _m (x).

(7) 스텝 S307 : 상기 스텝 S303 내지 스텝 S306을 M번 반복함으로써, F(x)를 M번 갱신한다.(7) Step S307: By repeating the steps S303 to S306 M times, F (x) is updated M times.

(8) 스텝 S308 : 최종 앙상블 모형을 H(x) = F(x)로 구축한다. 이때, 새로운 반응 변수 x에 대하여 H(x) > 0 이면, 그룹 2에, H(x) < 0 이면, 그룹 1에 할당한다.(8) Step S308: The final ensemble model is constructed as H (x) = F (x). At this time, if H (x)> 0 for the new response variable x, it is assigned to Group 2, if H (x) <0, it is assigned to Group 1.

한편, 상술한 리얼 부스팅 알고리즘 또는 로지트 부스팅 알고리즘은 다양한 기본 학습기를 사용할 수 있지만, 그레디언트 부스팅 알고리즘은 기본 학습기로 반드시 의사 결정 나무를 사용하여야 한다.Meanwhile, the above-described real boosting algorithm or logit boosting algorithm may use various basic learners, but the gradient boosting algorithm must use a decision tree as the basic learner.

또한, 상기 알고리즘에서 기본 학습기를 구축할 때, 학습기의 복잡도(Complexity)를 미리 정하여야 한다. 기본 학습기가 의사 결정 나무인 경우에는 최종 노드의 수로 학습기의 복잡도를 조절할 수 있다. 한편, 부스팅 알고리즘에서는 기본 학습기들이 약한 학습기이므로, 학습기의 복잡도를 최소화한다. 일반적으로는 2 개 내지 8 개의 최종 노드를 가지는 의사 결정 나무가 기본 학습기로 많이 사용되고 있다.In addition, when constructing the basic learner in the above algorithm, the complexity of the learner should be determined in advance. If the basic learner is a decision tree, the complexity of the learner can be controlled by the number of final nodes. Meanwhile, in the boosting algorithm, since the basic learners are weak learners, the complexity of the learners is minimized. In general, decision trees having 2 to 8 final nodes are commonly used as basic learners.

3. 멀티 클래스(Multi Class)에 대한 종래의 부스팅 알고리즘3. Conventional Boosting Algorithms for Multi Classes

상술한 3 가지 부스팅 알고리즘들은 그룹이 2 개인 경우에 적용되는 알고리즘들이고, 그룹이 2 개 이상으로 확장되는 경우에는 상기 알고리즘들을 확장 변형하여야만 한다.The three boosting algorithms described above are algorithms applied when there are two groups, and when the groups are extended to two or more groups, the algorithms must be expanded and modified.

이하에서는 멀티 클래스에 적용되는 리얼 부스팅 알고리즘, 로지트 부스팅 알고리즘 및 그레디언트 부스팅 알고리즘에 대하여 살펴 보도록 하겠다.Hereinafter, the real boosting algorithm, the logit boosting algorithm, and the gradient boosting algorithm applied to the multi-class will be described.

3-1. 멀티 클래스에 적용되는 리얼 부스팅 알고리즘3-1. Real Boosting Algorithm Applies to Multiclasses

도 4는 분류 문제에서 멀티 클래스인 경우에 적용되는 종래의 리얼 부스팅 방법을 나타낸 흐름도로서, 이를 상세히 설명하면, 다음과 같다.4 is a flowchart illustrating a conventional real boosting method applied to a multi-class in a classification problem, which will be described in detail as follows.

(1) 스텝 S401 : 주어진 하나의 학습 자료 (x_i, y_i)에 대하여 J 개의 새로운 자료 ((x_i, 1), y_i1), ..., ((x_i, J), y_iJ)를 생성한다. 이때, y_ik는 y_i가 k이면, 1이고, k가 아니면, -1이다.(1) Step S401: J new materials ((x _i , 1), y _i1 ), ..., ((x _i , J), y _{iJ for a} given learning material (x _i , y _i ) ) At this time, y _ik is 1 if y _i is k, and -1 if k is not.

(2) 스텝 S402 : 2-클래스 리얼 부스팅 알고리즘을 n X J 개의 새로운 자료에 적용시켜(즉, 멀티 클래스를 2-클래스로 변환), 최종 앙상블 모형을 아래의 [수학식 10]에 의하여 구축한다.(2) Step S402: By applying a two-class real boosting algorithm to n X J new materials (i.e., converting multi-class to two-class), a final ensemble model is constructed by Equation 10 below.

(3) 스텝 S403 : 새로운 설명 변수 x에 대하여 argmax_jH(x, j) 그룹에 할당한다.(3) Step S403: The new explanation variable x is assigned to the argmax _j H (x, j) group.

한편, 상술한 멀티 클래스에 적용되는 리얼 부스팅 알고리즘은 실행 횟수, 즉, 계산 횟수가 부스팅 횟수에 해당하므로 시간이 특히 너무 오래 걸린다는 문제점이 있다.On the other hand, the real boosting algorithm applied to the above-described multi-class has a problem in that it takes too long since the number of executions, that is, the number of calculations corresponds to the number of boosting times.

3-2. 멀티 클래스에 적용되는 로지트 부스팅 알고리즘3-2. Logit Boosting Algorithm Applies to Multiclasses

도 5는 분류 문제에서 멀티 클래스인 경우에 적용되는 종래의 로지트 부스팅 방법을 나타낸 흐름도로서, 이를 상세히 설명하면, 다음과 같다.5 is a flowchart illustrating a conventional logit boosting method applied to a multi-class in a classification problem, which will be described in detail as follows.

(1) 스텝 S501 : 각종 변수들을 초기화한다. 즉, F_j(x) = 0, p_j(x) = 1/J, j = 1, 2, ..., J로 놓는다.(1) Step S501: Initialize various variables. That is, let F _j (x) = 0, p _j (x) = 1 / J, j = 1, 2, ..., J.

(2) 스텝 S502 : i 번째 자료가 j 그룹에 포함되면 y_i ^*= 1로, 포함되지 아니하면, y_i ^*= 0로 놓는다.(2) Step S502: If the i th data is included in the j group, y _i ^* = 1, and if not, y _i ^* = 0.

(3) 스텝 S503 : 새로운 반응 변수 z_i및 새로운 가중치 w_i를 아래의 [수학식 11]에 의하여 결정한다.(3) Step S503: The new response variable z _i and the new weight w _i are determined by Equation 11 below.

(4) 스텝 S504 : 반응 변수 z_i, 설명 변수 x_i, 가중치 w_i를 이용하고, 기본 학습기를 참조함으로써, 회귀 모형 f_mj(x)를 산출한다.(4) Step S504: The regression model f _mj (x) is calculated by referring to the basic learner using the response variable z _i , the explanatory variable x _i , and the weight w _i .

(5) 스텝 S505 : 상기 스텝 S502 내지 스텝 S504를 J번 반복한다.(5) Step S505: The above steps S502 to S504 are repeated J times.

(6) 스텝 S506 : j 번째 F_j(x)를 아래의 [수학식 12]에 의하여 갱신한다.(6) Step S506: The j th F _j (x) is updated by Equation 12 below.

(7) 스텝 S507 : j 번째 확률 p_j(x)를 아래의 [수학식 13]에 의하여 갱신한다.(7) Step S507: The j th probability p _j (x) is updated by Equation 13 below.

(8) 스텝 S508 : 상기 스텝 S502 내지 스텝 S507을 M 번 반복한다.(8) Step S508: The above steps S502 to S507 are repeated M times.

(9) 스텝 S509 : 새로운 설명 변수 x에 대하여 argmax_jF_j(x) 그룹에 할당한다.(9) Step S509: Assign the new explanatory variable x to the argmax _j F _j (x) group.

3-3. 멀티 클래스에 적용되는 그레디언트 부스팅 알고리즘3-3. Gradient Boosting Algorithm Applies to Multiclasses

도 6은 분류 문제에서 멀티 클래스인 경우에 적용되는 종래의 그레디언트 부스팅 방법을 나타낸 흐름도로서, 이를 상세히 설명하면 다음과 같다.6 is a flowchart illustrating a conventional gradient boosting method applied to a multi-class in a classification problem, which will be described in detail as follows.

(1) 스텝 S601 : 각종 변수들을 초기화한다. 즉, F_j(x) = 0, p_j(x) = 1/J, j = 1, 2, ..., J로 놓는다.(1) Step S601: Initialize various variables. That is, let F _j (x) = 0, p _j (x) = 1 / J, j = 1, 2, ..., J.

(2) 스텝 S602 : i 번째 자료가 j 그룹에 포함되면, y_i ^*= 1, 포함되지 아니하면, y_i ^*= 0로 놓는다.(2) Step S602: If the i th data is included in the j group, y _i ^* = 1, and if not, y _i ^* = 0.

(3) 스텝 S603 : 새로운 반응 변수 z_i를 아래의 [수학식 14]로 결정한다.(3) Step S603: The new reaction variable z _i is determined by Equation 14 below.

(4) 스텝 S604 : 반응 변수 z_i, 설명 변수 x_i를 이용하고, 기본 학습기(회귀 의사 결정 나무 모형 : Regression Decision Tree)를 이용하여 회귀 모형 f_mj(x)를 구축한다.(4) Step S604: Using the response variable z _i and the explanatory variable x _i , a regression model f _mj (x) is constructed by using a basic learner (Regression Decision Tree).

(5) 스텝 S605 : 상기 회귀 모형 f_mj(x)의 l 번째 최종 노드(Terminal Node)의 예측값을 아래의 [수학식 15]에 의하여 추정한다.(5) step S605: predicted value of the l-th final node of the regression model f _mj (x) It is estimated by Equation 15 below.

(6) 스텝 S606 : F_j(x)를 아래의 [수학식 16]에 의하여 결정한다.(6) Step S606: F _j (x) is determined by the following [Equation 16].

(7) 스텝 S607 : 상기 스텝 S602 내지 스텝 S606을 j = 1, ..., J 번 반복한다.(7) Step S607: Steps S602 to S606 are repeated j = 1, ..., J times.

(8) 스텝 S608 : 상기 스텝 S602 내지 스텝 S607을 m = 1, ..., M 번 반복한다.(8) Step S608: The steps S602 to S607 are repeated m = 1, ..., M times.

(9) 스텝 S609 : 새로운 설명 변수 x에 대하여 argmax_jF_j(x) 그룹에 할당한다.(9) Step S609: Assign the new explanatory variable x to the argmax _j F _j (x) group.

4. 종래 기술들의 문제점 정리4. Problem solving of prior arts

상술한 바와 같은 부스팅 기법의 기본 아이디어는 여러 개의 약한 학습기를 구축한 후, 이를 결합하여 새로운 강한 학습기를 만드는 것이다. 그러나, 이러한 아이디어는 여러 가지 문제점을 가지고 있는데, 이를 정리하면 다음과 같다.The basic idea of the boosting technique as described above is to build several weak learners and combine them to create a new strong learner. However, this idea has a number of problems, which can be summarized as follows.

첫째, 구축된 모형의 해석이 어렵다.First, it is difficult to interpret the constructed model.

둘째, 여러 개의 튜닝 모수(Tuning Parameter), 즉, 의사 결정 나무의 크기, 의사 결정 나무의 개수 등이 있는데, 이를 일반 사용자가 정하는 것이 쉽지 않다.Second, there are several tuning parameters, that is, the size of the decision tree, the number of decision trees, and so on.

셋째, 자료가 과적합(Overfitting)되는 경향이 있다.(Ridgeway, 2000)Third, data tends to be overfitting (Ridgeway, 2000).

넷째, 반응 변수의 종류가 연속형인 경우에 알고리즘의 사용이 쉽지 않다.Fourth, the use of the algorithm is not easy when the type of response variable is continuous.

다섯째, 분류(Classification) 문제에서 확률의 추정이 쉽지 않다. 즉, 새로운 데이터가 입력될 때, 입력된 새로운 데이터가 취할 분류 확률을 추정할 수가 없다.Fifth, it is not easy to estimate the probability in the classification problem. That is, when new data is inputted, it is not possible to estimate the classification probability that the new data inputted will take.

여섯째, 부스팅 기법의 일반적인(통계학적) 이론이 미비하여 그 안정성에 대한 검증이 쉽지 않다.Sixth, the general (statistical) theory of boosting techniques is insufficient, making it difficult to verify its stability.

상기와 같은 종래 기술의 문제점을 해결하기 위한 본 발명의 목적은 다차원 거대 자료의 분석 및 예측에 있어서 기존의 방법에 비하여 보다 빠른 계산 시간 및 안정적인 모델의 구축이 용이하고 뛰어난 해석력과 예측력을 가지는 앙상블 구축 장치 및 그 방법을 제공하기 위한 것이다.An object of the present invention for solving the problems of the prior art as described above is to build an ensemble with faster analysis time and stable model construction and excellent analysis and predictive power than conventional methods in analyzing and predicting multidimensional large data An apparatus and a method thereof are provided.

도 1은 분류 문제에서 그룹이 두 개인 경우에 적용되는 종래의 리얼 부스팅 방법을 나타낸 흐름도이고,1 is a flowchart illustrating a conventional real boosting method applied to two groups in a classification problem.

도 2는 분류 문제에서 그룹이 두 개인 경우에 적용되는 종래의 로지트 부스팅 방법을 나타낸 흐름도이고,2 is a flowchart illustrating a conventional logit boosting method applied to two groups in a classification problem.

도 3은 분류 문제에서 그룹이 두 개인 경우에 적용되는 종래의 그레디언트 부스팅 방법을 나타낸 흐름도이고,3 is a flowchart illustrating a conventional gradient boosting method applied to two groups in a classification problem.

도 4는 분류 문제에서 멀티 클래스인 경우에 적용되는 종래의 리얼 부스팅 방법을 나타낸 흐름도이고,4 is a flowchart illustrating a conventional real boosting method applied to a multi-class in a classification problem.

도 5는 분류 문제에서 멀티 클래스인 경우에 적용되는 종래의 로지트 부스팅 방법을 나타낸 흐름도이고,5 is a flowchart illustrating a conventional logit boosting method applied to a multi-class in a classification problem.

도 6은 분류 문제에서 멀티 클래스인 경우에 적용되는 종래의 그레디언트 부스팅 방법을 나타낸 흐름도이고,6 is a flowchart illustrating a conventional gradient boosting method applied to a multi-class in a classification problem.

도 7은 본 발명의 일 실시예에 따른 앙상블 모형 구축 방법을 개략적으로 도시한 흐름도이고,7 is a flowchart schematically illustrating a method for constructing an ensemble model according to an embodiment of the present invention.

도 8은 최적의 의사 결정 나무 선택을 위한 교차 확인 과정을 개략적으로 도시한 흐름도이고,8 is a flowchart schematically illustrating a cross-check process for selecting an optimal decision tree,

도 9는 TIC를 이용한 최적의 의사 결정 나무를 선택하는 방법의 전체 개요를 보여주는 흐름도이고,9 is a flowchart showing an overall overview of a method of selecting an optimal decision tree using TIC,

도 10은 본 발명의 일 실시예에 따른 부스트랩 자료의 생성 과정을 나타낸 흐름도이고,10 is a flowchart illustrating a process of generating boost strap data according to an embodiment of the present invention.

도 11은 종래의 기본 학습기들의 집합에 포함되는 학습기 중 참 학습기와 거리가 작은 학습기들이 다수개 존재함을 보여 주는 기본 개념도이고,11 is a basic conceptual view showing that there are a plurality of learners having a small distance from a true learner included in a set of conventional basic learners.

도 12a는 본 발명에서 제안하는 캠 알고리즘에 따라 첫 번째 학습기를 구축하는 방법을 개념적으로 도시한 도면이고,12a is a diagram conceptually illustrating a method of constructing a first learner according to a cam algorithm proposed by the present invention;

도 12b는 본 발명에서 제안하는 캠 알고리즘에 따라 두 번째 학습기를 구축하는 방법을 개념적으로 도시한 도면이고,12B is a diagram conceptually illustrating a method of building a second learner according to the cam algorithm proposed by the present invention.

도 12c는 본 발명에서 제안하는 캠 알고리즘에 따라 세 번째 학습기를 구축하는 방법을 개념적으로 도시한 도면이고,12c conceptually illustrates a method of constructing a third learner according to the cam algorithm proposed by the present invention;

도 12d는 본 발명에서 제안하는 캠 알고리즘에 따라 m 번째 학습기를 구축하는 방법을 개념적으로 도시한 도면이고,12D is a diagram conceptually showing a method of building an m-th learner according to the cam algorithm proposed by the present invention.

도 13은 본 발명의 일 실시예에 따른 캠 알고리즘의 개요를 보여주는 흐름도이고,13 is a flowchart showing an outline of a cam algorithm according to an embodiment of the present invention;

도 14는 본 발명의 일 실시예에 따른 연속형 변수를 위한 앙상블 알고리즘을 개략적으로 도시한 흐름도이고,14 is a flowchart schematically illustrating an ensemble algorithm for continuous variables according to an embodiment of the present invention;

도 15는 연관성 규칙 생성 알고리즘의 개요를 나타내는 흐름도이고,15 is a flowchart showing an overview of an association rule generation algorithm;

도 16a 내지 도 16d는 종래의 앙상블 기법과 본 발명에서 제안하는 앙상블 기법의 성능을 알아 보기 위한 가상 실험 결과를 보여 주는 그래프이고,16A to 16D are graphs showing the results of a virtual experiment for examining the performance of the conventional ensemble technique and the ensemble technique proposed by the present invention.

도 17a 내지 도 17i는 종래의 앙상블 기법과 본 발명에서 제안하는 앙상블 기법의 성능을 알아 보기 위한 실제 자료 분석의 결과를 보여주는 그래프이다.17A to 17I are graphs showing the results of actual data analysis to determine the performance of the conventional ensemble technique and the ensemble technique proposed by the present invention.

상기한 목적을 달성하기 위하여 본 발명에 따르면, 앙상블(Ensemble) 모형을 이용한 데이터 마이닝(Data Mining) 모형 구축 장치에 있어서, M 개의 앙상블 모형 구축 수단을 포함하고, 첫 번째 앙상블 모형 구축 수단은 입력되는 다차원 학습 자료로부터 첫 번째 학습기를 구축하여, 상기 구축된 첫 번째 학습기 자체를 첫 번째 앙상블 모형으로 구축하고; k 번째 앙상블 모형 구축 수단은 k-1 번째 앙상블 모형 구축 수단의 결과물인 k-1 번째 앙상블 모형을 입력받아 다차원 공간 상에서 직교인 k 번째 학습기를 구축하고, 상기 구축된 k 번째 학습기와 k-1 번째 앙상블 모형을 결합함으로써, k 번째 앙상블 모형을 구축하는 것을 특징으로 하는 앙상블 모형을 이용한 데이터 마이닝 모형 구축 장치를 제공한다.In order to achieve the above object, according to the present invention, in a data mining model building apparatus using an ensemble model, comprising M ensemble model building means, the first ensemble model building means is input Constructing a first learner from multidimensional learning materials, and constructing the constructed first learner itself as a first ensemble model; The k th ensemble model construction means receives a k-1 ensemble model that is the result of the k-1 ensemble model construction means, constructs an orthogonal k th learner in a multidimensional space, and constructs the k th learner and the k-1 th Provided is a data mining model building apparatus using the ensemble model, wherein the ensemble model is constructed by combining the ensemble model.

또한, 앙상블(Ensemble) 모형을 이용한 데이터 마이닝(Data Mining) 모형 구축 방법에 있어서, M 번의 앙상블 모형 구축 단계를 포함하고, 제 1 번째 앙상블모형 구축 단계는, 입력되는 다차원 학습 자료로부터 첫 번째 학습기를 구축하여 상기 구축된 첫 번째 학습기 자체를 첫 번째 앙상블 모형으로 구축하고; 제 k 번째 앙상블 모형 구축 단계는, 제 k-1 번째 앙상블 모형 구축 단계의 결과물인 k-1 번째 학습기를 입력받아, 다차원 공간 상에서 직교인 k 번째 학습기를 구축하고, 상기 구축된 k 번째 학습기와 k-1 번째 앙상블 모형을 결합함으로써, k 번째 앙상블 모형을 구축하는 것을 특징으로 하는 앙상블 모형을 이용한 데이터 마이닝 모형 구축 방법을 제공한다.In addition, in a data mining model construction method using an ensemble model, the method includes constructing an M ensemble model, and the first ensemble model construction step includes a first learner from the input multidimensional training material. Constructing the constructed first learner itself as a first ensemble model; In the k-th ensemble model construction step, the k-1 th learner, which is the result of the k-1 th ensemble model construction step, is input, constructs an orthogonal k th learner in a multidimensional space, and constructs the k th learner and k. By combining the -1 th ensemble model, a method for constructing a data mining model using the ensemble model, characterized in that the k th ensemble model is constructed.

또한, 앙상블(Ensemble) 모형을 이용한 두 그룹 분류에서의 데이터 마이닝(Data Mining) 모형 구축 장치에 있어서, 다차원 데이터인 학습 자료들을 입력받는 입력 수단; 앙상블 모형의 잔차를 이용하여 가중치를 계산하는 가중치 계산 수단; 상기 입력 수단에 의하여 입력된 자료들로부터 주어진 설명 변수 x에 대하여 반응 변수가 1이 될 확률을 기본 학습기 및 상기 가중치를 이용하여 추정하는 확률 추정 수단; 상기 확률 추정 수단에 의하여 추정된 확률에 대하여 주어진 손실 함수(Loss Function)를 최소로 하는 수정 모수를 계산하는 수정 모수 계산 수단; 상기 계산된 수정 모수를 이용하여 학습기를 갱신하여 재구축하는 학습기 갱신 수단; 및 상기 갱신된 학습기에 기반하여 앙상블 모형을 최종적으로 구축하는 앙상블 모형 구축 수단을 포함하는 것을 특징으로 하는 앙상블 모형을 이용한 두 그룹 분류에서의 데이터 마이닝 모형 구축 장치를 제공한다.In addition, an apparatus for constructing a data mining model in two group classifications using an ensemble model, the apparatus comprising: input means for receiving training materials that are multidimensional data; Weight calculation means for calculating a weight using a residual of the ensemble model; Probability estimating means for estimating a probability that the response variable becomes 1 with respect to the explanatory variable x given from the data input by the input means using a basic learner and the weight; Correction parameter calculation means for calculating a correction parameter that minimizes a given loss function with respect to the probability estimated by the probability estimation means; Learner updating means for updating and rebuilding a learner using the calculated correction parameter; And an ensemble model building means for finally building an ensemble model based on the updated learner.

또한, 앙상블(Ensemble) 모형을 이용한 두 그룹 분류에서의 데이터 마이닝(Data Mining) 모형 구축 방법에 있어서, 다차원 데이터인 학습 자료들을 입력받는 제 1 단계; 앙상블 모형의 잔차를 이용하여 가중치를 계산하는 제 2 단계; 상기 제 1 단계에서 입력된 자료들로부터 주어진 설명 변수 x에 대하여 반응 변수가 1이 될 확률을 기본 학습기 및 상기 가중치를 이용하여 추정하는 제 3 단계; 상기 제 3 단계에서 추정한 확률에 대하여 주어진 손실 함수(Loss Function)를 최소로 하는 수정 모수를 계산하는 제 4 단계; 및 상기 제 4 단계에서 계산된 수정 모수를 이용하여 학습기를 갱신하여 재구축하는 제 5 단계를 포함하는 것을 특징으로 하는 앙상블 모형을 이용한 두 그룹 분류에서의 데이터 마이닝 모형 구축 방법을 제공한다.A method of constructing a data mining model in two group classifications using an ensemble model, the method comprising: a first step of receiving training materials that are multidimensional data; Calculating a weight using a residual of the ensemble model; A third step of estimating a probability that the response variable becomes 1 with respect to the explanatory variable x given from the data input in the first step using a basic learner and the weight; A fourth step of calculating a correction parameter that minimizes a given loss function with respect to the probability estimated in the third step; And a fifth step of updating and reconstructing the learner by using the correction parameters calculated in the fourth step.

앙상블(Ensemble) 모형을 이용한 멀티 클래스 분류에서의 데이터 마이닝(Data Mining) 모형 구축 장치에 있어서, 다차원 데이터인 학습 자료들을 입력받는 입력 수단; 앙상블 모형의 잔차를 이용하여 가중치를 계산하는 가중치 계산 수단; 상기 입력 수단에 의하여 입력된 자료들로부터 주어진 설명 변수 x에 대하여 반응 변수가 j 그룹에 속할 확률을 기본 학습기 및 상기 가중치를 이용하여 추정하는 확률 추정 수단; 상기 확률 추정 수단에 의하여 추정된 확률에 대한 주어진 손실 함수(Loss Function)를 최소로 하는 수정 모수를 계산하는 수정 모수 계산 수단; 상기 계산된 수정 모수를 이용하여 학습기를 갱신하여 재구축하는 학습기 갱신 수단; 및 상기 갱신된 학습기에 기반하여 앙상블 모형을 최종적으로 구축하는 앙상블 모형 구축 수단을 포함하는 것을 특징으로 하는 앙상블 모형을 이용한 멀티 클래스 분류에서의 데이터 마이닝 모형 구축 장치를 제공한다.An apparatus for constructing a data mining model in a multi-class classification using an ensemble model, the apparatus comprising: input means for receiving training materials that are multidimensional data; Weight calculation means for calculating a weight using a residual of the ensemble model; Probability estimating means for estimating a probability that the response variable belongs to the j group with respect to the explanatory variable x given from the data input by the input means using a basic learner and the weight; Correction parameter calculation means for calculating a correction parameter that minimizes a given loss function for the probability estimated by the probability estimation means; Learner updating means for updating and rebuilding a learner using the calculated correction parameter; And an ensemble model building means for finally building an ensemble model based on the updated learner.

또한, 앙상블(Ensemble) 모형을 이용한 멀티 클래스 분류에서의 데이터 마이닝(Data Mining) 모형 구축 방법에 있어서, 다차원 데이터인 학습 자료들을 입력받는 제 1 단계; 앙상블 모형의 잔차를 이용하여 가중치를 계산하는 제 2 단계; 상기 제 1 단계에서 입력된 자료들로부터 주어진 설명 변수 x에 대하여 반응 변수가 j 그룹에 속할 확률을 기본 학습기 및 상기 가중치를 이용하여 추정하는 제 3 단계; 상기 제 3 단계에서 추정된 확률에 대한 주어진 손실 함수(Loss Function)를 최소로 하는 수정 모수를 계산하는 제 4 단계; 상기 제 4 단계에서 계산된 수정 모수를 이용하여 학습기를 갱신하여 재구축하는 제 5 단계; 및 상기 갱신된 학습기에 기반하여 앙상블 모형을 최종적으로 구축하는 제 6 단계를 포함하는 것을 특징으로 하는 앙상블 모형을 이용한 멀티 클래스 분류에서의 데이터 마이닝 모형 구축 방법을 제공한다.A method of constructing a data mining model in a multi-class classification using an ensemble model, the method comprising: a first step of receiving training materials which are multidimensional data; Calculating a weight using a residual of the ensemble model; A third step of estimating, using the basic learner and the weight, a probability that the response variable belongs to the j group with respect to the explanatory variable x given from the data input in the first step; A fourth step of calculating a correction parameter that minimizes a given loss function with respect to the probability estimated in the third step; A fifth step of updating and rebuilding a learner using the correction parameter calculated in the fourth step; And a sixth step of finally constructing an ensemble model based on the updated learner.

또한, 반응 변수가 연속형(Regression)인 경우에 앙상블(Ensemble) 모형을 이용한 데이터 마이닝(Data Mining) 모형 구축 장치에 있어서, 다차원 데이터인 학습 자료들을 입력받는 입력 수단; 앙상블 모형의 잔차를 계산하는 잔차 계산 수단; 상기 입력 수단에 의하여 입력된 자료들로부터 기본 학습기 및 상기 잔차를 이용하여 회귀 모형을 구축하는 회귀 모형 구축 수단; 상기 회귀 모형 구축 수단에 의하여 구축된 회귀 모형에 대한 주어진 손실 함수(Loss Function)를 최소로 하는 수정 모수를 계산하는 수정 모수 계산 수단; 상기 계산된 수정 모수를 이용하여 반응 변수를 갱신함으로써, 학습기를 재구축하는 학습기 갱신 수단; 및 상기 재구축된 학습기에 기반하여 앙상블 모형을 최종적으로 구축하는 앙상블 모형 구축 수단을 포함하는 것을 특징으로 하는 반응 변수가 연속형인 경우에 앙상블 모형을 이용한 데이터 마이닝 모형 구축 장치를 제공한다.In addition, a data mining model building apparatus using an ensemble model when the response variable is a regression, the apparatus comprising: input means for receiving training materials that are multidimensional data; Residual calculation means for calculating a residual of the ensemble model; Regression model building means for constructing a regression model using a basic learner and the residual from data input by the input means; Correction parameter calculation means for calculating a correction parameter that minimizes a given loss function for the regression model constructed by the regression model construction means; Learner updating means for rebuilding a learner by updating a response variable using the calculated correction parameter; And an ensemble model building means for finally constructing an ensemble model based on the reconstructed learner, and provides a data mining model building apparatus using the ensemble model when the response variable is continuous.

또한, 반응 변수가 연속형(Regression)인 경우에 앙상블(Ensemble) 모형을 이용한 데이터 마이닝(Data Mining) 모형 구축 방법에 있어서, 다차원 데이터인 학습 자료들을 입력받는 제 1 단계; 앙상블 모형의 잔차를 계산하는 제 2 단계; 상기 제 1 단계에서 입력된 자료들로부터 기본 학습기 및 상기 잔차를 이용하여 회귀 모형을 구축하는 제 3 단계; 상기 제 3 단계에서 구축된 회귀 모형에 대한 주어진 손실 함수(Loss Function)를 최소로 하는 수정 모수를 계산하는 제 4 단계; 상기 제 4 단계에서 계산된 수정 모수를 이용하여 반응 변수를 갱신함으로써, 학습기를 재구축하는 제 5 단계; 및 상기 제 5 단계에서 재구축된 학습기에 기반하여 앙상블 모형을 최종적으로 구축하는 제 6 단계를 포함하는 것을 특징으로 하는 반응 변수가 연속형인 경우에 앙상블 모형을 이용한 데이터 마이닝 모형 구축 방법을 제공한다.In addition, a data mining model construction method using an ensemble model when the response variable is a regression, the method comprising: a first step of receiving training materials that are multidimensional data; Calculating a residual of the ensemble model; A third step of constructing a regression model using a basic learner and the residual from data input in the first step; A fourth step of calculating a correction parameter that minimizes a given loss function for the regression model constructed in the third step; A fifth step of rebuilding the learner by updating the response variable using the correction parameter calculated in the fourth step; And a sixth step of finally constructing the ensemble model based on the learner reconstructed in the fifth step. The method provides a data mining model building method using the ensemble model when the response variable is continuous.

보다 더 양호하게는, 상기 입력된 학습 데이터들을 가중치를 이용하여 부스트랩(Boostrap) 자료로 생성한다.Even more preferably, the input training data is generated as boost data by using a weight.

또한, 보다 더 양호하게는, 상기 학습기는 의사 결정 나무(Decision Tree) 또는 신경망 모형 중 어느 하나인 것을 특징으로 한다.Also, more preferably, the learner is either a decision tree or a neural network model.

또한, 보다 더 양호하게는, 현재 앙상블 모형의 손실 함수가 가장 작을 때 최종적으로 앙상블 모형을 구축한다.Also, more preferably, the ensemble model is finally constructed when the loss function of the current ensemble model is the smallest.

이하, 첨부된 도면을 참조하면서 본 발명의 일 실시예에 따른 데이터 마이닝을 위한 앙상블 구축 장치 및 그 방법을 보다 상세하게 설명하기로 한다.Hereinafter, an ensemble construction device and method for data mining according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

1. 서설1. Introduction

본 발명에서 제안하는 데이터 마이닝 모형 구축 알고리즘은 상술한 부스팅 기법의 문제들을 해결하면서, 동시에 그 예측력이 부스팅보다 뛰어난 새로운 알고리즘이다. 또한 해석력에 있어서도 의사 결정 나무를 기본 학습기로 채용하는 경우, 종래의 데이터 마이닝 기법 중 그 해석력이 가장 뛰어나다고 알려진(물론, 예측력은 매우 나쁨.) 의사 결정 나무보다도 더욱 향상된 결과를 가져온다. 특히, 주어진 자료를 설명할 수 있는 다양한 종류의 연관성 규칙을 추출할 수 있고, 이를 통하여 다양한 각도에서 주어진 자료를 분석할 수 있다.The data mining model construction algorithm proposed by the present invention solves the problems of the above-mentioned boosting technique and at the same time, is a new algorithm whose predictive power is superior to boosting. In addition, when the decision tree is adopted as the basic learner in terms of interpretation power, the decision tree is further improved than the decision tree, which is known to have the best interpretation power (of course, the prediction is very poor) among the conventional data mining techniques. In particular, it is possible to extract various kinds of association rules that can explain a given data and to analyze the given data from various angles.

이론적으로 본 발명에서 제안하는 알고리즘은 부스팅 기법의 이론과는 기본이 되는 아이디어가 완전히 다르다. 부스팅 기법은 약한 학습기 여러개를 융합하여 새로운 강한 학습기를 만드는 것임에 비하여, 본 발명에서 제안하는 알고리즘은 강한 학습기 여러 개를 융합하여 보다 더 강한 학습기를 만드는 것이다.Theoretically, the algorithm proposed by the present invention is completely different from the basic idea of the boosting technique. While the boosting technique is to fuse a plurality of weak learners to make a new strong learner, the algorithm proposed in the present invention is to fuse a plurality of strong learners to make a stronger learner.

본 발명에서 제안하는 앙상블 기법에 쓰이는 기본 학습기는 의사 결정 나무를 사용하는 경우를 예로 들어 설명하도록 하겠다. 그 이유는, 의사 결정 나무의 장점인 알고리즘의 단순성과 높은 해석력 때문이다. 그러나, 기본 학습기의 선정은 자료의 종류에 따라 바뀔 수 있는데, 예를 들면, 문자 인식이나 음성 인식의 경우에는 신경망(Neural Network) 모형을 사용할 수 있다.The basic learner used in the ensemble technique proposed by the present invention will be described taking the case of using a decision tree as an example. The reason for this is the simplicity of the algorithm and the high interpretation power, which are the advantages of decision trees. However, the selection of the basic learner can be changed according to the type of material. For example, a neural network model can be used for text recognition or speech recognition.

의사 결정 나무를 기본 학습기로 사용하기 위해서는 빠른 계산이 필수적이다. 일반적으로 현재 널리 알려진 의사 결정 나무 구축 알고리즘으로는 'Breiman'이 제시한 카트 알고리즘(CART Algorithm, Breiman et al., 1986)이 있다. 카트 알고리즘의 경우에는, 나무의 성장, 가지 치기 및 최적 의사 결정 나무 선택의 3 단계로 이루어진다.Fast computations are essential to using decision trees as basic learners. In general, currently known decision tree construction algorithms are the cart algorithm proposed by Breiman (CART Algorithm, Breiman et al., 1986). In the case of the kart algorithm, there are three stages: tree growth, pruning and optimal decision tree selection.

이때, 세 번째 단계인 최적 나무 모형을 선택하는 알고리즘은 교차 확인(Cross Validation) 기법을 사용하는데, 이 기법은 많은 계산량을 요구한다. 하나의 의사 결정 나무를 생성하기 위하여 교차 확인에 필요한 계산량은 그리 부담이 되지 않지만, 앙상블 기법에서는 여러 개의 의사 결정 나무를 생성하기 때문에, 모든 의사 결정 나무에 교차 확인 기법을 적용하는 것은 계산량의 폭증을 필연적으로 수반하게 된다.In this case, the third step, the algorithm for selecting an optimal tree model, uses a cross validation technique, which requires a large amount of computation. The computation required for cross validation to generate a single decision tree is not too burdensome, but because the ensemble technique generates multiple decision trees, applying cross validation to all decision trees can lead to a surge in computation. It is necessarily accompanied.

본 발명에서는 이러한 교차 확인의 문제점을 극복하기 위하여 TIC(Tree Information Criteria)라는 양을 새로 정의하고, 이를 이용함으로써, 보다 빠른 시간에 최적의 의사 결정 나무를 구축하게 된다.In the present invention, in order to overcome the problem of cross-checking, the TIC (Tree Information Criteria) is newly defined and used to construct an optimal decision tree at a faster time.

그러나, 본 발명에서 제안하는 데이터 마이닝 모형 구축 기법은 상기 TIC를 이용한 의사 결정 나무 선택 방법에 제한되는 것은 아니다. 왜냐하면, 본 발명은 기본 학습기로 의사 결정 나무만이 사용되는 것이 아니라 다양한 기본 학습기(예를 들면, 신경망 학습기)를 사용할 수 있기 때문이다. 다만, 설명의 편의상 본 출원에서는 기본 학습기로 의사 결정 나무를 선택하여 설명한다.However, the data mining model construction technique proposed by the present invention is not limited to the decision tree selection method using the TIC. This is because the present invention can use not only a decision tree as a basic learner but also various basic learners (for example, neural network learners). However, for convenience of description, the present invention selects and describes a decision tree as a basic learner.

도 7은 본 발명의 일 실시예에 따른 앙상블 모형 구축 방법을 개략적으로 도시한 흐름도로서, 이를 설명하면 다음과 같다.7 is a flowchart schematically illustrating a method for constructing an ensemble model according to an embodiment of the present invention.

먼저, 스텝 S701에서, 다차원 데이터인 학습 데이터가 입력되면, 스텝 S702에서, 기본 학습기를 구축한다. 이때, 기본 학습기로는 의사 결정 나무, 신경망 모형 등이 사용될 수 있는 바, 본 출원서에서는 설명의 편의상 의사 결정 나무인 것으로 한다. 따라서, 스텝 S702에서는 최적의 의사 결정 나무를 구축하게 된다. 이때, 최적의 의사 결정 나무를 구축하는 방법은 여러 가지가 있는데, 본 발명에서는 TIC(Tree Information Criteria) 알고리즘이라는 새로운 방법을 도입한다.First, when learning data which is multidimensional data is input in step S701, a basic learner is constructed in step S702. In this case, a decision tree, a neural network model, and the like may be used as the basic learner. In the present application, it is assumed that the decision tree is for convenience of explanation. Therefore, the optimal decision tree is constructed in step S702. At this time, there are several methods for constructing an optimal decision tree, and the present invention introduces a new method called a tree information criterion (TIC) algorithm.

이어서, 스텝 S703에서, 구축된 기본 학습기들을 이용하여 앙상블 모형을 구축하고, 스텝 S704에서, 상기 구축된 앙상블 모형이 최적인지 여부를 판단한다.Next, in step S703, an ensemble model is constructed using the constructed basic learners, and in step S704, it is determined whether the constructed ensemble model is optimal.

상기 스텝 S704에서의 판단 결과, 최적이 아닌 것으로 판단되면, 스텝 S705에서, 새로운 반응 변수를 생성한 후, 상기 스텝 S702로 복귀한다.If it is determined in the step S704 that it is not optimal, in step S705 a new response variable is generated, and then the flow returns to the step S702.

상기 스텝 S704에서의 판단 결과, 최적인 것으로 판단되면, 스텝 S706에서, 최종 모형을 구축한 후, 종료한다.If it is determined that the result of determination in step S704 is optimal, then in step S706, the final model is constructed and then ends.

한편, 본 출원서의 구성은 다음과 같다.On the other hand, the configuration of the present application is as follows.

(1) 먼저, 교차 확인 방법을 대치하는 새로운 최적의 의사 결정 나무 구축 알고리즘인 TIC에 대해서 설명함으로써, 최적의 의사 결정 나무 구축 방법을 알아 보고(스텝 S702), (2) 본 발명이 이용하는 부스트랩(Boostrap) 자료 추출 방법을 살펴보며(스텝 S703의 전단부), (3) 이를 이용한 새로운 앙상블 알고리즘을 설명하고,(반드시 TIC 알고리즘을 이용할 필요가 없슴. 즉, 기본 학습기로 의사 결정 나무를 반드시 선택할 필요가 없슴.)(스텝 S703의 후반부 내지 스텝 S706) (3) 연관성 규칙 생성 알고리즘을 살펴보며, (4) 가상 실험을 통하여 여러 가지 알고리즘의 성능을 비교한 후, 마지막으로 실제 자료를 이용하여 각각의 알고리즘의 성능을 비교한다.(1) First, TIC, a new optimal decision tree construction algorithm that replaces the cross-check method, will be described to find an optimal decision tree construction method (step S702), and (2) a boost trap used by the present invention. (Boostrap) Examine the data extraction method (front part of step S703), (3) describe the new ensemble algorithm using it (not necessarily use the TIC algorithm, that is, select the decision tree as the basic learner). (L) from step S703 to step S706. (3) Examine the association rule generation algorithm, and (4) compare the performance of various algorithms through virtual experiments. Compare the performance of the algorithm.

2. TIC(Tree Information Criteria)2.TIC (Tree Information Criteria)

본 절에서는 최적의 의사 결정 나무를 결정하는 문제에서 기존에 사용되는 교차 확인 방법의 문제점을 개선하는 TIC 방법에 대하여 설명한다. 먼저 의사 결정 나무를 생성하는 전반적인 알고리즘을 설명하고, 종래의 교차 확인 알고리즘을 살펴 본 후, 본 발명에서 제안하는 TIC를 이용한 최적 의사 결정 나무 선택 알고리즘을 설명하도록 하겠다.This section describes the TIC method that improves the problem of the existing cross-checking method in the problem of determining the optimal decision tree. First, an overall algorithm for generating a decision tree will be described, and after reviewing a conventional cross-check algorithm, the optimal decision tree selection algorithm using TIC proposed by the present invention will be described.

2-1. 의사 결정 나무 구축 알고리즘(Breiman et al., 1984)2-1. Decision Tree Construction Algorithm (Breiman et al., 1984)

'Breiman'이 제시한 의사 결정 나무 구축 알고리즘은 크게는 삼단계로 나눌 수 있다.The decision tree construction algorithm proposed by Breiman can be largely divided into three stages.

첫째는 성장 알고리즘으로서, 주어진 자료에 대하여 가장 큰 크기의 의사 결정 나무를 생성하는 단계이다.The first is a growth algorithm, which produces the largest decision tree for a given piece of data.

둘째는 가지 치기 알고리즘으로서, 상기 성장 알고리즘을 통하여 구축한 거대한 의사 결정 나무에서 불필요한 가지를 순서대로 삭제함으로써, 내포되는 여러개의 의사 결정 나무들을 생성하는 단계이다. 이때, 구축된 의사 결정 나무들은 점점 그 크기가 작아진다.The second is a pruning algorithm, in which a plurality of nested decision trees are generated by deleting unnecessary branches in order from a huge decision tree constructed through the growth algorithm. At this time, the constructed decision trees become smaller in size.

셋째는 최적 나무 선택 알고리즘으로서, 상기 가지 치기 알고리즘으로 구한 의사 결정 나무 중 최적의 의사 결정 나무를 선택하는 단계이다.Third, an optimal tree selection algorithm is a step of selecting an optimal decision tree among decision trees obtained by the pruning algorithm.

2-2. 최적 의사 결정 나무 선택을 위한 교차 확인 알고리즘(k 폴드 교차 확인)2-2. Cross-check algorithm for choosing optimal decision trees (k-fold cross check)

도 8은 최적의 의사 결정 나무 선택을 위한 교차 확인 과정을 개략적으로 도시한 흐름도로서, 이를 상세히 설명하면, 다음과 같다.8 is a flowchart schematically illustrating a cross-check process for selecting an optimal decision tree, which will be described in detail as follows.

입력되는 다차원 데이터에서 성장 알고리즘과 가지 치기 알고리즘을 이용하여 생성된 의사 결정 나무를 T₁, ..., T_m이라 하고, e_i는 T_i의 교차 확인 에러라고 하자.The decision trees generated using the growth algorithm and the pruning algorithm on the input multidimensional data are called T ₁ , ..., T _m , and e _i is the cross-check error of T _i .

(1) 스텝 S801 : 각종 변수들을 초기화한다. 즉, e_i= 0, i = 1, 2, ..., m 으로 놓는다.(1) Step S801: Initialize various variables. That is, e _i = 0, i = 1, 2, ..., m.

(2) 스텝 S802 : 주어진 n 개의 학습 자료를 k 등분하여 k 개의 상호 배반인 자료 D₁, D₂, ..., D_k를 생성한다.(2) Step S802: Divide the given n learning materials by k equal parts _k to generate _k D mutually betrayed materials D ₁ , D ₂ , ..., D _k .

(3) 스텝 S803 : D_i를 테스트 자료로 하고, 나머지 자료를 학습 자료로 한다.(3) Step S803: D _i is used as test data and the remaining data is used as learning data.

(4) 스텝 S804 : 상기 학습 자료들을 이용하여 내포되는 의사 결정 나무들(성장과 가지 치기 알고리즘을 이용하여)을 구축한다.(4) Step S804: Construct decision trees (using growth and pruning algorithms) nested using the learning materials.

(5) 스텝 S805 : 상기 구축된 의사 결정 나무들 각각에 대하여 테스트자료(D_i)를 이용하여 예측 에러를 구한다.(5) Step S805: A prediction error is obtained for each of the constructed decision trees by using the test data D _i .

(6) 스텝 S806 : 상기 구축된 의사 결정 나무 중 의사 결정 나무 T_j에 가장 근접한 의사 결정 나무를 선택한다. 이때, 선택하는 알고리즘은 'Breiman et al.(1984)'에 상세히 기재되어 있는 바, 여기서는 생략한다.(6) Step S806: From the constructed decision trees, the decision tree closest to the decision tree T _j is selected. At this time, the algorithm to select is described in detail in Breiman et al. (1984), and will be omitted here.

(7) 스텝 S807 : e_j에 상기 스텝 S806에서 구한 의사 결정 나무의 예측 에러를 더한다.(7) Step S807: Add e _j to the prediction error of the decision tree found in step S806.

(8) 스텝 S808 : j = 1, ..., m번 반복한다.(8) Step S808: j = 1, ..., m times repeated.

(9) 스텝 S809 : i = 1, ..., k번 반복한다.(9) Step S809: i = 1, ... are repeated k times.

(10) 스텝 S810 : e₁, ..., e_m을 의사 결정 나무 T₁, ..., T_m각각의 교차 확인 에러라 부르며, 이 교차 확인 에러가 가장 작은 의사 결정 나무를 최적의 의사 결정 나무로 선택한다.(10) Step S810: e ₁ , ..., e _m are called decision trees of the decision trees T ₁ , ..., T _m , and the decision tree with the smallest cross confirmation error is the best decision. Choose as a crystal tree.

한편, 이러한 교차 확인 알고리즘은 k 폴드 교차 확인 알고리즘이라고도 부르는데, 일반적으로 5 폴드 또는 10 폴드 교차 확인 방법이 주로 사용된다.On the other hand, such cross-check algorithm is also called k-fold cross-check algorithm, generally 5 or 10 fold cross-check method is mainly used.

상술한 바와 같은 최적의 의사 결정 나무 구축을 위한 교차 확인 알고리즘은 의사 결정 나무를 여러번 구축해야 한다. 따라서, 자료가 거대한 경우에는 계산 시간이 매우 길어지고, 그 결과가 자료를 어떻게 나누느냐에 따라 임의적으로 변동하는 문제점이 있다.As described above, the cross-checking algorithm for constructing the optimal decision tree must build the decision tree several times. Therefore, when the data is huge, the calculation time becomes very long, and the result varies randomly depending on how the data is divided.

이러한 교차 확인 방법이 가지고 있는 계산량의 폭증과 결과의 불안정성 등의 문제를 해결하기 위하여 본 발명에서는 TIC라는 알고리즘을 새롭게 제안한다.이하에서는 TIC 알고리즘을 소개한다.In order to solve problems such as the increase in the amount of computation and the instability of the result that the cross-checking method has, the present invention newly proposes an algorithm called TIC. Hereinafter, the TIC algorithm will be introduced.

2-3. 최적의 의사 결정 나무 선택을 위한 TIC 알고리즘2-3. TIC Algorithm for Optimal Decision Tree Selection

TIC 알고리즘의 목적은 여러 개의 나무 순열, 즉, T₁, ..., T_m중 최적의 나무를 결정하는 것이다. 이때, 각각의 나무의 사후 확률(Posterior Probability)을 계산하고, 이 사후 확률이 가장 큰 나무를 최적의 나무로 선택하게 된다.The purpose of the TIC algorithm is to determine the optimal tree among several tree permutations, ie, T ₁ , ..., T _m . At this time, the posterior probability of each tree is calculated, and the tree having the largest posterior probability is selected as an optimal tree.

사후 확률이란 주어진 자료에 대하여 각각의 나무의 확률을 의미한다. 즉, 나무 T_i의 사후 확률은 주어진 자료 D_n= {(y₁, x₁), ..., (y_n, x_n)}에 대하여이 된다.Post-probability means the probability of each tree for a given data. That is, the posterior probability of the tree T _i is given for the given data D _n = {(y ₁ , x ₁ ), ..., (y _n , x _n )} Becomes

도 9는 TIC를 이용한 최적의 의사 결정 나무를 선택하는 방법의 전체 개요를 보여주는 흐름도로서, 이를 상세히 설명하면, 다음과 같다.9 is a flowchart showing an overall overview of a method of selecting an optimal decision tree using TIC, which will be described in detail as follows.

먼저, 스텝 S901에서, 다차원 데이터인 학습 데이터가 입력되면, 스텝 S902에서, 상기 자료를 이용하여 최대 크기의 의사 결정 나무를 구축한다. 이어서, 스텝 S903에서, 상기 구축된 최대 크기 의사 결정 나무들을 가지 치기 이론을 이용하여 내포 의사 결정 나무(Nested Trees)들로 새롭게 생성한다.First, when learning data which is multidimensional data is input in step S901, a decision tree of maximum size is constructed using the data in step S902. Then, in step S903, the constructed maximum size decision trees are newly generated as Nested Trees using the pruning theory.

그리고, 스텝 S904에서, 각각의 의사 결정 나무들의 사후 확률을 계산한 후, 스텝 S905에서, 최대 사후 확률을 가지는 의사 결정 나무를 선택하여, 스텝 S906에서, 단일화된 최적 의사 결정 나무를 최종적으로 구한다.After calculating the posterior probabilities of the respective decision trees in step S904, the decision tree having the largest posterior probability is selected in step S905, and finally, in step S906, the unified optimal decision tree is finally obtained.

이하에서는 이러한 최적 의사 결정 나무를 선택하는 방법을 보다 상세하게설명한다.The following describes in more detail how to select such an optimal decision tree.

먼저, 사후 확률을 계산하는 일반적인 방법에 대하여 살펴 본다.First, we look at the general method of calculating the posterior probability.

사후 확률은 베이지안 정리(Bayesian Theorem)에 의하여= cPr(D_n│T_i)Pr(T_i)가 되며, 이때 상기 Pr(D_{n │}T_i)는 모형이 T_i일 때의 자료의 확률, Pr(T_i)는 자료를 보기 전에 사용자가 임의로 정한 확률, 그리고, c는로 만드는 상수이다.The posterior probability is determined by the Bayesian Theorem. = cPr (D _n │ T _i ) Pr (T _i ), where Pr (D _{n │} T _i ) is the probability of the data when the model is T _i , and Pr (T _i ) is the user before viewing the data. Is a randomly determined probability, and c is Is a constant that makes

한편, 사후 확률을 구하는 목적은 사후 확률이 가장 큰 나무를 결정하기 위한 것으로서, 상기 상수 c는 구할 필요가 없으며, 아래의 [수학식 17]을 사용하기 로 한다.On the other hand, the purpose of calculating the posterior probability is to determine the tree with the largest posterior probability, and the constant c need not be obtained, and the following Equation 17 will be used.

Pr(D_n│ T_i)를 구하여 보자.Let's find Pr (D _n | T _i ).

먼저, 자료가 독립이므로, 아래의 [수학식 18]가 성립한다.First, since the data are independent, Equation 18 below holds.

또한, 상기 [수학식 18]은 아래의 [수학식 19]로도 쓸 수 있다.In addition, Equation 18 may also be written as Equation 19 below.

여기서, 나무 모형 T_i는 주어진 입력 x_k에 대하여 y_k의 확률 구조를 나타내는 모형이므로, Pr(x_k│T_i)는 T_i에 의존하지 아니한다. 즉, Pr(x_k│T_i) = Pr(x_k)이다. 따라서, Pr(D_n│T_i)를 구하기 위하여는 Pr(y_k│T_i, x_k)를 구하면 된다.Here, the tree model T _i is a model representing a probability structure of y _{k for} a given input x _k , so Pr (x _k │T _i ) does not depend on T _i . That is, Pr (x _{k |} T _i ) = Pr (x _k ). Therefore, Pr (y _{k |} T _i , x _k ) can be obtained to obtain Pr (D _{n |} T _i ).

한편, 상수 c와 마찬가지로 Pr(x_k)는 모든 나무에 공통으로 적용되는 값으로서, 최대의 사후 확률을 가지는 나무를 찾는데는 필요하지 않다. 따라서, 이를 반영하여 수식으로 표현하면, 아래의 [수학식 20]이 된다.On the other hand, like the constant c, Pr (x _k ) is a value commonly applied to all trees, and is not necessary to find a tree having the largest posterior probability. Therefore, when the expression is reflected by the expression, it is expressed by Equation 20 below.

상기를 구하는 방법은 다음과 같다.remind How to obtain is as follows.

의 최종 노드들의 집합을라 하자. 그리고, 주어진 h 번째 최종 노드에 대하여 각 J 개의 그룹의 확률을라 하자. 그러면, 주어진 입력 변수 x_k가 나무의 h 번째 최종 노드에 속하는 경우, 아래의 [수학식 21]이 성립한다. The final set of nodes Let's do it. Then, for each given h th final node, the probability of each J group Let's do it. Then, given the input variable x _k If it belongs to the h-th final node of, Equation 21 below holds.

이때, y_k는 자료가 속하는 그룹을 나타낸다.Where y _k represents the group to which the data belongs.

상술한 내용들을 이용하면, 아래의 [수학식 22]가 성립한다.Using the foregoing, Equation 22 below holds.

이때, n_jh는 h 번째 최종 노드에 포함되는 자료 중, 그룹 j에 속하는 자료의 수이다.In this case, n _jh is the number of data belonging to the group j among the data included in the h-th last node.

각 최종 노드의 확률가 모르는 변수이므로, 이를 기대값을 이용하여 제거한다. 기대값을 구하기 위하여는의 분포가 필요한데, 이를라 하자. 그러면, 아래의 [수학식 23]이 성립한다.Probability of each final node Is an unknown variable, so remove it using the expected value. To get the expected value We need a distribution of Let's do it. Then, Equation 23 below holds.

여기서,로 여러 가지 분포를 사용할 수 있으며, 일반적인 분포를 사용하면, 아래의 [수학식 24]가 성립한다.here, Various distributions can be used, and using the general distribution, Equation 24 below holds.

또한, 일양 분포를 사용하면, 아래의 [수학식 25]가 성립한다.In addition, using a one-way distribution, the following Equation 25 holds.

이때,이다.At this time, to be.

한편, 상기 일양 분포는 아래의 [수학식 26]과 같이 정의된다.On the other hand, the daily distribution is defined as in Equation 26 below.

이하에서는 나무의 사전 확률(Prior Probability)를 정하는 방법을 살펴 보자.Prior Probability of Trees Let's look at how to determine.

는 자료로부터 구하는 것이 아니라, 사용자가 입력하는 것이다. Is not obtained from the data, but is entered by the user.

TIC를 위한는 다음과 같이 구축한다.For TIC Is constructed as follows:

먼저, 각각의 주어진 h 번째 노드에서 그 노드가 중간 노드(즉, 계속해서 분기가 진행됨.)가 될 확률을 아래의 [수학식 27]과 같이 정의하자.First, let's define the probability that at each given h th node it will be an intermediate node (ie, branching continues) as shown in Equation 27 below.

여기서, f_h는 주어진 노드의 조상 노드들의 수이고, 상수와는 사용자에 의하여 정하여 진다.Where f _h is the number of ancestor nodes of a given node and is a constant Wow Is determined by the user.

그러면, 주어진 노드가 최종 노드가 될 확률은 자연스럽게 아래의 [수학식28]과 같이 결정된다.Then, the probability that a given node becomes the final node is naturally determined as shown in Equation 28 below.

상기 [수학식 28]와 같은 조건하에서 주어진 나무 T_i의 사전 확률은 아래의 [수학식 29]와 같이 표현된다.The prior probability of a given tree T _i under the same condition as Equation 28 is expressed as Equation 29 below.

이 때,는 중간 노드(즉, 최종 노드가 아닌 모든 노드)의 집합이다.At this time, Is the set of intermediate nodes (ie all nodes that are not final nodes).

이제, 상술한 내용들을 이용하여 TIC를 계산해 보도록 하자.Now, let's calculate the TIC using the above description.

상술한 수식을 모두 정리하면, 아래의 [수학식 30]으로 최종 정리된다. 이때, 아래의 [수학식 30]은 일양 분포를 이용한 것이다.When all of the above-described equations are put together, the final equation (30) is given. At this time, Equation 30 below uses a solar distribution.

그리고, 위의 마지막 식에 log를 취한 값을 TIC로 정의한다. 즉, 나무 T_i의 TIC는 아래의 [수학식 31]과 같이 표현된다.Then, the logarithm of the last expression above is defined as TIC. That is, the TIC of the tree T _i is expressed as Equation 31 below.

그룹이 두 개인 경우, 즉, J = 2 인 경우의 TIC는 아래의 [수학식 32]와 같이 표현된다.In the case of two groups, that is, J = 2, the TIC is expressed as Equation 32 below.

이때,는 h 번째 최종 노드에 있는 자료 중, 두 번째 그룹에 속하는 자료의 수가 된다.At this time, Is the number of data in the second group among the data in the h-th last node.

상술한 바와 같이 정의한 TIC를 각각의 의사 결정 나무 T₁, ..., T_m에 적용하여 TIC가 최대가 되는 의사 결정 나무를 최적의 의사 결정 나무로 선택함으로서, 본 알고리즘은 종료된다.The algorithm is terminated by applying the TIC defined as described above to each decision tree T ₁ , ..., T _m to select the decision tree with the maximum TIC as the optimal decision tree.

한편, 종래의 베이지안 정리를 이용하는 방법과 본 발명에서 제시하는 TIC 방법은 사후 확률을 이용한다는 측면에서는 같은 발명이나, 사후 확률을 구할 때 사용되는 사전 확률의 구축에 있어서 차이가 있다. 그리고, 이러한 차이는 사후 확률의 계산에 많은 영향을 미친다. 즉, 종래의 베이지안 정리를 이용하는 방법에서는 사후 확률이 수식으로 계산되지 아니하며, 이를 컴퓨터를 사용하여 계산하는데, 그 계산 시간이 교차 확인을 사용하는 방법보다 훨씬 오래 걸린다.On the other hand, the conventional Bayesian theorem and the TIC method proposed in the present invention have the same invention in terms of the use of posterior probabilities, and there is a difference in the construction of the prior probabilities used in obtaining the posterior probabilities. And this difference greatly affects the calculation of the posterior probability. That is, in the conventional Bayesian theorem, the posterior probabilities are not calculated by the equation, and the calculation is performed using a computer, which takes much longer than the method using the cross check.

종래의 베이지안 정리를 이용하는 방법에서 사전 확률을 구축하는 방법은 가능한 모든 나무에 확률을 할당한다. 그런데, 가능한 모든 의사 결정 나무의 수는 엄청나게 많으므로, 사전 확률을 구축하는 방법 또한 매우 복잡하다. 그리고, 필연적으로, 사후 확률을 구하여야 하는 의사 결정 나무의 수도 크게 증가하게 되고, 이는 곧 계산량의 폭증으로 이어진다.In the conventional Bayesian theorem, the method of building prior probabilities assigns probabilities to all possible trees. However, the total number of possible decision trees is enormous, so how to build prior probabilities is also very complex. And, inevitably, the number of decision trees that need to find the posterior probabilities greatly increases, which leads to a surge in computation.

그러나, TIC 방법은 종래의 베이지안 정리를 이용하는 방법의 문제점을 해결한 것으로서, 사전 확률을 가능한 모든 의사 결정 나무에 할당하는 것이 아니라, 가지 치기 알고리즘으로부터 도출된 내포되는 의사 결정 나무에만 할당한다. 따라서, 사전 확률을 구축하는 방법이 매우 쉽고, 사후 확률의 계산 또한 간단해 진다는 효과가 있다.However, the TIC method solves the problem of using the conventional Bayesian theorem and does not assign prior probabilities to all possible decision trees, but only to nested decision trees derived from pruning algorithms. Therefore, there is an effect that the method of building the prior probabilities is very easy and the calculation of the post probabilities is also simplified.

즉, TIC 방법에서 사용하는 사전 확률 구축 방법은 자료를 이용하여 의사 결정 나무들의 집합을 줄이는 방법으로서, 이 부분이 종래의 베이지안 정리를 이용한 방법과 결정적으로 다른 부분이다.In other words, the prior probability building method used in the TIC method is a method of reducing the set of decision trees by using data, which is crucially different from the conventional Bayesian theorem.

정리하면, TIC를 이용하는 방법은 의사 결정 나무를 한번만 구축하면 되므로, 교차 확인을 이용하는 방법에 비하여 계산 속도가 비약적으로 향상된다. 또한, 그 결과도 같은 자료에는 항상 같게 되므로, 결과에 대한 신뢰도가 교차 확인 방법에 비하여 매우 뛰어나다.In summary, the method of using TIC only needs to build a decision tree once, so the computation speed is significantly improved compared to the method of using cross validation. In addition, the results are always the same for the same data, so the reliability of the results is much better than the cross-checking method.

아래의 [표 1]은 종래의 5 폴드 교차 확인 방법과 본 발명에서 제안하는 TIC를 이용한 최적 의사 결정 나무의 선택 방법의 시뮬레이션 결과를 보여준다.Table 1 below shows the simulation results of the conventional 5-fold cross-checking method and the optimal decision tree selection method using the TIC proposed in the present invention.

즉, 본 실험 데이터는 5 폴드 교차 확인을 통한 단일 나무(Single Tree)의 생성과 본 발명에서 제안하는 TIC를 이용한 싱글 트리의 생성 속도를 비교하기 위한 데이터이다.In other words, the present experimental data is for comparing the generation rate of a single tree using 5-fold cross-check and the generation rate of a single tree using the TIC proposed in the present invention.

각각의 실험 데이터는 평균 동일한 데이터를 반복 횟수 500 번씩 생성할 때의 평균 시간을 나타내며, 컴퓨터의 사양은 펜티엄 3 900 MHz, 메인 메모리 256 메가 바이트, 운영 체제는 윈도우 2000 이다.Each experimental data represents the average time to generate the same number of iterations 500 times, with a computer specification of Pentium 3 900 MHz, 256 megabytes of main memory and Windows 2000 operating system.

아래의 [표 1]에 의하면, 본 발명에서 제안하는 TIC 방법은 종래의 5 폴드 교차 확인 방법에 비하여 대략 1/5의 계산 시간만이 소요됨을 알 수 있다.According to Table 1 below, it can be seen that the TIC method proposed in the present invention takes only about 1/5 calculation time compared to the conventional 5-fold cross-checking method.

한편, 시뮬레이션 자료는 데이터마이닝에서 널리 알려져 있는 표준 자료에 해당하는 바, 각각 'Radius2', 'Interaction', 'Breast Cancer', 'Ionosphere' 및 'Sonar' 자료로서, 본 기술 분야에서는 데이터마이닝의 효율을 가늠하는 가장 유력한 시뮬레이션 자료이다. 본 시뮬레이션 자료는 'UC Irvine'의 'Machine Learning Web Site'(http://www1.ics.uci.edu/~mlearn/MLRepository.html)에 상세하게 나와 있다.Meanwhile, simulation data corresponds to standard data widely known in data mining, and are data of 'Radius2', 'Interaction', 'Breast Cancer', 'Ionosphere' and 'Sonar', respectively. It is the most influential simulation data. This simulation material is detailed in UC Irvine's Machine Learning Web Site (http://www1.ics.uci.edu/~mlearn/MLRepository.html).

[표 1]TABLE 1

데이터data 5폴드 교차확인5-fold cross check TICTIC 평균 소요 시간Average travel time 평균 소요 시간Average travel time radius2radius2 235.4 msec235.4 msec 43.2 msec43.2 msec interactioninteraction 228.6 msec228.6 msec 43.3 msec43.3 msec breast cancerbreast cancer 128.3 msec128.3 msec 25.6 msec25.6 msec ionosphereionosphere 182.9 msec182.9 msec 34.4 msec34.4 msec sonarsonar 259.3 msec259.3 msec 46.6 msec46.6 msec

3. 부스트랩(Boostrap) 자료 추출 방법3. Extracting Boost data

본 장에서는 본 발명이 이용하고 있는 부스트랩 자료의 추출 방법에 대하여 개략적으로 설명한다.This chapter outlines the method of extracting boost data used by the present invention.

먼저, 부스트랩 자료 생성의 전반적인 알고리즘은 다음과 같다.First, the overall algorithm for generating boost data is as follows.

원래의 데이터가 y₁, ..., y_n의 n 개의 데이터라고 하면, 생성하고자 하는 부스트랩 자료의 개수 m을 정한 후, 새로운 데이터 집합 D^B를 공집합으로 초기화한다.If the original data is _n data of y ₁ , ..., y _n , the number m of the boost data to be generated is determined, and then the new data set D ^B is initialized to the empty set.

그리고, 난수 발생기를 이용하여인 정수 k를 생성한 후,를로 할당하고,를에 추가하는데, 이 과정을 i = 1, ..., m 번 반복한다.And, using a random number generator After creating the integer k, To And assign it to To In addition, repeat this process i = 1, ..., m times.

도 10은 본 발명의 일 실시예에 따른 부스트랩 자료의 생성 과정을 나타낸 흐름도로서, 이를 설명하면 다음과 같다.10 is a flowchart illustrating a process of generating boost strap data according to an embodiment of the present invention.

(1) 스텝 S1001 : 입력되는 n 개의 다차원 자료 (x₁, y₁), ..., (x_n, y_n)에 대하여 w₁, ..., w_n의 가중치를 할당한다. 여기서 x_i는 p 차원의 설명 변수이다. 즉, x_i= (x_1i, ..., x_pi)이다.(1) Step S1001: Weights of w ₁ , ..., w _n are assigned to the _n multidimensional data (x ₁ , y ₁ ), ..., (x _n , y _n ) to be input. Where x _i is the explanatory variable in p dimension. That is, x _i = (x _1i , ..., x _pi ).

(2) 스텝 S1002 : y₁, ..., y_n에 대하여 할당된 w₁, ..., w_n의 가중치의 누적 가중치를 계산한다. 즉, 새로운 누적 가중치는 아래의 [수학식 33]과 같이 계산된다.(2) Step S1002: The cumulative weight of weights of w ₁ , ..., w _n assigned to y ₁ , ..., y _n is calculated. That is, the new cumulative weight is calculated as in Equation 33 below.

(3) 스텝 S1003 : 생성하고자 하는 새로운 데이터의 개수 m을 정한다. 본 발명에 따른 앙상블 알고리즘을 위한 부스트랩 자료의 개수 m은 n으로 한다.(3) Step S1003: The number m of new data to be created is determined. The number m of boost strap data for the ensemble algorithm according to the present invention is n.

(4) 스텝 S1004 : 새로운 데이터의 집합, D^B를 공집합으로 초기화한다.(4) Step S1004: Initialize the new data set, D ^B, into an empty set.

(5) 스텝 S1005 : 난수 발생기를 이용하여을 만족하는 실수 난수를 생성한다.(5) Step S1005: Using a random number generator Generates a real random number that satisfies

(6) 스텝 S1006 :, ...,중,를 만족하는 j를 결정한다. 이때, j = 1, ..., n 이다.(6) step S1006: , ..., medium, Determine j satisfying At this time, j = 1, ..., n.

(7) 스텝 S1007 :로 할당한다.(7) step S1007: To be assigned.

(8) 스텝 S1008 :에 해당하는 가중치으로 한다.(8) step S1008: Weight corresponding to It is done.

(9) 스텝 S1009 :를에 추가한다.(9) step S1009: To Add to

(10) 스텝 S1010 : i = 1, ..., m 번 반복한다.(10) Step S1010: i = 1, ... is repeated m times.

4. 본 발명에서 제안하는 앙상블 알고리즘의 배경 및 기본 원리4. Background and Basic Principle of Ensemble Algorithm

4-1. 서언4-1. Preface

본 장에서는 본 출원에서 제안하는 새로운 앙상블 알고리즘의 배경 및 기본 원리를 설명한다. 본 발명에 따른 앙상블 알고리즘은 캠 알고리즘(CHEM : Convex Hull Ensemble Machine)이라고 지칭하겠다.This chapter describes the background and basic principles of the new ensemble algorithm proposed in this application. The ensemble algorithm according to the present invention will be referred to as a cam algorithm (CHEM: Convex Hull Ensemble Machine).

캠 알고리즘은 여러 개의 기본 학습기를 이용하여 새로운 학습기를 생성하는앙상블 알고리즘이다.The cam algorithm is an ensemble algorithm that creates a new learner using several basic learners.

분류 문제(반응 변수가 범주형인 경우)나 회귀 모형(반응 변수가 연속형인 경우)에 있어서, 학습 문제를 함수 추정 문제로 바꿀 수 있다.For classification problems (when the response variable is categorical) or regression models (when the response variable is continuous), the learning problem can be turned into a function estimation problem.

반응 변수가 J 개의 범주를 가지는 분류 문제에서는 J 차원 함수 F = (F₁, ..., F_J)를 추정하는 문제인데, 이때, 상기 함수 F는 아래의 [수학식 34]와 같이 정의된다.In the classification problem where the response variable has J categories, the problem is to estimate the J-dimensional function F = (F ₁ , ..., F _J ), where the function F is defined as in Equation 34 below. .

또한, 회귀 모형 문제인 경우에는를 추정하는 문제로 된다.Also, if the problem is a regression model It becomes a problem of estimating.

함수 F의 참값을(참 학습기라 칭한다.)라고 하면,를 캠 알고리즘에서 사용되는 기본 학습기들의 집합이라고 하자. 즉, 주어진 학습 자료에 대하여 최적의 기본 학습기를 집합중의 하나로 선택한다. 한편, 최적의 기본 학습기를 찾는 방법은 종래에 널리 알려져 있다.The true value of function F If we say (true learner), Let be a set of basic learners used in the cam algorithm. That is, set the optimal basic learner for a given learning material Choose one of the Meanwhile, a method of finding an optimal basic learner is well known in the art.

데이터 마이닝에 사용되는 기본 학습기들의 집합로는 의사 결정 나무, 신경망 모형 등이 사용된다. 그러나, 이러한 기본 학습기들의 큰 문제점으로는 자료의 변화에 매우 민감하게 반응한다는 것이다. 이러한 기본 학습기들의 불안정성의 원인을 규명하고 이를 극복하기 위한 것이 본 발명에서 제안하는 캠 알고리즘인 것이다.Set of basic learners used for data mining Decision trees and neural network models are used. However, a major problem with these basic learners is that they are very sensitive to changes in data. The cam algorithm proposed in the present invention is to identify and overcome the cause of such instability of the basic learners.

종래의 기본 학습기들의 불안정성의 원인을 살펴보면, 다음과 같다.The causes of instability of the conventional basic learners are as follows.

4-2. 종래의 기본 학습기들의 불안정성의 원인4-2. Causes of instability of conventional basic learners

데이터 마이닝에 사용되는 여러 가지 알고리즘들이 매우 불안정하게 움직이는 이유는 기본 학습기 집합내부에 있는 서로 다른 많은 학습기들이 자료를 비슷하게 설명하기 때문이다.The reason why the various algorithms used in data mining are very unstable is the basic set of learners. Many different learners inside explain the material similarly.

전혀 다른 학습기들이 자료를 비슷하게 설명하는 보다 근본적인 이유는가 고려된 기본 학습기 집합에 포함되지 않기 때문이다. 또한,에 포함되는 학습기 중,와 거리(쉽게 말하면 다른 정도)가 작은 학습기들이 여러 개 존재하기 때문이다.A more fundamental reason why different learners describe the material similarly Set of basic learners with consideration Because it is not included. Also, Among the learners included in This is because there are several learners with a small distance from each other.

도 11은 종래의 기본 학습기들의 집합에 포함되는 학습기 중 참 학습기와 거리가 작은 학습기들이 다수개 존재함을 보여 주는 기본 개념도이다.FIG. 11 is a basic conceptual view illustrating that a plurality of learners having a small distance from a true learner are included in a conventional set of basic learners.

도 11에 도시되어 있듯이,의 여러 학습기들이를 둘러싸고 있는 모양이다. 이 경우에는, 자료가 조금만 변해도, 최적의 학습기가 크게 변할 수 있다. 즉, 자료가로부터 어느 방향으로 변하느냐에 따라, 최적의 학습기는,및중에 어느 하나가 될 것이다.As shown in FIG. Many learners It is the shape that surrounds. In this case, even a slight change in the material may significantly change the optimal learner. That is, the material Depending on which direction it is from, the optimal learner , And Will be either.

기본 학습기의 집합가 도 11에 도시된 바와 같이 위치하면, 아무리 최적의 학습기를 잘 구축하여도를 제대로 구축할 수 없다. 하지만, 여러 개의학습기를 결합하면를 구축할 수 있다. 그 이유는,가에 포함되는 학습기들의 컨벡스 헐(Convex Hull) 공간에 위치하기 때문이다. 특히, 도 11에서, 적절한 가중치 w₁, w₂, w₃를 구하면 아래의 [수학식 35]가 성립한다.Set of basic learners Is located as shown in Fig. 11, no matter how well constructed the best learner Cannot be built properly. However, combining multiple learners Can be built. The reason is that, end This is because they are located in the convex hull space of the learners included in the. In particular, in Fig. 11, the following equation (35) holds when the appropriate weights w ₁ , w ₂ and w ₃ are obtained.

상기 [수학식 35]의 의미를 살펴보면,를에 속하는 몇 개의 학습기들의 가중 평균으로 구할 수 있다는 것이다. 본 발명에서 제안하는 캠 알고리즘은 이러한 아이디어를 이용하여 개발된 알고리즘이다.Looking at the meaning of [Equation 35], To It can be obtained as the weighted average of several learners belonging to. The cam algorithm proposed in the present invention is an algorithm developed using this idea.

이하에서는 본 발명에서 제안하는 캠 알고리즘에 따른 학습기 결합 방법의 기본 원리에 대하여 자세히 설명한다.Hereinafter, the basic principle of the learner combining method according to the cam algorithm proposed in the present invention will be described in detail.

4-3. 참 학습기를 알 때, 캠 알고리즘에 따른 학습기 결합 방법의 기본 원리4-3. Knowing the true learner, the basic principle of how to combine learner according to cam algorithm

본 절에서는 참 학습기를 알고 이가에 포함되어지지 않을 때,의 여러 학습기들의 가중 평균으로를 구축하는 방법을 소개한다. 다음 절에서는,가 미지인 경우, 자료를 이용하여를 추정하는 방법을 소개한다.This section is a true learner Know this end When not included in, Weighted average of different learners Introduce how to build it. In the next section, If the image is unknown, Introduce how to estimate.

캠 알고리즘의 기본 가정은가에 포함되는 M 개의 학습기의가중 평균으로 표현된다는 것이다. 이를 나타낸 것이 아래의 [수학식 36]이다.The basic assumption of the cam algorithm is end M learners included in It is expressed as a weighted average. This is shown in Equation 36 below.

캠 알고리즘은 가중 평균에 쓰인 학습기와 가중치를 순차적으로 찾아가는 알고리즘이다. 캠 알고리즘에서 기존에 구축된 k 개의 학습기와 가중치를 이용하여 (k+1)번째 학습기과 가중치을 찾는 알고리즘의 원리를 단계별로 설명하면, 다음과 같다.Cam algorithm is used for weighted average And weights This algorithm searches sequentially. K learners built from Cam algorithm And weights (K + 1) th learner using And weights The principle of the algorithm to find step by step is as follows.

첫 번째로, 상기 [수학식 36]에 따른 현재의 앙상블 모형 F_k와 직교하는 학습기 중, 최적의 학습기를로 하고, 두 번째로, 새로운 앙상블 모형을 아래의 [수학식 37]에 의하여 생성하는데, 이 때, 가중치와은과 참 학습기와의 거리가 최소가 되게 구한다.First, of the learners orthogonal to the current ensemble model F _k according to [Equation 36], the best learner Secondly, a new ensemble model is generated by Equation 37 below, with weights Wow silver And true learner Find the minimum distance between and.

이러한 알고리즘을 보다 상세하게 설명하면, 다음과 같다.This algorithm is described in more detail as follows.

(1) 첫 번째 학습기 구축(1) Build your first learner

도 12a는 본 발명에서 제안하는 캠 알고리즘에 따라 첫 번째 학습기를 구축하는 방법을 개념적으로 도시한 도면이다.12A is a diagram conceptually illustrating a method of constructing a first learner according to a cam algorithm proposed in the present invention.

도 12a에 도시한 바와 같이, 최적의 학습기(즉,와 가장 가깝게 위치하는 학습기)을 구한다.As shown in Fig. 12A, an optimal learner (In other words, And the learner that is located closest to.

(2) 두 번째 학습기 구축(2) Build a second learner

도 12b는 본 발명에서 제안하는 캠 알고리즘에 따라 두 번째 학습기를 구축하는 방법을 개념적으로 도시한 도면이다.12B is a diagram conceptually illustrating a method for constructing a second learner according to the cam algorithm proposed by the present invention.

과 직교하며 최적인 학습기를 찾고 상기 [수학식 37]에 따른 앙상블 모형 F₂가와 가장 거리가 짧아지는 가중치과를 구한다. 이때,과는 각각과이다. 한편, 상기 [수학식 37]에 따른 앙상블 모형 F₂는 아래의 [수학식 38]과 같이 표현된다. Optimal learner orthogonal to Find the ensemble model F ₂ according to Equation 37 Shortest weight with and Obtain At this time, and Are each and to be. Meanwhile, the ensemble model F ₂ according to Equation 37 is expressed as Equation 38 below.

이때, d₁은와과의 거리이고, d₂는와와의 거리이다.Where d ₁ is Wow Distance from d ₂ Wow Is the distance from

(3) 세 번째 학습기 구축(3) Build a third learner

도 12c는 본 발명에서 제안하는 캠 알고리즘에 따라 세 번째 학습기를 구축하는 방법을 개념적으로 도시한 도면이다.12c conceptually illustrates a method of constructing a third learner according to the cam algorithm proposed in the present invention.

와 직교하며 최적인 학습기를 구한다. 그리고, 상기 [수학식 37]에따른 앙상블 모형 F₃중,와 거리가 가장 가깝게 하는 가중치와를 구한다. 한편, 상기 [수학식 37]에 따른 앙상블 모형 F₃는 아래의 [수학식 39]와 같이 표현된다. Optimal learner orthogonal to and Obtain And, in the ensemble model F ₃ according to [Equation 37], Weight closest to distance Wow Obtain Meanwhile, the ensemble model F ₃ according to Equation 37 is expressed as Equation 39 below.

그러면, 아래의 [수학식 40]이 성립한다.Then, Equation 40 below holds.

(4) m 번째 학습기 구축(4) building the m-th learner

도 12d는 본 발명에서 제안하는 캠 알고리즘에 따라 m 번째 학습기를 구축하는 방법을 개념적으로 도시한 도면이다.12D is a diagram conceptually illustrating a method of building an m-th learner according to the cam algorithm proposed in the present invention.

위의 알고리즘을 계속 반복함으로써, m 번째 앙상블 모형을 아래의 [수학식 41]과 같이 구한다.By repeating the above algorithm, the m th ensemble model is obtained as shown in Equation 41 below.

4-4. 참 학습기가 미지일 때, 캠 알고리즘에 따른 학습기 결합 방법의 기본 원리4-4. Fundamental principle of how to combine learner according to cam algorithm when true learner is unknown

모든 학습 문제에서는는 미지이고, 그 대신 n 개의 학습 자료이 주어진다. 학습의 목적은 자료를 이용하여를 효과적으로 추정하는 것이다. 본 절에서는 윗 절에서 설명한 알고리즘이 자료가 주어진 경우 어떻게 구성되는 가를 설명한다.In all learning questions Is unknown, instead n learning resources Is given. The purpose of learning is to use materials Effectively estimating This section describes how the algorithm described in the previous section is constructed given the data.

l을 주어진 손실 함수라 하고, 주어진 학습기의 디비언스를 아래의 [수학식 42]로 정의한다.Let l be the given loss function and given the learner The division of is defined by Equation 42 below.

범주형 자료인 경우에는 두 그룹인 경우만을 고려한다.For categorical data, only two groups are considered.

(1) 첫 번째 학습기 구축(1) Build your first learner

최적의 학습기을 구한다.Optimal learner Obtain

(2) 두 번째 학습기 구축(2) Build a second learner

과 직교이며 최적의 학습기를 찾는다. Orthogonal and find the best learner.

먼저, 직교인 학습기를 구하기 위하여는 잔차를 사용한다. 잔차 r_i는 반응 변수가 범주형인 경우에는 아래의 [수학식 43]으로 구할 수 있다.First, we use the residuals to find orthogonal learners. The residual r _i can be obtained from Equation 43 below when the response variable is categorical.

여기서, P₁은 첫 번째 앙상블 모형을 이용하여 y가 1일 확률이다.Where P ₁ is the probability that y is 1 using the first ensemble model.

반응 변수가 연속형인 경우에는 아래의 [수학식 44]로 구할 수 있다.When the response variable is continuous, it can be obtained by Equation 44 below.

범주형 자료인 경우에는 ||를 가중치로 하여 최적의 학습기를 구축한다. 연속형인 경우에는를 반응 변수로 하여 최적의 학습기를 구한다.For categorical data, | Optimal learner with weight | Build it. If continuous Optimal learner using as a response variable Obtain

한편, 잔차를 이용하여 최적의 학습기를 구하는 이유는, 회귀 모형에서 잔차는 반응 변수와 직교하는 성질이 있기 때문이다. 따라서, 잔차에 최적인 학습기는 반응 변수에 최적인 학습기과 거의 직교한다.On the other hand, the reason for finding the optimal learner using the residual is that the residual in the regression model is orthogonal to the response variable. Therefore, learner which is most suitable for residual Is the best learner for the response variable Is almost orthogonal to

이어서, 잔차를 이용하여 최적의 학습기를 구한 후를 최소로 하는 상수를 구한다. 그리고,=로 놓는다. 이때, 앙상블 모형은 상기 [수학식 37]에 의하여 결정된다. 즉, 아래의 [수학식 45]와 같이 된다.Next, the best learner using the residuals After finding Constant to minimize Obtain And, = Place it. At this time, the ensemble model is determined by Equation 37. That is, the following equation (45) is obtained.

, ,

한편, 통계 이론적으로는 근사적으로와와의 거리의 제곱이 된다.Meanwhile, statistically theoretical Is approximately Wow Squared with.

(3) 세 번째 학습기 구축(3) Build a third learner

세 번째 학습기 구축은 두 번째 학습기 구축 방법에서대신를 사용하여 잔차를 구하는 것 외에는 동일하다. 구하여진 세 번째 학습기에 대하여 앙상블 모형 F₃는 아래의 [수학식 46]과 같이 된다.Building a third learner is how to build a second learner instead The same is true except to find the residual using. Obtained Third Learner The ensemble model F ₃ is given by Equation 46 below.

, ,

(4) 위의 알고리즘을 계속 반복하여 m 번째 앙상블 모형 F_m을 아래의 [수학식 47]과 같이 구한다.(4) The above algorithm is repeated repeatedly to obtain the m th ensemble model F _m as shown in [Equation 47] below.

5. 두 그룹 분류 문제에서의 캠 알고리즘5. Cam Algorithm in Two Group Classification Problems

두 그룹 분류 문제에서 본 발명에서 제안하는 캠 알고리즘을 보다 상세하게 설명하면 다음과 같다.The cam algorithm proposed by the present invention in two group classification problems is described in more detail as follows.

도 13은 본 발명의 일 실시예에 따른 캠 알고리즘의 개요를 보여주는 흐름도이다.13 is a flowchart showing an outline of a cam algorithm according to an embodiment of the present invention.

먼저, 스텝 S1301에서, 각종 변수들을 초기화한 후, 스텝 S1302에서, 입력되는 다차원 자료들을 가중치를 이용하여 부스트랩 자료로 생성한다. 그리고, 스텝 S1303에서, 주어진 설명 변수에 대하여 반응 변수가 1이 될 확률을 기본 학습기를 이용하여 추정한 후, 스텝 S1304에서, 주어진 디비언스를 최소로 하는 수정 모수를 계산한다.First, in step S1301, after initializing various variables, in step S1302, the input multi-dimensional data is generated as a boost strap data using a weight. Then, in step S1303, the probability that the response variable is 1 with respect to the given explanatory variable is estimated using the basic learner, and then in step S1304, a correction parameter for minimizing the given division is calculated.

그리고, 스텝 S1305에서, 상기 수정 모수를 이용하여 수정된 학습기를 구축한 후, 스텝 S1306에서, 수정된 학습기에 기반하여 앙상블 모형을 구축한다.In step S1305, a modified learner is constructed using the corrected parameters, and then in step S1306, an ensemble model is constructed based on the corrected learner.

그리고, 스텝 S1307에서, 후술하는 스탑 규칙을 만족하는지 여부를 판단하여, 만족하지 아니하면, 상기 스텝 1302로 복귀하고, 만족하면, 종료한다.In step S1307, it is determined whether or not the stop rule described later is satisfied. If not, the process returns to the step 1302, and if it is satisfied, the process ends.

캠 알고리즘을 수학식 등을 사용하여 보다 상세하게 설명한다.The cam algorithm will be described in more detail using equations and the like.

(1) 제 1 단계 : 반응 변수 y_i를 자료가 그룹 2에 속하면 1로, 그룹 1에 속하면 0로 놓는다.(1) Step 1: Set the response variable y _i to 1 if the data belongs to group 2 and to 0 if it belongs to group 1.

(2) 제 2 단계 : 각종 변수들을 초기화한다. 즉, n 개의 가중치 w₁, ..., w_n을 w_i= 1/n으로, F(x) = 0로 놓는다.(2) Second step: initialize various variables. That is, n weights w ₁ , ..., w _{n are set} to w _i = 1 / n and F (x) = 0.

(3) 제 3 단계 : 상기 가중치 {w_i}를 이용하여 부스트랩 자료 (x₁ ^B, y₁ ^B), ..., (x_n ^B, y_n ^B)를 생성한다. 부스트랩 자료 생성에 대해서는 이미 상술한 바 있다.(3) Third step: Using the weight {w _i } to generate the boost data (x ₁ ^B , y ₁ ^B ), ..., (x _n ^B , y _n ^B ). The generation of boost strap data has already been described above.

(4) 제 4 단계 : 부스트랩 자료를 이용하여 주어진 설명 변수 x에 대하여 반응 변수가 1이 될 확률을 기본 학습기를 이용하여 추정한다.(4) Stage 4: The probability that the response variable will be 1 for the given explanatory variable x using the boost strap data Is estimated using the basic learner.

(5) 제 5 단계 : 주어진 손실 함수 l에 대하여 아래의 [수학식 48]을 최소로 하는 수정 모수를 계산한다.(5) Step 5: A correction parameter for minimizing [Equation 48] below for a given loss function l Calculate

이때,이다.At this time, to be.

(6) 제 6 단계 : 상기 수정 모수를 이용하여 학습기를 새로 수정하여 재구축한다. 이 과정을 수식으로 나타내면, 아래의 [수학식 49]와 같이 된다.(6) Step 6: Rebuild the learner by using the modified parameters. When this process is expressed by an equation, it is expressed as Equation 49 below.

로 놓는다. Place it.

로 구한다. Obtain as

으로 갱신한다. Update with

가중치를로 갱신한다.Weights Update to.

이 때,이다.At this time, to be.

(7) 제 7 단계 : 앙상블 모형을 최종적으로 구축하기 위하여 상기 제 3 단계 내지 제 6 단계를 m = 1, ..., M 번 반복한다. 그리고, 최종 앙상블 모형을 H(x)=F(x)로 하여, 새로운 반응 변수 x에 대하여 H(x)>0이면 그룹 2에, H(x)<0이면그룹 1에 할당한다.(7) Step 7: Repeating the third to sixth steps m = 1, ..., M times to finally build the ensemble model. The final ensemble model is set to H (x) = F (x) and assigned to group 2 if H (x)> 0 for the new response variable x and to group 1 if H (x) <0.

한편, 상기 손실 함수로는 여러 가지 손실 함수가 사용될 수 있지만, 보다 양호한 결과를 얻기 위해서는 익스포넨셜(Exponential) 손실 함수 또는 로그 우도(Negative Log-Likelihood) 손실 함수가 사용될 수 있다.Meanwhile, various loss functions may be used as the loss function, but an exponential loss function or a negative log-likelihood loss function may be used to obtain a better result.

아래의 [수학식 50]은 익스포넨셜 손실 함수이고, [수학식 51]은 로그 우도 손실 함수를 나타낸 것이다.Equation 50 below is an exponential loss function, and Equation 51 represents a log likelihood loss function.

또한, 본 발명에서는 부스트랩을 사용하지 않고, 가중치를 이용한 기본 학습기를 생성할 수도 있다. 그러나, 대부분의 경우 부스트랩을 이용하는 것이 훨씬 좋은 성능을 나타낸다.In addition, in the present invention, a basic learner using a weight may be generated without using a boost strap. However, in most cases, using boost straps is much better.

또한, 본 발명에서 사용되는 기본 학습기로는 여러 가지가 쓰일 수 있는데, 본 실시예에서는 의사결정나무를 사용하였다. 종래의 부스팅 알고리즘과는 달리, 캠 알고리즘에서는 기본 학습기들이 강한 학습기이다. 따라서 단순한 의사 결정 나무가 아니라 의사 결정 나무 구축의 전 과정을 거친 최적의 의사 결정 나무를 사용한다. 이때, 계산상의 문제점을 극복하기 위하여 TIC를 사용한다. 이는 이미 상술한 바 있다.In addition, a variety of basic learners used in the present invention can be used, in this embodiment a decision tree was used. Unlike conventional boosting algorithms, the basic learners are strong learners in the cam algorithm. Therefore, instead of using a simple decision tree, we use an optimal decision tree that goes through the entire process of constructing a decision tree. At this time, TIC is used to overcome the computational problems. This has already been described above.

캠 알고리즘을 두 개 이상의 분류 문제로 확장하는 알고리즘은 다음과 같다.An algorithm that extends the cam algorithm to two or more classification problems is as follows.

6. 멀티 클래스 분류 문제에서의 캠 알고리즘6. Cam Algorithm in Multiclass Classification Problems

멀티 클래스로 확장된 캠 알고리즘의 경우에도 그 개요는 도 13의 과정을 따른다. 다만, 적용되는 수식 등이 두 그룹 분류 문제에서의 캠 알고리즘과 약간씩의 차이를 보이는 바, 이를 상세하게 설명하면, 다음과 같다.Even in the case of the cam algorithm extended to the multi-class, the overview follows the process of FIG. However, the applied formulas and the like are slightly different from the cam algorithms in the two group classification problems, which will be described in detail as follows.

(1) 제 1 단계 : 각종 변수들을 초기화한다. 즉, 가중치, i = 1, ..., n, j = 1, ..., J, F_j(x) = 0, j = 1, ..., J로 놓는다.(1) Step 1: Initialize various variables. That is, weight , i = 1, ..., n, j = 1, ..., J, F _j (x) = 0, j = 1, ..., J.

(2) 제 2 단계 : i 번째 자료가 j 그룹에 포함되면 y_i ^*를 1로 놓고, 포함되지 아니하면 0로 놓는다.(2) Step 2: If y-th data is included in j group, set y _i ^* to 1, otherwise to 0.

(3) 제 3 단계 : 가중치 {w_1j, ..., w_nj}를 이용하여 부스트랩 자료 (x₁ ^B, y₁ ^*B), ..., (x_n ^B, y_n ^*B)를 생성한다. 부스트랩 자료 생성에 대해서는 이미 상술한 바 있다.(3) Step 3: _Booststrap data (x ₁ ^B , y ₁ ^{* B} ), ..., (x _n ^B , y _n ^{* B} ) using weights {w _1j , ..., w _nj } Create The generation of boost strap data has already been described above.

(5) 제 5 단계 :로 놓는다.(5) 5th step: Place it.

(6) 제 6 단계 : 상기 제 2 단계 내지 제 5 단계를 j = 1, ..., J 번 반복한다.(6) Sixth step: repeating the second to fifth steps j = 1, ..., J times.

(7) 제 7 단계 : 주어진 손실 함수 l에 대하여 아래의 [수학식 52]를 최소로 하는 수정 모수를 계산한다.(7) Step 7: Corrected parameter that minimizes Equation 52 below for a given loss function l Calculate

이때,이고,이다.At this time, ego, to be.

(8) 제 8 단계 : 상기 수정 모수를 이용하여 학습기를 새롭게 수정하여 구축한다. 이 과정을 수식으로 나타내면, 아래의 [수학식 53]과 같이 된다.(8) Eighth Step: The learner is newly modified by using the modified parameters. If this process is expressed by an equation, Equation 53 is given below.

로 놓는다. Place it.

로 구한다. Obtain as

으로 갱신한다. 이때,이다. Update with At this time, to be.

가중치를로 갱신한다.Weights Update to.

이때,이고,는 i 번째 관측치가 j 번째 그룹에 속하면 1이고, 아니면 0이다.At this time, ego, Is 1 if the i th observation belongs to the j th group, or 0 otherwise.

(9) 제 9 단계 : 상기 제 2 단계 내지 제 8 단계를 m = 1, ..., M 번 반복함으로써, 앙상블 모형을 최종적으로 구축한다. 이때, 새로운 설명 변수 x에 대하여그룹에 할당한다.(9) Ninth Step: The second to eighth steps are repeated m = 1, ..., M times to finally construct an ensemble model. In this case, for the new explanatory variable x Assign to a group.

캠 알고리즘을 연속형 변수 문제로 확장하는 알고리즘은 다음과 같다.The algorithm that extends the cam algorithm to the continuous variable problem is as follows.

7. 연속형 변수 문제에서의 캠 알고리즘7. Cam Algorithm for Continuous Variable Problems

본 장에서는 반응 변수가 연속형인 경우에 앙상블 모형을 만드는 알고리즘 (Regression CHEM)을 설명한다. 반응 변수가 연속형인 경우를 회귀 모형이라 하며, 그 기본 모형은 다음과 같다.This chapter describes an algorithm (Regression CHEM) that creates an ensemble model when the response variable is continuous. The case where the response variable is continuous is called a regression model. The basic model is as follows.

도 14는 본 발명의 일 실시예에 따른 연속형 변수를 위한 앙상블 알고리즘을 개략적으로 도시한 흐름도로서, 이를 상세히 설명하면 다음과 같다.14 is a flowchart schematically illustrating an ensemble algorithm for a continuous variable according to an embodiment of the present invention, which will be described in detail as follows.

먼저, 스텝 S1401에서, 각종 변수들을 초기화하고, 반응 변수를 정의한 후, 스텝 S1402에서, 기본 학습기를 이용하여 회귀 모형을 구축한다. 이어서, 스텝 S1403에서, 주어진 디비언스를 최소로 하는 수정 모수를 계산한 후, 스텝 S1404에서, 상기 수정 모수를 기반으로 새로운 반응 변수를 갱신하며, 스텝 S1405에서, 갱신된 반응 변수를 토대로 앙상블 모형을 구축한다.First, in step S1401, various variables are initialized and response variables are defined. In step S1402, a regression model is constructed using a basic learner. Then, in step S1403, after calculating a correction parameter that minimizes the given division, in step S1404, a new response variable is updated based on the correction parameter, and in step S1405, an ensemble model is generated based on the updated response variable. Build.

그리고, 스텝 S1406에서, 후술하는 스탑 규칙을 만족하는지 여부를 판단하여, 만족하지 아니하면, 상기 스텝 S1402로 복귀하고, 만족하면, 종료한다.In step S1406, it is determined whether or not the stop rule described later is satisfied. If not, the process returns to step S1402, and when it is satisfied, the process ends.

이를 보다 상세하게 설명하면 다음과 같다.This will be described in more detail as follows.

먼저, 입력되는 n 개의 다차원 자료인 학습 자료 (x₁, y₁), ..., (x_n, y_n)이 주어졌다고 가정하자. 여기서, x_i는 p 차원의 설명 변수, 즉, x_i= (x_1i, ..., x_pi)이고, 반응 변수는 y_i이다.First, suppose that _n multidimensional data to be input are learning data (x ₁ , y ₁ ), ..., (x _n , y _n ). Where x _i is an explanatory variable of p dimension, that is, x _i = (x _1i , ..., x _pi ) and the response variable is y _i .

본 알고리즘의 목적은 n 개의 학습 자료를 이용하여 설명 변수로 반응 변수를 가장 잘 설명하는 관계를 찾는 것이다. 다시 말하면, 학습 자료들을 이용하여 최적의 함수 H : R^p→ R을 만드는 것이다. 그리고, 새로운 설명 변수 x가 주어지면, 이 자료의 반응 변수를 H(x)로 추정한다.The purpose of this algorithm is to find the relationship that best describes the response variable as the explanatory variable using n learning materials. In other words, we use the training materials to create the optimal function H: R ^p → R. And given the new explanatory variable x, we estimate the response of this data as H (x).

한편, 통계 이론적으로 볼 때,이다. 즉, 조건부 기대값을 추정하는 것이 회귀 분석의 목적이다.On the other hand, in statistical theory, to be. In other words, estimating conditional expectation is the purpose of regression analysis.

연속형 변수를 위한 캠 알고리즘은 다음과 같다.The cam algorithm for continuous variables is

(1) 제 1 단계 : 새로운 반응 변수 z_i= y_i로 놓는다.(1) Step 1: Set a new response variable z _i = y _i .

(2) 제 2 단계 : 반응 변수를 z_i, 설명 변수를 x_i로 하여 회귀 모형를 기본 학습기를 이용하여 구한다.(2) Step 2: Regression model with response variable z _i and explanatory variable x _i Obtain using the basic learner.

(3) 제 3 단계 : 주어진 손실 함수 l에 대하여 아래의 [수학식 54]를 최소로 하는 수정 모수를 찾는다.(3) Step 3: A correction parameter that minimizes Equation 54 below for a given loss function l Find it.

(4) 제 4 단계 : 상기 수정 모수를 이용하여 새로운 반응 변수를 갱신한다. 이를 나타낸 것이 아래의 [수학식 55]이다.(4) fourth step: the correction parameters Use to update the new response variable. This is shown in Equation 55 below.

로 놓는다. Place it.

를 계산한다. Calculate

로 갱신한다. Update to.

새로운 반응 변수로 갱신한다.New response variable Update to.

(5) 제 5 단계 : 상기 제 2 단계 내지 제 4 단계를 m = 1, ..., M 번 반복한다.(5) 5th step: repeating said 2nd step-4th step m = 1, ..., M times.

(6) 최종 앙상블 모형을 H(x) = F(x)로 구축한다.(6) Construct the final ensemble model as H (x) = F (x).

한편, 상기 손실 함수 l은또는등의 알려져 있는 여러 가지 손실 함수등이 사용될 수 있다.Meanwhile, the loss function l is or Various known loss functions can be used.

8. 스탑(Stop) 규칙8. Stop Rules

본 장에서는 최종 앙상블 모형에 필요한 기본 학습기의 개수를 정하는 알고리즘을 설명한다. 본 스탑 규칙의 기본 아이디어는 현재 앙상블 모형의 디비언스가 가장 작을 때, 더 이상 앙상블을 갱신하지 아니하고, 전체 알고리즘을 정지시키는 것이다.This chapter describes the algorithm that determines the number of basic learners required for the final ensemble model. The basic idea behind this stop rule is to stop updating the ensemble and stop the entire algorithm when the current ensemble model's division is the smallest.

스탑 규칙을 설명하면, 다음과 같다.The stop rule is described as follows.

(1) 제 1 단계 : 양의 정수 K 값을 정한다.(1) Step 1: Determine a positive integer K value.

(2) 제 2 단계 : F_m을 처음 m 개의 기본 학습기로 구축된 앙상블 모형이라하고, 주어진 손실 함수 l에 대하여 앙상블 모형 F_m의 디비언스(Deviance)를 아래의 [수학식 56]에 따라 계산한다.(2) Step 2: F _m is called an ensemble model constructed with the first m basic learners, and for the given loss function l, the deviation of the ensemble model F _m is calculated according to Equation 56 below. do.

(3) 제 3 단계 :,로 놓고,를 만족시키는 최초의 m에 대하여 앙상블 모형 F_m을 최종 앙상블 모형으로 정하고, 알고리즘을 정지시킨다.(3) Third step: , Put it on, The ensemble model F _m is defined as the final ensemble model for the first m that satisfies, and the algorithm is stopped.

한편, 양의 정수 K는 사용자가 정의하는 값이다.On the other hand, the positive integer K is a value defined by the user.

아래의 [표 2]는 여러 자료에서 캠 알고리즘에 스탑 규칙을 적용하지 않았을 때와 적용하였을 때의 예측력을 비교한 시뮬레이션 자료이다. 아래의 [표 2]에 도시되어 있듯이 두 예측력은 거의 비슷하게 나온다. 따라서, 스탑 규칙을 적용하면, 적은 수의 기본 학습기를 이용하여 최적의 앙상블 모형을 구축할 수 있으며, 이를 통하여 계산 속도가 크게 향상되는 결과가 도출된다.[Table 2] below is a simulation data comparing the predictive power when the stop rule is not applied to the cam algorithm and when it is applied to various data. As shown in Table 2 below, the two predictive powers are almost the same. Therefore, when the stop rule is applied, an optimal ensemble model can be constructed using a small number of basic learners, and this results in a large improvement in the calculation speed.

[표 2]TABLE 2

9. 연관성 규칙 생성 알고리즘9. Association rule generation algorithm

9-1. 서언9-1. Preface

본 장에서는 최종 앙상블 모형을 만들 때 쓰인 기본 학습기를 이용하여 자료를 설명할 수 있는 다양한 종류의 연관성 규칙을 찾아내는 알고리즘을 설명한다.This chapter describes algorithms to find various kinds of association rules that can be used to describe data using the basic learner used to create the final ensemble model.

한편, 본 장에서 제시하는 연관성 규칙 생성 알고리즘은 본 발명에서 제시하는 캠 알고리즘에 의하여 구축된 앙상블 모형 뿐만 아니라, 종래의 앙상블 모형 구축 방법에도 적용될 수 있다.On the other hand, the association rule generation algorithm presented in this chapter can be applied not only to the ensemble model constructed by the cam algorithm proposed in the present invention but also to a conventional ensemble model construction method.

9-2. 연관성 규칙 생성 알고리즘9-2. Association rule generation algorithm

연관성 규칙을 찾는 알고리즘은 다음과 같다.The algorithm for finding the association rule is

도 15는 연관성 규칙 생성 알고리즘의 개요를 나타내는 흐름도이다.15 is a flowchart showing an outline of an association rule generation algorithm.

(1) 스텝 S1501 : 각종 변수들을 초기화한다. 즉, 반응 변수가 범주형 데이터이면, 관심이 있는 그룹을 g로, 최소 허용 자료수를 m으로, 최소 허용 신뢰도 p로 결정한다. 또한, 반응 변수가 연속형인 경우에는 그룹 g 대신에 관심 영역 (g_L, g_U)을 결정한다.(1) Step S1501: Initialize various variables. That is, if the response variable is categorical data, determine the group of interest as g, the minimum allowable data number as m, and the minimum allowable reliability p. In addition, if the response variable is continuous, the region of interest (g _L , g _U ) is determined instead of the group g.

(2) 스텝 S1502 : 기본 규칙의 총집합 S를 구축한다. 이를 보다 상세히 설명하면 다음과 같다.(2) Step S1502: Construct the total set S of the basic rules. This will be described in more detail as follows.

기본 학습기를 검색하여 앙상블에 사용되었던 모든 기본 학습기의 모든 노드 중 포함하는 자료의 수가 m 보다 크고, 그룹 g의 확률이 p 보다 큰 모든 노드를 선택한다.The basic learner is searched and selects all nodes of all the basic learners used in the ensemble with more than m and more than p probability of group g.

이때, 두 그룹 분류 문제인 경우에는 앙상블 횟수만큼의 기본 학습기가 존재하므로, 모든 기본 학습기를 검색하여 조건에 맞는 모든 노드를 선택한다.At this time, in the case of two group classification problems, since there are as many basic learners as the ensembles, all the basic learners are searched and all nodes matching the conditions are selected.

만일, 두 그룹 이상의 분류 문제인 경우에는 앙상블 횟수와 그룹의 수를 곱한 만큼의 기본 학습기가 존재하며, 이들 중 관심 그룹 g에 해당하는 앙상블 횟수만큼의 기본 학습기를 이용하여 조건에 맞는 모든 노드를 선택한다.If the classification problem is more than two groups, there are basic learners multiplied by the number of ensembles and the number of groups, and among all of them, all nodes satisfying the condition are selected using the basic learners corresponding to the number of ensembles of interest group g. .

상술한 방법으로 선택된 노드를 기본 규칙의 총집합 S로 한다.The node selected in the above-described method is the total set S of the basic rules.

(3) 스텝 S1503 : 조건에 맞게 선택되어진 모든 노드들에 대하여 해당 노드보다 상위 노드들은 규칙의 집합에서 제거한다. 이를 상세히 설명하면 다음과 같다.(3) Step S1503: For all nodes selected according to the condition, nodes higher than the corresponding node are removed from the set of rules. This will be described in detail as follows.

모든 규칙에 대하여(i = 1, ..., N) 다음을 N 번 반복한다.All rules For (i = 1, ..., N), repeat the following N times.

를 선택하고, 선택된 s_i에 대하여 k = 1, ..., N 까지의 s_k노드가 s_i의 상위 노드이면, s_k를 S에서 제거한다. If s _k nodes up to k = 1, ..., N for the selected s _i are upper nodes of s _i , then remove s _k from S.

(4) 스텝 S1504 : S에 포함된 모든 규칙들에 대한 신뢰도를 계산한다. 이때, 각 조건의 신뢰도는 해당 규칙에 속하는 데이터의 전체 개수를 n, 이 중 관심있는 그룹(즉, 그룹 g)에 속하는 자료의 수를 n_g라 하면, 신뢰도는 n_g/n으로 계산된다.(4) Step S1504: The reliability for all the rules included in S is calculated. In this case, when the reliability of each condition is n, the total number of data belonging to the rule is n, and the number of data belonging to the group of interest (that is, group g) is n _g , and the reliability is calculated as n _g / n.

(5) 스텝 S1505 : 계산된 S 집합의 규칙들을 정렬한다. 즉, 계산된 S 집합의 규칙을 신뢰도가 높은 것부터 낮은 것의 순으로 정렬한다. 이때, 정렬되어진규칙들을 o₁, ..., o_H라 한다.(5) Step S1505: Arrange the rules of the calculated S set. In other words, the calculated S set rules are sorted in order of high reliability to low reliability. In this case, the sorted rules are referred to as o ₁ , ..., o _H.

(6) 스텝 S1506 : 연관성 규칙 집합을 R이라 하자. R이 공집합이면, o_h를 집합 R에 추가하고, R이 공집합이 아니면, 집합 R에 포함된 모든 연관성 규칙들에 대하여 o_h와의 유사성을 비교한다. R에 포함된 모든 규칙과 유사하지 아니하면, 노드 o_h를 R에 추가한다.(6) Step S1506: Assume that the association rule set is R. If R is empty, add o _h to set R, and if R is not empty, compare the similarity with o _h for all the association rules contained in set R. If not similar to all the rules contained in R, add node o _h to R.

한편, 유사성 비교 방법은 다음과 같다.Meanwhile, the similarity comparison method is as follows.

설명 변수 x = (x₁, ..., x_p)에 대하여 주어진 두 개의 규칙 o와 r은 아래의 [수학식 57]과 같이 정의된다고 하자.The two rules o and r given for the explanatory variable x = (x ₁ , ..., x _p ) are defined as in Equation 57 below.

이때, x_i가 R_oi에 포함되는 자료들의 집합을 D_o, x_i가 R_ri에 포함되는 자료들의 집합을 D_r이라 하자. 한편, 이 경우 R_oi와 R_ri는 R의 부분 집합이다.At this time, x _i is the set of data D _o, x _i included in R _r D _oi Let the set of data contained in R _ri. In this case, R _oi and R _ri are a subset of R.

먼저, 최대 허용 유사성을 결정한 후,에 포함되는 자료의 수를에 포함되는 자료의 수로 나눈 값이 s 보다 크거나 같으면, x_i에 대하여 두 개의 조건 o와 r은 유사하다고 판정하고, s보다 작으면, x_i에 대하여 두 개의 조건 o와 r은 유사하지 않다고 판정한다. 이러한 과정을 i = 1, ..., p 번 반복한다.First, maximum allowable similarity After determining The number of materials included in If dividing by the number of data contained in is greater than or equal to s, two conditions o and r are determined to be similar for x _i , and if less than s, two conditions o and r are not similar to x _i . Determine. This process is repeated i = 1, ..., p times.

그리고, 모든 x_i에 대하여 유사성 판정의 결과가 모두 유사하다라고 판정되어 지면, 규칙 o와 규칙 r은 유사하다라고 판정하며, 어떠한 x_i라도 유사하지 아니하다라고 판정되면, o와 r은 유사하지 않다고 판정한다.And if it is determined that the results of the similarity determination are all similar for all x _i , it is determined that rule o and r are similar, and if it is determined that no x _i is not similar, o and r are not similar. It is determined that it is not.

(7) 스텝 S1507 : 연관성 규칙 집합 R에 포함되어 있는 모든 규칙들을 사용하여 신뢰도 순으로 자료를 해석한다.(7) Step S1507: Data are interpreted in order of reliability using all the rules included in the association rule set R. FIG.

9-3. 연관성 규칙 생성 알고리즘의 성능 실험9-3. Performance Experiment of Association Rule Generation Algorithm

여기에서는 연관성 규칙 생성 알고리즘의 해석력을 보기 위한 실험 자료를 설명한다. 대비되는 종래 기술로는 CART 알고리즘을 선택한 바, CART에서 구축된 하나의 의사 결정 나무를 연관성 규칙 생성 알고리즘에 적용시켰다. 또한, 실제 자료는 German 데이터를 이용하여 실험하였다.This section describes experimental data to show the interpretation power of the association rule generation algorithm. As a contrasting prior art, the CART algorithm was selected, and one decision tree constructed in CART was applied to the association rule generation algorithm. In addition, the actual data were tested using German data.

실제 데이터의 연관성 규칙 결과는 아래의 [표 3] 및 [표 4]에 정리되어 있다.The results of the association rules of the actual data are summarized in [Table 3] and [Table 4] below.

[표 3] CART를 이용한 연관성 규칙의 검색 결과[Table 3] Search Results of Association Rules Using CART

[표 4] 캠 알고리즘에 연관성 규칙 알고리즘을 적용한 검색 결과[Table 4] Search results applying the association rule algorithm to the cam algorithm

German 데이터는 1,000 명의 신용 거래 현황 자료를 기반으로 700 명의 우량 신용 고객과 300 명의 불량 신용 고객으로 구성된 데이터이며, 연관성 규칙의 정확한 비교를 위하여 동일한 조건의 최소 허용 자료수와 최소 허용 신뢰도를 사용하여 분석하였다. 불량 신용 고객의 자료를 분석하기 위한 최소 허용 자료수는 50 명(5%), 최소 허용 신뢰도는 50 %로 하였으며, 우량 신용 고객의 자료를 분석하기 위한 최소 허용 자료수는 50 명(5 %), 최소 허용 신뢰도는 85 %로 하였다.German data consists of 700 good and 300 bad credit customers, based on 1,000 credit transactions, and is analyzed using the minimum number of allowable data and the minimum acceptable reliability under the same conditions for accurate comparison of association rules. It was. The minimum allowable number of data for analyzing bad credit customers 'data was 50 people (5%) and the minimum allowable reliability was 50%. The minimum allowable data for analyzing good credit customers' data was 50 people (5%). The minimum acceptable reliability was 85%.

위의 조건으로 검색되어진 CART의 연관성 규칙 검색 결과는 1 개의 불량 신용 고객 그룹과 1 개의 우량 신용 고객군으로 검색이 되었으며, 캠 알고리즘에서 연관성 규칙을 적용한 예는 5 개의 불량 신용 고객 그룹과 4 개의 우량 신용 고객 그룹으로 나타났다.The relevance rule search result of CART was searched by 1 bad credit customer group and 1 good credit customer group. Appeared as a customer group.

캠 알고리즘의 연관성 규칙은 CART의 연관성 규칙을 포함하는 광범위한 검색에 해당하며, 캠 알고리즘의 결과는 CART의 결과를 포함하는 연관성 규칙을 찾아냄을 알 수 있다. 또한, CART에 의하여 검색되어진 연관성 규칙에 해당하는 데이터는 하나의 기본 학습기로부터 생성되어 있기 때문에 서로간에 배반적인 데이터로 구성이 되는 반면, 본 발명에서 제시하는 연관성 규칙 알고리즘은 규칙이 포함하는 데이터들이 서로 배반이 아닌 집합으로 나타난다.It can be seen that the relevance rule of the cam algorithm corresponds to an extensive search including the relevance rule of CART, and the result of the cam algorithm finds an association rule that includes the result of the CART. In addition, since the data corresponding to the association rule retrieved by the CART is generated from one basic learner, the data is composed of mutually exclusive data, whereas the association rule algorithm proposed in the present invention includes data included in the rule. It appears as a set, not a betrayal.

다시 말하면, 캠 알고리즘은 여러 개의 기본 학습기를 이용하여 연관성 규칙을 찾아내므로, 하나의 기본 학습기에 의존하는 CART에 비하여 매우 다양한 종류의 연관성 규칙을 찾아 낼 수 있으며, 이를 통하여 자료를 보다 입체적으로 해석할 수 있게 된다.In other words, the cam algorithm finds association rules by using several basic learners, and thus, it is possible to find a wide variety of association rules, compared to CART, which relies on one basic learner. You can do it.

10. 실험을 통한 앙상블 기법의 성능 비교10. Performance Comparison of Ensemble Techniques

본 장에서는 각종 실험을 통하여 여러 가지 앙상블 기법의 성능을 비교하여 본다.This chapter compares the performance of various ensemble techniques through various experiments.

10-1. 가상 실험을 통한 성능 비교10-1. Performance comparison through virtual experiments

가상 실험에는 다음과 같은 모형을 사용한다.The following model is used for the virtual experiment.

(1) 모형 1 : Radius 2(1) Model 1: Radius 2

학습 자료 수는 1,000 개이고, 테스트 자료 수는 5,000 개이며, 그룹의 수는 2이다.The number of learning materials is 1,000, the number of test materials is 5,000, and the number of groups is two.

설명 변수 : x = (x₁, ..., x₁₀)이고, 이들은 각각 독립이며 표준 정규 분포를 따른다.Explanatory variables: x = (x ₁ , ..., x ₁₀ ), each of which is independent and follows a standard normal distribution.

반응 변수 :이면, 확률 0.9로 y = 1이고, 확률 0.1로 y = -1로 한다. 또한,이면, 확률 0.9로 y = -1이고, 확률 0.1로 y = 1로 한다. 여기서 c는를 만족하는 상수이다.Reaction Variables: If so, the probability is 0.9 and y is 1, and the probability is 0.1 and y is -1. Also, If so, the probability is 0.9 and y = -1, and the probability is 0.1 and y = 1. Where c is Constant that satisfies

(2) 모형 2 : Interaction(2) Model 2: Interaction

학습 자료 수는 1,000 개이고, 테스트 자료수는 5,000 개이며, 그룹의 수는 2 이다.The number of learning materials is 1,000, the number of test materials is 5,000, and the number of groups is 2.

반응 변수 : 반응 변수는를 따르는 0-1 변수이고, F(x)는 아래의 [수학식 58]과 같다.Reaction Variable: The response variable Is a 0-1 variable, and F (x) is expressed by Equation 58 below.

(3) 모형 3 : Two Normal(3) Model 3: Two Normal

처음 500 개의 자료는 그룹 1에 속하고, 설명 변수 x = (x₁, ..., x₁₀)이며, 이들은 각각 독립이고 표준 정규 분포를 따른다. 나머지 500 개 자료는 그룹 2에 속하며, 설명 변수 x = (x₁, ..., x₁₀)이며, 이들은 각각 독립이고 평균이 0이며 분산이 2인 정규 분포를 따른다.The first 500 data belong to group 1 and the explanatory variables x = (x ₁ , ..., x ₁₀ ), each of which is independent and follows a standard normal distribution. The remaining 500 data belong to group 2, with explanatory variables x = (x ₁ , ..., x ₁₀ ), each following a normal distribution with independent, average 0, and variance 2.

(4) 모형 4 : Simple Quadratic(4) Model 4: Simple Quadratic

설명 변수 x는 표준 정규 분포를 따른다.The explanatory variable x follows the standard normal distribution.

반응 변수는 아래의 [수학식 59]를 따르는 0-1 변수이고, F(x) = -x²+ 2 로 주어진다.The response variable is a 0-1 variable according to Equation 59 below, and is given by F (x) = -x ² + 2.

도 16a 내지 도 16d는 이러한 4 개의 가상 실험 결과를 보여 주는 그래프이다.16A-16D are graphs showing the results of these four virtual experiments.

도 16a 내지 도 16d를 보면 알 수 있듯이, 본 발명에서 제시하는 캠 알고리즘이 매우 안정적으로 작동함을 볼 수 있다. 특히, 모형 4에서는 다른 모든 앙상블 알고리즘은 나무의 수가 증가하면서, 성능이 나빠지지만, 캠 알고리즘의 경우에는 이러한 문제가 전혀 발생하지 아니한다.As can be seen from Figures 16a to 16d, it can be seen that the cam algorithm proposed in the present invention operates very stably. In particular, in Model 4, all other ensemble algorithms perform poorly as the number of trees increases, but this problem does not occur at all for the cam algorithm.

아래의 [표 5]는 상기 도 16a 내지 도 16d의 결과를 수치로 표현한 도표이다.Table 5 below is a table representing the results of FIGS. 16A to 16D numerically.

[표 5] 가상 실험의 결과[Table 5] Results of the virtual experiment

한편, 대부분의 모형에서 캠 알고리즘의 예측력이 종래의 앙상블 알고리즘에 비하여 보다 우수함을 알 수 있다. 예측력뿐만 아니라 디비언스(Deviance)를 비교하여 보면, 캠 알고리즘이 종래의 알고리즘에 비하여 훨씬 우수함을 알 수 있다.즉, 캠 알고리즘 이외의 종래 알고리즘에서는 디비언스가 계속 증가하는데, 이는 종래의 앙상블 알고리즘들의 함수 추정이 거의 안된다는 것을 의미한다.On the other hand, it can be seen that in most models, the predictive power of the cam algorithm is superior to the conventional ensemble algorithm. Comparing Deviance as well as predictive power, it can be seen that the cam algorithm is much better than the conventional algorithm. That is, in the conventional algorithms other than the cam algorithm, the division continues to increase. This means that the function estimation is rarely done.

그에 반하여 캠 알고리즘에서의 디비언스 값은 안정적으로 출력됨을 알 수 있다. 예측력은 좋으나 디비언스가 증가하는 현상은 캠 알고리즘 이외의 모든 앙상블 알고리즘에서 발견되며, 이러한 현상은 분류 문제에서 두 그룹 간의 경계선을 잘 찾아낼 수 있지만, 그 외의 모든 정보는 상실된다는 것을 의미한다.In contrast, it can be seen that the deviation value in the cam algorithm is stably output. A good predictive power, but an increase in division, is found in all ensemble algorithms other than the Cam algorithm, which means that the boundary between the two groups can be found in the classification problem, but all other information is lost.

예를 들면, 새로운 설명 변수 x에 대하여 반응 변수 y가 k 번째 그룹에 속할 확률은 캠 알고리즘 이외의 모든 앙상블 알고리즘에서 추정이 안된다.For example, for the new explanatory variable x, the probability that the response variable y belongs to the k-th group is not estimated in all ensemble algorithms other than the cam algorithm.

결론적으로 본 발명에서 제안하는 캠 알고리즘은 그 예측력의 우수성 뿐만 아니라 안정성의 우수성도 함께 가지고 있다는 것을 본 가상 실험 결과가 보여 준다.In conclusion, our experimental results show that the cam algorithm proposed in the present invention has not only superiority of predictive power but also stability superiority.

10-2. 실제 자료의 분석을 통한 비교10-2. Comparison through analysis of actual data

여러 개의 실제 자료들의 분석을 통하여 캠 알고리즘과 기존의 알고리즘을 비교한다. 사용된 실제 자료는 상술한 'UC Irvine'의 'Machine Learning Web Site'(http://www1.ics.uci.edu/~mlearn/MLRepository.html)에서 구하였다.The cam algorithm is compared with the existing algorithm by analyzing several real data. The actual data used was obtained from the above-mentioned 'Machine Learning Web Site' (http://www1.ics.uci.edu/~mlearn/MLRepository.html) of UC Irvine.

아래의 [표 6]은 실제 자료에 대한 정보를 보여 주는 도표이다.[Table 6] below is a chart showing information about actual data.

[표 6] 실제 자료에 대한 정보[Table 6] Information on the actual data

도 17a 내지 도 17i는 실제 자료 분석의 결과를 보여주는 그래프이다.17A to 17I are graphs showing the results of actual data analysis.

대부분의 경우에 캠 알고리즘이 매우 안정적으로 작동함을 알 수 있다. 'Ionospher 자료'에서는 조금 나쁘게 움직이지만, 그 차이는 그리 크지 않다. 'German 자료'인 경우에는 캠 알고리즘이 다른 알고리즘보다 예측력이 뛰어나고, 안정적임을 알 수 있다. 상술한 가상 실험과 마찬가지로 디비언스 값은 캠 알고리즘이 가장 작은 값을 가짐을 알 수 있다.In most cases it can be seen that the cam algorithm works very reliably. In Ionospher data, it works a bit badly, but the difference is not so great. In case of 'German data', the cam algorithm is more predictive and stable than other algorithms. As in the above-described virtual experiment, it can be seen that the deviation value has the smallest cam algorithm.

아래의 [표 7]은 상기 도 17a 내지 도 17i의 결과를 수치적으로 표현한 도표이다.Table 7 below is a table numerically expressing the results of FIGS. 17A to 17I.

[표 7] 실제 자료 분석 결과[Table 7] Actual data analysis results

결론적으로 캠 알고리즘은 아주 우수한 예측력과 동시에 매우 안정적이며(어떤 자료에서도 크게 잘못 예측하지는 아니한다.), 함수 추정(즉, 디비언스가 작다.)이 가능하다.In conclusion, the cam algorithm is very stable (not too badly predicted from any data), with very good predictive power, and function estimation (i.e. small division) is possible.

함수 추정은 확률의 추정을 의미하며, 캠 알고리즘은 종래의 앙상블 알고리즘과는 달리 주어진 자료에서 반응 변수의 값을 확률적으로 나타낼 수 있다. 이는 실제 자료의 분석에서 아주 유용하게 사용될 수 있다.Function estimation means estimation of probability, and unlike the conventional ensemble algorithm, the cam algorithm can probabilistically represent the value of a response variable in a given data. This can be very useful in analyzing actual data.

또한, 아래의 [표 8]은 연속형 변수의 실험 결과를 보여 주는 도표이다.In addition, Table 8 below is a chart showing the experimental results of the continuous variable.

[표 8] 연속형 변수 분석 결과[Table 8] Result of continuous variable analysis

상기 [표 8]에 도시된 종래의 알고리즘으로는 LS 부스트 알고리즘을 사용하였다.As the conventional algorithm shown in Table 8, the LS boost algorithm was used.

Friedman 모형은 가상 모형으로서, 학습 자료의 수는 500 개이고, 테스트 자료 수는 5,000 개다. 또한, 설명 변수 x = (x₁, ..., x₁₀)이고, 이들은 각각 독립이며, [0, 1]에서 균등 분포를 따른다. 또한, 반응 변수는이고,이다.The Friedman model is a virtual model with 500 training materials and 5,000 test data. Also, the explanatory variables x = (x ₁ , ..., x ₁₀ ), which are independent of each other and follow an even distribution at [0, 1]. In addition, the response variable ego, to be.

실제 자료로는 보스톤 지역에서 여러 환경 변수가 집값에 미치는 영향을 알아 보기 위한 자료를 이용하였다.(Boston Housing Data) 이 자료도 인터넷에서 쉽게 구할 수 있다.The actual data were used to examine the impact of various environmental variables on house prices in the Boston area (Boston Housing Data). These data are also readily available on the Internet.

자료의 수는 506이며, 테스트 에러를 구하기 위하여 5 폴드 교차 확인 방법을 사용하였다. 가상 실험과 보스톤 집값 자료 분석의 결과, 캠 알고리즘은 분류 문제뿐만 아니라 회귀 모형에도 잘 작동함을 알 수 있다. 특히, 종래의 LS 부스팅 방법에서는 축소 모수를 사용자가 정의하여야 하나, 캠 알고리즘에서는 사용자가지정하여야 하는 것이 거의 없다.The number of data was 506, and 5 fold cross check was used to find the test error. Results of hypothetical experiments and Boston house data analysis show that the cam algorithm works well for regression models as well as classification problems. In particular, in the conventional LS boosting method, the user needs to define a reduction parameter, but there is almost no user specification in the cam algorithm.

위에서 양호한 실시예에 근거하여 이 발명을 설명하였지만, 이러한 실시예는 이 발명을 제한하려는 것이 아니라 예시하려는 것이다. 이 발명이 속하는 분야의 숙련자에게는 이 발명의 기술사상을 벗어남이 없이 위 실시예에 대한 다양한 변화나 변경 또는 조절이 가능함이 자명할 것이다. 그러므로, 이 발명의 보호범위는 첨부된 청구범위에 의해서 한정될 것이며, 위와 같은 변화예나 변경예 또는 조절예를 모두 포함하는 것으로 해석되어야 할 것이다.While the invention has been described above based on the preferred embodiments thereof, these embodiments are intended to illustrate rather than limit the invention. It will be apparent to those skilled in the art that various changes, modifications, or adjustments to the above embodiments can be made without departing from the spirit of the invention. Therefore, the protection scope of the present invention will be limited by the appended claims, and should be construed as including all such changes, modifications or adjustments.

이상과 같이 본 발명에 의하면 다음과 같은 효과가 있다.As described above, the present invention has the following effects.

첫째, 제안된 캠 알고리즘은 그 예측력이 종래의 앙상블 구축 알고리즘보다 훨씬 뛰어나고, 매우 안정적으로 작동한다. 즉, 과적합 문제가 발생할 여지가 현저히 줄어든다.First, the proposed cam algorithm has much better predictive power than the conventional ensemble construction algorithm and works very stably. In other words, the possibility of overfitting problems is significantly reduced.

둘째, 제안된 캠 알고리즘은 기존의 앙상블 구축 방법이 공통적으로 가지는 해석력의 저하를 극복하고, 연관성 규칙 알고리즘을 사용함으로써, 어떤 데이터 마이닝 기법보다 더 우수한 해석력을 보여 준다.Secondly, the proposed cam algorithm overcomes the degradation of the analysis power that the conventional ensemble construction method has in common, and shows the superior analysis power than any data mining technique by using the association rule algorithm.

셋째, 연속형 변수 문제에서도 자연스럽게 캠 알고리즘을 적용함으로써, 일반 산업 분야에서도 쉽게 적용될 수 있다.Third, the cam algorithm is naturally applied even in the continuous variable problem, so that it can be easily applied in general industrial fields.

Claims

In the data mining model building apparatus using an ensemble model,

Includes M ensemble model building means,

The first ensemble model building means constructs the first learner from the input multidimensional training material, and constructs the constructed first learner itself as the first ensemble model;

The k th ensemble model construction means receives a k-1 ensemble model that is the result of the k-1 ensemble model construction means, constructs an orthogonal k th learner in a multidimensional space, and constructs the k th learner and the k-1 th An apparatus for constructing data mining models using an ensemble model, comprising constructing a k-th ensemble model by combining an ensemble model.

At this time, M is an integer of 2 or more, and k is an integer of 2 or more and M or less.

The method of claim 1,

The k th ensemble construction means,

And an orthogonal learner is constructed in a multidimensional space using residuals of the k-1 th ensemble model, and then a k th ensemble model is constructed using the k-th ensemble model.

The method of claim 2,

k + 1th learner building means,

k th learner Take the input, Constant to minimize After obtaining An apparatus for constructing a data mining model using an ensemble model, comprising constructing a k + 1 th learner using a.

At this time, d is division and k is an integer of 2 or more and M-1 or less.

The method of claim 2,

If the response variable is categorical, data mining model building apparatus using the ensemble model characterized in that the residual is determined by the following [Equation 1].

[Equation 1]

Here, the input training materials are (x ₁ , y ₁ ), ..., (x _n , y _n ), r _i is the i th residual, and P _k-1 is the k-1 th ensemble model. is the probability that y will be 1.

The method of claim 4, wherein

An apparatus for constructing data mining models using an ensemble model, comprising: building an optimal learner using the absolute value of the residual as a weight.

The method of claim 2,

If the response variable is continuous, the data mining model building apparatus using the ensemble model characterized in that the residual is determined by the following [Equation 2].

[Equation 2]

Where F _k-1 is the k-1 th ensemble model.

The method of claim 6,

An apparatus for constructing a data mining model using an ensemble model, wherein the optimal learner is constructed using the residual as a response variable.

The method of claim 1,

The k-th ensemble model building means,

An apparatus for constructing a data mining model using an ensemble model, wherein the division is calculated using a loss function, and then the k th ensemble model is constructed using the loss function.

The method of claim 8,

The k-th ensemble model building means,

An apparatus for constructing data mining models using an ensemble model, wherein the division is calculated according to Equation 3 below.

[Equation 3]

here, Is the division and l is the loss function.

The method of claim 9,

The k-th ensemble model building means,

k-1st ensemble model F _k-1 and kth learner Data mining model building apparatus using the ensemble model, characterized in that by constructing the k th ensemble model F _k by [Equation 4] below.

[Equation 4]

Where w _k is And u _k-1 is to be.

The method of claim 1,

and a stop means for constructing a final ensemble model with m learners, when m learners are constructed, and if the given loss function is the smallest, the apparatus for constructing a data mining model using the ensemble model.

The method of claim 11,

The stop means,

An apparatus for constructing a data mining model using an ensemble model, wherein the ensemble model F _m is constructed as a final ensemble model for the first m satisfying Equation 5 below.

[Equation 5]

, Put it on, Find the m.

At this time, Where l is a loss function and K is a predetermined constant.

In the data mining model building method using an ensemble model,

Including the M ensemble model building phase,

The first ensemble model building step may include: constructing the first learner from the input multidimensional learning material, and constructing the constructed first learner itself as the first ensemble model;

In the k-th ensemble model construction step, the k-1 th learner, which is the result of the k-1 th ensemble model construction step, is input, constructs an orthogonal k th learner in a multidimensional space, and constructs the k th learner and k. A method of constructing a data mining model using an ensemble model, comprising constructing a k-th ensemble model by combining a -1 th ensemble model.

The method of claim 13,

The k-th ensemble model building step,

A method of constructing a data mining model using an ensemble model, comprising constructing an orthogonal learner in a multidimensional space using residuals of a k-1 th ensemble model, and then constructing a k th ensemble model using the same.

The method of claim 14,

The k + 1th ensemble model building phase

k th learner Take the input, Constant to minimize After obtaining A method of constructing a data mining model using an ensemble model, comprising constructing a k-th learner using a.

At this time, d is division and k is an integer of 2 or more and M-1 or less.

The method of claim 14,

If the response variable is categorical, the data mining model construction method using an ensemble model characterized in that the residual is determined by the following [Equation 6].

[Equation 6]

The method of claim 16,

A data mining model building method using an ensemble model, characterized in that to construct an optimal learner using the absolute value of the residual as a weight.

The method of claim 14,

If the response variable is continuous, the data mining model construction method using an ensemble model characterized in that the residual is determined by the following equation (7).

[Equation 7]

Where F _k-1 is the k-1 th ensemble model.

The method of claim 18,

A data mining model building method using an ensemble model, characterized in that to construct an optimal learner using the residual as a response variable.

The method of claim 13,

The k-th ensemble model building step,

A method of constructing a data mining model using an ensemble model, wherein the division is calculated using a loss function, and then the k th ensemble model is constructed using the loss function.

The method of claim 20,

The k-th ensemble model building step,

A method of constructing a data mining model using an ensemble model, wherein the division is calculated according to Equation 8 below.

[Equation 8]

here, Is the division and l is the loss function.

The method of claim 21,

The k-th ensemble model building step,

k-1st ensemble model F _k-1 and kth learner Data mining model building apparatus using the ensemble model, characterized in that by constructing the k th ensemble model F _k by the following equation (9).

[Equation 9]

Where w _k is And u _k-1 is to be.

The method of claim 13,

When the m learners are constructed, if the given loss function is the smallest, the method further includes terminating the entire algorithm in the m th step and constructing a final ensemble model with the m learners. How to Build a Mining Model.

The method of claim 23,

A method for constructing a data mining model using an ensemble model, comprising constructing an ensemble model F _m as a final ensemble model for the first m satisfying Equation 10 below.

[Equation 10]

, Put it on, Find the m.

At this time, Where l is a loss function and K is a predetermined constant.

In the data mining model building apparatus in two groups classification using an ensemble model,

Input means for receiving learning materials that are multidimensional data;

Weight calculation means for calculating a weight using a residual of the ensemble model;

Probability estimating means for estimating a probability that the response variable becomes 1 with respect to the explanatory variable x given from the data input by the input means using a basic learner and the weight;

Correction parameter calculation means for calculating a correction parameter that minimizes a given loss function with respect to the probability estimated by the probability estimation means;

Learner updating means for updating and rebuilding a learner using the calculated correction parameter; And

And an ensemble model building means for finally building an ensemble model based on the updated learner.

The method of claim 25,

Two group classification using the ensemble model further comprises generating the input training data as boost data by using the weight and then transmitting the boost data to the probability estimating means. Mining model building device in

The method of claim 25,

The correction parameter calculating means is a correction parameter that minimizes the following Equation 11 for a given loss function l. Apparatus for constructing a data mining model in two group classifications using an ensemble model, characterized in that for calculating.

[Equation 11]

At this time, ego, Is the probability that the response variable will be 1 for the given explanatory variable x.

The method of claim 27,

And the learner updating means updates the learner according to the following [Equation 12].

[Equation 12]

here, Is the correction parameter.

The method of claim 28,

And the ensemble model building means constructs an ensemble model according to Equation 13 below.

[Equation 13]

Obtain as

Update with

The method of claim 29,

The weight calculation means is a data mining model building apparatus for classification of two groups using the ensemble model, characterized in that by updating the weight according to the following equation (14).

[Equation 14]

Weights Update to.

At this time, to be.

The method of claim 25,

The learner is a data mining model building apparatus in two groups classification using an ensemble model, characterized in that the decision tree (Decision Tree).

The method of claim 25,

The learner is a data mining model building apparatus in two groups classification using an ensemble model, characterized in that the neural network model.

The method of claim 25,

The ensemble model building means,

Apparatus for constructing a data mining model in two group classifications using an ensemble model, wherein the ensemble model is finally constructed when the division of the current ensemble model is smallest.

The method of claim 33, wherein

An apparatus for constructing data mining models in two group classifications using an ensemble model, comprising constructing an ensemble model F _m as a final ensemble model for the first m satisfying Equation 15 below.

[Equation 15]

, Put it on, Find the m.

At this time, , L is a loss function, K is a preset constant, Is the m-th ensemble model.

In the method of building a data mining model in two group classifications using an ensemble model,

A first step of receiving learning materials that are multidimensional data;

Calculating a weight using a residual of the ensemble model;

A third step of estimating a probability that the response variable becomes 1 with respect to the explanatory variable x given from the data input in the first step using a basic learner and the weight;

A fourth step of calculating a correction parameter that minimizes a given loss function with respect to the probability estimated in the third step; And

And a fifth step of updating and reconstructing the learner using the modified parameters calculated in the fourth step.

36. The method of claim 35 wherein

The third step,

After generating the input learning data as a boost (Boostrap) data by using a weight, using the basic learner to estimate the probability that the response variable is 1 for the given explanatory variable x using this Data mining model construction method in two group classification using ensemble model.

36. The method of claim 35 wherein

The fourth step,

Modification parameter that minimizes Equation 16 below for a given loss function l A method of constructing a data mining model in two group classifications using an ensemble model, comprising

[Equation 16]

The method of claim 37,

The fifth step,

A method of constructing a data mining model in two group classifications using an ensemble model characterized by updating a learner according to Equation 17 below.

Formula 17

here, Is the correction parameter.

The method of claim 38,

And a sixth step of constructing the ensemble model according to Equation 18 below.

[Equation 18]

Obtain as

Update with

The method of claim 39,

The second step,

A method of constructing a data mining model in two group classifications using an ensemble model, wherein the weight is updated and calculated according to Equation 19 below.

[Equation 19]

Weights Update to.

At this time, to be.

36. The method of claim 35 wherein

The learner is a decision tree (Decision Tree) characterized in that the data mining model building method in two groups classification using an ensemble model.

36. The method of claim 35 wherein

And the learner is a neural network model.

36. The method of claim 35 wherein

A method of constructing a data mining model in two group classifications using an ensemble model, wherein the ensemble model is finally constructed when the division of the current ensemble model is smallest.

The method of claim 43,

A method of constructing a data mining model in two group classifications using an ensemble model, comprising constructing an ensemble model F _m as a final ensemble model for the first m that satisfies Equation 20 below.

[Equation 20]

, Put it on, Find the m.

An apparatus for constructing a data mining model in a multi-class classification using an ensemble model,

Input means for receiving learning materials that are multidimensional data;

Probability estimating means for estimating a probability that the response variable belongs to the j group with respect to the explanatory variable x given from the data input by the input means using a basic learner and the weight;

Correction parameter calculation means for calculating a correction parameter that minimizes a given loss function for the probability estimated by the probability estimation means;

The method of claim 45,

In the multi-class classification using the ensemble model, further comprising boosting data generation means for generating the input learning data as a boost (Boostrap) data by using a weight, and then transmitting it to the probability estimation means Data mining model building device.

The method of claim 45,

The correction parameter calculating means is a correction parameter that minimizes the following [Equation 21] for a given loss function l. Apparatus for constructing a data mining model in a multiclass classification using an ensemble model, characterized in that for calculating a.

Formula 21

Where p _mj is the probability that the response variable belongs to j group for the given explanatory variable x.

The method of claim 47,

The learner updating means is a data mining model building apparatus in a multi-class classification using an ensemble model characterized in that the learner is updated according to Equation 22 below.

Formula 22

here, Is the correction parameter.

49. The method of claim 48 wherein

And the ensemble model building means updates the ensemble model according to the following [Equation 23].

Formula 23

Obtain as

Update with At this time, to be.

The method of claim 49,

The weight calculation means is a data mining model building apparatus in the multi-class classification using the ensemble model, characterized in that for updating the weight according to the following formula (24).

Formula 24

Weights Update to.

At this time, to be.

Is 1 if the i th observation belongs to the j th group, or 0 otherwise.

The method of claim 45,

The learner is a data mining model building apparatus in a multi-class classification using an ensemble model, characterized in that the decision tree (Decision Tree).

The method of claim 45,

The learner is a data mining model building apparatus in a multi-class classification using an ensemble model, characterized in that the neural network model.

The method of claim 45,

The ensemble model building means,

An apparatus for constructing a data mining model in a multi-class classification using an ensemble model, wherein the ensemble model is finally constructed when the division of the current ensemble model is smallest.

The method of claim 53, wherein

An apparatus for constructing data mining models in a multi-class classification using an ensemble model, wherein the ensemble model F _m is constructed as the final ensemble model for the first m satisfying the following Equation 25.

[Equation 25]

, Put it on, Find the m.

In the data mining model construction method in the multi-class classification using the ensemble model,

A first step of receiving learning materials that are multidimensional data;

Calculating a weight using a residual of the ensemble model;

A third step of estimating, using the basic learner and the weight, a probability that the response variable belongs to the j group with respect to the explanatory variable x given from the data input in the first step;

A fourth step of calculating a correction parameter that minimizes a given loss function with respect to the probability estimated in the third step;

A fifth step of updating and rebuilding a learner using the correction parameter calculated in the fourth step; And

And a sixth step of finally constructing an ensemble model based on the updated learner.

The method of claim 55,

The third step,

After generating the input learning data as a boost (Boostrap) data by using a weight, using the basic learner to estimate the probability that the response variable is 1 for the given explanatory variable x using this Data Mining Model Construction in Multi-Class Classification Using Ensemble Model.

The method of claim 55,

The fourth step,

Correction parameter that minimizes Equation 26 below for a given loss function l A method of constructing a data mining model in a multiclass classification using an ensemble model, comprising:

Formula 26

The method of claim 57,

The fifth step,

A data mining model building method in a multi-class classification using an ensemble model, characterized in that the learner is updated by the following [Equation 27].

[Equation 27]

here, Is the correction parameter.

The method of claim 58,

The sixth step,

A method of constructing a data mining model in a multi-glass classification using an ensemble model, characterized by updating the ensemble model according to Equation 28 below.

[Equation 28]

Obtain as

Update with At this time, to be.

The method of claim 59,

The second step,

A data mining model construction method in a multi-class classification using an ensemble model, wherein the weight is updated by Equation 29 below.

Formula 29

Weights Update to.

At this time, to be.

Is 1 if the i th observation belongs to the j th group, or 0 otherwise.

The method of claim 55,

The learner is a decision tree (Decision Tree) characterized in that the data mining model building method in a multi-class classification using an ensemble model.

The method of claim 55,

The learner is a neural network model, data mining model building method in a multi-class classification using an ensemble model.

The method of claim 55,

The sixth step,

A method of constructing a data mining model in a multi-class classification using an ensemble model, wherein the ensemble model is finally constructed when the division of the current ensemble model is smallest.

The method of claim 63, wherein

A method for constructing a data mining model in a multi-class classification using an ensemble model, comprising constructing an ensemble model F _m as a final ensemble model for the first m satisfying Equation 30 below.

Formula 30

, Put it on, Find the m.

In the data mining model building apparatus using an ensemble model when the response variable is a regression,

Input means for receiving learning materials that are multidimensional data;

Residual calculation means for calculating a residual of the ensemble model;

Regression model building means for constructing a regression model using a basic learner and the residual from data input by the input means;

Correction parameter calculation means for calculating a correction parameter that minimizes a given loss function for the regression model constructed by the regression model construction means;

Learner updating means for rebuilding a learner by updating a response variable using the calculated correction parameter; And

And an ensemble model building means for finally constructing an ensemble model based on the reconstructed learner.

66. The method of claim 65,

The correction parameter calculation means,

Modification parameter that minimizes Equation 31 below for a given loss function l Apparatus for constructing a data mining model using an ensemble model when the response variable is continuous.

Formula 31

66. The method of claim 65,

The learner updating means,

The above correction parameters An apparatus for constructing a data mining model using an ensemble model when the response variable is continuous by updating a new learner by using Equation 32 below and then constructing an ensemble model using the same.

Formula 32

Place it.

Calculate

Update to.

New response variable Update to.

In this case, the input training data is (x ₁ , y ₁ ), ..., (x _n , y _n ), x _i is an explanatory variable of p dimension, y _i is a response variable, Is a regression model.

66. The method of claim 65,

The learner is a decision tree (Decision Tree) data mining model building apparatus using an ensemble model when the response variable is continuous.

66. The method of claim 65,

The learner is a neural network model, the data mining model building apparatus using the ensemble model when the response variable is continuous.

66. The method of claim 65,

The ensemble model building means,

An apparatus for constructing a data mining model using an ensemble model when the response variable is continuous, characterized in that the ensemble model is finally constructed when the deviation of the current ensemble model is smallest.

The method of claim 70,

An apparatus for constructing a data mining model using an ensemble model when the response variable is continuous, wherein the ensemble model F _m is constructed as the final ensemble model for the first m that satisfies Equation 33 below.

[Equation 33]

, Put it on, Find the m.

In the method of constructing a data mining model using an ensemble model when the response variable is a regression,

A first step of receiving learning materials that are multidimensional data;

Calculating a residual of the ensemble model;

A third step of constructing a regression model using a basic learner and the residual from data input in the first step;

A fourth step of calculating a correction parameter that minimizes a given loss function for the regression model constructed in the third step;

A fifth step of rebuilding the learner by updating the response variable using the correction parameter calculated in the fourth step; And

And a sixth step of finally constructing the ensemble model based on the learner reconstructed in the fifth step.

The method of claim 72,

The fourth step,

Correction parameter that minimizes Equation 34 below for a given loss function l A method of constructing a data mining model using an ensemble model when the response variable is continuous.

[Equation 34]

The method of claim 72,

The fifth step,

The above correction parameters A method of constructing a data mining model using an ensemble model when the response variable is a continuous form, after which the learner is updated by using Equation 35 below and the ensemble model is constructed using the learner.

Formula 35

Place it.

Calculate

Update to.

New response variable Update to.

The method of claim 72,

The learner is a decision tree (Decision Tree) characterized in that the data mining model building method using an ensemble model when the response variable is continuous.

The method of claim 72,

The learner is a neural network model, when the response variable is a continuous type data mining model building method using an ensemble model.

The method of claim 72,

The sixth step,

A method of constructing a data mining model using an ensemble model when the response variable is continuous, characterized in that the ensemble model is finally constructed when the deviation of the current ensemble model is smallest.

76. The method of claim 75 wherein

A method for constructing a data mining model using an ensemble model when the response variable is continuous, wherein the ensemble model F _m is constructed as a final ensemble model for the first m satisfying the following Equation 36.

Formula 36

, Put it on, Find the m.