WO2003075187A1

WO2003075187A1 - Apparatus and method for constructing data mining model using ensemble machines

Info

Publication number: WO2003075187A1
Application number: PCT/KR2003/000409
Authority: WO
Inventors: Yong-Dai Kim
Original assignee: BL SYSTEMS Inc
Current assignee: BL SYSTEMS Inc
Priority date: 2002-03-02
Filing date: 2003-03-03
Publication date: 2003-09-12
Anticipated expiration: 2004-09-02
Also published as: KR100640264B1; KR20030071939A; AU2003212671A1

Abstract

The apparatus for constructing a data mining model using ensemble machines, comprises M ensemble machine construction means for constructing the ensemble machines, said M being an integer equal to and greater than 2; wherein a first ensemble machine construction means constructs a first learning machine from multidimensional training data to be inputted, and then, constructs the constructed first learning machine itself into a first ensemble machine, and a k-th ensemble machine construction means receives a (k-1)-th ensemble machine constructed from a (k-1)-th ensemble machine construction means, constructs a k-th learning machine orthogonal to the (k-1)-th ensemble machine in a multidimensional space, and then constructs a k-th ensemble machine by combining the constructed k-th learning machine and the (k-1)-th ensemble machine, said k being an integer between 2 and M.

Description

APPARATUS AND METHOD FOR CONSTRUCTING DATA MINING MODEL

USING ENSEMBLE MACHINES

Technical Field

The present invention relates to an apparatus and method for constructing a data mining model using ensemble machines, and more particularly, to an apparatus and method for constructing a data mining model wherein faster computational speed, more stable model construction, and superior analyzability and predictability can be easily obtained as compared with a conventional method when analyzing and predicting multidimensional huge data.

Background Art

1. Introduction In data mining, many ensemble machines have been studied based on a Bagging algorithm first proposed by Breiman until now.

That is, there have been performed studies for developing the Bagging algorithm. As results of such studies, there are a Boosting algorithm proposed by Freund and Schapire, an Arcing algorithm proposed by Breiman, a random forest algorithm proposed by Breiman, and the like.

Various improved algorithms are based on the boosting algorithm proposed by Freund and Schapire among these ensemble machines because of its superior predictability. As for such improved algorithms, there are a Real Boosting algorithm proposed by Schapire and Singer and a Gradient boosting algorithm proposed by Friedman. That is, the conventional ensemble algorithms for use in the data mining are mostly based on the boosting algorithm.

Hereinafter, conventional various boosting algorithms which have been applied to classification problems will be briefly explained. In particular, the real boosting algorithm, a Logit boosting algorithm and the gradient boosting algorithm, which are currently used most widely, will be explained. 2. Conventional boosting algorithms applicable to two-class classification problem

The boosting algorithm is a method which is generally used in a classification problem. A basic model of the classification problem is as follows:

Assume that n training data (xj, yi), ..., (x_n, y_n) are given. Here, Xj is a p- dimensional explanatory variable, i.e. x (xii, ..., x_pj); and a response variable y_\ designates a group in which the data are included. That is, if there are J groups, y, has one integer value of 1 to J.

An objective of the classification problem is to find out a relationship in which the response variable can be best explained based on the explanatory variable using the n training data. In other words, the objective is to obtain an optimal function of H:R^p- {1,

2, ..., J} using the training data. And then, when a new explanatory variable x is given, the data is assigned to Group H(x).

First, a case where there are provided two groups, that is, J=2 is considered. A case where there are provided multiple groups will be explained later. Various boosting algorithms will be now explained below.

2-1. Real boosting algorithm (Schapire and Singer, 1999)

The real boosting algorithm is most representative of the algorithms for constructing an ensemble machine for data mining. This algorithm is described in detail in United States Patent No. 5,819,247 entitled

"Apparatus and methods for machine learning hypotheses," and will be specifically explained as follows.

FIG. 1 is a flowchart illustrating the conventional real boosting algorithm applicable to a case where there are provided two groups in a classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 1.

(1) Step SI 01: A response variable y, is set either to 1 if the data belong to Group 2 or to -1 if the data belong to Group 1.

(2) Step SI 02: n weights w_1} ..., w_n are initialized by setting the weights in accordance with a relationship, w l/n. (3) Step SI 03: A probability p_m that a value of the response variable with respect to the explanatory variable x will be 1 is estimated through basic learning machines using the n training data (xj, y_\), ..., (x_n, y_n) and the weights wi, ..., w_n. The probability P_m is determined according to the following formula (1): p_m(^χ) ^{= p}w(y^{= 1} I *)

(1) (4) Step SI 04: f_m(x)is obtained by transforming the probability P_m(x)- The f_m(x) is determined according to the following formula (2):

(2)

(5) Step SI 05: New weights are obtained from the following formula (3) and then n normalized to satisfy the relationship ^ w_{ =1 : ι=l W tflew ⁼ W i,old Xφ(-yfm(.^Xt X * = " ^• > ⁿ

(3)

(6) Step SI 06: M basic learning machines are generated by repeating the steps S103 to S105 M times (m=l, ..., M). (7) Step SI 07: A final ensemble machine is determined according to the following formula (4). At this time, a new explanatory variable x is assigned to Group 2 if H(x) > 0. Otherwise, the new explanatory variable x is assigned to Group 1.

M

H(x = ∑ „-0)

(4) Thus, the final ensemble machine is constructed through the foregoing processes.

2-2. Logit boosting algorithm (Friedman et al., 2000)

FIG. 2 is a flowchart illustrating the conventional Logit boosting algorithm applicable to a case where there are provided two groups in a classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 2.

(1) Step S201: A response variable yi is set either to 1 if the data belong to Group 2 or to 0 if the data belong to Group 1.

(2) Step S202: n weights wi, ..., w_n are initialized by setting the weights in accordance with a relationship, and F(x) and p(x) are also initialized to 0 and 1/2, respectively. (3) Step S203: A new response variable z, and a new weight w, with respect to the given explanatory variable are obtained from the following formula (5):

(5)

(4) Step S204: A regression model f_m(x) is constructed with reference to the basic learning machines and using the response variable z_l5 the explanatory variable x, and the weight w,.

(5) Step S205: A probability is updated using the f_m(x) constructed in step S204 and is expressed as the following formula (6): exp(F(x)) -1- exp(-.F(:x.))

FXp ) = (x)+^-/_m(x)

(6)

(6) Step S206: Processes of updating the probability are performed M times by repeating steps S203 to S205 M times.

(7) Step S207: A final ensemble machine is constructed by using the relationship, H(x)=F(x). At this time, a new response variable x is assigned to Group 2 if H(x) > 0.

Otherwise, the new response variable x is assigned to Group 1.

2-3. Gradient boosting algorithm (Friedman, 2001)

FIG. 3 shows a flowchart illustrating the conventional gradient boosting algorithm applicable to a case where there are provided two groups in a classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 3.

(1) Step S301: A response variable y, is set either to 1 if the data belong to Group 2 or to -1 if the data belong to Group 1.

(2) Step S302 : Various kinds of variables are initialized according to the following formula (7):

(7)

(3) Step S303: A new response variable z, with respect to the given explanatory variable are calculated from the following formula (8):

^Zi 1 + exp(2.y,F(x_;))

(8)

(4) Step S304: f_m(x) is estimated using the response variable z_b the explanatory variable x, and the basic learning machines (regression decision trees).

(5) Step S305: An estimated value Ύ i of an /-th terminal node of the f_m(x) is estimated based on the following formula (9):

(9) where Ri is a set of the data belonging to an i-th terminal node.

(6) Step S306: The F(x) is updated in accordance with the formula,

F(x)=F(x)+f_m(x).

(7) Step S307: The F(x) is updated M times by repeating steps S303 to S306 M time. (8) Step S308: A final ensemble machine is constructed by using the relationship,

H(x)=F(x). At this time, a new response variable x is assigned to Group 2 if H(x) > 0. Otherwise, the new response variable x is assigned to Group 1.

In the meantime, various kinds of basic learning machines can be utilized in the aforementioned real boosting algorithm and Logit boosting algorithm, whereas the gradient boosting algorithm should utilize the decision trees as basic learning machines.

Further, when the basic learning machine is to be constructed in the algorithms, complexity of the learning machine should be beforehand determined. In a case where the basic learning machine is the decision tree, the complexity of the learning machine can be controlled by adjusting the number of the terminal nodes. However, since the basic learning machines used in the boosting algorithms are weak ones, the complexity thereof should be minimized. In general, the decision tree with two to eight terminal nodes is widely used as a basic learning machine. 3. Conventional boosting algorithms applicable to multi-class classification

The aforementioned three boosting algorithms have been applicable to cases where there are two groups. However, in a case where there are three or more groups, the algorithms should be expanded and modified accordingly. Hereinafter, the real boosting algorithm, Logit boosting algorithm, and gradient boosting algorithm applicable to the multi-class classification will be explained.

3-1. Real boosting algorithm applicable to Multi class

FIG. 4 shows a flowchart illustrating the conventional real boosting algorithm applicable to the multi-class classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 4.

(1) Step S401: J new data, ((X_J, 1), yπ), ..., ((XJ, J), yjj) are generated from given training data (X_J, y;). At this time, y^ is 1 if y; is k, whereas otherwise, yi is -1.

(2) Step S402: A final ensemble machine is constructed according to the following formula (10) by applying the two-class real boosting algorithm to an (n x J) new data (i.e., by transforming the multi-class classification into the two-class classification):

H(xJ) : R : X { 1, ..., J} → R (10)

(3) Step S403: A new explanatory variable x is assigned to argmax_j H(x, j) Group. However, according to the real boosting algorithm applicable to the above multi- class classification, the number of execution, i.e. the number of calculation, correspond to the number of boosting. Thus, there is a problem in that the real boosting algorithm is very time-consuming.

3-2. Logit boosting algorithm applicable to multi-class classification

FIG. 5 shows a flowchart illustrating the conventional Logit boosting algorithm applicable to the multi-class classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 5.

(1) Step S501: Various kinds of variables are initialized. That is, it is set that F_j(x)=0 and p_j(x)=l/J, where j=l , 2, ... , J.

(2) Step S502: It is set that y;^*=l if i-th data belong to Group j whereas yi^*=0 if i-th data do not belong to Group j.

(3) Step S503: A New response variable Zj and a weight Wj are determined according to the following formula (11):

w, = Λ(*,)( I - p, ^χ _ty (11)

(4) Step S504: A regression model f_mj(x) is calculated and constructed with reference to the basic learning machines and using the response variable z,, the explanatory variable X; and the weight w,.

(5) Step S505: Steps S502 to S504 are repeated J times. (6) Step S506: A j-th F_j(x) is updated according to the following formula (12): f_m pA) = ^ _*-• (/;„,( - ^ •-/ k έ= lω »

F_J *y= F; ^χ + f_ny(_X ^~) ₍₁₂₎

(7) Step S507: A j-th probability p,(x) is updated according to the following fon ιla (13): exp(-E-CxQ) J

∑ exp( .χx)) ^k=1 (13) (8) Step S508: Steps S502 to S507 are repeated M times.

(9) Step S509: A new explanatory variable x is assigned to argmax_j F_j(x) Group.

3-3. Gradient boosting algorithm applicable to multi-class classification FIG. 6 shows a flowchart illustrating the conventional gradient boosting algorithm applicable to the multi-class classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 6.

(1) Step S601: Various kinds of variables are initialized. That is, it is set that Fj(x)=0 and P_j(x)=l/J, where j=l, 2, ..., J.

(2) Step S602: It is set that y,^*=l if i-th data belong to Group j whereas

if i-th data do not belong to Group j . (3) Step S603: A new response variable z, is determined according to the following formula (14):

(14)

(4) Step S604: A regression model f_mj(x) is constructed with reference to the basic learning machines (regression decision trees) and using the response variable z, and explanatory variable x,.

(5) Step S605: An estimated value γ of an /-th terminal node in the regression model following formula (15):

(15)

(6) Step S606: A F_j(x) is determined according to the following formula (16):

(16) (7) Step S607: Steps S602 to S606 are repeated J times (j=l, ..., J).

(8) Step S608: Steps S602 to S607 are repeated M times (m=l, ..., M).

(9) Step S609: A new explanatory variable x is assigned to an argmax_j F_j(x) Group.

4. Summary of problems in prior arts A basic idea of the aforementioned boosting algorithms is to develop a new strong learning machine by constructing several weak learning machines and then combining them. However, such an idea involves several problems which are summarized as follows.

First, it is difficult to analyze the constructed models. Second, it is not easy for a general user to define a number of tuning parameters, such as the size of a decision tree, the number of the decision trees and the like. Third, there is a tendency toward overfitting of data (Ridgeway, 2000). Fourth, if a response variable is continuous, it is not easy to use the algorithms. Fifth, it is not easy to estimate probabilities in a classification problem. That is, when new data are inputted, classification probability to be taken by the inputted new data cannot be estimated.

Sixth, it is not easy to verify stability of the boosting algorithm, due to lack of a general (statistical) theorem of the boosting algorithm.

Disclosure of the Invention

The present invention is conceived to solve the above problems in the prior art. Accordingly, an object of the present invention is to provide an apparatus and method for constructing an ensemble machine wherein faster computational speed, more stable model construction, and superior analyzability and predictability can be easily obtained as compared with a conventional method when analyzing and predicting multidimensional huge data.

According to an aspect of the present invention, there is provided an apparatus for constructing a data mining model using ensemble machines, comprising: M ensemble machine construction means for constructing the ensemble machines, said M being an integer equal to and greater than 2; wherein a first ensemble machine construction means constructs a first learning machine from multidimensional training data to be inputted, and then, constructs the constructed first learning machine itself into a first ensemble machine, and a k-th ensemble machine construction means receives a (k-l)-th ensemble machine constructed from a (k-l)-th ensemble machine construction means, constructs a k-th learning machine orthogonal to the (k-l)-th ensemble machine in a multidimensional space, and then constructs a k-th ensemble machine by combining the constructed k-th learning machine and the (k-l)-th ensemble machine, said k being an integer between 2 and M.

According to another aspect of the present invention, there is provided method for constructing a data mining model using ensemble machines, comprising: a step of constructing M ensemble machines, said M being an integer equal to and greater than 2; wherein first ensemble machine constructing step constructs a first learning machine from multidimensional training data to be inputted, and then, constructs the constructed first learning machine itself into a first ensemble machine, and a k-th ensemble machine constructing step receives a (k-l)-th ensemble machine constructed from a (k-l)-th ensemble machine construction means, constructs a k-th learning machine orthogonal to the (k-l)-th ensemble machine in a multidimensional space, and then constructs a k-th ensemble machine by combining the constructed k-th learning machine and the (k-l)-th ensemble machine, said k being an integer between 2 and M.

According to another aspect of the present invention, there is provided an apparatus for constructing a data mining model in two-class classification using ensemble machines, comprising: an input means for receiving multidimensional training data; a weight calculation means for calculating weights by using residuals of the ensemble machines; a probability estimation means for estimating, by using basic learning machines and the weight, probability that a response variable will be 1 with respect to an explanatory variable x obtained from the data inputted by the input means; a correction parameter calculation means for calculating a correction parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the probability estimation means; a learning machine update means for updating and reconstructing learning machines by using the calculated correction parameter; and an ensemble machine construction means for constructing a final ensemble machine based on the updated learning machines.

According to an aspect of the present invention, there is provided method for constructing a data mining model in two-class classification using ensemble machines, comprising: a first step for receiving multidimensional training data; a second step for calculating weights by using residuals of the ensemble machines; a third step for estimating, by using basic learning machines and the weight, probability that a response variable will be 1 with respect to an explanatory variable x obtained from the data inputted by the first step; a fourth step for calculating a correction parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the third step; and a fifth step for updating and reconstructing learning machines by using the correction parameter calculated from the fourth step.

According to an aspect of the present invention, there is provided an apparatus for constructing a data mining model in multi-class classification using ensemble machines, comprising: an input means for receiving multidimensional training data; a weight calculation means for calculating weights by using residuals of the ensemble machines; a probability estimation means for estimating, by using basic learning machines and the weights, probabilities that a response variable will belong to Group j with respect to an explanatory variable x obtained from the data inputted by the input means; a coπection parameter calculation means for calculating a coπection parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the probability estimation means; a learning machine update means for updating and reconstructing learning machines by using the calculated coπection parameter; and an ensemble machine construction means for constructing a final ensemble machine based on the updated learning machines.

According to an aspect of the present invention, there is provided method for constructing a data mining model in multi-class classification using ensemble machines, comprising: a first step for receiving multidimensional training data; a second step for calculating weights by using residuals of the ensemble machines; a third step for estimating, by using basic learning machines and the weights, probabilities that a response variable will belong to Group j with respect to an explanatory variable x obtained from the data inputted by the first step; a fourth step for calculating a coπection parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the third step; a fifth step for updating and reconstructing learning machines by using the calculated coπection parameter; and a sixth step for constructing a final ensemble machine based on the updated learning machines.

According to an aspect of the present invention, there is provided an apparatus for constructing a data mining model using ensemble machines in a case where a response variable is regressive, comprising: an input means for receiving multidimensional training data; a residual calculation means for calculating residuals of the ensemble machines; a regression model construction means for constructing regression models from the data inputted by the input means through basic learning machines and the residuals; a coπection parameter calculation means for calculating a coπection parameter which minimizes a given loss function with respect to the constructed regression models obtained from the regression model construction means; a learning machine update means for reconstructing learning machines by updating the response variable based on the calculated coπection parameter; and an ensemble machine construction means for constructing a final ensemble machine based on the reconstructed learning machines.

According to an aspect of the present invention, there is provided method for constructing a data mining model using ensemble machines in a case where a response variable is regressive, comprising: a first step for receiving multidimensional training data; a second step for calculating residuals of the ensemble machines; a third step for constructing regression models from the data inputted by the first step through basic learning machines and the residuals; a fourth step for calculating a coπection parameter which minimizes a given loss function with respect to the constructed regression models obtained from the third step; a fifth step for reconstructing learning machines by updating the response variable based on the calculated coπection parameter; and a sixth step for constructing a final ensemble machine based on the reconstructed learning machines. More specifically, the inputted training data is generated into Boostrap data by using the weights.

More specifically, the learning machines are decision trees or neural network models.

More specifically, the final ensemble machine is constructed when a deviance of a cuπent ensemble machine is smallest.

Brief Description of the Drawings

The above and other objects and features of the present invention will become apparent from the following description of prefeπed embodiments given in conjunction with the accompanying drawings, in which:

FIG. 1 is a flowchart illustrating a conventional real boosting algorithm applicable to a case where there are provided two groups in a classification problem;

FIG. 2 is a flowchart illustrating a conventional Logit boosting algorithm applicable to a case where there are provided two groups in the classification problem; FIG. 3 is a flowchart illustrating a conventional gradient boosting algorithm applicable to a case where there are provided two groups in the classification problem;

FIG. 4 is a flowchart illustrating a conventional real boosting algorithm applicable to a case of a multi-class classification problem;

FIG. 5 is a flowchart illustrating a conventional Logit boosting algorithm applicable to a case of the multi-class classification problem;

FIG. 6 is a flowchart illustrating a conventional gradient boosting algorithm applicable to a case of the multi-class classification problem;

FIG. 7 is a flowchart schematically illustrating a method for constructing an ensemble machine according to a prefeπed embodiment of the present invention;

FIG. 8 is a flowchart schematically illustrating a cross validation process for selecting an optimal decision tree;

FIG. 9 is a flowchart illustrating an entire outline of a process for selecting the optimal decision tree through tree information criteria (TIC);

FIG. 10 is a flowchart showing a process for generating Boostrap data according to another prefeπed embodiment of the present invention; FIG. 11 is a basic conceptual view showing existence of the plurality of learning machines having small distances from a true learning machine among the learning machines included in the conventional set of basic learning machines;

FIG. 12a is a conceptual view showing a method for constructing a first learning machine according to the CHEM algorithm proposed by the present invention; FIG. 12b is a conceptual view showing a method for constructing a second learning machine according to the CHEM algorithm proposed by the present invention;

FIG. 12c is a conceptual view showing a method for constructing a third learning machine according to the CHEM algorithm proposed by the present invention;

FIG. 12d is a conceptual view showing a method for constructing an m-th learning machine according to the CHEM algorithm proposed by the present invention;

FIG. 13 is a flowchart illustrating an outline of a CHEM algorithm according to one embodiment of the present invention;

FIG. 14 is a flowchart schematically illustrating an ensemble algorithm for the continuous variables according to one embodiment of the present invention; FIG. 15 is a flowchart illustrating an outline of an algorithm for generating association rules;

FIGS. 16a to 16d are graphs showing the results of virtual tests for validating performances of the conventional ensemble methods and the ensemble method proposed by the present invention; and FIGS. 17a to 17i are graphs showing the results of actual data analysis for validating performances of the conventional ensemble methods and the ensemble method proposed by the present invention.

Best Mode for Carrying Out the Invention

Hereinafter, prefeπed embodiments of the present invention will be described in detail with reference to the accompanying drawings.

1. Introduction

An algorithm for constructing a data mining model proposed by the present invention is a novel algorithm of which predictability is superior to the boosting algorithms while solving the aforementioned problems in the boosting algorithms. Further, in a case where a decision tree is employed as a basic learning machine, the algorithm of the present invention has a further improved analyzability as compared with a decision tree which is known to have most superior analyzability among conventional data mining techniques (of course, the decision tree has very poor predictability). In particular, according to the algorithm of the present mvention, various kinds of association rules capable of describing given data can be extracted, and accordingly, the given data can be analyzed in various respects.

Theoretically, the algorithm proposed by the present invention is completely different from the boosting algorithm in view of their basic ideas. The boosting algorithm is to develop a strong learning machine by combining several weak learning machines, whereas the algorithm of the present invention is to develop a stronger learning machine by combining several strong learning machines.

A basic learning machine used in an ensemble scheme of the present invention will be explained by way of example of the decision tree used as the learning machine. The reason is that the decision tree has advantages such as high analyzability and simplicity of the algorithm. However, the basic learning machine can be differently selected depending on data types. For example, a neural network model may be utilized in case of character or voice recognition.

In order to utilize the decision tree as the basic learning machine, fast computational speed is essentially required. In general, a CART algorithm proposed by

Breiman et al. in 1986 is cuπently well-known as a decision tree construction algorithm. The CART algorithm comprises three steps of growing the decision tree, pruning, and selecting the optimal decision tree.

At this time, the third step of selecting the optimal decision tree utilizes a cross validation method which requires a large amount of computation. The amount of calculation required for the cross validation is not a burden when generating a single decision tree. However, the ensemble scheme requires generating several decision trees.

Thus, if the cross validation method is to be applied to all the decision trees, an enormous increase of the amount of calculation will be inevitable.

In order to overcome the problem of the cross validation, the present invention defines and utilizes a new parameter of tree information criteria (TIC) so that the optimal decision tree can be constructed in a short time.

However, the algorithm for constructing the data mining model proposed by the present invention is not limited to a method of selecting the decision tree using the TIC.

In other words, according to the present invention, various basic learning machines (e.g., a neural network learning machine) as well as the decision tree can be utilized as the basic leaning machine. For convenience of explanation, the decision tree will be merely selected and explained herein as the basic learning machine.

FIG. 7 shows a flowchart schematically illustrating a method for constructing an ensemble machine according to a prefeπed embodiment of the present invention. Referring to FIG. 7, if multidimensional training data are inputted in step S701, a basic learning machine is constructed in step S702. At this time, a decision tree, a neural network model or the like can be utilized as the basic learning machine. For convenience of explanation, however, the decision tree is employed herein as the basic learning machine.

Therefore, the optimal decision tree is constructed in step S702. In such a case, the optimal decision tree can be constructed in various manners. A new scheme refeπed to as a tree information criteria (TIC) algorithm will be introduced into the present invention.

Then, the ensemble machine is constructed by using the constructed basic learning machines in step S703, and it is determined whether the constructed ensemble machine is optimal. If a result of determination in step S704 is that the ensemble machine is not optimal, a new response variable is generated in step 705, and then, the procedure is returned to step S702.

If the result of determination in step S704 is that the ensemble machine is optimal, a final ensemble machine is constructed in step 706, and then, the procedure is terminated. The present invention will be explained herein in the following order. (1) First, the TIC, i.e. the novel algorithm for constructing the optimal decision tree, which is substituted for the cross validation method, will be described in order to know about how to construct the optimal decision tree (refer to step S702). (2) A method of extracting Boostrap data for use in the present invention will be examined (refer to a first half of step S703). (3) A new algorithm using the Boostrap data extraction method will be described (refer to a second half of step S703 and steps S704 to S706). (In such a case, it is not necessary to utilize the TIC algorithm. That is, it is not necessary to select the decision tree as the basic learning machine.) (4) An algorithm for generating association rules will be examined. (5) Performances of various algorithms will be compared with one another through virtual tests, and then, the performances will be finally compared against actual data.

2. Tree Infomiation Criteria (TIC)

A TIC method for solving the problems of the cross validation method which has been generally utilized when determining the optimal decision tree will be explained. First, general algorithms for generating the decision trees will be described. Then, a conventional cross validation algorithm will be examined. Finally, an algorithm for selecting the optimal decision tree using the TIC proposed by the present invention will be explained.

2-1. Algorithm for constructing decision tree (Breiman et al., 1984)

The algorithm for constructing the decision tree, which was proposed by Breiman, can be roughly divided into three steps.

A first step is a growth algorithm for generating a largest decision tree with respect to given data. A second step is a pruning algorithm for generating a plurality of nested decision trees by pruning unnecessary braches from the huge decision tree constructed through the growth algorithm. At this time, the constructed decision trees become smaller and smaller.

A third step is an algorithm for selecting the optimal decision tree among the decision trees which have been obtained from the pruning algorithm.

2-2. Cross validation algorithm for selecting optimal decision tree (k-fold cross validation)

FIG. 8 shows a flowchart schematically illustrating the process of a cross validation method for selecting the optimal decision tree. The process will be explained in detail below.

Assume that the decision trees generated from the multidimensional data through the growth algorithm and the pruning algorithm are T_l5 ..., T_m, and e{ is a cross validation

(1) Step S801 : Various kinds of variables are initialized. That is, it is set that ei=0, where i=l, 2, ..., m.

(2) Step S802: Given n training data are equally divided into k parts so that mutually exclusive data D_\, D₂, ..., D_k can be generated.

(3) Step S803: Di is set to test data, and the other data are set to training data.

(4) Step S 804: By using the training data, the nested decision trees are constructed (through the growth and pruning algorithms).

(5) Step S805: By using the test data Di, a prediction eπor is calculated with respect to each of the nested decision trees.

(6) Step S806: A decision tree, which is closest to one decision tree T_j among the constructed decision trees, is selected. At this time, since the decision tree selection algorithm was fully disclosed by the 'Breiman et al. (1984)', the description thereof will be omitted herein.

(7) Step S807: The prediction eπor of the decision tree obtained from step S806 is added to the βj.

(8) Step S808: Steps S806 and S807 are repeated m times (j=l, ..., m). (9) Step S809: Steps S803 and S808 are repeated k times (i=l , ..., k).

(10) Step S810:

..., e_m are referred to as cross validation eπors of the decision trees, T_lr ..., T_m, respectively. A decision tree of which cross validation eπor is smallest is selected as an optimal decision tree.

In the meantime, such a cross validation algorithm is also refeπed to as the k-fold cross validation algorithm in which 5-fold or 10-fold cross validation method is generally utilized.

In the aforementioned cross validation algorithm for constructing the optimal decision free, the decision trees should be constructed many times. Thus, if the data are huge, there are problems in that very long calculation time is required and a calculation result is arbitrarily varied depending on how to divide the data.

In order to solve the problems, such as enormous increase of the amount of calculation and instability of the calculation result, which are inherent in the cross validation method, the TIC algorithm of the present invention is newly proposed and will be hereinafter described.

2-3. TIC algorithm for selecting optimal decision tree

An objective of the TIC algorithm is to select the optimal decision tree among a permutation of several decision trees, T_l9 ..., T_m. At this time, each posterior probability of the respective decision trees is calculated, and then, the decision tree of which posterior probability is largest is selected as the optimal decision tree.

The posterior probability means a probability of each tree with respect to the given data. That is, the posterior probability of the tree Ti will be Pr(Tj|D_n) with respect to the given data D_n={(yι, xi), ..., (y„, x_n)} .

FIG. 9 shows a flowchart illustrating an entire outline of a method for selecting the optimal decision tree using the TIC. The method will be explained in detail below.

If the multidimensional training data are first inputted in step S901, the largest decision tree is constructed based on the training data in step S902. Then, in step S903, the constructed largest decision tree is newly generated into nested decision trees through the pruning technique. Thereafter, each posterior probability of the respective decision trees is calculated in step S904, the decision free of which posterior probability is largest is then selected in step S905, and a single optimal decision tree is finally selected in step S906.

Hereinafter, the method for selecting such an optimal decision tree will be more specifically explained.

First, a general method for calculating the posterior probability will be described. The posterior probability is defined as Pr(T,|D_n)=c Pr(D_n|T₁) Pr(T,) according to the

Bayesian theorem, where Pr(D_n|T_ι) is a probability of the data in a case where the model is T_l5 Pr(Tj) is a probability arbitrarily determined by a user when the user examines the data, m and c is a constant for establishing the relationship ^Pr(T. | D_n) = 1.

In the meantime, since an objective of calculating the posterior probability is to determine the decision tree of which posterior probability is largest, the constant c need not be obtained. Thus, the following formula (17) can be simply utilized to calculate the posterior probability:

Vτ(T, I D_n) oc Pr(D J T,)?τ(T,)

(17) Now, Pr(D_n|T will be calculated.

Since the data are independent of one another, the following formula (18) is established:

(18) Further, the formula (18) may be rewritten as the following formula (19):

I O = Prfc I T x_k)Vr(x_k | T)

(19)

Here, since the tree T, is used to represent a probability structure of y_k with respect to a given input X , Pr(X_k|T,) is not dependent on the T_λ. That is, Pr(x_k|T = Pr(x_k). Therefore, once Pr(y_k|T_]5 X ) is calculated, Pr(D_n|T is obtained accordingly.

Furthermore, similarly to the constant c, since Pr(x_k) is a value which is commonly applied to all the trees, it is not used to find out the tree of which posterior probability is largest. Thus, in consideration of the foregoing, the above formula (17) is rewritten and expressed as the following formula (20):

Pr(E_r I D„) oc ft Pr(y_k | E_Λ)Pr(E-) ^{k= l} . ^{■ • •} (20) The Pr(y_k|Tj, X ) is calculated in the following manner.

Assume that a set of terminal nodes of T; is expressed as T₍ and the probabilities of J groups with respect to the given h-th terminal node are expressed as (pih, ..., pjh)- Then, if the given input variable X_k belongs to the h-th terminal node of the tree Ti, the following formula (21) is established:

(21) where y_k represents a group to which the data belong.

Using the foregoing, the following formula (22) is established:

(22) where n_jh is the number of the data, belonging to Group j, among the data included in the h-th terminal node.

Since the probabilities of the respective terminal nodes (pi , ..., pj_h) are unknown variables, they are removed by using expected values. A distribution of the probabilities (pih, ..., pj_h) is needed for obtaining the expected values. Assume that the distribution is π (pi_h, ..., pj_h)- Then, the following formula (23) is established: _lh,...,_PJh)dp_lh,..d_Pjh

(23) Here, various types of distributions can be utilized for τr (pi , • ■ •_. pji - If a general distribution is utilized, the following formula (24) is established:

Pr(-D r,.)oc IT ( Tϊp" n(P _lh, ...,P_Jh)dp _lh...dp_Jh h<a τ-_t ^J 7=1

(24)

Further, if a uniform distribution is utilized, the following formula (25) is established:

(25) J where «_Λ = ∑«_/Λ .

Furthermore, the uniform distribution is defined as the following formula (26):

(26) Hereinafter, a method for determining a prior probability of a tree Pr(J) will be explained.

Pr(E_;) is not obtained from data but inputted by the user. Pr(7}) for the TIC is constructed as follows.

First, a probability that a given h-th node will be an intermediate node (that is, a node from which successive branching occurs) is defined as the following formula (27): (l+ )^'P (27) where f_h is the number of ancestor nodes of the given node, and the constants a and β are determined by the user. Then, a probability that a given node will be a terminal node is naturally determined as the following formula (28):

(28)

The prior probability of a given tree Ti under the conditions of the formula (28) is expressed as the following formula (29):

^Pr(?;^.) = he πT, -f^«, fl-Λ^)"Η heT, fl-^«Cⁱ-Λr'^{) ,}

(29) where Tj - T_i is a set of the immediate nodes (i.e., all the nodes which are not terminal nodes). Next, the TIC will be calculated using the aforementioned matters.

When the above formulas are combined and reaπanged, the following formula (30) is finally obtained. At this time, the formula (30) uses the uniform distribution. Pr(^ - Pr(E>„|7;)Pr(-7 )

(30) /ιe7) -¹- V^WΛ ^{+ J} ) heT, -f, heT,

Further, the TIC is defined as a logarithmic value of the final right side of the above formula. That is, TIC of the tree T, is expressed as the following formula (31):

TIC = ∑ ( ∑log(r(»_J,+l)))-log(r(H_;+ ))+ ∑ logα-βlog(l+/_A) h≡ F; J=l hET_f F;

+ Σ log(l-α(l- _Λ)-^p) ^{he F}-- (31) In a case where there are two groups, i.e. J = 2, the TIC is expressed as the following formula (32):

where d_h is the number of data belonging to a second group among data of an h-th terminal node. After applying the TIC defined as such has been applied to respective decision trees

Ti, ^{• • •} , T_m, a decision tree having the maximum TIC is selected as an optimal decision tree.

Then, the algorithm is terminated.

Meanwhile, the conventional method using the Bayesian theorem and the TIC method proposed by the present invention are the same as each other in that they utilize a posterior probability. However, they are a little different from each other in view of the posterior probability construction which will be used when calculating their posterior probabilities. Such a difference has a great influence on calculation of the posterior probabilities. That is, in the conventional method using the Bayesian theorem, the posterior probability cannot be calculated by means of numerical formulas. Thus, since the probabilities should be calculated through a computer, it takes much more time to perform the calculation as compared with the method using the cross validation.

The method for constructing the posterior probability according to the conventional method using the Bayesian theorem assigns probabilities to all possible decision trees.

However, since the number of all the possible decision trees is very large, the method for constructing the posterior probability is also greatly complicated. Accordingly, since the number of decision trees which require the calculation of the posterior probabilities is inevitably greatly increased, this leads to enoπnous increase in the amount of the calculation.

However, the TIC method solves the problem of the conventional method using the

Bayesian theorem by assigning the posterior probabilities not to all the possible decision trees but only to nested decision trees induced through the pruning algorithm. Thus, there are advantages in that it is very easy to construct the prior probability and the calculation of the posterior probability is simplified.

That is, the prior probability constructing method for use in the TIC method is a method for reducing sets of the decision trees by using data. This point is the very sharp difference between the TIC method and the conventional method using the Bayesian theorem.

In summary, since it is sufficient to construct the decision tree only once in the TIC method, the computational speed is rapidly increased over the cross validation method.

Furthermore, since the results obtained with respect to the same data are always the same in the TIC method, reliability of the results is greatly superior to that of the cross validation method.

Table 1 below shows simulation results of the selection of an optimal decision tree by using the conventional 5-fold cross validation method and the TIC method proposed by the present invention, respectively. That is, these experiment data are used to compare the generation rate of a single tree through the 5-fold cross validation with that of the single tree through the TIC proposed by the present invention.

Respective pieces of the experiment data indicate average time when the same data are repeatedly generated 500 times. The specification of a computer was a Pentium III 900 MHz processor, a 256 Mbyte main memory, and a Window 2000 O/S.

According to Table 1 below, it can be seen that the calculation time of the TIC method proposed by the present invention was about 1/5 times as much as that of the conventional 5-fold cross validation method.

Meanwhile, the simulation data coπespond to standard data which have been widely known in the field of the data mining, i.e. 'Radius2,' 'Interaction,' 'Breast Cancer,'

'Ionosphere,' and 'Sonar' data. These data are the most powerful simulation data for estimating the efficiency of the data mining in the relevant art. These simulation data are shown in detail in 'Machine Learning Web Site'

(http://wwwl. ics.uci.edu/~mlearn/MLRepository .html) of 'UC Irvine.' Table 1

3. Method for extracting Boostrap data

Hereinafter, a method for extracting Boostrap data used in the present invention will be briefly described.

First, a general algorithm for generating the Boostrap data will be explained below. Assuming that original data are n data, i.e. y_l9 ^{■ • ■} , y„, the number of the Boostrap data to be generated, m, is determined and a new data set D^B is initialized to an empty set.

Then, a constant k satisfying the relationship Kε[l, n] is generated using a random number generator, and yj^B is assigned as y_k and then added to D^B. This process is repeated m times (i = 1 , ^{• ■ •} , m). FIG. 10 is a flowchart illustrating the process of generating the Boostrap data according to one embodiment of the present invention, which will be explained below.

(1) Step S1001: Weights Wi, ^{■ ■ ■} , w_n are assigned to n multidimensional data (x_ls yι)_» "^■ _» (^χn, y_n) to be inputted. Here, x; is a p-dimensional explanatory variable. That is,

X. ^— ( li. ^{" '} s X i)- (2) Step SI 002: Cumulative weights of the weights Wi, ^{■ • ■} , w_n assigned to yi,

^{• • •} , y_n are calculated. That is, the cumulative weights are calculated according to the following formula (33): 'I 0)

(33)

(3) Step SI 003: The number of new data to be generated, m, is determined. The number of the Boostrap data, m, for use in the ensemble algorithm according to the present invention is set to n.

(4) Step SI 004: The new data set D^B is initialized to the empty set.

(5) Step SI 005: A random real number satisfying the relationship k<≡ [0, w„*] is generated using the random number generator.

(6) Step S1006: An integer j satisfying the relationship W_j.j* < k ≤ w among wi*, ^{■ ■}-, w_n* is determined. Here, j = l, ^{• • •}, n.

(7) Step S1007: (x^B, y^B) is assigned as (x_{, y,).

(8) Step SI 008: The weight w^B coπesponding to (x^B, y^B) is set to 1/m.

(9) Step SI 009: (x_t ^B,y_t ^B) is added to D^B.

(10) Step S1010: The process is repeated m times (i = 1, ^{■ ■ •}, m).

4. Background and basic principle of ensemble algorithm proposed by the present invention

4-1. Introduction

Hereinafter, the background and basic principle of the novel ensemble algorithm proposed by the present invention will be described. The ensemble algorithm proposed by the present invention will be refeπed to as the Convex Hull Ensemble Machine (CHEM) algorithm.

The CHEM algorithm is an ensemble algorithm for generating a new learning machine by using a plurality of basic learning machines. In a classification problem (response variables are categorical) or a regression model problem (response variables are continuous), a learning problem can be changed to a function estimation problem.

A classification problem in which the response variables have J categories is a problem of estimating a J-dimensional function F = (Fi, ^{• • ■}, Fj). At this time, the function F is defined as the following formula (34): P(y=k I -v) = — _j

∑ expGF-O)) ι = l

(34)

Meanwhile, the regression model problem becomes a problem of estimating a function F(x) = E(y|x). Assume that if a true value of the function F is defined as Ψ * (refeπed to as "true learning machine"), Ψ = {Ψ β, θ ^ Θ } is a set of the basic learning machines for use in the CHEM algorithm. That is, an optimal basic learning machine for given training data is selected as one of the set Ψ . Meanwhile, a method for finding the optimal learning machine has been widely known in the art. As for the set of basic learning machines Ψ for use in the data mining, there are a decision tree, a neural network model, and the like. However, such basic learning machines have a serious problem in that they very sensitively respond to changes in data.

It is the CHEM algorithm proposed by the present invention that finds out and overcomes causes of instability of such basic learning machines. Next, the causes of the instability of the conventional basic learning machines will be discussed below.

4-2. Causes of instability of conventional basic learning machines The reason why various algorithms used for the data mining operate unstably is that various different learning machines within the set of basic learning machines Ψ describe data in similar manners to one another.

The further basic reason why the completely different learning machines describe the data in the similar manners is that the true learning machine Ψ * is not included in the considered set of basic learning machines Ψ . Moreover, this is because there are a plurality of learning machines, which have small distances (i.e., the degree of difference) from the true learning machine ψ *, among the basic learning machines included in the set Ψ .

FIG. 11 is a basic conceptual view showing existence of the plurality of learning machines having small distances from the true learning machine among the learning machines included in the conventional set of basic learning machines. As shown in FIG. 11, various learning machines Ψ surround the true learning machine Ψ*. In such a case, even though data is changed merely in a slight degree, the optimal learning machine may be greatly changed. That is, the optimal learning machine will be one of Ψ _\, Ψ ₂ and Ψ₃ depending on a direction in which the data are shifted from the true learning machine Ψ*.

If the set of basic learning machines Ψ is constructed as shown in FIG. 11, it is impossible to properly constmct the true learning machine Ψ* even though the optimal learning machine has been constructed well. However, the true learning machine Ψ* can be constructed by combining a plurality of learning machines. This is because the true learning machine Ψ* is located in a convex hull space of the learning machines included in the set Ψ . Particularly, if proper weights wi, w₂ and w₃ are obtained in FIG.

11, the following formula (35) is established:

When examining the meaning of the formula (35), it is noted that the true learning machine Ψ * can be obtained as a weighted average of several learning machines included in the set ψ . The CHEM algorithm proposed by the present invention has been developed by using such an idea.

Hereinafter, the basic principle of a method for combining learning machines according to the CHEM algorithm proposed by the present invention will be explained in detail.

4-3. Basic principle of learning-machine combining method according to CHEM algorithm when true learning machine is known

A method for constructing the true learning machine Ψ* by using the weighted average of several learning machines included in the set Ψ when the true learning machine ψ* is known and not included in the set Ψ will be now introduced.

Thereafter, a method for estimating the true learning machine Ψ* using data when the true learning machine Ψ * is not known will be introduced.

A basic assumption of the CHEM algorithm is that the true learning machine Ψ * is expressed as a weighted average of M learning machines Ψ _\, ^{■ • •}, Ψu included in the set Ψ . This is expressed as the following formula (36):

M

, V . i=\

(36) The CHEM algorithm is an algorithm in which the learning machines Ψ_x used for obtaining the weighted average and the weight Wj thereof are sequentially found out. The principle of an algorithm for finding out a (k+l)-th learning machine ^^" ₊i and a (k+l)-th weight W_k+i thereof by using k learning machines Ψ _\, ^{• •}-, Ψ^ and k weights w_l5 ^{• • ■} , Wk, which have been already constructed, in the CHEM algorithm will be explained below by steps.

First, an optimal learning machine among learning machines orthogonal to a cuπent ensemble machine F_k according to the formula (36) is set to Ψ_{\ \}. Second, a new ensemble machine F_{k+ 1} is generated according to the following formula (37). At this time, weights U_k and W_k+i are obtained such that the distance between the ensemble machine F_k+i and the true learning machine Ψ * is minimized.

^M *^+W*+l

(37)

This algorithm will be explained in more detail below.

(1) Construction of first learning machine FIG. 12a is a conceptual view showing a method for constructing a first learning machine according to the CHEM algorithm proposed by the present invention.

As shown in FIG. 12a, an optimal learning machine Ψ _\ (i.e., a learning machine located closest to the true learning machine Ψ *) is obtained.

(2) Construction of second learning machine FIG. 12b is a conceptual view showing a method for constructing a second learning machine according to the CHEM algorithm proposed by the present invention.

An optimal learning machine Ψ₂ orthogonal to ψ _\ is found out, and weights wi, w by which the distance between an ensemble machine F according to the formula (37) and the true learning machine ψ * is shortest are obtained. At this time, wi and w are ■

1/dι and l/d₂ , respectively. Meanwhile, the ensemble machine F₂ according to the formula (37) is expressed as the following formula (38):

F. ^W lΨ l^+W2ψ 2 w_i+w₂

(38) where di is the distance between Ψ* and Ψ _\, and d is the distance between Ψ* and Ψ -

(3) Construction of third learning machine

FIG. 12c is a conceptual view showing a method for constructing a third learning machine according to the CHEM algorithm proposed by the present invention. An optimal learning machine Ψj orthogonal to F₂ is found out. Then, weights u₂, w₃ by which the distance between an ensemble machine F₃ according to the formula (37) and the true learning machine Ψ* is shortest are obtained. Meanwhile, the ensemble machine F₃ according to the formula (37) is expressed as the following formula (39):

u ₂+w₃ (39)

Then, the following formula (40) is established:

(40)

(4) Construction of m-th learning machine

FIG. 12d is a conceptual view showing a method for constructing an m-th learning machine according to the CHEM algorithm proposed by the present invention.

By continuously repeating the above algorithm, an m-th ensemble machine F_m is obtained from the following formula (41): _vιFi

F =-

∑w . z= l

(41)

4-4. Basic principle of learning-machine combining method according to CHEM algorithm when true learning machine is not known

In all the learning problems, the true learning machine Ψ* is not known. Instead, n training data (xj, yi), ^{• • •}, (x_n, y_n) are given. The objective of the learning is to effectively estimate the true learning machine Ψ * by using the data. Hereinafter, how the aforementioned algorithm is constructed when the data is given will be explained.

A loss function is indicated by /, and a deviance of a given learning machine Φ is defined as the following formula (42): n

;=ι

(42) In case of categorical data, only two groups are considered.

(1) Construction of first learning machine An optimal learning machine Ψ i is obtained.

(2) Construction of second learning machine

An optimal learning machine Ψ₂ orthogonal to Ψ i is found out. First, a residual is used for obtaining the orthogonal learning machine. The residual r_\ can be obtained from the fomiula (43) when the response variables are categorical:

exp (ψι( ))

P_λ(x)- l+exp(_Ψl(x)) (43) where Pi is a probability that y will be 1 when a first ensemble machine is used. Alternatively, the residual r; can be obtained from the formula (44) when the response variables are continuous:

(44)

In case of the categorical data, an optimal learning machine Ψis constructed by adopting |ri| as a weight. In case of the continuous data, the optimal learning machine Ψ is obtained by adopting τ_\ a response variable.

Meanwhile, the reason of obtaining the optimal learning machine by using the residual is that the residual has a property of being orthogonal to the response variable in a regression model. Therefore, the optimal learning machine ?P^*for the residual is nearly orthogonal to the optimal learning machine Ψ _\ for the response variable.

Then, after the optimal learning machine Ψis obtained by using the residual, a constant η for minimizing d( η Ψ) is obtained. In addition, it is set that Ψ = V Ψ .

At this time, an ensemble machine is determined from the formula (37). That is, the following formula (45) is established: _F _ w _{lψ l}+τ4-_{2ψ 2}

² τ^. ,+w₂ ^W ι = lΛ*(ψ ι , w₂ ⁼ l ^(ψ₂)

(45)

Meanwhile, d(Ψ) is approximately a square of the distance between ?^and Ψ* according to the statistical theory.

(3) Construction of third learning machine

The construction of a third learning machine is identical to the construction of the second learning machine except that the residual is obtained by using F instead of Ψ _\. The ensemble machine F₃ for the obtained third learning machine ?f₃ is expressed as the following formula (46):

W _{lψ l}---W₂ψ ₂+W₃ψ ₃

F =

W _J+W_J+W_J w₃= l/ (ψ ₃)

(46)

(4) By repeating the above algorithm, the m-th ensemble machine F_m is obtained from the following formula (47): m

∑^w^

F m = ι=l

∑w

(47)

5. CHEM algorithm in two-class classification problem

Hereinafter, the CHEM algorithm proposed by the present invention will be explained in detail with respect to a two-class classification problem.

FIG. 13 is a flowchart illustrating an outline of a CHEM algorithm according to one embodiment of the present invention.

First, various kinds of variables are initialized in step SI 301. Boostrap data are generated by applying weights to multidimensional data in step S1302. Then, a probability that the response variable will be 1 for a given explanatory variable is estimated by using the basic learning machines in step S1303. In step S1304, a coπection parameter for minimizing a given deviance is calculated. Then, a coπected learning machine is constructed by using the coπection parameter in step SI 305. Subsequently, an ensemble machine is constructed based on the coπected learning machine in step SI 306.

Then, in step SI 307, it is determined whether the ensemble machine satisfies stop rules to be described later. If it does not satisfy the stop rules, the procedure is returned to step SI 302. If so, the procedure is terminated.

The CHEM algorithm will be explained in detail by means of formulas and the like.

(1) First step: The response variable yj is set to 1 if the data belong to Group 2, whereas it is set to 0 if the data belong to Group 1.

(2) Second step: Various kinds of variables are initialized. That is, Wj = 1/n for n weights, wi, ^{■ ■ •} , w_n , and F(x) = 0 are set.

(3) Third step: The Boostrap data (xι^B, yι^B), ^{• ■ ■}, (x_n ^B, y_n ^B) are generated using the weight {W_J} . The generation of the Boostrap data has been already explained above.

Λ

(4) Fourth step: A probability p_m (x) = P(y = lx) that the response variable will be 1 for a given explanatory variable x is estimated using the Boostrap data through the basic learning machines.

(5) Fifth step: The coπection parameter η for minimizing a value of the following formula (48) with respect to a given loss function / is calculated:

(48) where /„ (x) = log(p_m (x) 1(1 - p_m (x)) .

(6) Sixth step: The learning machine is newly coπected and reconstructed by using the coπection parameter. This process is expressed as the following formula (49):

(49) n where f_m(x) = ηf_m(x) , u_m = l/∑l( „f_m(x_t)) , and the weight w; is updated

(=1 exp(E(x)) according to w_: = (y_{ - P(x) I ^jP(x)(l - P(x)) | , where E(x) = -

+ exp(E(x))

(7) Seventh step: In order to construct a final ensemble machine, the third to sixth steps are repeated M times (m = 1, ^{• • •}, M). Then, a final ensemble machine is set to H(x) = F(x), and H(x) is assigned to Group 2 if H(x) > 0 and to Group 1 if H(x) < 0 for a new response variable x.

Meanwhile, although various loss functions may be used for the above loss function, an exponential loss function or a negative log-likelihood loss function can be used in order to obtain more preferable results. The exponential loss function and the negative log-likelihood loss function are expressed as the following formulas (50) and (51), respectively: tOJ)=exp(-(2y-l) 72)₃

(50) i(y )=-(yf-log (l+exp 0 )) _ (51)

Alternatively, the present invention may generate the basic learning machines by using the weights instead of the Boostrap data. However, in most cases, the use of the Boostrap data results in a more excellent performance.

Further, although various basic learning machines may be employed in the present invention, this embodiment utilizes the decision tree. Contrary to the conventional boosting algorithm, the basic learning machines in the CHΕM algorithm are strong learning machines. Therefore, the optimal decision tree which has been obtained through the entire processes of constructing the decision tree is used instead of a simple decision tree. At this time, the TIC is used to overcome a problem of the calculation, which has been already described above.

An algorithm for expanding the CHΕM algorithm up to a more-than-two-class classification problem will be described below.

6. CHΕM algorithm in multi-class classification problem An outline of a CHΕM algorithm expanded up to a multi-class classification also follows the procedure of FIG. 13. However, the CHΕM algorithm for the multi-class classification problem is slightly different from the CHEM algorithm for the two-class classification problem in view of formulas etc. applied to them. This algorithm will be described in detail below.

(1) First step: Various variables are initialized. That is, the weight w,_j is set to w_υ = l/n where i = 1, ^{• • •}, n andj = l, ^{■ • •}, J; and it is set that F_j(x) = 0, where j = 1, ^{• • •}, !.

(2) Second step: y,* is set to 1 if i-th data belong to a j-th group and to 0 if not so.

(3) Third step: The Boostrap data (xι^B, yι*^B), ^{• • ■}, (x_n ^B, y_n*^B) are generated using the weights {w_lj5 ^{• ■ •}, w_nj}. The generation of the Boostrap data has been already explained above.

Λ (4) Fourth step: A probability p_mj (x) = P(y* = 1 |x) that the response variable will be 1 for a given explanatory variable x is estimated using the Boostrap data through the basic learning machines.

(5) Fifth step: It is set that „_y(x) = log(p_mj(x)/(l- p_mj(x))) .

(6) Sixth step: The second to fifth steps are repeated J times (j = 1, "", J). (7) Seventh step: A coπection parameter η for minimizing a value of the following formula (52) with respect to a given loss function / is calculated:

(52)

1 ^J where f_mj (x) = f_mj (-*) --£ f_mk (x) , and /„ (x) = (f_mϊ (x), ... ^ (x)) .

J k=\ (8) Eighth step: A learning machine is newly coπected and constructed using the coπection parameter. This process is expressed as the following formula (53): m m ι=l (=1

(53) where f_m(x) = ηf_m(x), u_m = l/∑l( „f_m(x,)) , F(x) = F₁(x),...,F_J(x)) , t e weight

;=1 Wi_j is updated according to w_y = >^* - P_} (x, )) / ^P_} (x, )(1-P_j (x, )) | , where exp(E (x)) P_j (x) = — , and y_y* is 1 if an i-th observed value belongs to a j-th group and 0

∑exp(E_A(x)) i=l if not so.

(9) Ninth step: A final ensemble machine is constructed by repeating the second to eighth steps M times (m = 1, ^{■ ■ ■}, M). At this time, a new explanatory variable x is assigned to the argmaX_jF_j(x) group.

An algorithm for expanding the CHEM algorithm up to a continuous variable problem will be described below.

7. CHEM algorithm in continuous variable problem

Hereinafter, an algorithm (regression CHEM algorithm) for constructing an ensemble machine in a case where response variables are continuous will be explained. Such a case is called as a regression model of which basic model is as follows. FIG. 14 is a flowchart schematically illustrating an ensemble algorithm for the continuous variables according to one embodiment of the present invention, which will be explained in detail below.

First, various variables are initialized and the response variables are defined in step

SI 401. A regression model is constructed using basic learning machines in step SI 402. Then, a coπection parameter for minimizing a given deviance is calculated in step SI 403.

In step SI 404, the response variables are updated based on the coπection parameter. An ensemble machine is constructed based on the updated response variables in step SI 405.

Then, in step S1406, it is determined whether the ensemble machine satisfies stop rules to be described later. If not, the procedure is returned to step S1402. If so, the procedure is terminated.

This procedure will be described in more detail below.

First, assume that n multidimensional training data (xi, yi), ^{• ■ •}, (x_n, y_n) are given, where _\ is a p-dimensional explanatory variable, i.e. xi = (xn, ^{■ ■ •}, x_pi), and y; is the response variable. An objective of this algorithm is to find out a relationship that can best explain the response variables by using the explanatory variables based on the n training data. In other words, it is to construct an optimal function H : R^p — > R by using the training data. Further, if a new explanatory variable x is given, a response variable for the given explanatory variable is estimated from H(x). Meanwhile, H(x) = E(y)χ) is established in the statistics. That is, the objective of the regression analysis is to estimate a conditional expected value. The CHEM algorithm for the continuous variables is as follows.

(1) First step: It is set that z; = yj for a new response variable.

(2) Second step: A regression model Φ _m(x) is obtained through the basic learning machines by setting the response variable to Z; and the explanatory variable to x_\. (3) Third step: A coπection parameter η for minimizing a value of the following formula (54) with respect to a given loss function / is calculated:

(4) Fourth step: The response variable z, is updated using the coπection parameter

V . This process is expressed as the following formula (55): z; = y_{ - F(x ,

(55) where ψ_m(x) = ηψ_m(x) , u_m - , and F(x) is updated according

∑i^{yι,ψ_mi≠))

(5) Fifth step: The second to fourth steps are repeated M times (m = 1, ^{• • ■}, M). (6) Sixth step: A final ensemble machine is constructed such that H(x) = F(x).

Meanwhile, as for the loss function /, various loss functions such as known l(y, T) = (y-T)², l(y, T) = |y-T| and the like may be used.

8. Stop rules Hereinafter, an algorithm for determining the number of basic learning machines required for constructing a final ensemble machine will be described. A basic idea of the stop rules is to terminate the entire algoritlim without further updating a cuπent ensemble machine when a deviance of the ensemble machine is minimum. The stop rules will be described below. (1) First step: A positive integer K is determined.

(2) Second step: Assume that F_m is an ensemble machine constructed by using m initial basic learning machines. A deviance of the ensemble machine F_m for a given loss function / is calculated according to the following formula (56): d_m = ∑/(y„ F_m<^x, >

(56)

(3) Third step: o_m = argmin_J=ι_> -, _md,, and o_m+κ= argmin_j=ι_, -_{, m+}κd, are set. An ensemble machine F_m of the first m satisfying the relationship o_m = o_m4-κ is determined as a final ensemble machine. Then, the algorithm is terminated.

Meanwhile, the positive integer K is a value defined by a user.

Table 2 below shows simulation data for comparing predictabilities of a case where the stop rales are not applied to the CHEM algorithm and a case where the stop rules are applied to the CHEM algorithm, with respect to various data. As shown in Table 2 below, both predictabilities are almost similar to each other. Thus, when the stop rales are applied to the algorithm, an optimal ensemble machine can be constructed using the smaller number of basic learning machines, resulting in a great increase of the computational speed.

Table 2

9. Algorithm for generating association rules

9-1. Introduction

An algorithm for searching various kinds of association rales for describing data using a basic learning machine for generating a final ensemble machine will be explained.

Meanwhile, the association rule generating algorithm of the present invention may be applied to any ensemble machines constructed by various conventional methods.

9-2. Algorithm for generating association rules The association rale generating algorithm is as follows.

Referring to FIG. 15, there is provided a flowchart schematically illustrating the association rule generating algorithm according to one embodiment of the present invention. It will be described in detail below. (1) step SI 501: Initialize all kinds of variables. That is, if the response variables are categorical data, a group of interest, the minimum permissible number of data and the minimum permissible confidence are defined as g, m, and p, respectively. Further, if the response variables are of a continuous type, the range of interest is defined as (gL, gu) instead of the group g. (2) step S1502: Construct a universal set S of fundamental rales. It will be described in detail below.

By searching the tree architecture, all nodes that have the number of data larger than m and the probability of the group g larger than p are selected from all of nodes of basic learning machines, i.e., all basic learning machines that are used in the ensemble (in case of continuous type variables, the probability of the range is applied thereto).

On the other hand, in order to obtain more desirable results, since there exist the basic learning machines as many as the number of the ensembles in a two-class classification problem, all of the basic learning machines are searched so as to select all nodes that meet the conditions. In a more-than-two-class classification problem (i.e., a multi-class classification problem), there are the basic learning machines as many as a product of the number of the ensembles and the number of the groups, and all nodes that meet the conditions are selected by using the basic learning machines as many as the number of the ensembles coπesponding to the group of interest g.

The rales of the nodes that are selected as described above refer to as the universal set S of fundamental rales.

(3) step SI 503: For all rales that are included in the universal set S, cancel the rules of nodes higher than a node that generates a relevant rule, from the universal set S. It will be described in detail below.

For all rules s_t e S (i=l,2,. . ., N), the following is repeated N times. After selection of s_{ e S , if a node that generates s_k for k=l,. . .,N for the selected

S_; is a node higher than a node that generates s_t , the s_k is cancelled from the universal set S.

(4) step SI 504: Calculate confidences for all rales that are included in the set S. At this time, the confidence for each condition is calculated by n_g/n, where the total number of data that belong to the relevant rule is n and the number of data that belong to the group of interest(i. e., the group g) is n_g.

(5) step SI 505: Arrange the calculated rules of the set S. That is, the calculated rules of the set S are aπanged in decreasing order of confidences. Then, the arranged rales refer to as oi,. . ., θh.

(6) step SI 506: Assume that an association rule set is R. If R is an empty set, the node θ_h is added to the set R. If not the set R is not an empty set, all of the association rales that are included in the set R are compared with the node θ_h to determine similarity. If the node θ_h is similar to none of the association rales that are included in the set R, the node θ_h is added to the set R.

Meanwhile, the similarity comparing method is as follows. Suppose that two given rules o and r for description variables x=(^χι_.- . ., x_p) are defined as Formula (57) below.

(57) Suppose that a set of data in which x, belongs to R_<„ is defined as a set D_m, and a set of data in which x. belongs to R_π is defined as a set D_π. At this time, R^ and R_π are subsets of the set R.

First, the maximum permissible similarity s e [θ,l] is determined, and it is then determined that the two rules o and r have similarity for x, if the ration of the number of data belonging to -^D«" ^{n O}" to the number of data belonging to ^D<» ^{u D}" is equal to or larger than s. If the ratio is smaller than s, it is determined that the two conditions o and r do not have similarity for x,. Such a procedure is repeated for i=l,. . ., p.

Then, if it is determined to have similarity for all x_l5 it is determined that the rules o and r have similarity. Otherwise, if it is determined not to have similarity for any x,, it is determined that the rules o and r do not have similarity.

(7) step SI 507: Using all rules that are included within the association rule set R, data are analyzed in order of confidences.

9-3. Performance experiment of algorithm for generating association rules Hereinafter, test data for showing analysis capacity of the association rale generating algoritlim will be explained. As an ensemble constructing scheme applied to this test, a well-known CART algorithm. Further, a decision tree that is constructed in the CART algorithm is applied to the association rale generating algorithm. Further, German data are used as actual data in this test.

Association results of actual data are shown in Tables 3 and 4 below.

Table 3

Search results of association rules using the CART algorithm

Table 4

Search results of the association rule algorithm applied to the CHEM algorithm

The German data includes 700 good credit customers and 300 bad credit customers based on 1,000 pieces of data on credit transaction status, which are analyzed by using the number of minimum permissible data and the minimum permissible confidence under the same conditions for accurate comparison of the association rules. For analyzing the bad credit customers, the number of the minimum permissible data was set as 50 persons (5%) and the number of the minimum permissible confidence was set as 50%. For analyzing the good credit customers, the number of the minimum permissible data was set as 50 persons (5%) and the number of the minimum permissible confidence was set as 85%.

The search results for the association rules of the CART algorithm under the above conditions were one bad credit customer group and one good credit customer group. In an example where the association rale is applied to the CHEM algorithm, search results were 5 bad credit customer groups and 4 good credit customer groups. The search of the association rules of the CHEM algorithm coπesponds to a wide range search of the association rules including the association rales of the CART algorithm. It will be readily noted that the results of the CHEM algorithm provides the association rales including the results of the CART algorithm. Further, since the data coπesponding to the association rales searched by the CART algorithm are generated from one basic learning machine, the data consist of data which are mutually exclusive from one another. On the contrary, the association rales of the CHEM algorithm are represented as a set including data which are non-exclusive from one another.

10. Comparison of performances of ensemble methods through tests

Hereinafter, the performances of various ensemble methods will be compared through tests.

10-1. Comparison of performances through virtual tests The following models were used in the virtual tests.

(1) Model 1: Radius 2

The number of training data was 1,000, the number of test data was 5,000, and the number of groups was 2.

Explanatory variables: x = (x_l5 ^{• • •}, xio), which were independent of one another and followed a standard normal distribution.

10

Response variables: It was set that if ∑x > c , y =1 with a probability of 0.9 and i=\

10 y = -1 with a probability 0.1. Further, it was set that if ∑x,-² < c , y = -1 with a ι=l probability of 0.9 and y = 1 with a probability 0.1. Here, c is a constant satisfying the

10 relationship Pr(∑x,.² > c) = 0.5 .

.=1 (2) Model 2: Interaction

Explanatory variables: x = (χ_b •••, χ₁₀), which were independent of one another and followed a standard normal distribution. Responsible variables: The responsible variables were variables of 0 to 1 following Pr(). = lx) = — , and F(x) is expressed as the following formula (58): l + exp(E(x))

6 6

Ft* = ( ∑OO + ∑(i)'-*-)

(58)

(3) Model 3: Two Normal

First 500 training data belonged to Group 1; and explanatory variables were x = (x_1} ^{• • •}, Xio), which were independent of one another and followed a standard normal distribution. The remaining 500 training data belonged to Group 2, and explanatory variables were x = (xi, ^{• • •}, Xι₀), which were independent one another and followed a normal distribution with an average of 0 and a variance of 2.

(4) Model 4: Simple Quadratic

Explanatory variable x followed a standard normal distribution.

Responsible variables were variables of 0 to 1 following the following formula (59), and F(x) ^: -x² + 2.

PrO= 1 I x)

1 -1- e?ςp F f

(59)

FIGS. 16a to 16d are graphs showing the results of the 4 virtual tests.

As can be seen FIGS. 16a to 16d, the CHEM algorithm proposed by the present invention operates very stably. Particularly, in case of Model 4, all the other algorithms have poorer performances as the number of trees is increased, whereas there are no such problems in the CHEM algorithm.

Table 5 below shows numerical values representing the results of FIGS. 16a to 16d.

Table 5. Results of virtual tests

Meanwhile, it can be understood that the predictability of the CHEM algorithm is superior to those of the conventional ensemble algorithms in the most models. It can be also understood that the CHEM algorithm is greatly superior to those of the conventional algorithms in view of the deviance in addition to the predictability. That is, the deviances of the conventional algorithms except for the CHEM algorithm are continuously increased. This means that function estimation of the conventional ensemble algorithms is hardly made.

On the contrary, it can be seen that the deviance of the CHEM algorithms is stably outputted. The phenomenon that the predictability is good but the deviance is increased is found in all the ensemble algorithms except for the CHEM algorithm. Such a phenomenon means that a boundary between two groups can be well found out in the classification problem but the other information is lost.

For example, a probability that a response variable y will belong to a k-th group for a new explanatory variable x cannot be estimated in all the ensemble algorithms except for the CHEM algorithm.

Consequently, the virtual test results show that the CHEM algorithm proposed by the present invention has both excellent predictability and stability.

10-2. Comparison through analysis of actual data The CHEM algorithm and the existing algorithms were compared with one another through analyses of various actual data. The used actual data were obtained from the 'Machine Learning Web Site' (http://wwwl.ics.uci.edu/~mlearn MLRepository.html) of 'UC Irvine' described above.

Table 6 below shows information on the actual data. Table 6. Information on actual data

FIGS. 17a to 17i are graphs showing the results of the actual data analysis.

It can be understood that the CHEM algorithm operates very stably in the most cases. Although the CHEM algorithm operates slightly unstably in case of 'Ionosphere data,' the difference in the stabilities of the operations between the data and the other data is not large. In case of 'German data,' it can be understood that the CHEM algorithm is superior to the other algorithms in view of their predictabilities and stabilities. It can also be seen that similarly to the aforementioned virtual tests, the CHEM algorithm has a smallest deviance value.

Table 7 below numerically shows the results of FIGS. 17a to 17i.

Table 7. Analysis results of actual data

In conclusion, the CHEM algorithm has a very excellent predictability, simultaneously is very stable (it does not produce greatly wrong estimation results in any data), and enables the function estimation (i.e., the deviance is small).

The function estimation means probability estimation. The CHEM algorithm can represent the values of response variables for given data as values with a discrete probability distribution, contrary to the conventional ensemble algorithms. This can be very usefully utilized in the analysis of the actual data.

Moreover, Table 8 below shows test results of continuous variables.

Table 8. Analysis results of continuous variables

In Table 8, the LS boost algoritlim was used as a conventional algorithm.

The Friedman model was a virtual model in which the number of training data was

500 and the number of test data was 5,000. Further, explanatory variables were x = (xi,

^{• ■ •} , xio) which were independent of one another and followed a unifom distribution at [0,

1]. Moreover, response variables were y = lOsin^ .-) + 20(x₃ - 0.5)² + 10x₄ + 5x₅ + ε , where ε ~ N(0,l) .

Data for finding out influences of various environmental variables on prices of houses in Boston were used as the actual data (Boston Housing Data). These data are easily available from the Internet.

The number of these data was 506, and the 5-fold cross validation method was used for obtaining a test eπor. It can be understood from the analysis results of the virtual tests and the data on the prices of houses in Boston that the CHEM algorithm well operates in the classification problem as well as the regression model. Particularly, reduction parameters should be defined by a user in the conventional LS boosting method, whereas it is hardly necessary for the user to define something in the CHEM algorithm.

According to the present invention, there are the following advantages.

First, the proposed CHEM algorithm is very superior to the conventional ensemble construction algorithms in view of their predictabilities and operates very stably. That is, there is a remarkably reduced chance that the overfitting problem occurs. Second, the proposed CHEM algorithm overcomes a drop in analyzability which may be commonly produced in the existing ensemble construction methods, and shows a more excellent analyzability over any other data mining methods by using the association rale algorithm.

Third, the CHEM algorithm can be naturally applied to the continuous variable problem and thus easily applied to general industrial fields.

Although the invention has been described with respect to the prefeπed embodiments, it will be understood by the skilled in the art that various changes and modifications may be made thereto without departing from the spirit and scope of the invention defined by the appended claims.

Claims

1. An apparatus for constructing a data mining model using ensemble machines, comprising: M ensemble machine construction means for constructing the ensemble machines, said M being an integer equal to and greater than 2; wherein a first ensemble machine constraction means constructs a first learning machine from multidimensional training data to be inputted, and then, constructs the constructed first learning machine itself into a first ensemble machine, and a k-th ensemble machine constraction means receives a (k-l)-th ensemble machine constructed from a (k-l)-th ensemble machine constraction means, constructs a k- th learning machine orthogonal to the (k-l)-th ensemble machine in a multidimensional space, and then constructs a k-th ensemble machine by combining the constructed k-th learning machine and the (k-l)-th ensemble machine, said k being an integer between 2 and M.

2. The apparatus as claimed in claim 1, wherein the k-th ensemble machine constraction means constructs a learning machine orthogonal to a (k-l)-th ensemble machine in the multidimensional space by using a residual of the (k-l)-th ensemble machine, and then constructs the k-th ensemble machine by using the learning machine.

3. The apparatus as claimed in claim 2, wherein the k-th ensemble machine constraction means receives the learning machine Ψ^ constructed from the residual of the (k-l)-th ensemble machine, calculates a constant V for minimizing d( v ??_k), where d is a deviance, and then constructs the k-th ensemble machine by using the V Ψ , where k is an integer between 2 and M-l .

4. The apparatus as claimed in claim 2, wherein if a response variable is categorical, the residual is determined according to the following formula (1): r yr^p _k-i(^x

where the inputted training data are ( i, yi), ..., (x_n, y_n)_. r, is an i-th residual, and P_k-i is a probability that y will be 1 by using the (k-l)-th ensemble machine.

5. The apparatus as claimed in claim 4, wherein an optimal learning machine is constructed by using an absolute value of the residual as a weight.

6. The apparatus as claimed in claim 2, wherein if a response variable is continuous, the residual is determined according to the following formula (2): r ,=y r^F k-i(^x ,) _^ (₂) where F_k-i is a (k-l)-th ensemble machine.

7. The apparatus as claimed in claim 6, wherein an optimal learning machine is constructed by using the residual as the response variable.

8. The apparatus as claimed in claim 1, wherein the k-th ensemble machine constraction means calculates a deviance by using a loss function, and then constructs the k-th ensemble machine by using the calculated deviance.

9. The apparatus as claimed in claim 8, wherein the k-th ensemble machine constraction means calculates the deviance according to the following formula (3): n

< )= ∑ /0„ΨO,)) (3) where d(Ψ) is the deviance and is the loss function.

10. The apparatus as claimed in claim 9, wherein the k-th ensemble machine construction means constructs the k-th ensemble machine Fk by combining the (k-l)-th ensemble machine F_k-i and the k-th learning machine Ψ^ according to the following formula (4):

■\^Fk- \ ^"+^{" w}iψi

F_k =

(4) where Wk =

and u _ι = l/d(F -ι).

11. The apparatus as claimed in claim -1, further comprising a stop means for constructing a final ensemble machine by using m learning machines, if a given loss function is smallest when the m learning machines are constructed.

12. The apparatus as claimed in claim 11, wherein the stop means constracts the m-th ensemble machine F_m for first m satisfying the formula (5) into the final ensemble machine:

°m = °m₊K > (⁵) where o_m - argmin_{;=1 m} d_j , o_m+κ = argmin_/=1 _ _m+x J₇ , / is a loss function, K is a n predetermined constant, and d_m = ∑t y_/. E,,. (•*,)) . ι=l

13. Method for constructing a data mining model using ensemble machines, comprising: a step of constructing M ensemble machines, said M being an integer equal to and greater than 2; wherein first ensemble machine constructing step constracts a first learning machine from multidimensional training data to be inputted, and then, constracts the constructed first learning machine itself into a first ensemble machine, and a k-th ensemble machine constructing step receives a (k-l)-th ensemble machine constracted from a (k-l)-th ensemble machine construction means, constracts a k-th learning machine orthogonal to the (k-l)-th ensemble machine in a multidimensional space, and then constracts a k-th ensemble machine by combining the constracted k-th learning machine and the (k-l)-th ensemble machine, said k being an integer between 2 and M.

14. The method as claimed in claim 13, wherein the k-th ensemble machine constructing step constracts a learning machine orthogonal to a (k-l)-th ensemble machine in the multidimensional space by using a residual of the (k-l)-th ensemble machine, and then constructs the k-th ensemble machine by using the learning machine.

15. The method as claimed in claim 14, wherein the k-th ensemble machine constructing step receives the learning machine Ψ^ constracted from the residual of the (k- l)-th ensemble machine, calculates a constant η for minimizing d( η Ψ^), where d is a deviance, and then constracts the k-th ensemble machine by using the V Ψ_k, where k is an integer between 2 and M-l .

16. The method as claimed in claim 14, wherein if a response variable is categorical, the residual is determined according to the following formula (6):

where the inputted training data are (xi, yi), ..., (x_n, y_n), τ is an i-th residual, and P_k-i is a probability that y will be 1 by using the (k-l)-th ensemble machine.

17. The method as claimed in claim 16, wherein an optimal learning machine is constracted by using an absolute value of the residual as a weight.

18. The method as claimed in claim 14, wherein if a response variable is continuous, the residual is determined according to the following formula (7):

where F_k-i is a (k-l)-th ensemble machine.

19. The method as claimed in claim 18, wherein an optimal learning machine is constracted by using the residual as the response variable.

20. The method as claimed in claim 13, wherein the k-th ensemble machine constracting step calculates a deviance by using a loss function, and then constracts the k- th ensemble machine by using the calculated deviance.

21. The method as claimed in claim 20, wherein the k-th ensemble machine constructing step calculates the deviance according to the following formula (8):

where d(Ψ) is the deviance and / is the loss function.

22. The method as claimed in claim 21, wherein the k-th ensemble machine constracting step constructs the k-th ensemble machine F_k by combining the (k-l)-th ensemble machine F -i and the k-th learning machine Ψ^ according to the following formula (9):

_T-. ^Uk- l^IT'k- l ^{÷ W}tΨ*

^Fk = ~ IT— ^uk- ι -^T- k _; ( where _k = lld(Ψf) and Uk-i = l/d(Fk-ι).

23. The method as claimed in claim 13, further comprising a step for constracting a final ensemble machine by using m learning machines, if a given loss function is smallest when the m learning machines are constructed.

24. The method as claimed in claim 23, wherein the constracting step constracts the m-th ensemble machine F_m for first m satisfying the formula (10) into the final ensemble machine: _m = o_m+κ , (10) where o_m = argmin_{/=1 m} d_j , o_m+κ = arg min _/=. _m+κ d_j , I is a loss function, K is a predetermined constant, and J_m = ∑ l(y_t , E„, (x_t )) ■ ι=l

25. An apparatus for constracting a data mining model in two-class classification using ensemble machines, comprising: an input means for receiving multidimensional training data; a weight calculation means for calculating weights by using residuals of the ensemble machines; a probability estimation means for estimating, by using basic learning machines and the weight, probability that a response variable will be 1 with respect to an explanatory variable x obtained from the data inputted by the input means; a coπection parameter calculation means for calculating a coπection parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the probability estimation means; a learning machine update means for updating and reconstructing learning machines by using the calculated coπection parameter; and an ensemble machine construction means for constracting a final ensemble machine based on the updated learning machines.

26. The apparatus as claimed in claim 25, further comprising a Boostrap data generation means for generating the inputted training data into Boostrap data by using the weights, and then transmitting the Boostrap data to the probability estimation means.

27. The apparatus as claimed in claim 25, wherein the coπection parameter calculation means calculates the coπection parameter V which minimizes a value of the following formula (11) with respect to the given loss function /:

∑ 'O.-η/^.)) (11) f_m(x)=logp_m(^χ)/(l -_Pm(^χ)) where p_m is a probability that the response variable will be 1 with respect to the given explanatory variable x.

28. The apparatus as claimed in claim 27, wherein the learning machine update means updates the learning machines according to the following formula (12): f m,new ^XJ⁼ m,oId ^X) (V2 where is the coπection parameter.

29. The apparatus as claimed in claim 28, wherein the ensemble machine constraction means constructs the ensemble machines according to the following formula (13): m m

'⁼¹ , (13) n where u_m =1/∑/0'_I,/„W) .

1=1

30. The apparatus as claimed in claim 29, wherein the weight calculation means updates and calculates the weights according to the following formula (14):

exp(E(x)) where E(x) = - l + exp(E(x))

31. The apparatus as claimed in claim 25, wherein the learning machines are decision trees.

32. The apparatus as claimed in claim 25, wherein the learning machines are neural network models.

33. The apparatus as claimed in claim 25, wherein the ensemble machine constraction means constructs the final ensemble machine when a deviance of a cuπent ensemble machine is smallest.

34. The apparatus as claimed in claim 33, wherein the ensemble machine construction means constracts the m-th ensemble machine F_m for first m satisfying the following formula (15) into the final ensemble machine:

°m = °m+κ > (15) where o_m — argmin-_{=1 m} d_} , o_m+κ = argmin _{=1 m+κ} d_} , I is a loss function, K is a n predetermined constant, d_m = l(y_t , F_m (x₍)) , and F_m is th m-th ensemble machine. ι=l

35. Method for constracting a data mining model in two-class classification using ensemble machines, comprising: a first step for receiving multidimensional training data; a second step for calculating weights by using residuals of the ensemble machines; a third step for estimating, by using basic learning machines and the weight, probability that a response variable will be 1 with respect to an explanatory variable x obtained from the data inputted by the first step; a fourth step for calculating a coπection parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the third step; and a fifth step for updating and reconstructing learning machines by using the coπection parameter calculated from the fourth step.

36. The method as claimed in claim 35, the third step generates the inputted training data as Boostrap data by using the weights, and then estimates a probability that the response variable will be 1 for a given explanatory variable x by using the Boostrap data through the basic learning machines.

37. The method as claimed in claim 35, wherein the fourth step calculates the coπection parameter η which minimizes a value of the following formula (16) with respect to the given loss function /:

where p_m is a probability that the response variable will be 1 with respect to the given explanatory variable x.

38. The method as claimed in claim 37, wherein the fifth step updates the learning machines according to the following formula (17):

J m,new ^{X) =}W m, ld ^X) (11) where η is the coπection parameter.

39. The method as claimed in claim 38, further comprising a step of constracting the ensemble machines according to the following formula (18): m rtl

F x) = ∑ -i ffx)/ ∑ u

^<'=^{! <=1} , (18) n where u_m = 11 ∑ l(y„f_m (x,)) . ι=l

40. The method as claimed in claim 39, wherein the second step updates and calculates the weights according to the following formula (19):

exp(E(x)) where E(x) = - l + exp(E(x))

41. The method as claimed in claim 35, wherein the learning machines are decision trees.

42. The method as claimed in claim 35, wherein the learning machines are neural network models.

43. The method as claimed in claim 35, wherein the final ensemble machine is constracted when a deviance of a cuπent ensemble machine is smallest.

44. The method as claimed in claim 43, wherein the m-th ensemble machine F_m for first m satisfying the following formula (20) is constructed as the final ensemble machine:

°m = °«»K > (²°) where ø_m = argmin ._{=1 m} J., o_m+κ = argmin_{/=1 m+A:} J₇ , / is a loss function, K is a n predetermined constant, d_m = -^"(.y,-_. E_m (x_r.)) , and F_m is th m-th ensemble machine. ι=l

45. An apparatus for constructing a data mining model in multi-class classification using ensemble machines, comprising: an input means for receiving multidimensional training data; a weight calculation means for calculating weights by using residuals of the ensemble machines; a probability estimation means for estimating, by using basic learning machines and the weights, probabilities that a response variable will belong to Group j with respect to an explanatory variable x obtained from the data inputted by the input means; a coπection parameter calculation means for calculating a coπection parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the probability estimation means; a learning machine update means for updating and reconstructing learning machines by using the calculated coπection parameter; and an ensemble machine construction means for constracting a final ensemble machine based on the updated learning machines.

46. The apparatus as claimed in claim 45, further comprising a Boostrap data generation means for generating the inputted training data into Boostrap data by using the weight, and then transmitting the Boostrap data to the probability estimation means.

47. The apparatus as claimed in claim 45, wherein the coπection parameter calculation means calculates the coπection parameter V which minimizes a value of the following formula (21) with respect to the given loss function /:

fm(^X) V πl W'-'/m ¹)) _an(J 21) log (p_mJ(x)/ l -p_m'_J(x))) where p_raj is probabilities that the response variable will belong to Group j with respect to the given explanatory variable x.

48. The apparatus as claimed in claim 47, wherein the learning machine update means updates the learning machines according to the following fomiula (22): f msιew(^{X) =r}V ,old ^X 22 where y is the coπection parameter.

49. The apparatus as claimed in claim 48, wherein the ensemble machine construction means updates the ensemble machines according to the following formula (23): tn m

F(pf)= ∑ « „(*)/ ∑ u '-¹ '"^l , (23) where u_m (E₁(x),...,E₇(x)) .

50. The apparatus as claimed in claim 49, wherein the weight calculation means updates the weights according to the following formula (24):

exp(E (x)) _* where P (x) = — , and y is either 1 if an i-th observed value

∑exp(E,(x))

A =l belongs to Group j or 0 if not.

51. The apparatus as claimed in claim 45, wherein the learning machines are decision trees.

52. The apparatus as claimed in claim 45, wherein the learning machines are neural network models.

53. The apparatus as claimed in claim 45, wherein the ensemble machine constraction means constracts a final ensemble machine when a deviance of a cuπent ensemble machine is smallest.

54. The apparatus as claimed in claim 53, wherein the ensemble machine construction means constracts the m-th ensemble machine F_m for first m satisfying the following fomiula (25) into the final ensemble machine:

°m = °m+K > (²⁵) where o_m = argmin _{=1 m} d_} , o_m+κ = arg min _{lf m+κ} d_; , l is loss function, K is a n predetermined constant, d_m = l(y„F_m (x₍ )) , and F_m is an m-th ensemble machine.

1=1

55. Method for constracting a data mining model in multi-class classification using ensemble machines, comprising: a first step for receiving multidimensional training data; a second step for calculating weights by using residuals of the ensemble machines; a third step for estimating, by using basic learning machines and the weights, probabilities that a response variable will belong to Group j with respect to an explanatory variable x obtained from the data inputted by the first step; a fourth step for calculating a coπection parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the third step; a fifth step for updating and reconstructing learning machines by using the calculated coπection parameter; and a sixth step for constructing a final ensemble machine based on the updated learning machines.

56. The method as claimed in claim 55, wherein the third step generates the inputted training data as Boostrap data by using the weight, and then estimates a probability that the response variable will be 1 for a given explanatory variable x is estimated using the Boostrap data through the basic learning machines.

57. The method as claimed in claim 55, wherein the fourth step calculates the coπection parameter η which minimizes a value of the following formula (26) with respect to the given loss function /:

_m( )⁼(/ _OTi ( )₃... _m )) ₂mά (26) f_mJ(x)=log(p_mJ(χ)/ l-p_m'_J x))) where p_mj is probabilities that the response variable will belong to Group j with respect to the given explanatory variable x.

58. The method as claimed in claim 57, wherein the fifth step updates the learning machines according to the following formula (27): f m,new ^X)^~'ψ m,old ^X) 27) where η is the coπection parameter.

59. The method as claimed in claim 58, wherein the sixth step updates the ensemble machines according to the following formula (28): m m

Fix)= _iu_if_m x / 2 _i '-¹ '⁼¹ , (28) where u_m and F(x) = (F_l(x),...,F_J(x)) .

60. The method as claimed in claim 59, wherein the second step updates the weights according to the following formula (29): w.. ^y _υ-^p __

P ( )(l -E,(x.)) (29) exp(E (x)) „ where P_} (x) = — , and y is either 1 if an i-th observed value

∑exp(E,(x)) k=\ belongs to Group j or 0 if not.

61. The method as claimed in claim 55, wherein the learning machines are decision trees.

62. The method as claimed in claim 55, wherein the learning machines are neural network models.

63. The method as claimed in claim 55, wherein the sixth step constracts a final ensemble machine when a deviance of a cuπent ensemble machine is smallest.

64. The method as claimed in claim 63, wherein the ensemble machine constraction means constracts the m-th ensemble machine F_m for first m satisfying the following formula (30) into the final ensemble machine:

°m = °m+K > (³°) where o_m = argmin_J=1 ,,. d_] , o_m+κ = argmin_{J=1 m+κ} d , I is a loss function, K is a n predetermined constant, d_m = l(y_t , F_m (x,)) , and F_m is an m-th ensemble machine.

1=1

65. An apparatus for constructing a data mining model using ensemble machines in a case where a response variable is regressive, comprising: an input means for receiving multidimensional training data; a residual calculation means for calculating residuals of the ensemble machines; a regression model construction means for constracting regression models from the data inputted by the input means through basic learning machines and the residuals; a coπection parameter calculation means for calculating a coπection parameter which minimizes a given loss function with respect to the constructed regression models obtained from the regression model constraction means; a learning machine update means for reconstructing learning machines by updating the response variable based on the calculated coπection parameter; and an ensemble machine constraction means for constracting a final ensemble machine based on the reconstructed learning machines.

66. The apparatus as claimed in claim 65, wherein the coπection parameter calculation means calculates the coπection parameter η which minimizes a value of the following formula (31) with respect to the given loss function .

67. The apparatus as claimed in claim 65, wherein the learning machine update means updates the learning machines based on the coπection parameter η according to the following formula (32): m

∑u^x)

F x) = ^_m . (32)

∑"« ι=l where ψ_m(x) is updated in accordance with

the formula, z_t = y_t - F(x_{) , the multidimensional fraining data to be inputted are (xi, yi),

..., (x_n, y_n), Xj is a p-dimensional explanatory variable, y; is a response variable, and ψ _m is a regression model.

68. The apparatus as claimed in claim 65, wherein the learning machines are decision frees.

69. The apparatus as claimed in claim 65, wherein the learning machines are neural network models.

70. The apparatus as claimed in claim 65, wherein the ensemble machine constraction means constracts a final ensemble machine when a deviance of a cuπent ensemble machine is smallest.

71. The apparatus as claimed in claim 70, wherein the ensemble machine construction means constructs the m-th ensemble machine F_m for first m satisfying the following formula (33) into the final ensemble machine: „, = o_m+κ , (33) where o_m = argmin_{;=1 m} d . , o_m+κ - argmin_{y=1 m+}^ d_j , I is a loss function, K is a predetermined constant, d_m = ∑ l(y_t,F_m (x,)) , and F_m is an m-th ensemble machine. ι=l

72. Method for constructing a data mining model using ensemble machines in a case where a response variable is regressive, comprising: a first step for receiving multidimensional training data; a second step for calculating residuals of the ensemble machines; a third step for constructing regression models from the data inputted by the first step through basic learning machines and the residuals; a fourth step for calculating a coπection parameter which minimizes a given loss function with respect to the constracted regression models obtained from the third step; a fifth step for reconstructing learning machines by updating the response variable based on the calculated coπection parameter; and a sixth step for constructing a final ensemble machine based on the reconstructed learning machines.

73. The method as claimed in claim 72, wherein the fourth step calculates the coπection parameter V which minimizes a value of the following formula (34) with respect to the given loss function /. n

∑^Z0 ηψm⁽*/⁾) '⁼¹ (34)

74. The method as claimed in claim 72, wherein the fifth step updates the learning machines based on the coπection parameter η according to the following formula (35):

where ψ_m(x) , Zj is updated in accordance with

the fomiula, z_t - y_{ - E(x₍) , the multidimensional training data to be inputted are (xi, yi),

..., (x_n, y_n), Xi is a p-dimensional explanatory variable, yi is a response variable, and φ _m is a regression model.

75. The method as claimed in claim 72, wherein the learning machines are decision frees.

76. The method as claimed in claim 72, wherein the learning machines are neural network models.

77. The method as claimed in claim 72, wherein the sixth step constructs a final ensemble machine when a deviance of a cuπent ensemble machine is smallest.

78. The method as claimed in claim 70, the m-th ensemble machine F_m for first m satisfying the following formula (36) is constracted as the final ensemble machine:

°m = °m+K > (36) where o_m = argmin ._=1>____j(B d. , o_m+κ = argmin _{=1 m+Λ}- J. , / is a loss function, K is a predetermined constant, d_m = l(y_{,F_m (x_;)) , and F_m is an m-th ensemble machine.

1=1