APPARATUS AND METHOD FOR CONSTRUCTING DATA MINING MODEL
USING ENSEMBLE MACHINES
Technical Field
The present invention relates to an apparatus and method for constructing a data mining model using ensemble machines, and more particularly, to an apparatus and method for constructing a data mining model wherein faster computational speed, more stable model construction, and superior analyzability and predictability can be easily obtained as compared with a conventional method when analyzing and predicting multidimensional huge data.
Background Art
1. Introduction In data mining, many ensemble machines have been studied based on a Bagging algorithm first proposed by Breiman until now.
That is, there have been performed studies for developing the Bagging algorithm. As results of such studies, there are a Boosting algorithm proposed by Freund and Schapire, an Arcing algorithm proposed by Breiman, a random forest algorithm proposed by Breiman, and the like.
Various improved algorithms are based on the boosting algorithm proposed by Freund and Schapire among these ensemble machines because of its superior predictability. As for such improved algorithms, there are a Real Boosting algorithm proposed by Schapire and Singer and a Gradient boosting algorithm proposed by Friedman. That is, the conventional ensemble algorithms for use in the data mining are mostly based on the boosting algorithm.
Hereinafter, conventional various boosting algorithms which have been applied to classification problems will be briefly explained. In particular, the real boosting algorithm, a Logit boosting algorithm and the gradient boosting algorithm, which are currently used most widely, will be explained.
2. Conventional boosting algorithms applicable to two-class classification problem
The boosting algorithm is a method which is generally used in a classification problem. A basic model of the classification problem is as follows:
Assume that n training data (xj, yi), ..., (xn, yn) are given. Here, Xj is a p- dimensional explanatory variable, i.e. x (xii, ..., xpj); and a response variable y\ designates a group in which the data are included. That is, if there are J groups, y, has one integer value of 1 to J.
An objective of the classification problem is to find out a relationship in which the response variable can be best explained based on the explanatory variable using the n training data. In other words, the objective is to obtain an optimal function of H:Rp- {1,
2, ..., J} using the training data. And then, when a new explanatory variable x is given, the data is assigned to Group H(x).
First, a case where there are provided two groups, that is, J=2 is considered. A case where there are provided multiple groups will be explained later. Various boosting algorithms will be now explained below.
2-1. Real boosting algorithm (Schapire and Singer, 1999)
The real boosting algorithm is most representative of the algorithms for constructing an ensemble machine for data mining. This algorithm is described in detail in United States Patent No. 5,819,247 entitled
"Apparatus and methods for machine learning hypotheses," and will be specifically explained as follows.
FIG. 1 is a flowchart illustrating the conventional real boosting algorithm applicable to a case where there are provided two groups in a classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 1.
(1) Step SI 01: A response variable y, is set either to 1 if the data belong to Group 2 or to -1 if the data belong to Group 1.
(2) Step SI 02: n weights w1} ..., wn are initialized by setting the weights in accordance with a relationship, w l/n. (3) Step SI 03: A probability pm that a value of the response variable with respect to the explanatory variable x will be 1 is estimated through basic learning machines using the
n training data (xj, y\), ..., (xn, yn) and the weights wi, ..., wn. The probability Pm is determined according to the following formula (1): pm(χ) = pw(y= 1 I *)
(1) (4) Step SI 04: fm(x)is obtained by transforming the probability Pm(x)- The fm(x) is determined according to the following formula (2):
(2)
(5) Step SI 05: New weights are obtained from the following formula (3) and then n normalized to satisfy the relationship ^ w{ =1 : ι=l W tflew = W i,old Xφ(-yfm(.Xt X * = " • > n
(3)
(6) Step SI 06: M basic learning machines are generated by repeating the steps S103 to S105 M times (m=l, ..., M). (7) Step SI 07: A final ensemble machine is determined according to the following formula (4). At this time, a new explanatory variable x is assigned to Group 2 if H(x) > 0. Otherwise, the new explanatory variable x is assigned to Group 1.
M
H(x = ∑ „-0)
(4) Thus, the final ensemble machine is constructed through the foregoing processes.
2-2. Logit boosting algorithm (Friedman et al., 2000)
FIG. 2 is a flowchart illustrating the conventional Logit boosting algorithm applicable to a case where there are provided two groups in a classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 2.
(1) Step S201: A response variable yi is set either to 1 if the data belong to Group 2 or to 0 if the data belong to Group 1.
(2) Step S202: n weights wi, ..., wn are initialized by setting the weights in accordance with a relationship, and F(x) and p(x) are also initialized to 0 and 1/2, respectively.
(3) Step S203: A new response variable z, and a new weight w, with respect to the given explanatory variable are obtained from the following formula (5):
(5)
(4) Step S204: A regression model fm(x) is constructed with reference to the basic learning machines and using the response variable zl5 the explanatory variable x, and the weight w,.
(5) Step S205: A probability is updated using the fm(x) constructed in step S204 and is expressed as the following formula (6): exp(F(x)) -1- exp(-.F(:x.))
FXp ) = (x)+^-/m(x)
(6)
(6) Step S206: Processes of updating the probability are performed M times by repeating steps S203 to S205 M times.
(7) Step S207: A final ensemble machine is constructed by using the relationship, H(x)=F(x). At this time, a new response variable x is assigned to Group 2 if H(x) > 0.
Otherwise, the new response variable x is assigned to Group 1.
2-3. Gradient boosting algorithm (Friedman, 2001)
FIG. 3 shows a flowchart illustrating the conventional gradient boosting algorithm applicable to a case where there are provided two groups in a classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 3.
(1) Step S301: A response variable y, is set either to 1 if the data belong to Group 2 or to -1 if the data belong to Group 1.
(2) Step S302 : Various kinds of variables are initialized according to the following formula (7):
(7)
(3) Step S303: A new response variable z, with respect to the given explanatory
variable are calculated from the following formula (8):
Zi 1 + exp(2.y,F(x;))
(8)
(4) Step S304: fm(x) is estimated using the response variable zb the explanatory variable x, and the basic learning machines (regression decision trees).
(5) Step S305: An estimated value Ύ i of an /-th terminal node of the fm(x) is estimated based on the following formula (9):
(9) where Ri is a set of the data belonging to an i-th terminal node.
(6) Step S306: The F(x) is updated in accordance with the formula,
F(x)=F(x)+fm(x).
(7) Step S307: The F(x) is updated M times by repeating steps S303 to S306 M time. (8) Step S308: A final ensemble machine is constructed by using the relationship,
H(x)=F(x). At this time, a new response variable x is assigned to Group 2 if H(x) > 0. Otherwise, the new response variable x is assigned to Group 1.
In the meantime, various kinds of basic learning machines can be utilized in the aforementioned real boosting algorithm and Logit boosting algorithm, whereas the gradient boosting algorithm should utilize the decision trees as basic learning machines.
Further, when the basic learning machine is to be constructed in the algorithms, complexity of the learning machine should be beforehand determined. In a case where the basic learning machine is the decision tree, the complexity of the learning machine can be controlled by adjusting the number of the terminal nodes. However, since the basic learning machines used in the boosting algorithms are weak ones, the complexity thereof should be minimized. In general, the decision tree with two to eight terminal nodes is widely used as a basic learning machine.
3. Conventional boosting algorithms applicable to multi-class classification
The aforementioned three boosting algorithms have been applicable to cases where there are two groups. However, in a case where there are three or more groups, the algorithms should be expanded and modified accordingly. Hereinafter, the real boosting algorithm, Logit boosting algorithm, and gradient boosting algorithm applicable to the multi-class classification will be explained.
3-1. Real boosting algorithm applicable to Multi class
FIG. 4 shows a flowchart illustrating the conventional real boosting algorithm applicable to the multi-class classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 4.
(1) Step S401: J new data, ((XJ, 1), yπ), ..., ((XJ, J), yjj) are generated from given training data (XJ, y;). At this time, y^ is 1 if y; is k, whereas otherwise, yi is -1.
(2) Step S402: A final ensemble machine is constructed according to the following formula (10) by applying the two-class real boosting algorithm to an (n x J) new data (i.e., by transforming the multi-class classification into the two-class classification):
H(xJ) : R : X { 1, ..., J} → R (10)
(3) Step S403: A new explanatory variable x is assigned to argmaxj H(x, j) Group. However, according to the real boosting algorithm applicable to the above multi- class classification, the number of execution, i.e. the number of calculation, correspond to the number of boosting. Thus, there is a problem in that the real boosting algorithm is very time-consuming.
3-2. Logit boosting algorithm applicable to multi-class classification
FIG. 5 shows a flowchart illustrating the conventional Logit boosting algorithm applicable to the multi-class classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 5.
(1) Step S501: Various kinds of variables are initialized. That is, it is set that Fj(x)=0 and pj(x)=l/J, where j=l , 2, ... , J.
(2) Step S502: It is set that y;*=l if i-th data belong to Group j whereas yi*=0 if i-th
data do not belong to Group j.
(3) Step S503: A New response variable Zj and a weight Wj are determined according to the following formula (11):
w, = Λ(*,)( I - p, χ ty (11)
(4) Step S504: A regression model fmj(x) is calculated and constructed with reference to the basic learning machines and using the response variable z,, the explanatory variable X; and the weight w,.
(5) Step S505: Steps S502 to S504 are repeated J times. (6) Step S506: A j-th Fj(x) is updated according to the following formula (12): fm pA) = ^ *-• (/;„,( - ^ •-/ k έ= lω »
FJ *y= F; χ + fny(X ~) (12)
(7) Step S507: A j-th probability p,(x) is updated according to the following fon ιla (13): exp(-E-CxQ) J
∑ exp( .χx)) k=1 (13) (8) Step S508: Steps S502 to S507 are repeated M times.
(9) Step S509: A new explanatory variable x is assigned to argmaxj Fj(x) Group.
3-3. Gradient boosting algorithm applicable to multi-class classification FIG. 6 shows a flowchart illustrating the conventional gradient boosting algorithm applicable to the multi-class classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 6.
(1) Step S601: Various kinds of variables are initialized. That is, it is set that Fj(x)=0 and Pj(x)=l/J, where j=l, 2, ..., J.
(2) Step S602: It is set that y,
*=l if i-th data belong to Group j whereas
if i-th data do not belong to Group j .
(3) Step S603: A new response variable z, is determined according to the following formula (14):
(14)
(4) Step S604: A regression model fmj(x) is constructed with reference to the basic learning machines (regression decision trees) and using the response variable z, and explanatory variable x,.
(5) Step S605: An estimated value γ of an /-th terminal node in the regression model following formula (15):
(15)
(6) Step S606: A F
j(x) is determined according to the following formula (16):
(16) (7) Step S607: Steps S602 to S606 are repeated J times (j=l, ..., J).
(8) Step S608: Steps S602 to S607 are repeated M times (m=l, ..., M).
(9) Step S609: A new explanatory variable x is assigned to an argmaxj Fj(x) Group.
4. Summary of problems in prior arts A basic idea of the aforementioned boosting algorithms is to develop a new strong learning machine by constructing several weak learning machines and then combining them. However, such an idea involves several problems which are summarized as follows.
First, it is difficult to analyze the constructed models. Second, it is not easy for a general user to define a number of tuning parameters, such as the size of a decision tree, the number of the decision trees and the like. Third, there is a tendency toward overfitting of data (Ridgeway, 2000). Fourth, if a response variable is continuous, it is not easy to use the algorithms. Fifth, it is not easy to estimate probabilities in a classification problem. That is, when new data are inputted, classification probability to be taken by the inputted new data
cannot be estimated.
Sixth, it is not easy to verify stability of the boosting algorithm, due to lack of a general (statistical) theorem of the boosting algorithm.
Disclosure of the Invention
The present invention is conceived to solve the above problems in the prior art. Accordingly, an object of the present invention is to provide an apparatus and method for constructing an ensemble machine wherein faster computational speed, more stable model construction, and superior analyzability and predictability can be easily obtained as compared with a conventional method when analyzing and predicting multidimensional huge data.
According to an aspect of the present invention, there is provided an apparatus for constructing a data mining model using ensemble machines, comprising: M ensemble machine construction means for constructing the ensemble machines, said M being an integer equal to and greater than 2; wherein a first ensemble machine construction means constructs a first learning machine from multidimensional training data to be inputted, and then, constructs the constructed first learning machine itself into a first ensemble machine, and a k-th ensemble machine construction means receives a (k-l)-th ensemble machine constructed from a (k-l)-th ensemble machine construction means, constructs a k-th learning machine orthogonal to the (k-l)-th ensemble machine in a multidimensional space, and then constructs a k-th ensemble machine by combining the constructed k-th learning machine and the (k-l)-th ensemble machine, said k being an integer between 2 and M.
According to another aspect of the present invention, there is provided method for constructing a data mining model using ensemble machines, comprising: a step of constructing M ensemble machines, said M being an integer equal to and greater than 2; wherein first ensemble machine constructing step constructs a first learning machine from multidimensional training data to be inputted, and then, constructs the constructed first learning machine itself into a first ensemble machine, and a k-th ensemble machine constructing step receives a (k-l)-th ensemble machine constructed from a (k-l)-th ensemble machine construction means, constructs a k-th learning machine orthogonal to the (k-l)-th ensemble machine in a multidimensional space, and then constructs a k-th
ensemble machine by combining the constructed k-th learning machine and the (k-l)-th ensemble machine, said k being an integer between 2 and M.
According to another aspect of the present invention, there is provided an apparatus for constructing a data mining model in two-class classification using ensemble machines, comprising: an input means for receiving multidimensional training data; a weight calculation means for calculating weights by using residuals of the ensemble machines; a probability estimation means for estimating, by using basic learning machines and the weight, probability that a response variable will be 1 with respect to an explanatory variable x obtained from the data inputted by the input means; a correction parameter calculation means for calculating a correction parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the probability estimation means; a learning machine update means for updating and reconstructing learning machines by using the calculated correction parameter; and an ensemble machine construction means for constructing a final ensemble machine based on the updated learning machines.
According to an aspect of the present invention, there is provided method for constructing a data mining model in two-class classification using ensemble machines, comprising: a first step for receiving multidimensional training data; a second step for calculating weights by using residuals of the ensemble machines; a third step for estimating, by using basic learning machines and the weight, probability that a response variable will be 1 with respect to an explanatory variable x obtained from the data inputted by the first step; a fourth step for calculating a correction parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the third step; and a fifth step for updating and reconstructing learning machines by using the correction parameter calculated from the fourth step.
According to an aspect of the present invention, there is provided an apparatus for constructing a data mining model in multi-class classification using ensemble machines, comprising: an input means for receiving multidimensional training data; a weight calculation means for calculating weights by using residuals of the ensemble machines; a probability estimation means for estimating, by using basic learning machines and the weights, probabilities that a response variable will belong to Group j with respect to an
explanatory variable x obtained from the data inputted by the input means; a coπection parameter calculation means for calculating a coπection parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the probability estimation means; a learning machine update means for updating and reconstructing learning machines by using the calculated coπection parameter; and an ensemble machine construction means for constructing a final ensemble machine based on the updated learning machines.
According to an aspect of the present invention, there is provided method for constructing a data mining model in multi-class classification using ensemble machines, comprising: a first step for receiving multidimensional training data; a second step for calculating weights by using residuals of the ensemble machines; a third step for estimating, by using basic learning machines and the weights, probabilities that a response variable will belong to Group j with respect to an explanatory variable x obtained from the data inputted by the first step; a fourth step for calculating a coπection parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the third step; a fifth step for updating and reconstructing learning machines by using the calculated coπection parameter; and a sixth step for constructing a final ensemble machine based on the updated learning machines.
According to an aspect of the present invention, there is provided an apparatus for constructing a data mining model using ensemble machines in a case where a response variable is regressive, comprising: an input means for receiving multidimensional training data; a residual calculation means for calculating residuals of the ensemble machines; a regression model construction means for constructing regression models from the data inputted by the input means through basic learning machines and the residuals; a coπection parameter calculation means for calculating a coπection parameter which minimizes a given loss function with respect to the constructed regression models obtained from the regression model construction means; a learning machine update means for reconstructing learning machines by updating the response variable based on the calculated coπection parameter; and an ensemble machine construction means for constructing a final ensemble machine based on the reconstructed learning machines.
According to an aspect of the present invention, there is provided method for
constructing a data mining model using ensemble machines in a case where a response variable is regressive, comprising: a first step for receiving multidimensional training data; a second step for calculating residuals of the ensemble machines; a third step for constructing regression models from the data inputted by the first step through basic learning machines and the residuals; a fourth step for calculating a coπection parameter which minimizes a given loss function with respect to the constructed regression models obtained from the third step; a fifth step for reconstructing learning machines by updating the response variable based on the calculated coπection parameter; and a sixth step for constructing a final ensemble machine based on the reconstructed learning machines. More specifically, the inputted training data is generated into Boostrap data by using the weights.
More specifically, the learning machines are decision trees or neural network models.
More specifically, the final ensemble machine is constructed when a deviance of a cuπent ensemble machine is smallest.
Brief Description of the Drawings
The above and other objects and features of the present invention will become apparent from the following description of prefeπed embodiments given in conjunction with the accompanying drawings, in which:
FIG. 1 is a flowchart illustrating a conventional real boosting algorithm applicable to a case where there are provided two groups in a classification problem;
FIG. 2 is a flowchart illustrating a conventional Logit boosting algorithm applicable to a case where there are provided two groups in the classification problem; FIG. 3 is a flowchart illustrating a conventional gradient boosting algorithm applicable to a case where there are provided two groups in the classification problem;
FIG. 4 is a flowchart illustrating a conventional real boosting algorithm applicable to a case of a multi-class classification problem;
FIG. 5 is a flowchart illustrating a conventional Logit boosting algorithm applicable to a case of the multi-class classification problem;
FIG. 6 is a flowchart illustrating a conventional gradient boosting algorithm
applicable to a case of the multi-class classification problem;
FIG. 7 is a flowchart schematically illustrating a method for constructing an ensemble machine according to a prefeπed embodiment of the present invention;
FIG. 8 is a flowchart schematically illustrating a cross validation process for selecting an optimal decision tree;
FIG. 9 is a flowchart illustrating an entire outline of a process for selecting the optimal decision tree through tree information criteria (TIC);
FIG. 10 is a flowchart showing a process for generating Boostrap data according to another prefeπed embodiment of the present invention; FIG. 11 is a basic conceptual view showing existence of the plurality of learning machines having small distances from a true learning machine among the learning machines included in the conventional set of basic learning machines;
FIG. 12a is a conceptual view showing a method for constructing a first learning machine according to the CHEM algorithm proposed by the present invention; FIG. 12b is a conceptual view showing a method for constructing a second learning machine according to the CHEM algorithm proposed by the present invention;
FIG. 12c is a conceptual view showing a method for constructing a third learning machine according to the CHEM algorithm proposed by the present invention;
FIG. 12d is a conceptual view showing a method for constructing an m-th learning machine according to the CHEM algorithm proposed by the present invention;
FIG. 13 is a flowchart illustrating an outline of a CHEM algorithm according to one embodiment of the present invention;
FIG. 14 is a flowchart schematically illustrating an ensemble algorithm for the continuous variables according to one embodiment of the present invention; FIG. 15 is a flowchart illustrating an outline of an algorithm for generating association rules;
FIGS. 16a to 16d are graphs showing the results of virtual tests for validating performances of the conventional ensemble methods and the ensemble method proposed by the present invention; and FIGS. 17a to 17i are graphs showing the results of actual data analysis for validating performances of the conventional ensemble methods and the ensemble method
proposed by the present invention.
Best Mode for Carrying Out the Invention
Hereinafter, prefeπed embodiments of the present invention will be described in detail with reference to the accompanying drawings.
1. Introduction
An algorithm for constructing a data mining model proposed by the present invention is a novel algorithm of which predictability is superior to the boosting algorithms while solving the aforementioned problems in the boosting algorithms. Further, in a case where a decision tree is employed as a basic learning machine, the algorithm of the present invention has a further improved analyzability as compared with a decision tree which is known to have most superior analyzability among conventional data mining techniques (of course, the decision tree has very poor predictability). In particular, according to the algorithm of the present mvention, various kinds of association rules capable of describing given data can be extracted, and accordingly, the given data can be analyzed in various respects.
Theoretically, the algorithm proposed by the present invention is completely different from the boosting algorithm in view of their basic ideas. The boosting algorithm is to develop a strong learning machine by combining several weak learning machines, whereas the algorithm of the present invention is to develop a stronger learning machine by combining several strong learning machines.
A basic learning machine used in an ensemble scheme of the present invention will be explained by way of example of the decision tree used as the learning machine. The reason is that the decision tree has advantages such as high analyzability and simplicity of the algorithm. However, the basic learning machine can be differently selected depending on data types. For example, a neural network model may be utilized in case of character or voice recognition.
In order to utilize the decision tree as the basic learning machine, fast computational speed is essentially required. In general, a CART algorithm proposed by
Breiman et al. in 1986 is cuπently well-known as a decision tree construction algorithm.
The CART algorithm comprises three steps of growing the decision tree, pruning, and selecting the optimal decision tree.
At this time, the third step of selecting the optimal decision tree utilizes a cross validation method which requires a large amount of computation. The amount of calculation required for the cross validation is not a burden when generating a single decision tree. However, the ensemble scheme requires generating several decision trees.
Thus, if the cross validation method is to be applied to all the decision trees, an enormous increase of the amount of calculation will be inevitable.
In order to overcome the problem of the cross validation, the present invention defines and utilizes a new parameter of tree information criteria (TIC) so that the optimal decision tree can be constructed in a short time.
However, the algorithm for constructing the data mining model proposed by the present invention is not limited to a method of selecting the decision tree using the TIC.
In other words, according to the present invention, various basic learning machines (e.g., a neural network learning machine) as well as the decision tree can be utilized as the basic leaning machine. For convenience of explanation, the decision tree will be merely selected and explained herein as the basic learning machine.
FIG. 7 shows a flowchart schematically illustrating a method for constructing an ensemble machine according to a prefeπed embodiment of the present invention. Referring to FIG. 7, if multidimensional training data are inputted in step S701, a basic learning machine is constructed in step S702. At this time, a decision tree, a neural network model or the like can be utilized as the basic learning machine. For convenience of explanation, however, the decision tree is employed herein as the basic learning machine.
Therefore, the optimal decision tree is constructed in step S702. In such a case, the optimal decision tree can be constructed in various manners. A new scheme refeπed to as a tree information criteria (TIC) algorithm will be introduced into the present invention.
Then, the ensemble machine is constructed by using the constructed basic learning machines in step S703, and it is determined whether the constructed ensemble machine is optimal. If a result of determination in step S704 is that the ensemble machine is not optimal, a new response variable is generated in step 705, and then, the procedure is
returned to step S702.
If the result of determination in step S704 is that the ensemble machine is optimal, a final ensemble machine is constructed in step 706, and then, the procedure is terminated. The present invention will be explained herein in the following order. (1) First, the TIC, i.e. the novel algorithm for constructing the optimal decision tree, which is substituted for the cross validation method, will be described in order to know about how to construct the optimal decision tree (refer to step S702). (2) A method of extracting Boostrap data for use in the present invention will be examined (refer to a first half of step S703). (3) A new algorithm using the Boostrap data extraction method will be described (refer to a second half of step S703 and steps S704 to S706). (In such a case, it is not necessary to utilize the TIC algorithm. That is, it is not necessary to select the decision tree as the basic learning machine.) (4) An algorithm for generating association rules will be examined. (5) Performances of various algorithms will be compared with one another through virtual tests, and then, the performances will be finally compared against actual data.
2. Tree Infomiation Criteria (TIC)
A TIC method for solving the problems of the cross validation method which has been generally utilized when determining the optimal decision tree will be explained. First, general algorithms for generating the decision trees will be described. Then, a conventional cross validation algorithm will be examined. Finally, an algorithm for selecting the optimal decision tree using the TIC proposed by the present invention will be explained.
2-1. Algorithm for constructing decision tree (Breiman et al., 1984)
The algorithm for constructing the decision tree, which was proposed by Breiman, can be roughly divided into three steps.
A first step is a growth algorithm for generating a largest decision tree with respect to given data. A second step is a pruning algorithm for generating a plurality of nested decision trees by pruning unnecessary braches from the huge decision tree constructed through the
growth algorithm. At this time, the constructed decision trees become smaller and smaller.
A third step is an algorithm for selecting the optimal decision tree among the decision trees which have been obtained from the pruning algorithm.
2-2. Cross validation algorithm for selecting optimal decision tree (k-fold cross validation)
FIG. 8 shows a flowchart schematically illustrating the process of a cross validation method for selecting the optimal decision tree. The process will be explained in detail below.
Assume that the decision trees generated from the multidimensional data through the growth algorithm and the pruning algorithm are Tl5 ..., Tm, and e{ is a cross validation
(1) Step S801 : Various kinds of variables are initialized. That is, it is set that ei=0, where i=l, 2, ..., m.
(2) Step S802: Given n training data are equally divided into k parts so that mutually exclusive data D\, D2, ..., Dk can be generated.
(3) Step S803: Di is set to test data, and the other data are set to training data.
(4) Step S 804: By using the training data, the nested decision trees are constructed (through the growth and pruning algorithms).
(5) Step S805: By using the test data Di, a prediction eπor is calculated with respect to each of the nested decision trees.
(6) Step S806: A decision tree, which is closest to one decision tree Tj among the constructed decision trees, is selected. At this time, since the decision tree selection algorithm was fully disclosed by the 'Breiman et al. (1984)', the description thereof will be omitted herein.
(7) Step S807: The prediction eπor of the decision tree obtained from step S806 is added to the βj.
(8) Step S808: Steps S806 and S807 are repeated m times (j=l, ..., m). (9) Step S809: Steps S803 and S808 are repeated k times (i=l , ..., k).
(10) Step S810:
..., e
m are referred to as cross validation eπors of the decision
trees, T
lr ..., T
m, respectively. A decision tree of which cross validation eπor is smallest is selected as an optimal decision tree.
In the meantime, such a cross validation algorithm is also refeπed to as the k-fold cross validation algorithm in which 5-fold or 10-fold cross validation method is generally utilized.
In the aforementioned cross validation algorithm for constructing the optimal decision free, the decision trees should be constructed many times. Thus, if the data are huge, there are problems in that very long calculation time is required and a calculation result is arbitrarily varied depending on how to divide the data.
In order to solve the problems, such as enormous increase of the amount of calculation and instability of the calculation result, which are inherent in the cross validation method, the TIC algorithm of the present invention is newly proposed and will be hereinafter described.
2-3. TIC algorithm for selecting optimal decision tree
An objective of the TIC algorithm is to select the optimal decision tree among a permutation of several decision trees, Tl9 ..., Tm. At this time, each posterior probability of the respective decision trees is calculated, and then, the decision tree of which posterior probability is largest is selected as the optimal decision tree.
The posterior probability means a probability of each tree with respect to the given data. That is, the posterior probability of the tree Ti will be Pr(Tj|Dn) with respect to the given data Dn={(yι, xi), ..., (y„, xn)} .
FIG. 9 shows a flowchart illustrating an entire outline of a method for selecting the optimal decision tree using the TIC. The method will be explained in detail below.
If the multidimensional training data are first inputted in step S901, the largest decision tree is constructed based on the training data in step S902. Then, in step S903, the constructed largest decision tree is newly generated into nested decision trees through the pruning technique. Thereafter, each posterior probability of the respective decision trees is calculated in step S904, the decision free of which posterior probability is largest is then selected in
step S905, and a single optimal decision tree is finally selected in step S906.
Hereinafter, the method for selecting such an optimal decision tree will be more specifically explained.
First, a general method for calculating the posterior probability will be described. The posterior probability is defined as Pr(T,|Dn)=c Pr(Dn|T1) Pr(T,) according to the
Bayesian theorem, where Pr(Dn|Tι) is a probability of the data in a case where the model is Tl5 Pr(Tj) is a probability arbitrarily determined by a user when the user examines the data, m and c is a constant for establishing the relationship ^Pr(T. | Dn) = 1.
In the meantime, since an objective of calculating the posterior probability is to determine the decision tree of which posterior probability is largest, the constant c need not be obtained. Thus, the following formula (17) can be simply utilized to calculate the posterior probability:
Vτ(T, I Dn) oc Pr(D J T,)?τ(T,)
(17) Now, Pr(Dn|T will be calculated.
Since the data are independent of one another, the following formula (18) is established:
(18) Further, the formula (18) may be rewritten as the following formula (19):
I O = Prfc I T x
k)Vr(x
k | T)
(19)
Here, since the tree T, is used to represent a probability structure of yk with respect to a given input X , Pr(Xk|T,) is not dependent on the Tλ. That is, Pr(xk|T = Pr(xk). Therefore, once Pr(yk|T]5 X ) is calculated, Pr(Dn|T is obtained accordingly.
Furthermore, similarly to the constant c, since Pr(xk) is a value which is commonly applied to all the trees, it is not used to find out the tree of which posterior probability is largest. Thus, in consideration of the foregoing, the above formula (17) is rewritten and expressed as the following formula (20):
Pr(Er I D„) oc ft Pr(yk | EΛ)Pr(E-) k= l . ■ • • (20)
The Pr(yk|Tj, X ) is calculated in the following manner.
Assume that a set of terminal nodes of T; is expressed as T( and the probabilities of J groups with respect to the given h-th terminal node are expressed as (pih, ..., pjh)- Then, if the given input variable Xk belongs to the h-th terminal node of the tree Ti, the following formula (21) is established:
(21) where yk represents a group to which the data belong.
Using the foregoing, the following formula (22) is established:
(22) where njh is the number of the data, belonging to Group j, among the data included in the h-th terminal node.
Since the probabilities of the respective terminal nodes (pi , ..., pj
h) are unknown variables, they are removed by using expected values. A distribution of the probabilities (pih, ..., pj
h) is needed for obtaining the expected values. Assume that the distribution is π (pi
h, ..., pj
h)- Then, the following formula (23) is established:
lh,...,
PJh)dp
lh,..d
Pjh
(23) Here, various types of distributions can be utilized for τr (pi , • ■ •. pji - If a general distribution is utilized, the following formula (24) is established:
Pr(-D r,.)oc IT ( Tϊp" n(P lh, ...,PJh)dp lh...dpJh h<a τ-t J 7=1
(24)
Further, if a uniform distribution is utilized, the following formula (25) is established:
(25)
J where «Λ = ∑«/Λ .
Furthermore, the uniform distribution is defined as the following formula (26):
(26) Hereinafter, a method for determining a prior probability of a tree Pr(J) will be explained.
Pr(E;) is not obtained from data but inputted by the user. Pr(7}) for the TIC is constructed as follows.
First, a probability that a given h-th node will be an intermediate node (that is, a node from which successive branching occurs) is defined as the following formula (27): (l+ )
'P (27) where f
h is the number of ancestor nodes of the given node, and the constants a and β are determined by the user. Then, a probability that a given node will be a terminal node is naturally determined as the following formula (28):
(28)
The prior probability of a given tree Ti under the conditions of the formula (28) is expressed as the following formula (29):
Pr(?;.) = he πT, -f«, fl-Λ)"Η heT, fl-«Ci-Λr') ,
(29) where Tj - Ti is a set of the immediate nodes (i.e., all the nodes which are not terminal nodes). Next, the TIC will be calculated using the aforementioned matters.
When the above formulas are combined and reaπanged, the following formula (30) is finally obtained. At this time, the formula (30) uses the uniform distribution. Pr(^ - Pr(E>„|7;)Pr(-7 )
(30)
/ιe7) -
1- V
WΛ
+ J ) heT, -f, heT,
Further, the TIC is defined as a logarithmic value of the final right side of the above formula. That is, TIC of the tree T, is expressed as the following formula (31):
TIC = ∑ ( ∑log(r(»J,+l)))-log(r(H;+ ))+ ∑ logα-βlog(l+/A) h≡ F; J=l hETf F;
+ Σ log(l-α(l- Λ)-p) he F-- (31) In a case where there are two groups, i.e. J = 2, the TIC is expressed as the following formula (32):
where d
h is the number of data belonging to a second group among data of an h-th terminal node. After applying the TIC defined as such has been applied to respective decision trees
Ti, • • • , Tm, a decision tree having the maximum TIC is selected as an optimal decision tree.
Then, the algorithm is terminated.
Meanwhile, the conventional method using the Bayesian theorem and the TIC method proposed by the present invention are the same as each other in that they utilize a posterior probability. However, they are a little different from each other in view of the posterior probability construction which will be used when calculating their posterior probabilities. Such a difference has a great influence on calculation of the posterior probabilities. That is, in the conventional method using the Bayesian theorem, the posterior probability cannot be calculated by means of numerical formulas. Thus, since the probabilities should be calculated through a computer, it takes much more time to perform the calculation as compared with the method using the cross validation.
The method for constructing the posterior probability according to the conventional method using the Bayesian theorem assigns probabilities to all possible decision trees.
However, since the number of all the possible decision trees is very large, the method for constructing the posterior probability is also greatly complicated. Accordingly, since the number of decision trees which require the calculation of the posterior probabilities is
inevitably greatly increased, this leads to enoπnous increase in the amount of the calculation.
However, the TIC method solves the problem of the conventional method using the
Bayesian theorem by assigning the posterior probabilities not to all the possible decision trees but only to nested decision trees induced through the pruning algorithm. Thus, there are advantages in that it is very easy to construct the prior probability and the calculation of the posterior probability is simplified.
That is, the prior probability constructing method for use in the TIC method is a method for reducing sets of the decision trees by using data. This point is the very sharp difference between the TIC method and the conventional method using the Bayesian theorem.
In summary, since it is sufficient to construct the decision tree only once in the TIC method, the computational speed is rapidly increased over the cross validation method.
Furthermore, since the results obtained with respect to the same data are always the same in the TIC method, reliability of the results is greatly superior to that of the cross validation method.
Table 1 below shows simulation results of the selection of an optimal decision tree by using the conventional 5-fold cross validation method and the TIC method proposed by the present invention, respectively. That is, these experiment data are used to compare the generation rate of a single tree through the 5-fold cross validation with that of the single tree through the TIC proposed by the present invention.
Respective pieces of the experiment data indicate average time when the same data are repeatedly generated 500 times. The specification of a computer was a Pentium III 900 MHz processor, a 256 Mbyte main memory, and a Window 2000 O/S.
According to Table 1 below, it can be seen that the calculation time of the TIC method proposed by the present invention was about 1/5 times as much as that of the conventional 5-fold cross validation method.
Meanwhile, the simulation data coπespond to standard data which have been widely known in the field of the data mining, i.e. 'Radius2,' 'Interaction,' 'Breast Cancer,'
'Ionosphere,' and 'Sonar' data. These data are the most powerful simulation data for
estimating the efficiency of the data mining in the relevant art. These simulation data are shown in detail in 'Machine Learning Web Site'
(http://wwwl. ics.uci.edu/~mlearn/MLRepository .html) of 'UC Irvine.' Table 1
3. Method for extracting Boostrap data
Hereinafter, a method for extracting Boostrap data used in the present invention will be briefly described.
First, a general algorithm for generating the Boostrap data will be explained below. Assuming that original data are n data, i.e. yl9 ■ • ■ , y„, the number of the Boostrap data to be generated, m, is determined and a new data set DB is initialized to an empty set.
Then, a constant k satisfying the relationship Kε[l, n] is generated using a random number generator, and yjB is assigned as yk and then added to DB. This process is repeated m times (i = 1 , • ■ • , m). FIG. 10 is a flowchart illustrating the process of generating the Boostrap data according to one embodiment of the present invention, which will be explained below.
(1) Step S1001: Weights Wi, ■ ■ ■ , wn are assigned to n multidimensional data (xls yι)» "■ » (χn, yn) to be inputted. Here, x; is a p-dimensional explanatory variable. That is,
X. — ( li. " ' s X i)- (2) Step SI 002: Cumulative weights of the weights Wi, ■ • ■ , wn assigned to yi,
• • • , y
n are calculated. That is, the cumulative weights are calculated according to the following formula (33):
'I 0)
(33)
(3) Step SI 003: The number of new data to be generated, m, is determined. The number of the Boostrap data, m, for use in the ensemble algorithm according to the present invention is set to n.
(4) Step SI 004: The new data set DB is initialized to the empty set.
(5) Step SI 005: A random real number satisfying the relationship k<≡ [0, w„*] is generated using the random number generator.
(6) Step S1006: An integer j satisfying the relationship Wj.j* < k ≤ w among wi*, ■ ■-, wn* is determined. Here, j = l, • • •, n.
(7) Step S1007: (xB, yB) is assigned as (x{, y,).
(8) Step SI 008: The weight wB coπesponding to (xB, yB) is set to 1/m.
(9) Step SI 009: (xt B,yt B) is added to DB.
(10) Step S1010: The process is repeated m times (i = 1, ■ ■ •, m).
4. Background and basic principle of ensemble algorithm proposed by the present invention
4-1. Introduction
Hereinafter, the background and basic principle of the novel ensemble algorithm proposed by the present invention will be described. The ensemble algorithm proposed by the present invention will be refeπed to as the Convex Hull Ensemble Machine (CHEM) algorithm.
The CHEM algorithm is an ensemble algorithm for generating a new learning machine by using a plurality of basic learning machines. In a classification problem (response variables are categorical) or a regression model problem (response variables are continuous), a learning problem can be changed to a function estimation problem.
A classification problem in which the response variables have J categories is a problem of estimating a J-dimensional function F = (Fi, • • ■, Fj). At this time, the function F is defined as the following formula (34):
P(y=k I -v) = — j
∑ expGF-O)) ι = l
(34)
Meanwhile, the regression model problem becomes a problem of estimating a function F(x) = E(y|x). Assume that if a true value of the function F is defined as Ψ * (refeπed to as "true learning machine"), Ψ = {Ψ β, θ ^ Θ } is a set of the basic learning machines for use in the CHEM algorithm. That is, an optimal basic learning machine for given training data is selected as one of the set Ψ . Meanwhile, a method for finding the optimal learning machine has been widely known in the art. As for the set of basic learning machines Ψ for use in the data mining, there are a decision tree, a neural network model, and the like. However, such basic learning machines have a serious problem in that they very sensitively respond to changes in data.
It is the CHEM algorithm proposed by the present invention that finds out and overcomes causes of instability of such basic learning machines. Next, the causes of the instability of the conventional basic learning machines will be discussed below.
4-2. Causes of instability of conventional basic learning machines The reason why various algorithms used for the data mining operate unstably is that various different learning machines within the set of basic learning machines Ψ describe data in similar manners to one another.
The further basic reason why the completely different learning machines describe the data in the similar manners is that the true learning machine Ψ * is not included in the considered set of basic learning machines Ψ . Moreover, this is because there are a plurality of learning machines, which have small distances (i.e., the degree of difference) from the true learning machine ψ *, among the basic learning machines included in the set Ψ .
FIG. 11 is a basic conceptual view showing existence of the plurality of learning machines having small distances from the true learning machine among the learning machines included in the conventional set of basic learning machines.
As shown in FIG. 11, various learning machines Ψ surround the true learning machine Ψ*. In such a case, even though data is changed merely in a slight degree, the optimal learning machine may be greatly changed. That is, the optimal learning machine will be one of Ψ \, Ψ 2 and Ψ3 depending on a direction in which the data are shifted from the true learning machine Ψ*.
If the set of basic learning machines Ψ is constructed as shown in FIG. 11, it is impossible to properly constmct the true learning machine Ψ* even though the optimal learning machine has been constructed well. However, the true learning machine Ψ* can be constructed by combining a plurality of learning machines. This is because the true learning machine Ψ* is located in a convex hull space of the learning machines included in the set Ψ . Particularly, if proper weights wi, w2 and w3 are obtained in FIG.
11, the following formula (35) is established:
When examining the meaning of the formula (35), it is noted that the true learning machine Ψ * can be obtained as a weighted average of several learning machines included in the set ψ . The CHEM algorithm proposed by the present invention has been developed by using such an idea.
Hereinafter, the basic principle of a method for combining learning machines according to the CHEM algorithm proposed by the present invention will be explained in detail.
4-3. Basic principle of learning-machine combining method according to CHEM algorithm when true learning machine is known
A method for constructing the true learning machine Ψ* by using the weighted average of several learning machines included in the set Ψ when the true learning machine ψ* is known and not included in the set Ψ will be now introduced.
Thereafter, a method for estimating the true learning machine Ψ* using data when the true learning machine Ψ * is not known will be introduced.
A basic assumption of the CHEM algorithm is that the true learning machine Ψ *
is expressed as a weighted average of M learning machines Ψ \, ■ • •, Ψu included in the set Ψ . This is expressed as the following formula (36):
M
, V . i=\
(36) The CHEM algorithm is an algorithm in which the learning machines Ψx used for obtaining the weighted average and the weight Wj thereof are sequentially found out. The principle of an algorithm for finding out a (k+l)-th learning machine ^" +i and a (k+l)-th weight Wk+i thereof by using k learning machines Ψ \, • •-, Ψ^ and k weights wl5 • • ■ , Wk, which have been already constructed, in the CHEM algorithm will be explained below by steps.
First, an optimal learning machine among learning machines orthogonal to a cuπent ensemble machine F
k according to the formula (36) is set to Ψ
\ \. Second, a new ensemble machine F
k+ 1 is generated according to the following formula (37). At this time, weights U
k and W
k+i are obtained such that the distance between the ensemble machine F
k+i and the true learning machine Ψ * is minimized.
M *+W*+l
(37)
This algorithm will be explained in more detail below.
(1) Construction of first learning machine FIG. 12a is a conceptual view showing a method for constructing a first learning machine according to the CHEM algorithm proposed by the present invention.
As shown in FIG. 12a, an optimal learning machine Ψ \ (i.e., a learning machine located closest to the true learning machine Ψ *) is obtained.
(2) Construction of second learning machine FIG. 12b is a conceptual view showing a method for constructing a second learning machine according to the CHEM algorithm proposed by the present invention.
An optimal learning machine Ψ2 orthogonal to ψ \ is found out, and weights wi, w by which the distance between an ensemble machine F according to the formula (37) and the true learning machine ψ * is shortest are obtained. At this time, wi and w are
■
1/dι and l/d2 , respectively. Meanwhile, the ensemble machine F2 according to the formula (37) is expressed as the following formula (38):
F. W lΨ l+W2ψ 2 wi+w2
(38) where di is the distance between Ψ* and Ψ \, and d is the distance between Ψ* and Ψ -
(3) Construction of third learning machine
FIG. 12c is a conceptual view showing a method for constructing a third learning machine according to the CHEM algorithm proposed by the present invention. An optimal learning machine Ψj orthogonal to F2 is found out. Then, weights u2, w3 by which the distance between an ensemble machine F3 according to the formula (37) and the true learning machine Ψ* is shortest are obtained. Meanwhile, the ensemble machine F3 according to the formula (37) is expressed as the following formula (39):
u 2+w3 (39)
Then, the following formula (40) is established:
(4) Construction of m-th learning machine
FIG. 12d is a conceptual view showing a method for constructing an m-th learning machine according to the CHEM algorithm proposed by the present invention.
By continuously repeating the above algorithm, an m-th ensemble machine Fm is obtained from the following formula (41): vιFi
F =-
∑w . z= l
(41)
4-4. Basic principle of learning-machine combining method according to CHEM
algorithm when true learning machine is not known
In all the learning problems, the true learning machine Ψ* is not known. Instead, n training data (xj, yi), • • •, (xn, yn) are given. The objective of the learning is to effectively estimate the true learning machine Ψ * by using the data. Hereinafter, how the aforementioned algorithm is constructed when the data is given will be explained.
A loss function is indicated by /, and a deviance of a given learning machine Φ is defined as the following formula (42): n
;=ι
(42) In case of categorical data, only two groups are considered.
(1) Construction of first learning machine An optimal learning machine Ψ i is obtained.
(2) Construction of second learning machine
An optimal learning machine Ψ2 orthogonal to Ψ i is found out. First, a residual is used for obtaining the orthogonal learning machine. The residual r\ can be obtained from the fomiula (43) when the response variables are categorical:
P
λ(x)- l+exp(
Ψl(x)) (43) where Pi is a probability that y will be 1 when a first ensemble machine is used. Alternatively, the residual r; can be obtained from the formula (44) when the response variables are continuous:
(44)
In case of the categorical data, an optimal learning machine Ψis constructed by adopting |ri| as a weight. In case of the continuous data, the optimal learning machine Ψ is obtained by adopting τ\ a response variable.
Meanwhile, the reason of obtaining the optimal learning machine by using the residual is that the residual has a property of being orthogonal to the response variable in a
regression model. Therefore, the optimal learning machine ?P*for the residual is nearly orthogonal to the optimal learning machine Ψ \ for the response variable.
Then, after the optimal learning machine Ψis obtained by using the residual, a constant η for minimizing d( η Ψ) is obtained. In addition, it is set that Ψ = V Ψ .
At this time, an ensemble machine is determined from the formula (37). That is, the following formula (45) is established: F _ w lψ l+τ4-2ψ 2
2 τ. ,+w2 W ι = lΛ*(ψ ι , w2 = l ^(ψ2)
(45)
Meanwhile, d(Ψ) is approximately a square of the distance between ?^and Ψ* according to the statistical theory.
(3) Construction of third learning machine
The construction of a third learning machine is identical to the construction of the second learning machine except that the residual is obtained by using F instead of Ψ \. The ensemble machine F3 for the obtained third learning machine ?f3 is expressed as the following formula (46):
W lψ l---W2ψ 2+W3ψ 3
F =
W J+WJ+WJ w3= l/ (ψ 3)
(46)
(4) By repeating the above algorithm, the m-th ensemble machine Fm is obtained from the following formula (47): m
∑w^
F m = ι=l
∑w
(47)
5. CHEM algorithm in two-class classification problem
Hereinafter, the CHEM algorithm proposed by the present invention will be explained in detail with respect to a two-class classification problem.
FIG. 13 is a flowchart illustrating an outline of a CHEM algorithm according to one embodiment of the present invention.
First, various kinds of variables are initialized in step SI 301. Boostrap data are
generated by applying weights to multidimensional data in step S1302. Then, a probability that the response variable will be 1 for a given explanatory variable is estimated by using the basic learning machines in step S1303. In step S1304, a coπection parameter for minimizing a given deviance is calculated. Then, a coπected learning machine is constructed by using the coπection parameter in step SI 305. Subsequently, an ensemble machine is constructed based on the coπected learning machine in step SI 306.
Then, in step SI 307, it is determined whether the ensemble machine satisfies stop rules to be described later. If it does not satisfy the stop rules, the procedure is returned to step SI 302. If so, the procedure is terminated.
The CHEM algorithm will be explained in detail by means of formulas and the like.
(1) First step: The response variable yj is set to 1 if the data belong to Group 2, whereas it is set to 0 if the data belong to Group 1.
(2) Second step: Various kinds of variables are initialized. That is, Wj = 1/n for n weights, wi, ■ ■ • , wn , and F(x) = 0 are set.
(3) Third step: The Boostrap data (xιB, yιB), • ■ ■, (xn B, yn B) are generated using the weight {WJ} . The generation of the Boostrap data has been already explained above.
Λ
(4) Fourth step: A probability pm (x) = P(y = lx) that the response variable will be 1 for a given explanatory variable x is estimated using the Boostrap data through the basic learning machines.
(5) Fifth step: The coπection parameter η for minimizing a value of the following formula (48) with respect to a given loss function / is calculated:
(48) where /„ (x) = log(pm (x) 1(1 - pm (x)) .
(6) Sixth step: The learning machine is newly coπected and reconstructed by using the coπection parameter. This process is expressed as the following formula (49):
(49)
n where fm(x) = ηfm(x) , um = l/∑l( „fm(xt)) , and the weight w; is updated
(=1 exp(E(x)) according to w: = (y{ - P(x) I ^jP(x)(l - P(x)) | , where E(x) = -
+ exp(E(x))
(7) Seventh step: In order to construct a final ensemble machine, the third to sixth steps are repeated M times (m = 1, • • •, M). Then, a final ensemble machine is set to H(x) = F(x), and H(x) is assigned to Group 2 if H(x) > 0 and to Group 1 if H(x) < 0 for a new response variable x.
Meanwhile, although various loss functions may be used for the above loss function, an exponential loss function or a negative log-likelihood loss function can be used in order to obtain more preferable results. The exponential loss function and the negative log-likelihood loss function are expressed as the following formulas (50) and (51), respectively: tOJ)=exp(-(2y-l) 72)3
(50) i(y )=-(yf-log (l+exp 0 )) _ (51)
Alternatively, the present invention may generate the basic learning machines by using the weights instead of the Boostrap data. However, in most cases, the use of the Boostrap data results in a more excellent performance.
Further, although various basic learning machines may be employed in the present invention, this embodiment utilizes the decision tree. Contrary to the conventional boosting algorithm, the basic learning machines in the CHΕM algorithm are strong learning machines. Therefore, the optimal decision tree which has been obtained through the entire processes of constructing the decision tree is used instead of a simple decision tree. At this time, the TIC is used to overcome a problem of the calculation, which has been already described above.
An algorithm for expanding the CHΕM algorithm up to a more-than-two-class classification problem will be described below.
6. CHΕM algorithm in multi-class classification problem An outline of a CHΕM algorithm expanded up to a multi-class classification also follows the procedure of FIG. 13. However, the CHΕM algorithm for the multi-class
classification problem is slightly different from the CHEM algorithm for the two-class classification problem in view of formulas etc. applied to them. This algorithm will be described in detail below.
(1) First step: Various variables are initialized. That is, the weight w,j is set to wυ = l/n where i = 1, • • •, n andj = l, ■ • •, J; and it is set that Fj(x) = 0, where j = 1, • • •, !.
(2) Second step: y,* is set to 1 if i-th data belong to a j-th group and to 0 if not so.
(3) Third step: The Boostrap data (xιB, yι*B), • • ■, (xn B, yn*B) are generated using the weights {wlj5 • ■ •, wnj}. The generation of the Boostrap data has been already explained above.
Λ (4) Fourth step: A probability pmj (x) = P(y* = 1 |x) that the response variable will be 1 for a given explanatory variable x is estimated using the Boostrap data through the basic learning machines.
(5) Fifth step: It is set that „y(x) = log(pmj(x)/(l- pmj(x))) .
(6) Sixth step: The second to fifth steps are repeated J times (j = 1, "", J). (7) Seventh step: A coπection parameter η for minimizing a value of the following formula (52) with respect to a given loss function / is calculated:
(52)
1 J where fmj (x) = fmj (-*) --£ fmk (x) , and /„ (x) = (fmϊ (x), ... ^ (x)) .
J k=\ (8) Eighth step: A learning machine is newly coπected and constructed using the coπection parameter. This process is expressed as the following formula (53): m m ι=l (=1
(53) where fm(x) = ηfm(x), um = l/∑l( „fm(x,)) , F(x) = F1(x),...,FJ(x)) , t e weight
;=1 Wij is updated according to wy = >* - P} (x, )) / ^P} (x, )(1-Pj (x, )) | , where exp(E (x)) Pj (x) = — , and yy* is 1 if an i-th observed value belongs to a j-th group and 0
∑exp(EA(x)) i=l if not so.
(9) Ninth step: A final ensemble machine is constructed by repeating the second to
eighth steps M times (m = 1, ■ ■ ■, M). At this time, a new explanatory variable x is assigned to the argmaXjFj(x) group.
An algorithm for expanding the CHEM algorithm up to a continuous variable problem will be described below.
7. CHEM algorithm in continuous variable problem
Hereinafter, an algorithm (regression CHEM algorithm) for constructing an ensemble machine in a case where response variables are continuous will be explained. Such a case is called as a regression model of which basic model is as follows. FIG. 14 is a flowchart schematically illustrating an ensemble algorithm for the continuous variables according to one embodiment of the present invention, which will be explained in detail below.
First, various variables are initialized and the response variables are defined in step
SI 401. A regression model is constructed using basic learning machines in step SI 402. Then, a coπection parameter for minimizing a given deviance is calculated in step SI 403.
In step SI 404, the response variables are updated based on the coπection parameter. An ensemble machine is constructed based on the updated response variables in step SI 405.
Then, in step S1406, it is determined whether the ensemble machine satisfies stop rules to be described later. If not, the procedure is returned to step S1402. If so, the procedure is terminated.
This procedure will be described in more detail below.
First, assume that n multidimensional training data (xi, yi), • ■ •, (xn, yn) are given, where \ is a p-dimensional explanatory variable, i.e. xi = (xn, ■ ■ •, xpi), and y; is the response variable. An objective of this algorithm is to find out a relationship that can best explain the response variables by using the explanatory variables based on the n training data. In other words, it is to construct an optimal function H : Rp — > R by using the training data. Further, if a new explanatory variable x is given, a response variable for the given explanatory variable is estimated from H(x). Meanwhile, H(x) = E(y)χ) is established in the statistics. That is, the objective of the regression analysis is to estimate a conditional expected value.
The CHEM algorithm for the continuous variables is as follows.
(1) First step: It is set that z; = yj for a new response variable.
(2) Second step: A regression model Φ m(x) is obtained through the basic learning machines by setting the response variable to Z; and the explanatory variable to x\. (3) Third step: A coπection parameter η for minimizing a value of the following formula (54) with respect to a given loss function / is calculated:
(4) Fourth step: The response variable z, is updated using the coπection parameter
V . This process is expressed as the following formula (55): z; = y{ - F(x ,
(55) where ψm(x) = ηψm(x) , um - , and F(x) is updated according
∑i{yι,ψmi≠))
(5) Fifth step: The second to fourth steps are repeated M times (m = 1, • • ■, M). (6) Sixth step: A final ensemble machine is constructed such that H(x) = F(x).
Meanwhile, as for the loss function /, various loss functions such as known l(y, T) = (y-T)2, l(y, T) = |y-T| and the like may be used.
8. Stop rules Hereinafter, an algorithm for determining the number of basic learning machines required for constructing a final ensemble machine will be described. A basic idea of the stop rules is to terminate the entire algoritlim without further updating a cuπent ensemble machine when a deviance of the ensemble machine is minimum. The stop rules will be described below. (1) First step: A positive integer K is determined.
(2) Second step: Assume that Fm is an ensemble machine constructed by using m initial basic learning machines. A deviance of the ensemble machine Fm for a given loss function / is calculated according to the following formula (56):
dm = ∑/(y„ Fm<x, >
(56)
(3) Third step: om = argminJ=ι> -, md,, and om+κ= argminj=ι, -, m+κd, are set. An ensemble machine Fm of the first m satisfying the relationship om = om4-κ is determined as a final ensemble machine. Then, the algorithm is terminated.
Meanwhile, the positive integer K is a value defined by a user.
Table 2 below shows simulation data for comparing predictabilities of a case where the stop rales are not applied to the CHEM algorithm and a case where the stop rules are applied to the CHEM algorithm, with respect to various data. As shown in Table 2 below, both predictabilities are almost similar to each other. Thus, when the stop rales are applied to the algorithm, an optimal ensemble machine can be constructed using the smaller number of basic learning machines, resulting in a great increase of the computational speed.
Table 2
9. Algorithm for generating association rules
9-1. Introduction
An algorithm for searching various kinds of association rales for describing data using a basic learning machine for generating a final ensemble machine will be explained.
Meanwhile, the association rule generating algorithm of the present invention may be applied to any ensemble machines constructed by various conventional methods.
9-2. Algorithm for generating association rules
The association rale generating algorithm is as follows.
Referring to FIG. 15, there is provided a flowchart schematically illustrating the association rule generating algorithm according to one embodiment of the present invention. It will be described in detail below. (1) step SI 501: Initialize all kinds of variables. That is, if the response variables are categorical data, a group of interest, the minimum permissible number of data and the minimum permissible confidence are defined as g, m, and p, respectively. Further, if the response variables are of a continuous type, the range of interest is defined as (gL, gu) instead of the group g. (2) step S1502: Construct a universal set S of fundamental rales. It will be described in detail below.
By searching the tree architecture, all nodes that have the number of data larger than m and the probability of the group g larger than p are selected from all of nodes of basic learning machines, i.e., all basic learning machines that are used in the ensemble (in case of continuous type variables, the probability of the range is applied thereto).
On the other hand, in order to obtain more desirable results, since there exist the basic learning machines as many as the number of the ensembles in a two-class classification problem, all of the basic learning machines are searched so as to select all nodes that meet the conditions. In a more-than-two-class classification problem (i.e., a multi-class classification problem), there are the basic learning machines as many as a product of the number of the ensembles and the number of the groups, and all nodes that meet the conditions are selected by using the basic learning machines as many as the number of the ensembles coπesponding to the group of interest g.
The rales of the nodes that are selected as described above refer to as the universal set S of fundamental rales.
(3) step SI 503: For all rales that are included in the universal set S, cancel the rules of nodes higher than a node that generates a relevant rule, from the universal set S. It will be described in detail below.
For all rules st e S (i=l,2,. . ., N), the following is repeated N times. After selection of s{ e S , if a node that generates sk for k=l,. . .,N for the selected
S; is a node higher than a node that generates st , the sk is cancelled from the universal
set S.
(4) step SI 504: Calculate confidences for all rales that are included in the set S. At this time, the confidence for each condition is calculated by ng/n, where the total number of data that belong to the relevant rule is n and the number of data that belong to the group of interest(i. e., the group g) is ng.
(5) step SI 505: Arrange the calculated rules of the set S. That is, the calculated rules of the set S are aπanged in decreasing order of confidences. Then, the arranged rales refer to as oi,. . ., θh.
(6) step SI 506: Assume that an association rule set is R. If R is an empty set, the node θh is added to the set R. If not the set R is not an empty set, all of the association rales that are included in the set R are compared with the node θh to determine similarity. If the node θh is similar to none of the association rales that are included in the set R, the node θh is added to the set R.
Meanwhile, the similarity comparing method is as follows. Suppose that two given rules o and r for description variables x=(χι.- . ., xp) are defined as Formula (57) below.
(57) Suppose that a set of data in which x, belongs to R<„ is defined as a set Dm, and a set of data in which x. belongs to Rπ is defined as a set Dπ. At this time, R^ and Rπ are subsets of the set R.
First, the maximum permissible similarity s e [θ,l] is determined, and it is then determined that the two rules o and r have similarity for x, if the ration of the number of data belonging to -D«" n O" to the number of data belonging to D<» u D" is equal to or larger than s. If the ratio is smaller than s, it is determined that the two conditions o and r do not have similarity for x,. Such a procedure is repeated for i=l,. . ., p.
Then, if it is determined to have similarity for all xl5 it is determined that the rules o and r have similarity. Otherwise, if it is determined not to have similarity for any x,, it is determined that the rules o and r do not have similarity.
(7) step SI 507: Using all rules that are included within the association rule set R,
data are analyzed in order of confidences.
9-3. Performance experiment of algorithm for generating association rules Hereinafter, test data for showing analysis capacity of the association rale generating algoritlim will be explained. As an ensemble constructing scheme applied to this test, a well-known CART algorithm. Further, a decision tree that is constructed in the CART algorithm is applied to the association rale generating algorithm. Further, German data are used as actual data in this test.
Association results of actual data are shown in Tables 3 and 4 below.
Table 3
Search results of association rules using the CART algorithm
Table 4
Search results of the association rule algorithm applied to the CHEM algorithm
The German data includes 700 good credit customers and 300 bad credit customers based on 1,000 pieces of data on credit transaction status, which are analyzed by using the number of minimum permissible data and the minimum permissible confidence under the same conditions for accurate comparison of the association rules. For analyzing the bad credit customers, the number of the minimum permissible data was set as 50 persons (5%) and the number of the minimum permissible confidence was set as 50%. For analyzing the good credit customers, the number of the minimum permissible data was set as 50 persons (5%) and the number of the minimum permissible confidence was set as 85%.
The search results for the association rules of the CART algorithm under the above conditions were one bad credit customer group and one good credit customer group. In an example where the association rale is applied to the CHEM algorithm, search results were 5 bad credit customer groups and 4 good credit customer groups.
The search of the association rules of the CHEM algorithm coπesponds to a wide range search of the association rules including the association rales of the CART algorithm. It will be readily noted that the results of the CHEM algorithm provides the association rales including the results of the CART algorithm. Further, since the data coπesponding to the association rales searched by the CART algorithm are generated from one basic learning machine, the data consist of data which are mutually exclusive from one another. On the contrary, the association rales of the CHEM algorithm are represented as a set including data which are non-exclusive from one another.
10. Comparison of performances of ensemble methods through tests
Hereinafter, the performances of various ensemble methods will be compared through tests.
10-1. Comparison of performances through virtual tests The following models were used in the virtual tests.
(1) Model 1: Radius 2
The number of training data was 1,000, the number of test data was 5,000, and the number of groups was 2.
Explanatory variables: x = (xl5 • • •, xio), which were independent of one another and followed a standard normal distribution.
10
Response variables: It was set that if ∑x > c , y =1 with a probability of 0.9 and i=\
10 y = -1 with a probability 0.1. Further, it was set that if ∑x,-2 < c , y = -1 with a ι=l probability of 0.9 and y = 1 with a probability 0.1. Here, c is a constant satisfying the
10 relationship Pr(∑x,.2 > c) = 0.5 .
.=1 (2) Model 2: Interaction
The number of training data was 1,000, the number of test data was 5,000, and the number of groups was 2.
Explanatory variables: x = (χb •••, χ10), which were independent of one another and followed a standard normal distribution. Responsible variables: The responsible variables were variables of 0 to 1 following
Pr(). = lx) = — , and F(x) is expressed as the following formula (58): l + exp(E(x))
6 6
Ft* = ( ∑OO + ∑(i)'-*-)
(58)
(3) Model 3: Two Normal
The number of training data was 1,000, the number of test data was 5,000, and the number of groups was 2.
First 500 training data belonged to Group 1; and explanatory variables were x = (x1} • • •, Xio), which were independent of one another and followed a standard normal distribution. The remaining 500 training data belonged to Group 2, and explanatory variables were x = (xi, • • •, Xι0), which were independent one another and followed a normal distribution with an average of 0 and a variance of 2.
(4) Model 4: Simple Quadratic
The number of training data was 1,000, the number of test data was 5,000, and the number of groups was 2.
Explanatory variable x followed a standard normal distribution.
Responsible variables were variables of 0 to 1 following the following formula (59), and F(x) : -x2 + 2.
PrO= 1 I x)
1 -1- e?ςp F f
(59)
FIGS. 16a to 16d are graphs showing the results of the 4 virtual tests.
As can be seen FIGS. 16a to 16d, the CHEM algorithm proposed by the present invention operates very stably. Particularly, in case of Model 4, all the other algorithms have poorer performances as the number of trees is increased, whereas there are no such problems in the CHEM algorithm.
Table 5 below shows numerical values representing the results of FIGS. 16a to 16d.
Table 5. Results of virtual tests
Meanwhile, it can be understood that the predictability of the CHEM algorithm is superior to those of the conventional ensemble algorithms in the most models. It can be also understood that the CHEM algorithm is greatly superior to those of the conventional algorithms in view of the deviance in addition to the predictability. That is, the deviances of the conventional algorithms except for the CHEM algorithm are continuously increased. This means that function estimation of the conventional ensemble algorithms is hardly made.
On the contrary, it can be seen that the deviance of the CHEM algorithms is stably outputted. The phenomenon that the predictability is good but the deviance is increased is found in all the ensemble algorithms except for the CHEM algorithm. Such a phenomenon means that a boundary between two groups can be well found out in the classification problem but the other information is lost.
For example, a probability that a response variable y will belong to a k-th group for a new explanatory variable x cannot be estimated in all the ensemble algorithms except for the CHEM algorithm.
Consequently, the virtual test results show that the CHEM algorithm proposed by the present invention has both excellent predictability and stability.
10-2. Comparison through analysis of actual data The CHEM algorithm and the existing algorithms were compared with one another through analyses of various actual data. The used actual data were obtained from the 'Machine Learning Web Site' (http://wwwl.ics.uci.edu/~mlearn MLRepository.html) of 'UC Irvine' described above.
Table 6 below shows information on the actual data. Table 6. Information on actual data
FIGS. 17a to 17i are graphs showing the results of the actual data analysis.
It can be understood that the CHEM algorithm operates very stably in the most cases. Although the CHEM algorithm operates slightly unstably in case of 'Ionosphere data,' the difference in the stabilities of the operations between the data and the other data is not large. In case of 'German data,' it can be understood that the CHEM algorithm is superior to the other algorithms in view of their predictabilities and stabilities. It can also be seen that similarly to the aforementioned virtual tests, the CHEM algorithm has a smallest deviance value.
Table 7 below numerically shows the results of FIGS. 17a to 17i.
Table 7. Analysis results of actual data
In conclusion, the CHEM algorithm has a very excellent predictability, simultaneously is very stable (it does not produce greatly wrong estimation results in any data), and enables the function estimation (i.e., the deviance is small).
The function estimation means probability estimation. The CHEM algorithm can represent the values of response variables for given data as values with a discrete probability distribution, contrary to the conventional ensemble algorithms. This can be very usefully utilized in the analysis of the actual data.
Moreover, Table 8 below shows test results of continuous variables.
Table 8. Analysis results of continuous variables
In Table 8, the LS boost algoritlim was used as a conventional algorithm.
The Friedman model was a virtual model in which the number of training data was
500 and the number of test data was 5,000. Further, explanatory variables were x = (xi,
• ■ • , xio) which were independent of one another and followed a unifom distribution at [0,
1]. Moreover, response variables were y = lOsin^ .-) + 20(x3 - 0.5)2 + 10x4 + 5x5 + ε , where ε ~ N(0,l) .
Data for finding out influences of various environmental variables on prices of houses in Boston were used as the actual data (Boston Housing Data). These data are easily available from the Internet.
The number of these data was 506, and the 5-fold cross validation method was used
for obtaining a test eπor. It can be understood from the analysis results of the virtual tests and the data on the prices of houses in Boston that the CHEM algorithm well operates in the classification problem as well as the regression model. Particularly, reduction parameters should be defined by a user in the conventional LS boosting method, whereas it is hardly necessary for the user to define something in the CHEM algorithm.
According to the present invention, there are the following advantages.
First, the proposed CHEM algorithm is very superior to the conventional ensemble construction algorithms in view of their predictabilities and operates very stably. That is, there is a remarkably reduced chance that the overfitting problem occurs. Second, the proposed CHEM algorithm overcomes a drop in analyzability which may be commonly produced in the existing ensemble construction methods, and shows a more excellent analyzability over any other data mining methods by using the association rale algorithm.
Third, the CHEM algorithm can be naturally applied to the continuous variable problem and thus easily applied to general industrial fields.
Although the invention has been described with respect to the prefeπed embodiments, it will be understood by the skilled in the art that various changes and modifications may be made thereto without departing from the spirit and scope of the invention defined by the appended claims.