[go: up one dir, main page]

WO2003075187A1 - Apparatus and method for constructing data mining model using ensemble machines - Google Patents

Apparatus and method for constructing data mining model using ensemble machines Download PDF

Info

Publication number
WO2003075187A1
WO2003075187A1 PCT/KR2003/000409 KR0300409W WO03075187A1 WO 2003075187 A1 WO2003075187 A1 WO 2003075187A1 KR 0300409 W KR0300409 W KR 0300409W WO 03075187 A1 WO03075187 A1 WO 03075187A1
Authority
WO
WIPO (PCT)
Prior art keywords
ensemble
machine
learning
machines
ensemble machine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/KR2003/000409
Other languages
French (fr)
Inventor
Yong-Dai Kim
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BL SYSTEMS Inc
Original Assignee
BL SYSTEMS Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BL SYSTEMS Inc filed Critical BL SYSTEMS Inc
Priority to AU2003212671A priority Critical patent/AU2003212671A1/en
Publication of WO2003075187A1 publication Critical patent/WO2003075187A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/40Data acquisition and logging
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Definitions

  • the present invention relates to an apparatus and method for constructing a data mining model using ensemble machines, and more particularly, to an apparatus and method for constructing a data mining model wherein faster computational speed, more stable model construction, and superior analyzability and predictability can be easily obtained as compared with a conventional method when analyzing and predicting multidimensional huge data.
  • the boosting algorithm is a method which is generally used in a classification problem.
  • a basic model of the classification problem is as follows:
  • n training data (xj, yi), ..., (x n , y n ) are given.
  • Xj is a p- dimensional explanatory variable, i.e. x (xii, ..., x p j); and a response variable y ⁇ designates a group in which the data are included. That is, if there are J groups, y, has one integer value of 1 to J.
  • An objective of the classification problem is to find out a relationship in which the response variable can be best explained based on the explanatory variable using the n training data.
  • the objective is to obtain an optimal function of H:R p - ⁇ 1,
  • the real boosting algorithm is most representative of the algorithms for constructing an ensemble machine for data mining. This algorithm is described in detail in United States Patent No. 5,819,247 entitled
  • FIG. 1 is a flowchart illustrating the conventional real boosting algorithm applicable to a case where there are provided two groups in a classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 1.
  • Step SI 01 A response variable y, is set either to 1 if the data belong to Group 2 or to -1 if the data belong to Group 1.
  • Step SI 02 n weights w 1 ⁇ ..., w n are initialized by setting the weights in accordance with a relationship, w l/n.
  • Step SI 03 A probability p m that a value of the response variable with respect to the explanatory variable x will be 1 is estimated through basic learning machines using the n training data (xj, y ⁇ ), ..., (x n , y n ) and the weights wi, ..., w n .
  • Step SI 04 f m (x)is obtained by transforming the probability P m (x)- The f m (x) is determined according to the following formula (2):
  • Step SI 07 A final ensemble machine is determined according to the following formula (4). At this time, a new explanatory variable x is assigned to Group 2 if H(x) > 0. Otherwise, the new explanatory variable x is assigned to Group 1.
  • FIG. 2 is a flowchart illustrating the conventional Logit boosting algorithm applicable to a case where there are provided two groups in a classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 2.
  • Step S201 A response variable yi is set either to 1 if the data belong to Group 2 or to 0 if the data belong to Group 1.
  • Step S202 n weights wi, ..., w n are initialized by setting the weights in accordance with a relationship, and F(x) and p(x) are also initialized to 0 and 1/2, respectively.
  • Step S203 A new response variable z, and a new weight w, with respect to the given explanatory variable are obtained from the following formula (5):
  • Step S204 A regression model f m (x) is constructed with reference to the basic learning machines and using the response variable z l5 the explanatory variable x, and the weight w,.
  • Step S205 A probability is updated using the f m (x) constructed in step S204 and is expressed as the following formula (6): exp(F(x)) -1- exp(-.F(:x.))
  • Step S206 Processes of updating the probability are performed M times by repeating steps S203 to S205 M times.
  • FIG. 3 shows a flowchart illustrating the conventional gradient boosting algorithm applicable to a case where there are provided two groups in a classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 3.
  • Step S301 A response variable y, is set either to 1 if the data belong to Group 2 or to -1 if the data belong to Group 1.
  • Step S302 Various kinds of variables are initialized according to the following formula (7):
  • Step S303 A new response variable z, with respect to the given explanatory variable are calculated from the following formula (8):
  • Step S304 f m (x) is estimated using the response variable z b the explanatory variable x, and the basic learning machines (regression decision trees).
  • Step S305 An estimated value ⁇ i of an /-th terminal node of the f m (x) is estimated based on the following formula (9):
  • Ri is a set of the data belonging to an i-th terminal node.
  • Step S306 The F(x) is updated in accordance with the formula
  • Step S307 The F(x) is updated M times by repeating steps S303 to S306 M time.
  • Step S308 A final ensemble machine is constructed by using the relationship,
  • H(x) F(x).
  • a new response variable x is assigned to Group 2 if H(x) > 0. Otherwise, the new response variable x is assigned to Group 1.
  • the basic learning machine when the basic learning machine is to be constructed in the algorithms, complexity of the learning machine should be beforehand determined. In a case where the basic learning machine is the decision tree, the complexity of the learning machine can be controlled by adjusting the number of the terminal nodes. However, since the basic learning machines used in the boosting algorithms are weak ones, the complexity thereof should be minimized. In general, the decision tree with two to eight terminal nodes is widely used as a basic learning machine. 3. Conventional boosting algorithms applicable to multi-class classification
  • the aforementioned three boosting algorithms have been applicable to cases where there are two groups. However, in a case where there are three or more groups, the algorithms should be expanded and modified accordingly.
  • the real boosting algorithm, Logit boosting algorithm, and gradient boosting algorithm applicable to the multi-class classification will be explained.
  • FIG. 4 shows a flowchart illustrating the conventional real boosting algorithm applicable to the multi-class classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 4.
  • Step S401 J new data, ((X J , 1), y ⁇ ), ..., ((XJ, J), yjj) are generated from given training data (X J , y;). At this time, y ⁇ is 1 if y; is k, whereas otherwise, yi is -1.
  • Step S402 A final ensemble machine is constructed according to the following formula (10) by applying the two-class real boosting algorithm to an (n x J) new data (i.e., by transforming the multi-class classification into the two-class classification):
  • Step S403 A new explanatory variable x is assigned to argmax j H(x, j) Group.
  • the number of execution i.e. the number of calculation
  • the real boosting algorithm is very time-consuming.
  • FIG. 5 shows a flowchart illustrating the conventional Logit boosting algorithm applicable to the multi-class classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 5.
  • Step S503 A New response variable Zj and a weight Wj are determined according to the following formula (11):
  • Step S504 A regression model f mj (x) is calculated and constructed with reference to the basic learning machines and using the response variable z,, the explanatory variable X; and the weight w,.
  • Step S505 Steps S502 to S504 are repeated J times.
  • Step S507 A j-th probability p,(x) is updated according to the following fon ⁇ la (13): exp(-E-CxQ) J
  • Step S508 Steps S502 to S507 are repeated M times.
  • Step S509 A new explanatory variable x is assigned to argmax j F j (x) Group.
  • FIG. 6 shows a flowchart illustrating the conventional gradient boosting algorithm applicable to the multi-class classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 6.
  • Step S603 A new response variable z, is determined according to the following formula (14):
  • Step S604 A regression model f mj (x) is constructed with reference to the basic learning machines (regression decision trees) and using the response variable z, and explanatory variable x,.
  • Step S605 An estimated value ⁇ of an /-th terminal node in the regression model following formula (15):
  • Step S606 A F j (x) is determined according to the following formula (16):
  • Step S609 A new explanatory variable x is assigned to an argmax j F j (x) Group.
  • an object of the present invention is to provide an apparatus and method for constructing an ensemble machine wherein faster computational speed, more stable model construction, and superior analyzability and predictability can be easily obtained as compared with a conventional method when analyzing and predicting multidimensional huge data.
  • an apparatus for constructing a data mining model using ensemble machines comprising: M ensemble machine construction means for constructing the ensemble machines, said M being an integer equal to and greater than 2; wherein a first ensemble machine construction means constructs a first learning machine from multidimensional training data to be inputted, and then, constructs the constructed first learning machine itself into a first ensemble machine, and a k-th ensemble machine construction means receives a (k-l)-th ensemble machine constructed from a (k-l)-th ensemble machine construction means, constructs a k-th learning machine orthogonal to the (k-l)-th ensemble machine in a multidimensional space, and then constructs a k-th ensemble machine by combining the constructed k-th learning machine and the (k-l)-th ensemble machine, said k being an integer between 2 and M.
  • method for constructing a data mining model using ensemble machines comprising: a step of constructing M ensemble machines, said M being an integer equal to and greater than 2; wherein first ensemble machine constructing step constructs a first learning machine from multidimensional training data to be inputted, and then, constructs the constructed first learning machine itself into a first ensemble machine, and a k-th ensemble machine constructing step receives a (k-l)-th ensemble machine constructed from a (k-l)-th ensemble machine construction means, constructs a k-th learning machine orthogonal to the (k-l)-th ensemble machine in a multidimensional space, and then constructs a k-th ensemble machine by combining the constructed k-th learning machine and the (k-l)-th ensemble machine, said k being an integer between 2 and M.
  • an apparatus for constructing a data mining model in two-class classification using ensemble machines comprising: an input means for receiving multidimensional training data; a weight calculation means for calculating weights by using residuals of the ensemble machines; a probability estimation means for estimating, by using basic learning machines and the weight, probability that a response variable will be 1 with respect to an explanatory variable x obtained from the data inputted by the input means; a correction parameter calculation means for calculating a correction parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the probability estimation means; a learning machine update means for updating and reconstructing learning machines by using the calculated correction parameter; and an ensemble machine construction means for constructing a final ensemble machine based on the updated learning machines.
  • a method for constructing a data mining model in two-class classification using ensemble machines comprising: a first step for receiving multidimensional training data; a second step for calculating weights by using residuals of the ensemble machines; a third step for estimating, by using basic learning machines and the weight, probability that a response variable will be 1 with respect to an explanatory variable x obtained from the data inputted by the first step; a fourth step for calculating a correction parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the third step; and a fifth step for updating and reconstructing learning machines by using the correction parameter calculated from the fourth step.
  • an apparatus for constructing a data mining model in multi-class classification using ensemble machines comprising: an input means for receiving multidimensional training data; a weight calculation means for calculating weights by using residuals of the ensemble machines; a probability estimation means for estimating, by using basic learning machines and the weights, probabilities that a response variable will belong to Group j with respect to an explanatory variable x obtained from the data inputted by the input means; a co ⁇ ection parameter calculation means for calculating a co ⁇ ection parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the probability estimation means; a learning machine update means for updating and reconstructing learning machines by using the calculated co ⁇ ection parameter; and an ensemble machine construction means for constructing a final ensemble machine based on the updated learning machines.
  • a data mining model in multi-class classification using ensemble machines comprising: a first step for receiving multidimensional training data; a second step for calculating weights by using residuals of the ensemble machines; a third step for estimating, by using basic learning machines and the weights, probabilities that a response variable will belong to Group j with respect to an explanatory variable x obtained from the data inputted by the first step; a fourth step for calculating a co ⁇ ection parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the third step; a fifth step for updating and reconstructing learning machines by using the calculated co ⁇ ection parameter; and a sixth step for constructing a final ensemble machine based on the updated learning machines.
  • an apparatus for constructing a data mining model using ensemble machines in a case where a response variable is regressive comprising: an input means for receiving multidimensional training data; a residual calculation means for calculating residuals of the ensemble machines; a regression model construction means for constructing regression models from the data inputted by the input means through basic learning machines and the residuals; a co ⁇ ection parameter calculation means for calculating a co ⁇ ection parameter which minimizes a given loss function with respect to the constructed regression models obtained from the regression model construction means; a learning machine update means for reconstructing learning machines by updating the response variable based on the calculated co ⁇ ection parameter; and an ensemble machine construction means for constructing a final ensemble machine based on the reconstructed learning machines.
  • a method for constructing a data mining model using ensemble machines in a case where a response variable is regressive comprising: a first step for receiving multidimensional training data; a second step for calculating residuals of the ensemble machines; a third step for constructing regression models from the data inputted by the first step through basic learning machines and the residuals; a fourth step for calculating a co ⁇ ection parameter which minimizes a given loss function with respect to the constructed regression models obtained from the third step; a fifth step for reconstructing learning machines by updating the response variable based on the calculated co ⁇ ection parameter; and a sixth step for constructing a final ensemble machine based on the reconstructed learning machines. More specifically, the inputted training data is generated into Boostrap data by using the weights.
  • the learning machines are decision trees or neural network models.
  • the final ensemble machine is constructed when a deviance of a cu ⁇ ent ensemble machine is smallest.
  • FIG. 1 is a flowchart illustrating a conventional real boosting algorithm applicable to a case where there are provided two groups in a classification problem
  • FIG. 2 is a flowchart illustrating a conventional Logit boosting algorithm applicable to a case where there are provided two groups in the classification problem
  • FIG. 3 is a flowchart illustrating a conventional gradient boosting algorithm applicable to a case where there are provided two groups in the classification problem
  • FIG. 4 is a flowchart illustrating a conventional real boosting algorithm applicable to a case of a multi-class classification problem
  • FIG. 5 is a flowchart illustrating a conventional Logit boosting algorithm applicable to a case of the multi-class classification problem
  • FIG. 6 is a flowchart illustrating a conventional gradient boosting algorithm applicable to a case of the multi-class classification problem
  • FIG. 7 is a flowchart schematically illustrating a method for constructing an ensemble machine according to a prefe ⁇ ed embodiment of the present invention.
  • FIG. 8 is a flowchart schematically illustrating a cross validation process for selecting an optimal decision tree
  • FIG. 9 is a flowchart illustrating an entire outline of a process for selecting the optimal decision tree through tree information criteria (TIC);
  • FIG. 10 is a flowchart showing a process for generating Boostrap data according to another prefe ⁇ ed embodiment of the present invention
  • FIG. 11 is a basic conceptual view showing existence of the plurality of learning machines having small distances from a true learning machine among the learning machines included in the conventional set of basic learning machines;
  • FIG. 12a is a conceptual view showing a method for constructing a first learning machine according to the CHEM algorithm proposed by the present invention
  • FIG. 12b is a conceptual view showing a method for constructing a second learning machine according to the CHEM algorithm proposed by the present invention
  • FIG. 12c is a conceptual view showing a method for constructing a third learning machine according to the CHEM algorithm proposed by the present invention.
  • FIG. 12d is a conceptual view showing a method for constructing an m-th learning machine according to the CHEM algorithm proposed by the present invention.
  • FIG. 13 is a flowchart illustrating an outline of a CHEM algorithm according to one embodiment of the present invention.
  • FIG. 14 is a flowchart schematically illustrating an ensemble algorithm for the continuous variables according to one embodiment of the present invention
  • FIG. 15 is a flowchart illustrating an outline of an algorithm for generating association rules
  • FIGS. 16a to 16d are graphs showing the results of virtual tests for validating performances of the conventional ensemble methods and the ensemble method proposed by the present invention
  • FIGS. 17a to 17i are graphs showing the results of actual data analysis for validating performances of the conventional ensemble methods and the ensemble method proposed by the present invention.
  • An algorithm for constructing a data mining model proposed by the present invention is a novel algorithm of which predictability is superior to the boosting algorithms while solving the aforementioned problems in the boosting algorithms. Further, in a case where a decision tree is employed as a basic learning machine, the algorithm of the present invention has a further improved analyzability as compared with a decision tree which is known to have most superior analyzability among conventional data mining techniques (of course, the decision tree has very poor predictability). In particular, according to the algorithm of the present mvention, various kinds of association rules capable of describing given data can be extracted, and accordingly, the given data can be analyzed in various respects.
  • the algorithm proposed by the present invention is completely different from the boosting algorithm in view of their basic ideas.
  • the boosting algorithm is to develop a strong learning machine by combining several weak learning machines, whereas the algorithm of the present invention is to develop a stronger learning machine by combining several strong learning machines.
  • a basic learning machine used in an ensemble scheme of the present invention will be explained by way of example of the decision tree used as the learning machine.
  • the reason is that the decision tree has advantages such as high analyzability and simplicity of the algorithm.
  • the basic learning machine can be differently selected depending on data types.
  • a neural network model may be utilized in case of character or voice recognition.
  • Breiman et al. in 1986 is cu ⁇ ently well-known as a decision tree construction algorithm.
  • the CART algorithm comprises three steps of growing the decision tree, pruning, and selecting the optimal decision tree.
  • the third step of selecting the optimal decision tree utilizes a cross validation method which requires a large amount of computation.
  • the amount of calculation required for the cross validation is not a burden when generating a single decision tree.
  • the ensemble scheme requires generating several decision trees.
  • the present invention defines and utilizes a new parameter of tree information criteria (TIC) so that the optimal decision tree can be constructed in a short time.
  • TIC tree information criteria
  • the algorithm for constructing the data mining model proposed by the present invention is not limited to a method of selecting the decision tree using the TIC.
  • various basic learning machines e.g., a neural network learning machine
  • the decision tree will be merely selected and explained herein as the basic learning machine.
  • FIG. 7 shows a flowchart schematically illustrating a method for constructing an ensemble machine according to a prefe ⁇ ed embodiment of the present invention.
  • a basic learning machine is constructed in step S702.
  • a decision tree, a neural network model or the like can be utilized as the basic learning machine.
  • the decision tree is employed herein as the basic learning machine.
  • the optimal decision tree is constructed in step S702.
  • the optimal decision tree can be constructed in various manners.
  • a new scheme refe ⁇ ed to as a tree information criteria (TIC) algorithm will be introduced into the present invention.
  • step S703 the ensemble machine is constructed by using the constructed basic learning machines in step S703, and it is determined whether the constructed ensemble machine is optimal. If a result of determination in step S704 is that the ensemble machine is not optimal, a new response variable is generated in step 705, and then, the procedure is returned to step S702.
  • step S704 If the result of determination in step S704 is that the ensemble machine is optimal, a final ensemble machine is constructed in step 706, and then, the procedure is terminated.
  • the TIC i.e. the novel algorithm for constructing the optimal decision tree, which is substituted for the cross validation method, will be described in order to know about how to construct the optimal decision tree (refer to step S702).
  • a method of extracting Boostrap data for use in the present invention will be examined (refer to a first half of step S703).
  • a new algorithm using the Boostrap data extraction method will be described (refer to a second half of step S703 and steps S704 to S706).
  • a TIC method for solving the problems of the cross validation method which has been generally utilized when determining the optimal decision tree will be explained.
  • general algorithms for generating the decision trees will be described.
  • a conventional cross validation algorithm will be examined.
  • an algorithm for selecting the optimal decision tree using the TIC proposed by the present invention will be explained.
  • the algorithm for constructing the decision tree which was proposed by Breiman, can be roughly divided into three steps.
  • a first step is a growth algorithm for generating a largest decision tree with respect to given data.
  • a second step is a pruning algorithm for generating a plurality of nested decision trees by pruning unnecessary braches from the huge decision tree constructed through the growth algorithm. At this time, the constructed decision trees become smaller and smaller.
  • a third step is an algorithm for selecting the optimal decision tree among the decision trees which have been obtained from the pruning algorithm.
  • FIG. 8 shows a flowchart schematically illustrating the process of a cross validation method for selecting the optimal decision tree. The process will be explained in detail below.
  • Step S802 Given n training data are equally divided into k parts so that mutually exclusive data D ⁇ , D 2 , ..., D k can be generated.
  • Step S803 Di is set to test data, and the other data are set to training data.
  • Step S 804 By using the training data, the nested decision trees are constructed (through the growth and pruning algorithms).
  • Step S805 By using the test data Di, a prediction e ⁇ or is calculated with respect to each of the nested decision trees.
  • Step S806 A decision tree, which is closest to one decision tree T j among the constructed decision trees, is selected. At this time, since the decision tree selection algorithm was fully disclosed by the 'Breiman et al. (1984)', the description thereof will be omitted herein.
  • Step S807 The prediction e ⁇ or of the decision tree obtained from step S806 is added to the ⁇ j.
  • Step S810 ..., e m are referred to as cross validation e ⁇ ors of the decision trees, T lr ..., T m , respectively.
  • a decision tree of which cross validation e ⁇ or is smallest is selected as an optimal decision tree.
  • the decision trees should be constructed many times. Thus, if the data are huge, there are problems in that very long calculation time is required and a calculation result is arbitrarily varied depending on how to divide the data.
  • the TIC algorithm of the present invention is newly proposed and will be hereinafter described.
  • An objective of the TIC algorithm is to select the optimal decision tree among a permutation of several decision trees, T l9 ..., T m . At this time, each posterior probability of the respective decision trees is calculated, and then, the decision tree of which posterior probability is largest is selected as the optimal decision tree.
  • the posterior probability means a probability of each tree with respect to the given data. That is, the posterior probability of the tree Ti will be Pr(Tj
  • D n ) with respect to the given data D n ⁇ (y ⁇ , xi), ..., (yford, x n ) ⁇ .
  • FIG. 9 shows a flowchart illustrating an entire outline of a method for selecting the optimal decision tree using the TIC. The method will be explained in detail below.
  • step S901 the largest decision tree is constructed based on the training data in step S902. Then, in step S903, the constructed largest decision tree is newly generated into nested decision trees through the pruning technique. Thereafter, each posterior probability of the respective decision trees is calculated in step S904, the decision free of which posterior probability is largest is then selected in step S905, and a single optimal decision tree is finally selected in step S906.
  • the posterior probability is defined as Pr(T,
  • D n ) c Pr(D n
  • T ⁇ ) is a probability of the data in a case where the model is T l5 Pr(Tj) is a probability arbitrarily determined by a user when the user examines the data, m and c is a constant for establishing the relationship ⁇ Pr(T.
  • D n ) 1.
  • T,) is not dependent on the T ⁇ . That is, Pr(x k
  • T Pr(x k ). Therefore, once Pr(y k
  • E ⁇ )Pr(E-) k l . ⁇ • • (20)
  • Tj, X ) is calculated in the following manner.
  • n jh is the number of the data, belonging to Group j, among the data included in the h-th terminal node.
  • Pr(E ; ) is not obtained from data but inputted by the user.
  • Pr(7 ⁇ ) for the TIC is constructed as follows.
  • a probability that a given h-th node will be an intermediate node is defined as the following formula (27): (l+ ) 'P (27) where f h is the number of ancestor nodes of the given node, and the constants a and ⁇ are determined by the user. Then, a probability that a given node will be a terminal node is naturally determined as the following formula (28):
  • Pr(? ; .) he ⁇ T, -f « , fl- ⁇ )" ⁇ heT, fl- « C i - ⁇ r' ) ,
  • Tj - T i is a set of the immediate nodes (i.e., all the nodes which are not terminal nodes).
  • the TIC will be calculated using the aforementioned matters.
  • TIC of the tree T is expressed as the following formula (31):
  • d h is the number of data belonging to a second group among data of an h-th terminal node.
  • a decision tree having the maximum TIC is selected as an optimal decision tree.
  • the conventional method using the Bayesian theorem and the TIC method proposed by the present invention are the same as each other in that they utilize a posterior probability. However, they are a little different from each other in view of the posterior probability construction which will be used when calculating their posterior probabilities. Such a difference has a great influence on calculation of the posterior probabilities. That is, in the conventional method using the Bayesian theorem, the posterior probability cannot be calculated by means of numerical formulas. Thus, since the probabilities should be calculated through a computer, it takes much more time to perform the calculation as compared with the method using the cross validation.
  • the method for constructing the posterior probability according to the conventional method using the Bayesian theorem assigns probabilities to all possible decision trees.
  • the TIC method solves the problem of the conventional method using the
  • Bayesian theorem by assigning the posterior probabilities not to all the possible decision trees but only to nested decision trees induced through the pruning algorithm.
  • the prior probability constructing method for use in the TIC method is a method for reducing sets of the decision trees by using data. This point is the very sharp difference between the TIC method and the conventional method using the Bayesian theorem.
  • Table 1 shows simulation results of the selection of an optimal decision tree by using the conventional 5-fold cross validation method and the TIC method proposed by the present invention, respectively. That is, these experiment data are used to compare the generation rate of a single tree through the 5-fold cross validation with that of the single tree through the TIC proposed by the present invention.
  • Respective pieces of the experiment data indicate average time when the same data are repeatedly generated 500 times.
  • the specification of a computer was a Pentium III 900 MHz processor, a 256 Mbyte main memory, and a Window 2000 O/S.
  • simulation data co ⁇ espond to standard data which have been widely known in the field of the data mining, i.e. 'Radius2,' 'Interaction,' 'Breast Cancer,'
  • Boostrap data a general algorithm for generating the Boostrap data will be explained below. Assuming that original data are n data, i.e. y l9 ⁇ • ⁇ , yclub, the number of the Boostrap data to be generated, m, is determined and a new data set D B is initialized to an empty set.
  • FIG. 10 is a flowchart illustrating the process of generating the Boostrap data according to one embodiment of the present invention, which will be explained below.
  • Step S1001 Weights Wi, ⁇ ⁇ ⁇ , w n are assigned to n multidimensional data (x ls y ⁇ ) » " ⁇ » ( ⁇ n, y n ) to be inputted.
  • x is a p-dimensional explanatory variable. That is,
  • Step SI 002 Cumulative weights of the weights Wi, ⁇ • ⁇ , w n assigned to yi,
  • Step SI 003 The number of new data to be generated, m, is determined.
  • the number of the Boostrap data, m, for use in the ensemble algorithm according to the present invention is set to n.
  • Step SI 004 The new data set D B is initialized to the empty set.
  • Step SI 005 A random real number satisfying the relationship k ⁇ [0, w flesh*] is generated using the random number generator.
  • Step S1006 An integer j satisfying the relationship W j .j* ⁇ k ⁇ w among wi*, ⁇ ⁇ -, w n * is determined.
  • j l, • • • , n.
  • Step S1007 (x B , y B ) is assigned as (x ⁇ , y,).
  • Step SI 008 The weight w B co ⁇ esponding to (x B , y B ) is set to 1/m.
  • Step SI 009 (x t B ,y t B ) is added to D B .
  • the ensemble algorithm proposed by the present invention will be refe ⁇ ed to as the Convex Hull Ensemble Machine (CHEM) algorithm.
  • CEM Convex Hull Ensemble Machine
  • the CHEM algorithm is an ensemble algorithm for generating a new learning machine by using a plurality of basic learning machines.
  • a classification problem response variables are categorical
  • a regression model problem response variables are continuous
  • a learning problem can be changed to a function estimation problem.
  • ⁇ * (refe ⁇ ed to as "true learning machine")
  • ⁇ ⁇
  • ⁇ ⁇ ⁇ ⁇ is a set of the basic learning machines for use in the CHEM algorithm. That is, an optimal basic learning machine for given training data is selected as one of the set ⁇ .
  • a method for finding the optimal learning machine has been widely known in the art.
  • the set of basic learning machines ⁇ for use in the data mining there are a decision tree, a neural network model, and the like. However, such basic learning machines have a serious problem in that they very sensitively respond to changes in data.
  • the further basic reason why the completely different learning machines describe the data in the similar manners is that the true learning machine ⁇ * is not included in the considered set of basic learning machines ⁇ . Moreover, this is because there are a plurality of learning machines, which have small distances (i.e., the degree of difference) from the true learning machine ⁇ *, among the basic learning machines included in the set ⁇ .
  • FIG. 11 is a basic conceptual view showing existence of the plurality of learning machines having small distances from the true learning machine among the learning machines included in the conventional set of basic learning machines.
  • various learning machines ⁇ surround the true learning machine ⁇ *.
  • the optimal learning machine may be greatly changed. That is, the optimal learning machine will be one of ⁇ ⁇ , ⁇ 2 and ⁇ 3 depending on a direction in which the data are shifted from the true learning machine ⁇ *.
  • the true learning machine ⁇ * can be constructed by combining a plurality of learning machines. This is because the true learning machine ⁇ * is located in a convex hull space of the learning machines included in the set ⁇ . Particularly, if proper weights wi, w 2 and w 3 are obtained in FIG.
  • the true learning machine ⁇ * can be obtained as a weighted average of several learning machines included in the set ⁇ .
  • the CHEM algorithm proposed by the present invention has been developed by using such an idea.
  • a basic assumption of the CHEM algorithm is that the true learning machine ⁇ * is expressed as a weighted average of M learning machines ⁇ ⁇ , ⁇ • • , ⁇ u included in the set ⁇ . This is expressed as the following formula (36):
  • the CHEM algorithm is an algorithm in which the learning machines ⁇ x used for obtaining the weighted average and the weight Wj thereof are sequentially found out.
  • the principle of an algorithm for finding out a (k+l)-th learning machine ⁇ " + i and a (k+l)-th weight W k+ i thereof by using k learning machines ⁇ ⁇ , • • -, ⁇ and k weights w l5 • • ⁇ , Wk, which have been already constructed, in the CHEM algorithm will be explained below by steps.
  • an optimal learning machine among learning machines orthogonal to a cu ⁇ ent ensemble machine F k according to the formula (36) is set to ⁇ ⁇ ⁇ .
  • a new ensemble machine F k+ 1 is generated according to the following formula (37). At this time, weights U k and W k+ i are obtained such that the distance between the ensemble machine F k+ i and the true learning machine ⁇ * is minimized.
  • FIG. 12a is a conceptual view showing a method for constructing a first learning machine according to the CHEM algorithm proposed by the present invention.
  • an optimal learning machine ⁇ ⁇ (i.e., a learning machine located closest to the true learning machine ⁇ *) is obtained.
  • FIG. 12b is a conceptual view showing a method for constructing a second learning machine according to the CHEM algorithm proposed by the present invention.
  • FIG. 12c is a conceptual view showing a method for constructing a third learning machine according to the CHEM algorithm proposed by the present invention.
  • An optimal learning machine ⁇ j orthogonal to F 2 is found out.
  • weights u 2 , w 3 by which the distance between an ensemble machine F 3 according to the formula (37) and the true learning machine ⁇ * is shortest are obtained.
  • the ensemble machine F 3 according to the formula (37) is expressed as the following formula (39):
  • FIG. 12d is a conceptual view showing a method for constructing an m-th learning machine according to the CHEM algorithm proposed by the present invention.
  • a loss function is indicated by /, and a deviance of a given learning machine ⁇ is defined as the following formula (42): n
  • An optimal learning machine ⁇ 2 orthogonal to ⁇ i is found out.
  • a residual is used for obtaining the orthogonal learning machine.
  • the residual r ⁇ can be obtained from the fomiula (43) when the response variables are categorical:
  • an optimal learning machine ⁇ is constructed by adopting
  • the optimal learning machine ⁇ is obtained by adopting ⁇ ⁇ a response variable.
  • the reason of obtaining the optimal learning machine by using the residual is that the residual has a property of being orthogonal to the response variable in a regression model. Therefore, the optimal learning machine ?P * for the residual is nearly orthogonal to the optimal learning machine ⁇ ⁇ for the response variable.
  • d( ⁇ ) is approximately a square of the distance between ? ⁇ and ⁇ * according to the statistical theory.
  • the construction of a third learning machine is identical to the construction of the second learning machine except that the residual is obtained by using F instead of ⁇ ⁇ .
  • the ensemble machine F 3 for the obtained third learning machine ?f 3 is expressed as the following formula (46):
  • FIG. 13 is a flowchart illustrating an outline of a CHEM algorithm according to one embodiment of the present invention.
  • step SI 301 various kinds of variables are initialized in step SI 301.
  • Boostrap data are generated by applying weights to multidimensional data in step S1302.
  • a probability that the response variable will be 1 for a given explanatory variable is estimated by using the basic learning machines in step S1303.
  • step S1303 a co ⁇ ection parameter for minimizing a given deviance is calculated.
  • a co ⁇ ected learning machine is constructed by using the co ⁇ ection parameter in step SI 305.
  • an ensemble machine is constructed based on the co ⁇ ected learning machine in step SI 306.
  • step SI 307 it is determined whether the ensemble machine satisfies stop rules to be described later. If it does not satisfy the stop rules, the procedure is returned to step SI 302. If so, the procedure is terminated.
  • the response variable yj is set to 1 if the data belong to Group 2, whereas it is set to 0 if the data belong to Group 1.
  • the present invention may generate the basic learning machines by using the weights instead of the Boostrap data.
  • the use of the Boostrap data results in a more excellent performance.
  • this embodiment utilizes the decision tree. Contrary to the conventional boosting algorithm, the basic learning machines in the CH ⁇ M algorithm are strong learning machines. Therefore, the optimal decision tree which has been obtained through the entire processes of constructing the decision tree is used instead of a simple decision tree. At this time, the TIC is used to overcome a problem of the calculation, which has been already described above.
  • CH ⁇ M algorithm in multi-class classification problem An outline of a CH ⁇ M algorithm expanded up to a multi-class classification also follows the procedure of FIG. 13. However, the CH ⁇ M algorithm for the multi-class classification problem is slightly different from the CHEM algorithm for the two-class classification problem in view of formulas etc. applied to them. This algorithm will be described in detail below.
  • Second step: y,* is set to 1 if i-th data belong to a j-th group and to 0 if not so.
  • x) that the response variable will be 1 for a given explanatory variable x is estimated using the Boostrap data through the basic learning machines.
  • , where exp(E (x)) P j (x) — , and y y * is 1 if an i-th observed value belongs to a j-th group and 0
  • FIG. 14 is a flowchart schematically illustrating an ensemble algorithm for the continuous variables according to one embodiment of the present invention, which will be explained in detail below.
  • step SI 401 A regression model is constructed using basic learning machines in step SI 402. Then, a co ⁇ ection parameter for minimizing a given deviance is calculated in step SI 403.
  • step SI 404 the response variables are updated based on the co ⁇ ection parameter.
  • An ensemble machine is constructed based on the updated response variables in step SI 405.
  • step S1406 it is determined whether the ensemble machine satisfies stop rules to be described later. If not, the procedure is returned to step S1402. If so, the procedure is terminated.
  • An objective of this algorithm is to find out a relationship that can best explain the response variables by using the explanatory variables based on the n training data. In other words, it is to construct an optimal function H : R p — > R by using the training data.
  • H(x) E(y) ⁇ ) is established in the statistics. That is, the objective of the regression analysis is to estimate a conditional expected value.
  • the CHEM algorithm for the continuous variables is as follows.
  • Second step A regression model ⁇ m (x) is obtained through the basic learning machines by setting the response variable to Z; and the explanatory variable to x ⁇ .
  • Third step A co ⁇ ection parameter ⁇ for minimizing a value of the following formula (54) with respect to a given loss function / is calculated:
  • Stop rules Hereinafter, an algorithm for determining the number of basic learning machines required for constructing a final ensemble machine will be described. A basic idea of the stop rules is to terminate the entire algoritlim without further updating a cu ⁇ ent ensemble machine when a deviance of the ensemble machine is minimum. The stop rules will be described below.
  • the positive integer K is a value defined by a user.
  • Table 2 below shows simulation data for comparing predictabilities of a case where the stop rales are not applied to the CHEM algorithm and a case where the stop rules are applied to the CHEM algorithm, with respect to various data. As shown in Table 2 below, both predictabilities are almost similar to each other. Thus, when the stop rales are applied to the algorithm, an optimal ensemble machine can be constructed using the smaller number of basic learning machines, resulting in a great increase of the computational speed.
  • association rule generating algorithm of the present invention may be applied to any ensemble machines constructed by various conventional methods.
  • step SI 501 Initialize all kinds of variables. That is, if the response variables are categorical data, a group of interest, the minimum permissible number of data and the minimum permissible confidence are defined as g, m, and p, respectively. Further, if the response variables are of a continuous type, the range of interest is defined as (gL, gu) instead of the group g.
  • step S1502 Construct a universal set S of fundamental rales. It will be described in detail below.
  • all nodes that have the number of data larger than m and the probability of the group g larger than p are selected from all of nodes of basic learning machines, i.e., all basic learning machines that are used in the ensemble (in case of continuous type variables, the probability of the range is applied thereto).
  • the rales of the nodes that are selected as described above refer to as the universal set S of fundamental rales.
  • step SI 503 For all rales that are included in the universal set S, cancel the rules of nodes higher than a node that generates a relevant rule, from the universal set S. It will be described in detail below.
  • step SI 504 Calculate confidences for all rales that are included in the set S. At this time, the confidence for each condition is calculated by n g /n, where the total number of data that belong to the relevant rule is n and the number of data that belong to the group of interest(i. e., the group g) is n g .
  • step SI 505 Arrange the calculated rules of the set S. That is, the calculated rules of the set S are a ⁇ anged in decreasing order of confidences. Then, the arranged rales refer to as oi,. . ., ⁇ h.
  • step SI 506 Assume that an association rule set is R. If R is an empty set, the node ⁇ h is added to the set R. If not the set R is not an empty set, all of the association rales that are included in the set R are compared with the node ⁇ h to determine similarity. If the node ⁇ h is similar to none of the association rales that are included in the set R, the node ⁇ h is added to the set R.
  • step SI 507 Using all rules that are included within the association rule set R, data are analyzed in order of confidences.
  • the German data includes 700 good credit customers and 300 bad credit customers based on 1,000 pieces of data on credit transaction status, which are analyzed by using the number of minimum permissible data and the minimum permissible confidence under the same conditions for accurate comparison of the association rules.
  • the number of the minimum permissible data was set as 50 persons (5%) and the number of the minimum permissible confidence was set as 50%.
  • the number of the minimum permissible data was set as 50 persons (5%) and the number of the minimum permissible confidence was set as 85%.
  • the search results for the association rules of the CART algorithm under the above conditions were one bad credit customer group and one good credit customer group.
  • search results were 5 bad credit customer groups and 4 good credit customer groups.
  • the search of the association rules of the CHEM algorithm co ⁇ esponds to a wide range search of the association rules including the association rales of the CART algorithm.
  • the results of the CHEM algorithm provides the association rales including the results of the CART algorithm.
  • the association rales of the CHEM algorithm are represented as a set including data which are non-exclusive from one another.
  • the number of training data was 1,000, the number of test data was 5,000, and the number of groups was 2.
  • the number of training data was 1,000, the number of test data was 5,000, and the number of groups was 2.
  • the number of training data was 1,000, the number of test data was 5,000, and the number of groups was 2.
  • the number of training data was 1,000, the number of test data was 5,000, and the number of groups was 2.
  • FIGS. 16a to 16d are graphs showing the results of the 4 virtual tests.
  • the CHEM algorithm proposed by the present invention operates very stably.
  • all the other algorithms have poorer performances as the number of trees is increased, whereas there are no such problems in the CHEM algorithm.
  • Table 5 below shows numerical values representing the results of FIGS. 16a to 16d.
  • the predictability of the CHEM algorithm is superior to those of the conventional ensemble algorithms in the most models. It can be also understood that the CHEM algorithm is greatly superior to those of the conventional algorithms in view of the deviance in addition to the predictability. That is, the deviances of the conventional algorithms except for the CHEM algorithm are continuously increased. This means that function estimation of the conventional ensemble algorithms is hardly made.
  • a probability that a response variable y will belong to a k-th group for a new explanatory variable x cannot be estimated in all the ensemble algorithms except for the CHEM algorithm.
  • Table 6 shows information on the actual data. Table 6. Information on actual data
  • FIGS. 17a to 17i are graphs showing the results of the actual data analysis.
  • the CHEM algorithm operates very stably in the most cases.
  • the CHEM algorithm operates slightly unstably in case of 'Ionosphere data,' the difference in the stabilities of the operations between the data and the other data is not large.
  • the CHEM algorithm is superior to the other algorithms in view of their predictabilities and stabilities. It can also be seen that similarly to the aforementioned virtual tests, the CHEM algorithm has a smallest deviance value.
  • the CHEM algorithm has a very excellent predictability, simultaneously is very stable (it does not produce greatly wrong estimation results in any data), and enables the function estimation (i.e., the deviance is small).
  • the function estimation means probability estimation.
  • the CHEM algorithm can represent the values of response variables for given data as values with a discrete probability distribution, contrary to the conventional ensemble algorithms. This can be very usefully utilized in the analysis of the actual data.
  • Table 8 shows test results of continuous variables.
  • the Friedman model was a virtual model in which the number of training data was
  • the number of these data was 506, and the 5-fold cross validation method was used for obtaining a test e ⁇ or. It can be understood from the analysis results of the virtual tests and the data on the prices of houses in Boston that the CHEM algorithm well operates in the classification problem as well as the regression model. Particularly, reduction parameters should be defined by a user in the conventional LS boosting method, whereas it is hardly necessary for the user to define something in the CHEM algorithm.
  • the proposed CHEM algorithm is very superior to the conventional ensemble construction algorithms in view of their predictabilities and operates very stably. That is, there is a remarkably reduced chance that the overfitting problem occurs.
  • the proposed CHEM algorithm overcomes a drop in analyzability which may be commonly produced in the existing ensemble construction methods, and shows a more excellent analyzability over any other data mining methods by using the association rale algorithm.
  • the CHEM algorithm can be naturally applied to the continuous variable problem and thus easily applied to general industrial fields.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The apparatus for constructing a data mining model using ensemble machines, comprises M ensemble machine construction means for constructing the ensemble machines, said M being an integer equal to and greater than 2; wherein a first ensemble machine construction means constructs a first learning machine from multidimensional training data to be inputted, and then, constructs the constructed first learning machine itself into a first ensemble machine, and a k-th ensemble machine construction means receives a (k-1)-th ensemble machine constructed from a (k-1)-th ensemble machine construction means, constructs a k-th learning machine orthogonal to the (k-1)-th ensemble machine in a multidimensional space, and then constructs a k-th ensemble machine by combining the constructed k-th learning machine and the (k-1)-th ensemble machine, said k being an integer between 2 and M.

Description

APPARATUS AND METHOD FOR CONSTRUCTING DATA MINING MODEL
USING ENSEMBLE MACHINES
Technical Field
The present invention relates to an apparatus and method for constructing a data mining model using ensemble machines, and more particularly, to an apparatus and method for constructing a data mining model wherein faster computational speed, more stable model construction, and superior analyzability and predictability can be easily obtained as compared with a conventional method when analyzing and predicting multidimensional huge data.
Background Art
1. Introduction In data mining, many ensemble machines have been studied based on a Bagging algorithm first proposed by Breiman until now.
That is, there have been performed studies for developing the Bagging algorithm. As results of such studies, there are a Boosting algorithm proposed by Freund and Schapire, an Arcing algorithm proposed by Breiman, a random forest algorithm proposed by Breiman, and the like.
Various improved algorithms are based on the boosting algorithm proposed by Freund and Schapire among these ensemble machines because of its superior predictability. As for such improved algorithms, there are a Real Boosting algorithm proposed by Schapire and Singer and a Gradient boosting algorithm proposed by Friedman. That is, the conventional ensemble algorithms for use in the data mining are mostly based on the boosting algorithm.
Hereinafter, conventional various boosting algorithms which have been applied to classification problems will be briefly explained. In particular, the real boosting algorithm, a Logit boosting algorithm and the gradient boosting algorithm, which are currently used most widely, will be explained. 2. Conventional boosting algorithms applicable to two-class classification problem
The boosting algorithm is a method which is generally used in a classification problem. A basic model of the classification problem is as follows:
Assume that n training data (xj, yi), ..., (xn, yn) are given. Here, Xj is a p- dimensional explanatory variable, i.e. x (xii, ..., xpj); and a response variable y\ designates a group in which the data are included. That is, if there are J groups, y, has one integer value of 1 to J.
An objective of the classification problem is to find out a relationship in which the response variable can be best explained based on the explanatory variable using the n training data. In other words, the objective is to obtain an optimal function of H:Rp- {1,
2, ..., J} using the training data. And then, when a new explanatory variable x is given, the data is assigned to Group H(x).
First, a case where there are provided two groups, that is, J=2 is considered. A case where there are provided multiple groups will be explained later. Various boosting algorithms will be now explained below.
2-1. Real boosting algorithm (Schapire and Singer, 1999)
The real boosting algorithm is most representative of the algorithms for constructing an ensemble machine for data mining. This algorithm is described in detail in United States Patent No. 5,819,247 entitled
"Apparatus and methods for machine learning hypotheses," and will be specifically explained as follows.
FIG. 1 is a flowchart illustrating the conventional real boosting algorithm applicable to a case where there are provided two groups in a classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 1.
(1) Step SI 01: A response variable y, is set either to 1 if the data belong to Group 2 or to -1 if the data belong to Group 1.
(2) Step SI 02: n weights w1} ..., wn are initialized by setting the weights in accordance with a relationship, w l/n. (3) Step SI 03: A probability pm that a value of the response variable with respect to the explanatory variable x will be 1 is estimated through basic learning machines using the n training data (xj, y\), ..., (xn, yn) and the weights wi, ..., wn. The probability Pm is determined according to the following formula (1): pm(χ) = pw(y= 1 I *)
(1) (4) Step SI 04: fm(x)is obtained by transforming the probability Pm(x)- The fm(x) is determined according to the following formula (2):
Figure imgf000005_0001
(2)
(5) Step SI 05: New weights are obtained from the following formula (3) and then n normalized to satisfy the relationship ^ w{ =1 : ι=l W tflew = W i,old Xφ(-yfm(.Xt X * = " > n
(3)
(6) Step SI 06: M basic learning machines are generated by repeating the steps S103 to S105 M times (m=l, ..., M). (7) Step SI 07: A final ensemble machine is determined according to the following formula (4). At this time, a new explanatory variable x is assigned to Group 2 if H(x) > 0. Otherwise, the new explanatory variable x is assigned to Group 1.
M
H(x = ∑ „-0)
(4) Thus, the final ensemble machine is constructed through the foregoing processes.
2-2. Logit boosting algorithm (Friedman et al., 2000)
FIG. 2 is a flowchart illustrating the conventional Logit boosting algorithm applicable to a case where there are provided two groups in a classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 2.
(1) Step S201: A response variable yi is set either to 1 if the data belong to Group 2 or to 0 if the data belong to Group 1.
(2) Step S202: n weights wi, ..., wn are initialized by setting the weights in accordance with a relationship, and F(x) and p(x) are also initialized to 0 and 1/2, respectively. (3) Step S203: A new response variable z, and a new weight w, with respect to the given explanatory variable are obtained from the following formula (5):
(5)
(4) Step S204: A regression model fm(x) is constructed with reference to the basic learning machines and using the response variable zl5 the explanatory variable x, and the weight w,.
(5) Step S205: A probability is updated using the fm(x) constructed in step S204 and is expressed as the following formula (6): exp(F(x)) -1- exp(-.F(:x.))
FXp ) = (x)+^-/m(x)
(6)
(6) Step S206: Processes of updating the probability are performed M times by repeating steps S203 to S205 M times.
(7) Step S207: A final ensemble machine is constructed by using the relationship, H(x)=F(x). At this time, a new response variable x is assigned to Group 2 if H(x) > 0.
Otherwise, the new response variable x is assigned to Group 1.
2-3. Gradient boosting algorithm (Friedman, 2001)
FIG. 3 shows a flowchart illustrating the conventional gradient boosting algorithm applicable to a case where there are provided two groups in a classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 3.
(1) Step S301: A response variable y, is set either to 1 if the data belong to Group 2 or to -1 if the data belong to Group 1.
(2) Step S302 : Various kinds of variables are initialized according to the following formula (7):
Figure imgf000006_0001
(7)
(3) Step S303: A new response variable z, with respect to the given explanatory variable are calculated from the following formula (8):
Zi 1 + exp(2.y,F(x;))
(8)
(4) Step S304: fm(x) is estimated using the response variable zb the explanatory variable x, and the basic learning machines (regression decision trees).
(5) Step S305: An estimated value Ύ i of an /-th terminal node of the fm(x) is estimated based on the following formula (9):
Figure imgf000007_0001
(9) where Ri is a set of the data belonging to an i-th terminal node.
(6) Step S306: The F(x) is updated in accordance with the formula,
F(x)=F(x)+fm(x).
(7) Step S307: The F(x) is updated M times by repeating steps S303 to S306 M time. (8) Step S308: A final ensemble machine is constructed by using the relationship,
H(x)=F(x). At this time, a new response variable x is assigned to Group 2 if H(x) > 0. Otherwise, the new response variable x is assigned to Group 1.
In the meantime, various kinds of basic learning machines can be utilized in the aforementioned real boosting algorithm and Logit boosting algorithm, whereas the gradient boosting algorithm should utilize the decision trees as basic learning machines.
Further, when the basic learning machine is to be constructed in the algorithms, complexity of the learning machine should be beforehand determined. In a case where the basic learning machine is the decision tree, the complexity of the learning machine can be controlled by adjusting the number of the terminal nodes. However, since the basic learning machines used in the boosting algorithms are weak ones, the complexity thereof should be minimized. In general, the decision tree with two to eight terminal nodes is widely used as a basic learning machine. 3. Conventional boosting algorithms applicable to multi-class classification
The aforementioned three boosting algorithms have been applicable to cases where there are two groups. However, in a case where there are three or more groups, the algorithms should be expanded and modified accordingly. Hereinafter, the real boosting algorithm, Logit boosting algorithm, and gradient boosting algorithm applicable to the multi-class classification will be explained.
3-1. Real boosting algorithm applicable to Multi class
FIG. 4 shows a flowchart illustrating the conventional real boosting algorithm applicable to the multi-class classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 4.
(1) Step S401: J new data, ((XJ, 1), yπ), ..., ((XJ, J), yjj) are generated from given training data (XJ, y;). At this time, y^ is 1 if y; is k, whereas otherwise, yi is -1.
(2) Step S402: A final ensemble machine is constructed according to the following formula (10) by applying the two-class real boosting algorithm to an (n x J) new data (i.e., by transforming the multi-class classification into the two-class classification):
H(xJ) : R : X { 1, ..., J} → R (10)
(3) Step S403: A new explanatory variable x is assigned to argmaxj H(x, j) Group. However, according to the real boosting algorithm applicable to the above multi- class classification, the number of execution, i.e. the number of calculation, correspond to the number of boosting. Thus, there is a problem in that the real boosting algorithm is very time-consuming.
3-2. Logit boosting algorithm applicable to multi-class classification
FIG. 5 shows a flowchart illustrating the conventional Logit boosting algorithm applicable to the multi-class classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 5.
(1) Step S501: Various kinds of variables are initialized. That is, it is set that Fj(x)=0 and pj(x)=l/J, where j=l , 2, ... , J.
(2) Step S502: It is set that y;*=l if i-th data belong to Group j whereas yi*=0 if i-th data do not belong to Group j.
(3) Step S503: A New response variable Zj and a weight Wj are determined according to the following formula (11):
Figure imgf000009_0001
w, = Λ(*,)( I - p, χ ty (11)
(4) Step S504: A regression model fmj(x) is calculated and constructed with reference to the basic learning machines and using the response variable z,, the explanatory variable X; and the weight w,.
(5) Step S505: Steps S502 to S504 are repeated J times. (6) Step S506: A j-th Fj(x) is updated according to the following formula (12): fm pA) = ^ *-• (/;„,( - ^ •-/ k έ= lω »
FJ *y= F; χ + fny(X ~) (12)
(7) Step S507: A j-th probability p,(x) is updated according to the following fon ιla (13): exp(-E-CxQ) J
∑ exp( .χx)) k=1 (13) (8) Step S508: Steps S502 to S507 are repeated M times.
(9) Step S509: A new explanatory variable x is assigned to argmaxj Fj(x) Group.
3-3. Gradient boosting algorithm applicable to multi-class classification FIG. 6 shows a flowchart illustrating the conventional gradient boosting algorithm applicable to the multi-class classification problem. The algorithm will be hereinafter explained in detail with reference to FIG. 6.
(1) Step S601: Various kinds of variables are initialized. That is, it is set that Fj(x)=0 and Pj(x)=l/J, where j=l, 2, ..., J.
(2) Step S602: It is set that y,*=l if i-th data belong to Group j whereas
Figure imgf000009_0002
if i-th data do not belong to Group j . (3) Step S603: A new response variable z, is determined according to the following formula (14):
Figure imgf000010_0001
(14)
(4) Step S604: A regression model fmj(x) is constructed with reference to the basic learning machines (regression decision trees) and using the response variable z, and explanatory variable x,.
(5) Step S605: An estimated value γ of an /-th terminal node in the regression model following formula (15):
Figure imgf000010_0002
(15)
(6) Step S606: A Fj(x) is determined according to the following formula (16):
Figure imgf000010_0003
(16) (7) Step S607: Steps S602 to S606 are repeated J times (j=l, ..., J).
(8) Step S608: Steps S602 to S607 are repeated M times (m=l, ..., M).
(9) Step S609: A new explanatory variable x is assigned to an argmaxj Fj(x) Group.
4. Summary of problems in prior arts A basic idea of the aforementioned boosting algorithms is to develop a new strong learning machine by constructing several weak learning machines and then combining them. However, such an idea involves several problems which are summarized as follows.
First, it is difficult to analyze the constructed models. Second, it is not easy for a general user to define a number of tuning parameters, such as the size of a decision tree, the number of the decision trees and the like. Third, there is a tendency toward overfitting of data (Ridgeway, 2000). Fourth, if a response variable is continuous, it is not easy to use the algorithms. Fifth, it is not easy to estimate probabilities in a classification problem. That is, when new data are inputted, classification probability to be taken by the inputted new data cannot be estimated.
Sixth, it is not easy to verify stability of the boosting algorithm, due to lack of a general (statistical) theorem of the boosting algorithm.
Disclosure of the Invention
The present invention is conceived to solve the above problems in the prior art. Accordingly, an object of the present invention is to provide an apparatus and method for constructing an ensemble machine wherein faster computational speed, more stable model construction, and superior analyzability and predictability can be easily obtained as compared with a conventional method when analyzing and predicting multidimensional huge data.
According to an aspect of the present invention, there is provided an apparatus for constructing a data mining model using ensemble machines, comprising: M ensemble machine construction means for constructing the ensemble machines, said M being an integer equal to and greater than 2; wherein a first ensemble machine construction means constructs a first learning machine from multidimensional training data to be inputted, and then, constructs the constructed first learning machine itself into a first ensemble machine, and a k-th ensemble machine construction means receives a (k-l)-th ensemble machine constructed from a (k-l)-th ensemble machine construction means, constructs a k-th learning machine orthogonal to the (k-l)-th ensemble machine in a multidimensional space, and then constructs a k-th ensemble machine by combining the constructed k-th learning machine and the (k-l)-th ensemble machine, said k being an integer between 2 and M.
According to another aspect of the present invention, there is provided method for constructing a data mining model using ensemble machines, comprising: a step of constructing M ensemble machines, said M being an integer equal to and greater than 2; wherein first ensemble machine constructing step constructs a first learning machine from multidimensional training data to be inputted, and then, constructs the constructed first learning machine itself into a first ensemble machine, and a k-th ensemble machine constructing step receives a (k-l)-th ensemble machine constructed from a (k-l)-th ensemble machine construction means, constructs a k-th learning machine orthogonal to the (k-l)-th ensemble machine in a multidimensional space, and then constructs a k-th ensemble machine by combining the constructed k-th learning machine and the (k-l)-th ensemble machine, said k being an integer between 2 and M.
According to another aspect of the present invention, there is provided an apparatus for constructing a data mining model in two-class classification using ensemble machines, comprising: an input means for receiving multidimensional training data; a weight calculation means for calculating weights by using residuals of the ensemble machines; a probability estimation means for estimating, by using basic learning machines and the weight, probability that a response variable will be 1 with respect to an explanatory variable x obtained from the data inputted by the input means; a correction parameter calculation means for calculating a correction parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the probability estimation means; a learning machine update means for updating and reconstructing learning machines by using the calculated correction parameter; and an ensemble machine construction means for constructing a final ensemble machine based on the updated learning machines.
According to an aspect of the present invention, there is provided method for constructing a data mining model in two-class classification using ensemble machines, comprising: a first step for receiving multidimensional training data; a second step for calculating weights by using residuals of the ensemble machines; a third step for estimating, by using basic learning machines and the weight, probability that a response variable will be 1 with respect to an explanatory variable x obtained from the data inputted by the first step; a fourth step for calculating a correction parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the third step; and a fifth step for updating and reconstructing learning machines by using the correction parameter calculated from the fourth step.
According to an aspect of the present invention, there is provided an apparatus for constructing a data mining model in multi-class classification using ensemble machines, comprising: an input means for receiving multidimensional training data; a weight calculation means for calculating weights by using residuals of the ensemble machines; a probability estimation means for estimating, by using basic learning machines and the weights, probabilities that a response variable will belong to Group j with respect to an explanatory variable x obtained from the data inputted by the input means; a coπection parameter calculation means for calculating a coπection parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the probability estimation means; a learning machine update means for updating and reconstructing learning machines by using the calculated coπection parameter; and an ensemble machine construction means for constructing a final ensemble machine based on the updated learning machines.
According to an aspect of the present invention, there is provided method for constructing a data mining model in multi-class classification using ensemble machines, comprising: a first step for receiving multidimensional training data; a second step for calculating weights by using residuals of the ensemble machines; a third step for estimating, by using basic learning machines and the weights, probabilities that a response variable will belong to Group j with respect to an explanatory variable x obtained from the data inputted by the first step; a fourth step for calculating a coπection parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the third step; a fifth step for updating and reconstructing learning machines by using the calculated coπection parameter; and a sixth step for constructing a final ensemble machine based on the updated learning machines.
According to an aspect of the present invention, there is provided an apparatus for constructing a data mining model using ensemble machines in a case where a response variable is regressive, comprising: an input means for receiving multidimensional training data; a residual calculation means for calculating residuals of the ensemble machines; a regression model construction means for constructing regression models from the data inputted by the input means through basic learning machines and the residuals; a coπection parameter calculation means for calculating a coπection parameter which minimizes a given loss function with respect to the constructed regression models obtained from the regression model construction means; a learning machine update means for reconstructing learning machines by updating the response variable based on the calculated coπection parameter; and an ensemble machine construction means for constructing a final ensemble machine based on the reconstructed learning machines.
According to an aspect of the present invention, there is provided method for constructing a data mining model using ensemble machines in a case where a response variable is regressive, comprising: a first step for receiving multidimensional training data; a second step for calculating residuals of the ensemble machines; a third step for constructing regression models from the data inputted by the first step through basic learning machines and the residuals; a fourth step for calculating a coπection parameter which minimizes a given loss function with respect to the constructed regression models obtained from the third step; a fifth step for reconstructing learning machines by updating the response variable based on the calculated coπection parameter; and a sixth step for constructing a final ensemble machine based on the reconstructed learning machines. More specifically, the inputted training data is generated into Boostrap data by using the weights.
More specifically, the learning machines are decision trees or neural network models.
More specifically, the final ensemble machine is constructed when a deviance of a cuπent ensemble machine is smallest.
Brief Description of the Drawings
The above and other objects and features of the present invention will become apparent from the following description of prefeπed embodiments given in conjunction with the accompanying drawings, in which:
FIG. 1 is a flowchart illustrating a conventional real boosting algorithm applicable to a case where there are provided two groups in a classification problem;
FIG. 2 is a flowchart illustrating a conventional Logit boosting algorithm applicable to a case where there are provided two groups in the classification problem; FIG. 3 is a flowchart illustrating a conventional gradient boosting algorithm applicable to a case where there are provided two groups in the classification problem;
FIG. 4 is a flowchart illustrating a conventional real boosting algorithm applicable to a case of a multi-class classification problem;
FIG. 5 is a flowchart illustrating a conventional Logit boosting algorithm applicable to a case of the multi-class classification problem;
FIG. 6 is a flowchart illustrating a conventional gradient boosting algorithm applicable to a case of the multi-class classification problem;
FIG. 7 is a flowchart schematically illustrating a method for constructing an ensemble machine according to a prefeπed embodiment of the present invention;
FIG. 8 is a flowchart schematically illustrating a cross validation process for selecting an optimal decision tree;
FIG. 9 is a flowchart illustrating an entire outline of a process for selecting the optimal decision tree through tree information criteria (TIC);
FIG. 10 is a flowchart showing a process for generating Boostrap data according to another prefeπed embodiment of the present invention; FIG. 11 is a basic conceptual view showing existence of the plurality of learning machines having small distances from a true learning machine among the learning machines included in the conventional set of basic learning machines;
FIG. 12a is a conceptual view showing a method for constructing a first learning machine according to the CHEM algorithm proposed by the present invention; FIG. 12b is a conceptual view showing a method for constructing a second learning machine according to the CHEM algorithm proposed by the present invention;
FIG. 12c is a conceptual view showing a method for constructing a third learning machine according to the CHEM algorithm proposed by the present invention;
FIG. 12d is a conceptual view showing a method for constructing an m-th learning machine according to the CHEM algorithm proposed by the present invention;
FIG. 13 is a flowchart illustrating an outline of a CHEM algorithm according to one embodiment of the present invention;
FIG. 14 is a flowchart schematically illustrating an ensemble algorithm for the continuous variables according to one embodiment of the present invention; FIG. 15 is a flowchart illustrating an outline of an algorithm for generating association rules;
FIGS. 16a to 16d are graphs showing the results of virtual tests for validating performances of the conventional ensemble methods and the ensemble method proposed by the present invention; and FIGS. 17a to 17i are graphs showing the results of actual data analysis for validating performances of the conventional ensemble methods and the ensemble method proposed by the present invention.
Best Mode for Carrying Out the Invention
Hereinafter, prefeπed embodiments of the present invention will be described in detail with reference to the accompanying drawings.
1. Introduction
An algorithm for constructing a data mining model proposed by the present invention is a novel algorithm of which predictability is superior to the boosting algorithms while solving the aforementioned problems in the boosting algorithms. Further, in a case where a decision tree is employed as a basic learning machine, the algorithm of the present invention has a further improved analyzability as compared with a decision tree which is known to have most superior analyzability among conventional data mining techniques (of course, the decision tree has very poor predictability). In particular, according to the algorithm of the present mvention, various kinds of association rules capable of describing given data can be extracted, and accordingly, the given data can be analyzed in various respects.
Theoretically, the algorithm proposed by the present invention is completely different from the boosting algorithm in view of their basic ideas. The boosting algorithm is to develop a strong learning machine by combining several weak learning machines, whereas the algorithm of the present invention is to develop a stronger learning machine by combining several strong learning machines.
A basic learning machine used in an ensemble scheme of the present invention will be explained by way of example of the decision tree used as the learning machine. The reason is that the decision tree has advantages such as high analyzability and simplicity of the algorithm. However, the basic learning machine can be differently selected depending on data types. For example, a neural network model may be utilized in case of character or voice recognition.
In order to utilize the decision tree as the basic learning machine, fast computational speed is essentially required. In general, a CART algorithm proposed by
Breiman et al. in 1986 is cuπently well-known as a decision tree construction algorithm. The CART algorithm comprises three steps of growing the decision tree, pruning, and selecting the optimal decision tree.
At this time, the third step of selecting the optimal decision tree utilizes a cross validation method which requires a large amount of computation. The amount of calculation required for the cross validation is not a burden when generating a single decision tree. However, the ensemble scheme requires generating several decision trees.
Thus, if the cross validation method is to be applied to all the decision trees, an enormous increase of the amount of calculation will be inevitable.
In order to overcome the problem of the cross validation, the present invention defines and utilizes a new parameter of tree information criteria (TIC) so that the optimal decision tree can be constructed in a short time.
However, the algorithm for constructing the data mining model proposed by the present invention is not limited to a method of selecting the decision tree using the TIC.
In other words, according to the present invention, various basic learning machines (e.g., a neural network learning machine) as well as the decision tree can be utilized as the basic leaning machine. For convenience of explanation, the decision tree will be merely selected and explained herein as the basic learning machine.
FIG. 7 shows a flowchart schematically illustrating a method for constructing an ensemble machine according to a prefeπed embodiment of the present invention. Referring to FIG. 7, if multidimensional training data are inputted in step S701, a basic learning machine is constructed in step S702. At this time, a decision tree, a neural network model or the like can be utilized as the basic learning machine. For convenience of explanation, however, the decision tree is employed herein as the basic learning machine.
Therefore, the optimal decision tree is constructed in step S702. In such a case, the optimal decision tree can be constructed in various manners. A new scheme refeπed to as a tree information criteria (TIC) algorithm will be introduced into the present invention.
Then, the ensemble machine is constructed by using the constructed basic learning machines in step S703, and it is determined whether the constructed ensemble machine is optimal. If a result of determination in step S704 is that the ensemble machine is not optimal, a new response variable is generated in step 705, and then, the procedure is returned to step S702.
If the result of determination in step S704 is that the ensemble machine is optimal, a final ensemble machine is constructed in step 706, and then, the procedure is terminated. The present invention will be explained herein in the following order. (1) First, the TIC, i.e. the novel algorithm for constructing the optimal decision tree, which is substituted for the cross validation method, will be described in order to know about how to construct the optimal decision tree (refer to step S702). (2) A method of extracting Boostrap data for use in the present invention will be examined (refer to a first half of step S703). (3) A new algorithm using the Boostrap data extraction method will be described (refer to a second half of step S703 and steps S704 to S706). (In such a case, it is not necessary to utilize the TIC algorithm. That is, it is not necessary to select the decision tree as the basic learning machine.) (4) An algorithm for generating association rules will be examined. (5) Performances of various algorithms will be compared with one another through virtual tests, and then, the performances will be finally compared against actual data.
2. Tree Infomiation Criteria (TIC)
A TIC method for solving the problems of the cross validation method which has been generally utilized when determining the optimal decision tree will be explained. First, general algorithms for generating the decision trees will be described. Then, a conventional cross validation algorithm will be examined. Finally, an algorithm for selecting the optimal decision tree using the TIC proposed by the present invention will be explained.
2-1. Algorithm for constructing decision tree (Breiman et al., 1984)
The algorithm for constructing the decision tree, which was proposed by Breiman, can be roughly divided into three steps.
A first step is a growth algorithm for generating a largest decision tree with respect to given data. A second step is a pruning algorithm for generating a plurality of nested decision trees by pruning unnecessary braches from the huge decision tree constructed through the growth algorithm. At this time, the constructed decision trees become smaller and smaller.
A third step is an algorithm for selecting the optimal decision tree among the decision trees which have been obtained from the pruning algorithm.
2-2. Cross validation algorithm for selecting optimal decision tree (k-fold cross validation)
FIG. 8 shows a flowchart schematically illustrating the process of a cross validation method for selecting the optimal decision tree. The process will be explained in detail below.
Assume that the decision trees generated from the multidimensional data through the growth algorithm and the pruning algorithm are Tl5 ..., Tm, and e{ is a cross validation
(1) Step S801 : Various kinds of variables are initialized. That is, it is set that ei=0, where i=l, 2, ..., m.
(2) Step S802: Given n training data are equally divided into k parts so that mutually exclusive data D\, D2, ..., Dk can be generated.
(3) Step S803: Di is set to test data, and the other data are set to training data.
(4) Step S 804: By using the training data, the nested decision trees are constructed (through the growth and pruning algorithms).
(5) Step S805: By using the test data Di, a prediction eπor is calculated with respect to each of the nested decision trees.
(6) Step S806: A decision tree, which is closest to one decision tree Tj among the constructed decision trees, is selected. At this time, since the decision tree selection algorithm was fully disclosed by the 'Breiman et al. (1984)', the description thereof will be omitted herein.
(7) Step S807: The prediction eπor of the decision tree obtained from step S806 is added to the βj.
(8) Step S808: Steps S806 and S807 are repeated m times (j=l, ..., m). (9) Step S809: Steps S803 and S808 are repeated k times (i=l , ..., k).
(10) Step S810:
Figure imgf000019_0001
..., em are referred to as cross validation eπors of the decision trees, Tlr ..., Tm, respectively. A decision tree of which cross validation eπor is smallest is selected as an optimal decision tree.
In the meantime, such a cross validation algorithm is also refeπed to as the k-fold cross validation algorithm in which 5-fold or 10-fold cross validation method is generally utilized.
In the aforementioned cross validation algorithm for constructing the optimal decision free, the decision trees should be constructed many times. Thus, if the data are huge, there are problems in that very long calculation time is required and a calculation result is arbitrarily varied depending on how to divide the data.
In order to solve the problems, such as enormous increase of the amount of calculation and instability of the calculation result, which are inherent in the cross validation method, the TIC algorithm of the present invention is newly proposed and will be hereinafter described.
2-3. TIC algorithm for selecting optimal decision tree
An objective of the TIC algorithm is to select the optimal decision tree among a permutation of several decision trees, Tl9 ..., Tm. At this time, each posterior probability of the respective decision trees is calculated, and then, the decision tree of which posterior probability is largest is selected as the optimal decision tree.
The posterior probability means a probability of each tree with respect to the given data. That is, the posterior probability of the tree Ti will be Pr(Tj|Dn) with respect to the given data Dn={(yι, xi), ..., (y„, xn)} .
FIG. 9 shows a flowchart illustrating an entire outline of a method for selecting the optimal decision tree using the TIC. The method will be explained in detail below.
If the multidimensional training data are first inputted in step S901, the largest decision tree is constructed based on the training data in step S902. Then, in step S903, the constructed largest decision tree is newly generated into nested decision trees through the pruning technique. Thereafter, each posterior probability of the respective decision trees is calculated in step S904, the decision free of which posterior probability is largest is then selected in step S905, and a single optimal decision tree is finally selected in step S906.
Hereinafter, the method for selecting such an optimal decision tree will be more specifically explained.
First, a general method for calculating the posterior probability will be described. The posterior probability is defined as Pr(T,|Dn)=c Pr(Dn|T1) Pr(T,) according to the
Bayesian theorem, where Pr(Dn|Tι) is a probability of the data in a case where the model is Tl5 Pr(Tj) is a probability arbitrarily determined by a user when the user examines the data, m and c is a constant for establishing the relationship ^Pr(T. | Dn) = 1.
In the meantime, since an objective of calculating the posterior probability is to determine the decision tree of which posterior probability is largest, the constant c need not be obtained. Thus, the following formula (17) can be simply utilized to calculate the posterior probability:
Vτ(T, I Dn) oc Pr(D J T,)?τ(T,)
(17) Now, Pr(Dn|T will be calculated.
Since the data are independent of one another, the following formula (18) is established:
Figure imgf000021_0001
(18) Further, the formula (18) may be rewritten as the following formula (19):
Figure imgf000021_0002
I O = Prfc I T xk)Vr(xk | T)
(19)
Here, since the tree T, is used to represent a probability structure of yk with respect to a given input X , Pr(Xk|T,) is not dependent on the Tλ. That is, Pr(xk|T = Pr(xk). Therefore, once Pr(yk|T]5 X ) is calculated, Pr(Dn|T is obtained accordingly.
Furthermore, similarly to the constant c, since Pr(xk) is a value which is commonly applied to all the trees, it is not used to find out the tree of which posterior probability is largest. Thus, in consideration of the foregoing, the above formula (17) is rewritten and expressed as the following formula (20):
Pr(Er I D„) oc ft Pr(yk | EΛ)Pr(E-) k= l . ■ • • (20) The Pr(yk|Tj, X ) is calculated in the following manner.
Assume that a set of terminal nodes of T; is expressed as T( and the probabilities of J groups with respect to the given h-th terminal node are expressed as (pih, ..., pjh)- Then, if the given input variable Xk belongs to the h-th terminal node of the tree Ti, the following formula (21) is established:
(21) where yk represents a group to which the data belong.
Using the foregoing, the following formula (22) is established:
Figure imgf000022_0002
(22) where njh is the number of the data, belonging to Group j, among the data included in the h-th terminal node.
Since the probabilities of the respective terminal nodes (pi , ..., pjh) are unknown variables, they are removed by using expected values. A distribution of the probabilities (pih, ..., pjh) is needed for obtaining the expected values. Assume that the distribution is π (pih, ..., pjh)- Then, the following formula (23) is established: lh,...,PJh)dplh,..dPjh
Figure imgf000022_0003
(23) Here, various types of distributions can be utilized for τr (pi , • ■ •. pji - If a general distribution is utilized, the following formula (24) is established:
Pr(-D r,.)oc IT ( Tϊp" n(P lh, ...,PJh)dp lh...dpJh h<a τ-t J 7=1
(24)
Further, if a uniform distribution is utilized, the following formula (25) is established:
Figure imgf000022_0004
(25) J where «Λ = ∑« .
Furthermore, the uniform distribution is defined as the following formula (26):
Figure imgf000023_0001
(26) Hereinafter, a method for determining a prior probability of a tree Pr(J) will be explained.
Pr(E;) is not obtained from data but inputted by the user. Pr(7}) for the TIC is constructed as follows.
First, a probability that a given h-th node will be an intermediate node (that is, a node from which successive branching occurs) is defined as the following formula (27): (l+ )'P (27) where fh is the number of ancestor nodes of the given node, and the constants a and β are determined by the user. Then, a probability that a given node will be a terminal node is naturally determined as the following formula (28):
Figure imgf000023_0002
(28)
The prior probability of a given tree Ti under the conditions of the formula (28) is expressed as the following formula (29):
Pr(?;.) = he πT, -f«, fl-Λ)"Η heT, fl-«Ci-Λr') ,
(29) where Tj - Ti is a set of the immediate nodes (i.e., all the nodes which are not terminal nodes). Next, the TIC will be calculated using the aforementioned matters.
When the above formulas are combined and reaπanged, the following formula (30) is finally obtained. At this time, the formula (30) uses the uniform distribution. Pr(^ - Pr(E>„|7;)Pr(-7 )
Figure imgf000023_0003
(30) /ιe7) -1- VWΛ + J ) heT, -f, heT,
Further, the TIC is defined as a logarithmic value of the final right side of the above formula. That is, TIC of the tree T, is expressed as the following formula (31):
TIC = ∑ ( ∑log(r(»J,+l)))-log(r(H;+ ))+ ∑ logα-βlog(l+/A) h≡ F; J=l hETf F;
+ Σ log(l-α(l- Λ)-p) he F-- (31) In a case where there are two groups, i.e. J = 2, the TIC is expressed as the following formula (32):
Figure imgf000024_0001
where dh is the number of data belonging to a second group among data of an h-th terminal node. After applying the TIC defined as such has been applied to respective decision trees
Ti, • • • , Tm, a decision tree having the maximum TIC is selected as an optimal decision tree.
Then, the algorithm is terminated.
Meanwhile, the conventional method using the Bayesian theorem and the TIC method proposed by the present invention are the same as each other in that they utilize a posterior probability. However, they are a little different from each other in view of the posterior probability construction which will be used when calculating their posterior probabilities. Such a difference has a great influence on calculation of the posterior probabilities. That is, in the conventional method using the Bayesian theorem, the posterior probability cannot be calculated by means of numerical formulas. Thus, since the probabilities should be calculated through a computer, it takes much more time to perform the calculation as compared with the method using the cross validation.
The method for constructing the posterior probability according to the conventional method using the Bayesian theorem assigns probabilities to all possible decision trees.
However, since the number of all the possible decision trees is very large, the method for constructing the posterior probability is also greatly complicated. Accordingly, since the number of decision trees which require the calculation of the posterior probabilities is inevitably greatly increased, this leads to enoπnous increase in the amount of the calculation.
However, the TIC method solves the problem of the conventional method using the
Bayesian theorem by assigning the posterior probabilities not to all the possible decision trees but only to nested decision trees induced through the pruning algorithm. Thus, there are advantages in that it is very easy to construct the prior probability and the calculation of the posterior probability is simplified.
That is, the prior probability constructing method for use in the TIC method is a method for reducing sets of the decision trees by using data. This point is the very sharp difference between the TIC method and the conventional method using the Bayesian theorem.
In summary, since it is sufficient to construct the decision tree only once in the TIC method, the computational speed is rapidly increased over the cross validation method.
Furthermore, since the results obtained with respect to the same data are always the same in the TIC method, reliability of the results is greatly superior to that of the cross validation method.
Table 1 below shows simulation results of the selection of an optimal decision tree by using the conventional 5-fold cross validation method and the TIC method proposed by the present invention, respectively. That is, these experiment data are used to compare the generation rate of a single tree through the 5-fold cross validation with that of the single tree through the TIC proposed by the present invention.
Respective pieces of the experiment data indicate average time when the same data are repeatedly generated 500 times. The specification of a computer was a Pentium III 900 MHz processor, a 256 Mbyte main memory, and a Window 2000 O/S.
According to Table 1 below, it can be seen that the calculation time of the TIC method proposed by the present invention was about 1/5 times as much as that of the conventional 5-fold cross validation method.
Meanwhile, the simulation data coπespond to standard data which have been widely known in the field of the data mining, i.e. 'Radius2,' 'Interaction,' 'Breast Cancer,'
'Ionosphere,' and 'Sonar' data. These data are the most powerful simulation data for estimating the efficiency of the data mining in the relevant art. These simulation data are shown in detail in 'Machine Learning Web Site'
(http://wwwl. ics.uci.edu/~mlearn/MLRepository .html) of 'UC Irvine.' Table 1
Figure imgf000026_0001
3. Method for extracting Boostrap data
Hereinafter, a method for extracting Boostrap data used in the present invention will be briefly described.
First, a general algorithm for generating the Boostrap data will be explained below. Assuming that original data are n data, i.e. yl9 ■ • ■ , y„, the number of the Boostrap data to be generated, m, is determined and a new data set DB is initialized to an empty set.
Then, a constant k satisfying the relationship Kε[l, n] is generated using a random number generator, and yjB is assigned as yk and then added to DB. This process is repeated m times (i = 1 , • ■ • , m). FIG. 10 is a flowchart illustrating the process of generating the Boostrap data according to one embodiment of the present invention, which will be explained below.
(1) Step S1001: Weights Wi, ■ ■ ■ , wn are assigned to n multidimensional data (xls yι)» " » (χn, yn) to be inputted. Here, x; is a p-dimensional explanatory variable. That is,
X. ( li. " ' s X i)- (2) Step SI 002: Cumulative weights of the weights Wi, ■ • ■ , wn assigned to yi,
• • • , yn are calculated. That is, the cumulative weights are calculated according to the following formula (33): 'I 0)
Figure imgf000027_0001
(33)
(3) Step SI 003: The number of new data to be generated, m, is determined. The number of the Boostrap data, m, for use in the ensemble algorithm according to the present invention is set to n.
(4) Step SI 004: The new data set DB is initialized to the empty set.
(5) Step SI 005: A random real number satisfying the relationship k<≡ [0, w„*] is generated using the random number generator.
(6) Step S1006: An integer j satisfying the relationship Wj.j* < k ≤ w among wi*, ■ ■-, wn* is determined. Here, j = l, • • •, n.
(7) Step S1007: (xB, yB) is assigned as (x{, y,).
(8) Step SI 008: The weight wB coπesponding to (xB, yB) is set to 1/m.
(9) Step SI 009: (xt B,yt B) is added to DB.
(10) Step S1010: The process is repeated m times (i = 1, ■ ■ •, m).
4. Background and basic principle of ensemble algorithm proposed by the present invention
4-1. Introduction
Hereinafter, the background and basic principle of the novel ensemble algorithm proposed by the present invention will be described. The ensemble algorithm proposed by the present invention will be refeπed to as the Convex Hull Ensemble Machine (CHEM) algorithm.
The CHEM algorithm is an ensemble algorithm for generating a new learning machine by using a plurality of basic learning machines. In a classification problem (response variables are categorical) or a regression model problem (response variables are continuous), a learning problem can be changed to a function estimation problem.
A classification problem in which the response variables have J categories is a problem of estimating a J-dimensional function F = (Fi, • • ■, Fj). At this time, the function F is defined as the following formula (34): P(y=k I -v) = — j
∑ expGF-O)) ι = l
(34)
Meanwhile, the regression model problem becomes a problem of estimating a function F(x) = E(y|x). Assume that if a true value of the function F is defined as Ψ * (refeπed to as "true learning machine"), Ψ = {Ψ β, θ ^ Θ } is a set of the basic learning machines for use in the CHEM algorithm. That is, an optimal basic learning machine for given training data is selected as one of the set Ψ . Meanwhile, a method for finding the optimal learning machine has been widely known in the art. As for the set of basic learning machines Ψ for use in the data mining, there are a decision tree, a neural network model, and the like. However, such basic learning machines have a serious problem in that they very sensitively respond to changes in data.
It is the CHEM algorithm proposed by the present invention that finds out and overcomes causes of instability of such basic learning machines. Next, the causes of the instability of the conventional basic learning machines will be discussed below.
4-2. Causes of instability of conventional basic learning machines The reason why various algorithms used for the data mining operate unstably is that various different learning machines within the set of basic learning machines Ψ describe data in similar manners to one another.
The further basic reason why the completely different learning machines describe the data in the similar manners is that the true learning machine Ψ * is not included in the considered set of basic learning machines Ψ . Moreover, this is because there are a plurality of learning machines, which have small distances (i.e., the degree of difference) from the true learning machine ψ *, among the basic learning machines included in the set Ψ .
FIG. 11 is a basic conceptual view showing existence of the plurality of learning machines having small distances from the true learning machine among the learning machines included in the conventional set of basic learning machines. As shown in FIG. 11, various learning machines Ψ surround the true learning machine Ψ*. In such a case, even though data is changed merely in a slight degree, the optimal learning machine may be greatly changed. That is, the optimal learning machine will be one of Ψ \, Ψ 2 and Ψ3 depending on a direction in which the data are shifted from the true learning machine Ψ*.
If the set of basic learning machines Ψ is constructed as shown in FIG. 11, it is impossible to properly constmct the true learning machine Ψ* even though the optimal learning machine has been constructed well. However, the true learning machine Ψ* can be constructed by combining a plurality of learning machines. This is because the true learning machine Ψ* is located in a convex hull space of the learning machines included in the set Ψ . Particularly, if proper weights wi, w2 and w3 are obtained in FIG.
11, the following formula (35) is established:
Figure imgf000029_0001
When examining the meaning of the formula (35), it is noted that the true learning machine Ψ * can be obtained as a weighted average of several learning machines included in the set ψ . The CHEM algorithm proposed by the present invention has been developed by using such an idea.
Hereinafter, the basic principle of a method for combining learning machines according to the CHEM algorithm proposed by the present invention will be explained in detail.
4-3. Basic principle of learning-machine combining method according to CHEM algorithm when true learning machine is known
A method for constructing the true learning machine Ψ* by using the weighted average of several learning machines included in the set Ψ when the true learning machine ψ* is known and not included in the set Ψ will be now introduced.
Thereafter, a method for estimating the true learning machine Ψ* using data when the true learning machine Ψ * is not known will be introduced.
A basic assumption of the CHEM algorithm is that the true learning machine Ψ * is expressed as a weighted average of M learning machines Ψ \, ■ • •, Ψu included in the set Ψ . This is expressed as the following formula (36):
Figure imgf000030_0001
M
, V . i=\
(36) The CHEM algorithm is an algorithm in which the learning machines Ψx used for obtaining the weighted average and the weight Wj thereof are sequentially found out. The principle of an algorithm for finding out a (k+l)-th learning machine ^" +i and a (k+l)-th weight Wk+i thereof by using k learning machines Ψ \, • •-, Ψ^ and k weights wl5 • • ■ , Wk, which have been already constructed, in the CHEM algorithm will be explained below by steps.
First, an optimal learning machine among learning machines orthogonal to a cuπent ensemble machine Fk according to the formula (36) is set to Ψ\ \. Second, a new ensemble machine Fk+ 1 is generated according to the following formula (37). At this time, weights Uk and Wk+i are obtained such that the distance between the ensemble machine Fk+i and the true learning machine Ψ * is minimized.
Figure imgf000030_0002
M *+W*+l
(37)
This algorithm will be explained in more detail below.
(1) Construction of first learning machine FIG. 12a is a conceptual view showing a method for constructing a first learning machine according to the CHEM algorithm proposed by the present invention.
As shown in FIG. 12a, an optimal learning machine Ψ \ (i.e., a learning machine located closest to the true learning machine Ψ *) is obtained.
(2) Construction of second learning machine FIG. 12b is a conceptual view showing a method for constructing a second learning machine according to the CHEM algorithm proposed by the present invention.
An optimal learning machine Ψ2 orthogonal to ψ \ is found out, and weights wi, w by which the distance between an ensemble machine F according to the formula (37) and the true learning machine ψ * is shortest are obtained. At this time, wi and w are ■
1/dι and l/d2 , respectively. Meanwhile, the ensemble machine F2 according to the formula (37) is expressed as the following formula (38):
F. W lΨ l+W2ψ 2 wi+w2
(38) where di is the distance between Ψ* and Ψ \, and d is the distance between Ψ* and Ψ -
(3) Construction of third learning machine
FIG. 12c is a conceptual view showing a method for constructing a third learning machine according to the CHEM algorithm proposed by the present invention. An optimal learning machine Ψj orthogonal to F2 is found out. Then, weights u2, w3 by which the distance between an ensemble machine F3 according to the formula (37) and the true learning machine Ψ* is shortest are obtained. Meanwhile, the ensemble machine F3 according to the formula (37) is expressed as the following formula (39):
u 2+w3 (39)
Then, the following formula (40) is established:
Figure imgf000031_0001
(40)
(4) Construction of m-th learning machine
FIG. 12d is a conceptual view showing a method for constructing an m-th learning machine according to the CHEM algorithm proposed by the present invention.
By continuously repeating the above algorithm, an m-th ensemble machine Fm is obtained from the following formula (41): vιFi
F =-
∑w . z= l
(41)
4-4. Basic principle of learning-machine combining method according to CHEM algorithm when true learning machine is not known
In all the learning problems, the true learning machine Ψ* is not known. Instead, n training data (xj, yi), • • •, (xn, yn) are given. The objective of the learning is to effectively estimate the true learning machine Ψ * by using the data. Hereinafter, how the aforementioned algorithm is constructed when the data is given will be explained.
A loss function is indicated by /, and a deviance of a given learning machine Φ is defined as the following formula (42): n
;=ι
(42) In case of categorical data, only two groups are considered.
(1) Construction of first learning machine An optimal learning machine Ψ i is obtained.
(2) Construction of second learning machine
An optimal learning machine Ψ2 orthogonal to Ψ i is found out. First, a residual is used for obtaining the orthogonal learning machine. The residual r\ can be obtained from the fomiula (43) when the response variables are categorical:
Figure imgf000032_0001
exp (ψι( ))
Pλ(x)- l+exp(Ψl(x)) (43) where Pi is a probability that y will be 1 when a first ensemble machine is used. Alternatively, the residual r; can be obtained from the formula (44) when the response variables are continuous:
Figure imgf000032_0002
(44)
In case of the categorical data, an optimal learning machine Ψis constructed by adopting |ri| as a weight. In case of the continuous data, the optimal learning machine Ψ is obtained by adopting τ\ a response variable.
Meanwhile, the reason of obtaining the optimal learning machine by using the residual is that the residual has a property of being orthogonal to the response variable in a regression model. Therefore, the optimal learning machine ?P*for the residual is nearly orthogonal to the optimal learning machine Ψ \ for the response variable.
Then, after the optimal learning machine Ψis obtained by using the residual, a constant η for minimizing d( η Ψ) is obtained. In addition, it is set that Ψ = V Ψ .
At this time, an ensemble machine is determined from the formula (37). That is, the following formula (45) is established: F _ w lψ l+τ4-2ψ 2
2 τ. ,+w2 W ι = lΛ*(ψ ι , w2 = l ^(ψ2)
(45)
Meanwhile, d(Ψ) is approximately a square of the distance between ?^and Ψ* according to the statistical theory.
(3) Construction of third learning machine
The construction of a third learning machine is identical to the construction of the second learning machine except that the residual is obtained by using F instead of Ψ \. The ensemble machine F3 for the obtained third learning machine ?f3 is expressed as the following formula (46):
W lψ l---W2ψ 2+W3ψ 3
F =
W J+WJ+WJ w3= l/ (ψ 3)
(46)
(4) By repeating the above algorithm, the m-th ensemble machine Fm is obtained from the following formula (47): m
w^
F m = ι=l
∑w
(47)
5. CHEM algorithm in two-class classification problem
Hereinafter, the CHEM algorithm proposed by the present invention will be explained in detail with respect to a two-class classification problem.
FIG. 13 is a flowchart illustrating an outline of a CHEM algorithm according to one embodiment of the present invention.
First, various kinds of variables are initialized in step SI 301. Boostrap data are generated by applying weights to multidimensional data in step S1302. Then, a probability that the response variable will be 1 for a given explanatory variable is estimated by using the basic learning machines in step S1303. In step S1304, a coπection parameter for minimizing a given deviance is calculated. Then, a coπected learning machine is constructed by using the coπection parameter in step SI 305. Subsequently, an ensemble machine is constructed based on the coπected learning machine in step SI 306.
Then, in step SI 307, it is determined whether the ensemble machine satisfies stop rules to be described later. If it does not satisfy the stop rules, the procedure is returned to step SI 302. If so, the procedure is terminated.
The CHEM algorithm will be explained in detail by means of formulas and the like.
(1) First step: The response variable yj is set to 1 if the data belong to Group 2, whereas it is set to 0 if the data belong to Group 1.
(2) Second step: Various kinds of variables are initialized. That is, Wj = 1/n for n weights, wi, ■ ■ • , wn , and F(x) = 0 are set.
(3) Third step: The Boostrap data (xιB, yιB), • ■ ■, (xn B, yn B) are generated using the weight {WJ} . The generation of the Boostrap data has been already explained above.
Λ
(4) Fourth step: A probability pm (x) = P(y = lx) that the response variable will be 1 for a given explanatory variable x is estimated using the Boostrap data through the basic learning machines.
(5) Fifth step: The coπection parameter η for minimizing a value of the following formula (48) with respect to a given loss function / is calculated:
(48) where /„ (x) = log(pm (x) 1(1 - pm (x)) .
(6) Sixth step: The learning machine is newly coπected and reconstructed by using the coπection parameter. This process is expressed as the following formula (49):
Figure imgf000034_0001
(49) n where fm(x) = ηfm(x) , um = l/∑l( „fm(xt)) , and the weight w; is updated
(=1 exp(E(x)) according to w: = (y{ - P(x) I ^jP(x)(l - P(x)) | , where E(x) = -
+ exp(E(x))
(7) Seventh step: In order to construct a final ensemble machine, the third to sixth steps are repeated M times (m = 1, • • •, M). Then, a final ensemble machine is set to H(x) = F(x), and H(x) is assigned to Group 2 if H(x) > 0 and to Group 1 if H(x) < 0 for a new response variable x.
Meanwhile, although various loss functions may be used for the above loss function, an exponential loss function or a negative log-likelihood loss function can be used in order to obtain more preferable results. The exponential loss function and the negative log-likelihood loss function are expressed as the following formulas (50) and (51), respectively: tOJ)=exp(-(2y-l) 72)3
(50) i(y )=-(yf-log (l+exp 0 )) _ (51)
Alternatively, the present invention may generate the basic learning machines by using the weights instead of the Boostrap data. However, in most cases, the use of the Boostrap data results in a more excellent performance.
Further, although various basic learning machines may be employed in the present invention, this embodiment utilizes the decision tree. Contrary to the conventional boosting algorithm, the basic learning machines in the CHΕM algorithm are strong learning machines. Therefore, the optimal decision tree which has been obtained through the entire processes of constructing the decision tree is used instead of a simple decision tree. At this time, the TIC is used to overcome a problem of the calculation, which has been already described above.
An algorithm for expanding the CHΕM algorithm up to a more-than-two-class classification problem will be described below.
6. CHΕM algorithm in multi-class classification problem An outline of a CHΕM algorithm expanded up to a multi-class classification also follows the procedure of FIG. 13. However, the CHΕM algorithm for the multi-class classification problem is slightly different from the CHEM algorithm for the two-class classification problem in view of formulas etc. applied to them. This algorithm will be described in detail below.
(1) First step: Various variables are initialized. That is, the weight w,j is set to wυ = l/n where i = 1, • • •, n andj = l, ■ • •, J; and it is set that Fj(x) = 0, where j = 1, • • •, !.
(2) Second step: y,* is set to 1 if i-th data belong to a j-th group and to 0 if not so.
(3) Third step: The Boostrap data (xιB, yι*B), • • ■, (xn B, yn*B) are generated using the weights {wlj5 • ■ •, wnj}. The generation of the Boostrap data has been already explained above.
Λ (4) Fourth step: A probability pmj (x) = P(y* = 1 |x) that the response variable will be 1 for a given explanatory variable x is estimated using the Boostrap data through the basic learning machines.
(5) Fifth step: It is set that „y(x) = log(pmj(x)/(l- pmj(x))) .
(6) Sixth step: The second to fifth steps are repeated J times (j = 1, "", J). (7) Seventh step: A coπection parameter η for minimizing a value of the following formula (52) with respect to a given loss function / is calculated:
Figure imgf000036_0001
(52)
1 J where fmj (x) = fmj (-*) --£ fmk (x) , and /„ (x) = (f (x), ... ^ (x)) .
J k=\ (8) Eighth step: A learning machine is newly coπected and constructed using the coπection parameter. This process is expressed as the following formula (53): m m ι=l (=1
(53) where fm(x) = ηfm(x), um = l/∑l( „fm(x,)) , F(x) = F1(x),...,FJ(x)) , t e weight
;=1 Wij is updated according to wy = >* - P} (x, )) / ^P} (x, )(1-Pj (x, )) | , where exp(E (x)) Pj (x) = — , and yy* is 1 if an i-th observed value belongs to a j-th group and 0
∑exp(EA(x)) i=l if not so.
(9) Ninth step: A final ensemble machine is constructed by repeating the second to eighth steps M times (m = 1, ■ ■ ■, M). At this time, a new explanatory variable x is assigned to the argmaXjFj(x) group.
An algorithm for expanding the CHEM algorithm up to a continuous variable problem will be described below.
7. CHEM algorithm in continuous variable problem
Hereinafter, an algorithm (regression CHEM algorithm) for constructing an ensemble machine in a case where response variables are continuous will be explained. Such a case is called as a regression model of which basic model is as follows. FIG. 14 is a flowchart schematically illustrating an ensemble algorithm for the continuous variables according to one embodiment of the present invention, which will be explained in detail below.
First, various variables are initialized and the response variables are defined in step
SI 401. A regression model is constructed using basic learning machines in step SI 402. Then, a coπection parameter for minimizing a given deviance is calculated in step SI 403.
In step SI 404, the response variables are updated based on the coπection parameter. An ensemble machine is constructed based on the updated response variables in step SI 405.
Then, in step S1406, it is determined whether the ensemble machine satisfies stop rules to be described later. If not, the procedure is returned to step S1402. If so, the procedure is terminated.
This procedure will be described in more detail below.
First, assume that n multidimensional training data (xi, yi), • ■ •, (xn, yn) are given, where \ is a p-dimensional explanatory variable, i.e. xi = (xn, ■ ■ •, xpi), and y; is the response variable. An objective of this algorithm is to find out a relationship that can best explain the response variables by using the explanatory variables based on the n training data. In other words, it is to construct an optimal function H : Rp — > R by using the training data. Further, if a new explanatory variable x is given, a response variable for the given explanatory variable is estimated from H(x). Meanwhile, H(x) = E(y)χ) is established in the statistics. That is, the objective of the regression analysis is to estimate a conditional expected value. The CHEM algorithm for the continuous variables is as follows.
(1) First step: It is set that z; = yj for a new response variable.
(2) Second step: A regression model Φ m(x) is obtained through the basic learning machines by setting the response variable to Z; and the explanatory variable to x\. (3) Third step: A coπection parameter η for minimizing a value of the following formula (54) with respect to a given loss function / is calculated:
Figure imgf000038_0001
(4) Fourth step: The response variable z, is updated using the coπection parameter
V . This process is expressed as the following formula (55): z; = y{ - F(x ,
(55) where ψm(x) = ηψm(x) , um - , and F(x) is updated according
∑i{yι,ψmi≠))
Figure imgf000038_0002
(5) Fifth step: The second to fourth steps are repeated M times (m = 1, • • ■, M). (6) Sixth step: A final ensemble machine is constructed such that H(x) = F(x).
Meanwhile, as for the loss function /, various loss functions such as known l(y, T) = (y-T)2, l(y, T) = |y-T| and the like may be used.
8. Stop rules Hereinafter, an algorithm for determining the number of basic learning machines required for constructing a final ensemble machine will be described. A basic idea of the stop rules is to terminate the entire algoritlim without further updating a cuπent ensemble machine when a deviance of the ensemble machine is minimum. The stop rules will be described below. (1) First step: A positive integer K is determined.
(2) Second step: Assume that Fm is an ensemble machine constructed by using m initial basic learning machines. A deviance of the ensemble machine Fm for a given loss function / is calculated according to the following formula (56): dm = ∑/(y„ Fm<x, >
(56)
(3) Third step: om = argminJ> -, md,, and om+κ= argminj=ι, -, m+κd, are set. An ensemble machine Fm of the first m satisfying the relationship om = om4-κ is determined as a final ensemble machine. Then, the algorithm is terminated.
Meanwhile, the positive integer K is a value defined by a user.
Table 2 below shows simulation data for comparing predictabilities of a case where the stop rales are not applied to the CHEM algorithm and a case where the stop rules are applied to the CHEM algorithm, with respect to various data. As shown in Table 2 below, both predictabilities are almost similar to each other. Thus, when the stop rales are applied to the algorithm, an optimal ensemble machine can be constructed using the smaller number of basic learning machines, resulting in a great increase of the computational speed.
Table 2
Figure imgf000039_0001
9. Algorithm for generating association rules
9-1. Introduction
An algorithm for searching various kinds of association rales for describing data using a basic learning machine for generating a final ensemble machine will be explained.
Meanwhile, the association rule generating algorithm of the present invention may be applied to any ensemble machines constructed by various conventional methods.
9-2. Algorithm for generating association rules The association rale generating algorithm is as follows.
Referring to FIG. 15, there is provided a flowchart schematically illustrating the association rule generating algorithm according to one embodiment of the present invention. It will be described in detail below. (1) step SI 501: Initialize all kinds of variables. That is, if the response variables are categorical data, a group of interest, the minimum permissible number of data and the minimum permissible confidence are defined as g, m, and p, respectively. Further, if the response variables are of a continuous type, the range of interest is defined as (gL, gu) instead of the group g. (2) step S1502: Construct a universal set S of fundamental rales. It will be described in detail below.
By searching the tree architecture, all nodes that have the number of data larger than m and the probability of the group g larger than p are selected from all of nodes of basic learning machines, i.e., all basic learning machines that are used in the ensemble (in case of continuous type variables, the probability of the range is applied thereto).
On the other hand, in order to obtain more desirable results, since there exist the basic learning machines as many as the number of the ensembles in a two-class classification problem, all of the basic learning machines are searched so as to select all nodes that meet the conditions. In a more-than-two-class classification problem (i.e., a multi-class classification problem), there are the basic learning machines as many as a product of the number of the ensembles and the number of the groups, and all nodes that meet the conditions are selected by using the basic learning machines as many as the number of the ensembles coπesponding to the group of interest g.
The rales of the nodes that are selected as described above refer to as the universal set S of fundamental rales.
(3) step SI 503: For all rales that are included in the universal set S, cancel the rules of nodes higher than a node that generates a relevant rule, from the universal set S. It will be described in detail below.
For all rules st e S (i=l,2,. . ., N), the following is repeated N times. After selection of s{ e S , if a node that generates sk for k=l,. . .,N for the selected
S; is a node higher than a node that generates st , the sk is cancelled from the universal set S.
(4) step SI 504: Calculate confidences for all rales that are included in the set S. At this time, the confidence for each condition is calculated by ng/n, where the total number of data that belong to the relevant rule is n and the number of data that belong to the group of interest(i. e., the group g) is ng.
(5) step SI 505: Arrange the calculated rules of the set S. That is, the calculated rules of the set S are aπanged in decreasing order of confidences. Then, the arranged rales refer to as oi,. . ., θh.
(6) step SI 506: Assume that an association rule set is R. If R is an empty set, the node θh is added to the set R. If not the set R is not an empty set, all of the association rales that are included in the set R are compared with the node θh to determine similarity. If the node θh is similar to none of the association rales that are included in the set R, the node θh is added to the set R.
Meanwhile, the similarity comparing method is as follows. Suppose that two given rules o and r for description variables x=(χι.- . ., xp) are defined as Formula (57) below.
Figure imgf000041_0001
(57) Suppose that a set of data in which x, belongs to R<„ is defined as a set Dm, and a set of data in which x. belongs to Rπ is defined as a set Dπ. At this time, R^ and Rπ are subsets of the set R.
First, the maximum permissible similarity s e [θ,l] is determined, and it is then determined that the two rules o and r have similarity for x, if the ration of the number of data belonging to -D«" n O" to the number of data belonging to Du D" is equal to or larger than s. If the ratio is smaller than s, it is determined that the two conditions o and r do not have similarity for x,. Such a procedure is repeated for i=l,. . ., p.
Then, if it is determined to have similarity for all xl5 it is determined that the rules o and r have similarity. Otherwise, if it is determined not to have similarity for any x,, it is determined that the rules o and r do not have similarity.
(7) step SI 507: Using all rules that are included within the association rule set R, data are analyzed in order of confidences.
9-3. Performance experiment of algorithm for generating association rules Hereinafter, test data for showing analysis capacity of the association rale generating algoritlim will be explained. As an ensemble constructing scheme applied to this test, a well-known CART algorithm. Further, a decision tree that is constructed in the CART algorithm is applied to the association rale generating algorithm. Further, German data are used as actual data in this test.
Association results of actual data are shown in Tables 3 and 4 below.
Table 3
Search results of association rules using the CART algorithm
Figure imgf000042_0001
Table 4
Search results of the association rule algorithm applied to the CHEM algorithm
Figure imgf000042_0002
Figure imgf000043_0001
The German data includes 700 good credit customers and 300 bad credit customers based on 1,000 pieces of data on credit transaction status, which are analyzed by using the number of minimum permissible data and the minimum permissible confidence under the same conditions for accurate comparison of the association rules. For analyzing the bad credit customers, the number of the minimum permissible data was set as 50 persons (5%) and the number of the minimum permissible confidence was set as 50%. For analyzing the good credit customers, the number of the minimum permissible data was set as 50 persons (5%) and the number of the minimum permissible confidence was set as 85%.
The search results for the association rules of the CART algorithm under the above conditions were one bad credit customer group and one good credit customer group. In an example where the association rale is applied to the CHEM algorithm, search results were 5 bad credit customer groups and 4 good credit customer groups. The search of the association rules of the CHEM algorithm coπesponds to a wide range search of the association rules including the association rales of the CART algorithm. It will be readily noted that the results of the CHEM algorithm provides the association rales including the results of the CART algorithm. Further, since the data coπesponding to the association rales searched by the CART algorithm are generated from one basic learning machine, the data consist of data which are mutually exclusive from one another. On the contrary, the association rales of the CHEM algorithm are represented as a set including data which are non-exclusive from one another.
10. Comparison of performances of ensemble methods through tests
Hereinafter, the performances of various ensemble methods will be compared through tests.
10-1. Comparison of performances through virtual tests The following models were used in the virtual tests.
(1) Model 1: Radius 2
The number of training data was 1,000, the number of test data was 5,000, and the number of groups was 2.
Explanatory variables: x = (xl5 • • •, xio), which were independent of one another and followed a standard normal distribution.
10
Response variables: It was set that if ∑x > c , y =1 with a probability of 0.9 and i=\
10 y = -1 with a probability 0.1. Further, it was set that if ∑x,-2 < c , y = -1 with a ι=l probability of 0.9 and y = 1 with a probability 0.1. Here, c is a constant satisfying the
10 relationship Pr(∑x,.2 > c) = 0.5 .
.=1 (2) Model 2: Interaction
The number of training data was 1,000, the number of test data was 5,000, and the number of groups was 2.
Explanatory variables: x = (χb •••, χ10), which were independent of one another and followed a standard normal distribution. Responsible variables: The responsible variables were variables of 0 to 1 following Pr(). = lx) = — , and F(x) is expressed as the following formula (58): l + exp(E(x))
6 6
Ft* = ( ∑OO + ∑(i)'-*-)
(58)
(3) Model 3: Two Normal
The number of training data was 1,000, the number of test data was 5,000, and the number of groups was 2.
First 500 training data belonged to Group 1; and explanatory variables were x = (x1} • • •, Xio), which were independent of one another and followed a standard normal distribution. The remaining 500 training data belonged to Group 2, and explanatory variables were x = (xi, • • •, Xι0), which were independent one another and followed a normal distribution with an average of 0 and a variance of 2.
(4) Model 4: Simple Quadratic
The number of training data was 1,000, the number of test data was 5,000, and the number of groups was 2.
Explanatory variable x followed a standard normal distribution.
Responsible variables were variables of 0 to 1 following the following formula (59), and F(x) : -x2 + 2.
PrO= 1 I x)
1 -1- e?ςp F f
(59)
FIGS. 16a to 16d are graphs showing the results of the 4 virtual tests.
As can be seen FIGS. 16a to 16d, the CHEM algorithm proposed by the present invention operates very stably. Particularly, in case of Model 4, all the other algorithms have poorer performances as the number of trees is increased, whereas there are no such problems in the CHEM algorithm.
Table 5 below shows numerical values representing the results of FIGS. 16a to 16d.
Table 5. Results of virtual tests
Figure imgf000045_0001
Figure imgf000046_0001
Meanwhile, it can be understood that the predictability of the CHEM algorithm is superior to those of the conventional ensemble algorithms in the most models. It can be also understood that the CHEM algorithm is greatly superior to those of the conventional algorithms in view of the deviance in addition to the predictability. That is, the deviances of the conventional algorithms except for the CHEM algorithm are continuously increased. This means that function estimation of the conventional ensemble algorithms is hardly made.
On the contrary, it can be seen that the deviance of the CHEM algorithms is stably outputted. The phenomenon that the predictability is good but the deviance is increased is found in all the ensemble algorithms except for the CHEM algorithm. Such a phenomenon means that a boundary between two groups can be well found out in the classification problem but the other information is lost.
For example, a probability that a response variable y will belong to a k-th group for a new explanatory variable x cannot be estimated in all the ensemble algorithms except for the CHEM algorithm.
Consequently, the virtual test results show that the CHEM algorithm proposed by the present invention has both excellent predictability and stability.
10-2. Comparison through analysis of actual data The CHEM algorithm and the existing algorithms were compared with one another through analyses of various actual data. The used actual data were obtained from the 'Machine Learning Web Site' (http://wwwl.ics.uci.edu/~mlearn MLRepository.html) of 'UC Irvine' described above.
Table 6 below shows information on the actual data. Table 6. Information on actual data
Figure imgf000047_0001
FIGS. 17a to 17i are graphs showing the results of the actual data analysis.
It can be understood that the CHEM algorithm operates very stably in the most cases. Although the CHEM algorithm operates slightly unstably in case of 'Ionosphere data,' the difference in the stabilities of the operations between the data and the other data is not large. In case of 'German data,' it can be understood that the CHEM algorithm is superior to the other algorithms in view of their predictabilities and stabilities. It can also be seen that similarly to the aforementioned virtual tests, the CHEM algorithm has a smallest deviance value.
Table 7 below numerically shows the results of FIGS. 17a to 17i.
Table 7. Analysis results of actual data
Figure imgf000047_0002
Figure imgf000048_0001
In conclusion, the CHEM algorithm has a very excellent predictability, simultaneously is very stable (it does not produce greatly wrong estimation results in any data), and enables the function estimation (i.e., the deviance is small).
The function estimation means probability estimation. The CHEM algorithm can represent the values of response variables for given data as values with a discrete probability distribution, contrary to the conventional ensemble algorithms. This can be very usefully utilized in the analysis of the actual data.
Moreover, Table 8 below shows test results of continuous variables.
Table 8. Analysis results of continuous variables
Figure imgf000048_0002
In Table 8, the LS boost algoritlim was used as a conventional algorithm.
The Friedman model was a virtual model in which the number of training data was
500 and the number of test data was 5,000. Further, explanatory variables were x = (xi,
• ■ • , xio) which were independent of one another and followed a unifom distribution at [0,
1]. Moreover, response variables were y = lOsin^ .-) + 20(x3 - 0.5)2 + 10x4 + 5x5 + ε , where ε ~ N(0,l) .
Data for finding out influences of various environmental variables on prices of houses in Boston were used as the actual data (Boston Housing Data). These data are easily available from the Internet.
The number of these data was 506, and the 5-fold cross validation method was used for obtaining a test eπor. It can be understood from the analysis results of the virtual tests and the data on the prices of houses in Boston that the CHEM algorithm well operates in the classification problem as well as the regression model. Particularly, reduction parameters should be defined by a user in the conventional LS boosting method, whereas it is hardly necessary for the user to define something in the CHEM algorithm.
According to the present invention, there are the following advantages.
First, the proposed CHEM algorithm is very superior to the conventional ensemble construction algorithms in view of their predictabilities and operates very stably. That is, there is a remarkably reduced chance that the overfitting problem occurs. Second, the proposed CHEM algorithm overcomes a drop in analyzability which may be commonly produced in the existing ensemble construction methods, and shows a more excellent analyzability over any other data mining methods by using the association rale algorithm.
Third, the CHEM algorithm can be naturally applied to the continuous variable problem and thus easily applied to general industrial fields.
Although the invention has been described with respect to the prefeπed embodiments, it will be understood by the skilled in the art that various changes and modifications may be made thereto without departing from the spirit and scope of the invention defined by the appended claims.

Claims

1. An apparatus for constructing a data mining model using ensemble machines, comprising: M ensemble machine construction means for constructing the ensemble machines, said M being an integer equal to and greater than 2; wherein a first ensemble machine constraction means constructs a first learning machine from multidimensional training data to be inputted, and then, constructs the constructed first learning machine itself into a first ensemble machine, and a k-th ensemble machine constraction means receives a (k-l)-th ensemble machine constructed from a (k-l)-th ensemble machine constraction means, constructs a k- th learning machine orthogonal to the (k-l)-th ensemble machine in a multidimensional space, and then constructs a k-th ensemble machine by combining the constructed k-th learning machine and the (k-l)-th ensemble machine, said k being an integer between 2 and M.
2. The apparatus as claimed in claim 1, wherein the k-th ensemble machine constraction means constructs a learning machine orthogonal to a (k-l)-th ensemble machine in the multidimensional space by using a residual of the (k-l)-th ensemble machine, and then constructs the k-th ensemble machine by using the learning machine.
3. The apparatus as claimed in claim 2, wherein the k-th ensemble machine constraction means receives the learning machine Ψ^ constructed from the residual of the (k-l)-th ensemble machine, calculates a constant V for minimizing d( v ??k), where d is a deviance, and then constructs the k-th ensemble machine by using the V Ψ , where k is an integer between 2 and M-l .
4. The apparatus as claimed in claim 2, wherein if a response variable is categorical, the residual is determined according to the following formula (1): r yrp k-i(x
Figure imgf000050_0001
where the inputted training data are ( i, yi), ..., (xn, yn). r, is an i-th residual, and Pk-i is a probability that y will be 1 by using the (k-l)-th ensemble machine.
5. The apparatus as claimed in claim 4, wherein an optimal learning machine is constructed by using an absolute value of the residual as a weight.
6. The apparatus as claimed in claim 2, wherein if a response variable is continuous, the residual is determined according to the following formula (2): r ,=y rF k-i(x ,) ^ (2) where Fk-i is a (k-l)-th ensemble machine.
7. The apparatus as claimed in claim 6, wherein an optimal learning machine is constructed by using the residual as the response variable.
8. The apparatus as claimed in claim 1, wherein the k-th ensemble machine constraction means calculates a deviance by using a loss function, and then constructs the k-th ensemble machine by using the calculated deviance.
9. The apparatus as claimed in claim 8, wherein the k-th ensemble machine constraction means calculates the deviance according to the following formula (3): n
< )= ∑ /0„ΨO,)) (3) where d(Ψ) is the deviance and is the loss function.
10. The apparatus as claimed in claim 9, wherein the k-th ensemble machine construction means constructs the k-th ensemble machine Fk by combining the (k-l)-th ensemble machine Fk-i and the k-th learning machine Ψ^ according to the following formula (4):
■\Fk- \ "+" wiψi
Fk =
(4) where Wk =
Figure imgf000051_0001
and u _ι = l/d(F -ι).
11. The apparatus as claimed in claim -1, further comprising a stop means for constructing a final ensemble machine by using m learning machines, if a given loss function is smallest when the m learning machines are constructed.
12. The apparatus as claimed in claim 11, wherein the stop means constracts the m-th ensemble machine Fm for first m satisfying the formula (5) into the final ensemble machine:
°m = °m+K > (5) where om - argmin;=1 m dj , om+κ = argmin/=1 _ m+x J7 , / is a loss function, K is a n predetermined constant, and dm = ∑t y/. E,,. (•*,)) . ι=l
13. Method for constructing a data mining model using ensemble machines, comprising: a step of constructing M ensemble machines, said M being an integer equal to and greater than 2; wherein first ensemble machine constructing step constracts a first learning machine from multidimensional training data to be inputted, and then, constracts the constructed first learning machine itself into a first ensemble machine, and a k-th ensemble machine constructing step receives a (k-l)-th ensemble machine constracted from a (k-l)-th ensemble machine construction means, constracts a k-th learning machine orthogonal to the (k-l)-th ensemble machine in a multidimensional space, and then constracts a k-th ensemble machine by combining the constracted k-th learning machine and the (k-l)-th ensemble machine, said k being an integer between 2 and M.
14. The method as claimed in claim 13, wherein the k-th ensemble machine constructing step constracts a learning machine orthogonal to a (k-l)-th ensemble machine in the multidimensional space by using a residual of the (k-l)-th ensemble machine, and then constructs the k-th ensemble machine by using the learning machine.
15. The method as claimed in claim 14, wherein the k-th ensemble machine constructing step receives the learning machine Ψ^ constracted from the residual of the (k- l)-th ensemble machine, calculates a constant η for minimizing d( η Ψ^), where d is a deviance, and then constracts the k-th ensemble machine by using the V Ψk, where k is an integer between 2 and M-l .
16. The method as claimed in claim 14, wherein if a response variable is categorical, the residual is determined according to the following formula (6):
Figure imgf000053_0001
where the inputted training data are (xi, yi), ..., (xn, yn), τ is an i-th residual, and Pk-i is a probability that y will be 1 by using the (k-l)-th ensemble machine.
17. The method as claimed in claim 16, wherein an optimal learning machine is constracted by using an absolute value of the residual as a weight.
18. The method as claimed in claim 14, wherein if a response variable is continuous, the residual is determined according to the following formula (7):
Figure imgf000053_0002
where Fk-i is a (k-l)-th ensemble machine.
19. The method as claimed in claim 18, wherein an optimal learning machine is constracted by using the residual as the response variable.
20. The method as claimed in claim 13, wherein the k-th ensemble machine constracting step calculates a deviance by using a loss function, and then constracts the k- th ensemble machine by using the calculated deviance.
21. The method as claimed in claim 20, wherein the k-th ensemble machine constructing step calculates the deviance according to the following formula (8):
Figure imgf000053_0003
where d(Ψ) is the deviance and / is the loss function.
22. The method as claimed in claim 21, wherein the k-th ensemble machine constracting step constructs the k-th ensemble machine Fk by combining the (k-l)-th ensemble machine F -i and the k-th learning machine Ψ^ according to the following formula (9):
T-. Uk- lIT'k- l ÷ WtΨ*
Fk = ~ IT— uk- ι -T- k ; ( where k = lld(Ψf) and Uk-i = l/d(Fk-ι).
23. The method as claimed in claim 13, further comprising a step for constracting a final ensemble machine by using m learning machines, if a given loss function is smallest when the m learning machines are constructed.
24. The method as claimed in claim 23, wherein the constracting step constracts the m-th ensemble machine Fm for first m satisfying the formula (10) into the final ensemble machine: m = om+κ , (10) where om = argmin/=1 m dj , om+κ = arg min /=. m+κ dj , I is a loss function, K is a predetermined constant, and Jm = ∑ l(yt , E„, (xt )) ■ ι=l
25. An apparatus for constracting a data mining model in two-class classification using ensemble machines, comprising: an input means for receiving multidimensional training data; a weight calculation means for calculating weights by using residuals of the ensemble machines; a probability estimation means for estimating, by using basic learning machines and the weight, probability that a response variable will be 1 with respect to an explanatory variable x obtained from the data inputted by the input means; a coπection parameter calculation means for calculating a coπection parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the probability estimation means; a learning machine update means for updating and reconstructing learning machines by using the calculated coπection parameter; and an ensemble machine construction means for constracting a final ensemble machine based on the updated learning machines.
26. The apparatus as claimed in claim 25, further comprising a Boostrap data generation means for generating the inputted training data into Boostrap data by using the weights, and then transmitting the Boostrap data to the probability estimation means.
27. The apparatus as claimed in claim 25, wherein the coπection parameter calculation means calculates the coπection parameter V which minimizes a value of the following formula (11) with respect to the given loss function /:
∑ 'O.-η/^.)) (11) fm(x)=logpm(χ)/(l -Pm(χ)) where pm is a probability that the response variable will be 1 with respect to the given explanatory variable x.
28. The apparatus as claimed in claim 27, wherein the learning machine update means updates the learning machines according to the following formula (12): f m,new XJ= m,oId X) (V2 where is the coπection parameter.
29. The apparatus as claimed in claim 28, wherein the ensemble machine constraction means constructs the ensemble machines according to the following formula (13): m m
'=1 , (13) n where um =1/∑/0'I,/„W) .
1=1
30. The apparatus as claimed in claim 29, wherein the weight calculation means updates and calculates the weights according to the following formula (14):
Figure imgf000055_0001
exp(E(x)) where E(x) = - l + exp(E(x))
31. The apparatus as claimed in claim 25, wherein the learning machines are decision trees.
32. The apparatus as claimed in claim 25, wherein the learning machines are neural network models.
33. The apparatus as claimed in claim 25, wherein the ensemble machine constraction means constructs the final ensemble machine when a deviance of a cuπent ensemble machine is smallest.
34. The apparatus as claimed in claim 33, wherein the ensemble machine construction means constracts the m-th ensemble machine Fm for first m satisfying the following formula (15) into the final ensemble machine:
°m = °m+κ > (15) where om — argmin-=1 m d} , om+κ = argmin =1 m+κ d} , I is a loss function, K is a n predetermined constant, dm = l(yt , Fm (x()) , and Fm is th m-th ensemble machine. ι=l
35. Method for constracting a data mining model in two-class classification using ensemble machines, comprising: a first step for receiving multidimensional training data; a second step for calculating weights by using residuals of the ensemble machines; a third step for estimating, by using basic learning machines and the weight, probability that a response variable will be 1 with respect to an explanatory variable x obtained from the data inputted by the first step; a fourth step for calculating a coπection parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the third step; and a fifth step for updating and reconstructing learning machines by using the coπection parameter calculated from the fourth step.
36. The method as claimed in claim 35, the third step generates the inputted training data as Boostrap data by using the weights, and then estimates a probability that the response variable will be 1 for a given explanatory variable x by using the Boostrap data through the basic learning machines.
37. The method as claimed in claim 35, wherein the fourth step calculates the coπection parameter η which minimizes a value of the following formula (16) with respect to the given loss function /:
Figure imgf000057_0001
where pm is a probability that the response variable will be 1 with respect to the given explanatory variable x.
38. The method as claimed in claim 37, wherein the fifth step updates the learning machines according to the following formula (17):
J m,new X) =W m, ld X) (11) where η is the coπection parameter.
39. The method as claimed in claim 38, further comprising a step of constracting the ensemble machines according to the following formula (18): m rtl
F x) = ∑ -i ffx)/ ∑ u
<'=! <=1 , (18) n where um = 11 ∑ l(y„fm (x,)) . ι=l
40. The method as claimed in claim 39, wherein the second step updates and calculates the weights according to the following formula (19):
Figure imgf000057_0002
exp(E(x)) where E(x) = - l + exp(E(x))
41. The method as claimed in claim 35, wherein the learning machines are decision trees.
42. The method as claimed in claim 35, wherein the learning machines are neural network models.
43. The method as claimed in claim 35, wherein the final ensemble machine is constracted when a deviance of a cuπent ensemble machine is smallest.
44. The method as claimed in claim 43, wherein the m-th ensemble machine Fm for first m satisfying the following formula (20) is constructed as the final ensemble machine:
°m = °«»K > (2°) where øm = argmin .=1 m J., om+κ = argmin/=1 m+A: J7 , / is a loss function, K is a n predetermined constant, dm = -"(.y,-. Em (xr.)) , and Fm is th m-th ensemble machine. ι=l
45. An apparatus for constructing a data mining model in multi-class classification using ensemble machines, comprising: an input means for receiving multidimensional training data; a weight calculation means for calculating weights by using residuals of the ensemble machines; a probability estimation means for estimating, by using basic learning machines and the weights, probabilities that a response variable will belong to Group j with respect to an explanatory variable x obtained from the data inputted by the input means; a coπection parameter calculation means for calculating a coπection parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the probability estimation means; a learning machine update means for updating and reconstructing learning machines by using the calculated coπection parameter; and an ensemble machine construction means for constracting a final ensemble machine based on the updated learning machines.
46. The apparatus as claimed in claim 45, further comprising a Boostrap data generation means for generating the inputted training data into Boostrap data by using the weight, and then transmitting the Boostrap data to the probability estimation means.
47. The apparatus as claimed in claim 45, wherein the coπection parameter calculation means calculates the coπection parameter V which minimizes a value of the following formula (21) with respect to the given loss function /:
Figure imgf000059_0001
fm(X) V πl W'-'/m 1)) an(J 21) log (pmJ(x)/ l -pm'J(x))) where praj is probabilities that the response variable will belong to Group j with respect to the given explanatory variable x.
48. The apparatus as claimed in claim 47, wherein the learning machine update means updates the learning machines according to the following fomiula (22): f msιew(X) =rV ,old X 22 where y is the coπection parameter.
49. The apparatus as claimed in claim 48, wherein the ensemble machine construction means updates the ensemble machines according to the following formula (23): tn m
F(pf)= ∑ « „(*)/ ∑ u '-1 '"l , (23) where um (E1(x),...,E7(x)) .
Figure imgf000059_0002
50. The apparatus as claimed in claim 49, wherein the weight calculation means updates the weights according to the following formula (24):
Figure imgf000060_0001
exp(E (x)) * where P (x) = — , and y is either 1 if an i-th observed value
∑exp(E,(x))
A =l belongs to Group j or 0 if not.
51. The apparatus as claimed in claim 45, wherein the learning machines are decision trees.
52. The apparatus as claimed in claim 45, wherein the learning machines are neural network models.
53. The apparatus as claimed in claim 45, wherein the ensemble machine constraction means constracts a final ensemble machine when a deviance of a cuπent ensemble machine is smallest.
54. The apparatus as claimed in claim 53, wherein the ensemble machine construction means constracts the m-th ensemble machine Fm for first m satisfying the following fomiula (25) into the final ensemble machine:
°m = °m+K > (25) where om = argmin =1 m d} , om+κ = arg min lf m+κ d; , l is loss function, K is a n predetermined constant, dm = l(y„Fm (x( )) , and Fm is an m-th ensemble machine.
1=1
55. Method for constracting a data mining model in multi-class classification using ensemble machines, comprising: a first step for receiving multidimensional training data; a second step for calculating weights by using residuals of the ensemble machines; a third step for estimating, by using basic learning machines and the weights, probabilities that a response variable will belong to Group j with respect to an explanatory variable x obtained from the data inputted by the first step; a fourth step for calculating a coπection parameter which minimizes a given loss function with respect to the estimated probabilities obtained from the third step; a fifth step for updating and reconstructing learning machines by using the calculated coπection parameter; and a sixth step for constructing a final ensemble machine based on the updated learning machines.
56. The method as claimed in claim 55, wherein the third step generates the inputted training data as Boostrap data by using the weight, and then estimates a probability that the response variable will be 1 for a given explanatory variable x is estimated using the Boostrap data through the basic learning machines.
57. The method as claimed in claim 55, wherein the fourth step calculates the coπection parameter η which minimizes a value of the following formula (26) with respect to the given loss function /:
Figure imgf000061_0001
m( )=(/ OTi ( )3... m )) 2mά (26) fmJ(x)=log(pmJ(χ)/ l-pm'J x))) where pmj is probabilities that the response variable will belong to Group j with respect to the given explanatory variable x.
58. The method as claimed in claim 57, wherein the fifth step updates the learning machines according to the following formula (27): f m,new X)~'ψ m,old X) 27) where η is the coπection parameter.
59. The method as claimed in claim 58, wherein the sixth step updates the ensemble machines according to the following formula (28): m m
Fix)= iuifm x / 2 i '-1 '=1 , (28) where um and F(x) = (Fl(x),...,FJ(x)) .
Figure imgf000062_0001
60. The method as claimed in claim 59, wherein the second step updates the weights according to the following formula (29): w.. y υ-p __
P ( )(l -E,(x.)) (29) exp(E (x)) „ where P} (x) = — , and y is either 1 if an i-th observed value
∑exp(E,(x)) k=\ belongs to Group j or 0 if not.
61. The method as claimed in claim 55, wherein the learning machines are decision trees.
62. The method as claimed in claim 55, wherein the learning machines are neural network models.
63. The method as claimed in claim 55, wherein the sixth step constracts a final ensemble machine when a deviance of a cuπent ensemble machine is smallest.
64. The method as claimed in claim 63, wherein the ensemble machine constraction means constracts the m-th ensemble machine Fm for first m satisfying the following formula (30) into the final ensemble machine:
°m = °m+K > (3°) where om = argminJ=1 ,,. d] , om+κ = argminJ=1 m+κ d , I is a loss function, K is a n predetermined constant, dm = l(yt , Fm (x,)) , and Fm is an m-th ensemble machine.
1=1
65. An apparatus for constructing a data mining model using ensemble machines in a case where a response variable is regressive, comprising: an input means for receiving multidimensional training data; a residual calculation means for calculating residuals of the ensemble machines; a regression model construction means for constracting regression models from the data inputted by the input means through basic learning machines and the residuals; a coπection parameter calculation means for calculating a coπection parameter which minimizes a given loss function with respect to the constructed regression models obtained from the regression model constraction means; a learning machine update means for reconstructing learning machines by updating the response variable based on the calculated coπection parameter; and an ensemble machine constraction means for constracting a final ensemble machine based on the reconstructed learning machines.
66. The apparatus as claimed in claim 65, wherein the coπection parameter calculation means calculates the coπection parameter η which minimizes a value of the following formula (31) with respect to the given loss function .
Figure imgf000063_0001
67. The apparatus as claimed in claim 65, wherein the learning machine update means updates the learning machines based on the coπection parameter η according to the following formula (32): m
∑u^x)
F x) = ^m . (32)
∑"« ι=l where ψm(x) is updated in accordance with
Figure imgf000063_0002
the formula, zt = yt - F(x{) , the multidimensional fraining data to be inputted are (xi, yi),
..., (xn, yn), Xj is a p-dimensional explanatory variable, y; is a response variable, and ψ m is a regression model.
68. The apparatus as claimed in claim 65, wherein the learning machines are decision frees.
69. The apparatus as claimed in claim 65, wherein the learning machines are neural network models.
70. The apparatus as claimed in claim 65, wherein the ensemble machine constraction means constracts a final ensemble machine when a deviance of a cuπent ensemble machine is smallest.
71. The apparatus as claimed in claim 70, wherein the ensemble machine construction means constructs the m-th ensemble machine Fm for first m satisfying the following formula (33) into the final ensemble machine: „, = om+κ , (33) where om = argmin;=1 m d . , om+κ - argminy=1 m+^ dj , I is a loss function, K is a predetermined constant, dm = ∑ l(yt,Fm (x,)) , and Fm is an m-th ensemble machine. ι=l
72. Method for constructing a data mining model using ensemble machines in a case where a response variable is regressive, comprising: a first step for receiving multidimensional training data; a second step for calculating residuals of the ensemble machines; a third step for constructing regression models from the data inputted by the first step through basic learning machines and the residuals; a fourth step for calculating a coπection parameter which minimizes a given loss function with respect to the constracted regression models obtained from the third step; a fifth step for reconstructing learning machines by updating the response variable based on the calculated coπection parameter; and a sixth step for constructing a final ensemble machine based on the reconstructed learning machines.
73. The method as claimed in claim 72, wherein the fourth step calculates the coπection parameter V which minimizes a value of the following formula (34) with respect to the given loss function /. n
Z0 ηψm(*/)) '=1 (34)
74. The method as claimed in claim 72, wherein the fifth step updates the learning machines based on the coπection parameter η according to the following formula (35):
Figure imgf000065_0001
where ψm(x) , Zj is updated in accordance with
Figure imgf000065_0002
the fomiula, zt - y{ - E(x() , the multidimensional training data to be inputted are (xi, yi),
..., (xn, yn), Xi is a p-dimensional explanatory variable, yi is a response variable, and φ m is a regression model.
75. The method as claimed in claim 72, wherein the learning machines are decision frees.
76. The method as claimed in claim 72, wherein the learning machines are neural network models.
77. The method as claimed in claim 72, wherein the sixth step constructs a final ensemble machine when a deviance of a cuπent ensemble machine is smallest.
78. The method as claimed in claim 70, the m-th ensemble machine Fm for first m satisfying the following formula (36) is constracted as the final ensemble machine:
°m = °m+K > (36) where om = argmin .=1>___j(B d. , om+κ = argmin =1 m+Λ- J. , / is a loss function, K is a predetermined constant, dm = l(y{,Fm (x;)) , and Fm is an m-th ensemble machine.
1=1
PCT/KR2003/000409 2002-03-02 2003-03-03 Apparatus and method for constructing data mining model using ensemble machines Ceased WO2003075187A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003212671A AU2003212671A1 (en) 2002-03-02 2003-03-03 Apparatus and method for constructing data mining model using ensemble machines

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2002-0011208 2002-03-02
KR1020020011208A KR100640264B1 (en) 2002-03-02 2002-03-02 Apparatus and Method for Constructing Data Mining Model Using Ensemble Model

Publications (1)

Publication Number Publication Date
WO2003075187A1 true WO2003075187A1 (en) 2003-09-12

Family

ID=27785964

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2003/000409 Ceased WO2003075187A1 (en) 2002-03-02 2003-03-03 Apparatus and method for constructing data mining model using ensemble machines

Country Status (3)

Country Link
KR (1) KR100640264B1 (en)
AU (1) AU2003212671A1 (en)
WO (1) WO2003075187A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109920551A (en) * 2019-01-24 2019-06-21 华东师范大学 A system for analyzing the characteristics of social behavior of children with autism based on machine learning
WO2020178843A1 (en) * 2019-03-05 2020-09-10 Telefonaktiebolaget Lm Ericsson (Publ) System and method for managing resources
US10832162B2 (en) 2016-09-08 2020-11-10 International Business Machines Corporation Model based data processing
CN112378619A (en) * 2020-11-06 2021-02-19 东北财经大学 Application of FER-FSE with ReMD-OSELM in total pressure real-time modeling in wind tunnel test stamping stage
CN113051827A (en) * 2021-03-30 2021-06-29 南华大学 Mine ventilation friction wind resistance prediction method based on classification and regression tree
TWI740896B (en) * 2016-03-04 2021-10-01 香港商阿里巴巴集團服務有限公司 Training method and training system of machine learning system

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180070103A (en) 2016-12-16 2018-06-26 삼성전자주식회사 Method and apparatus for recognition
KR102078289B1 (en) * 2017-12-18 2020-02-19 기술보증기금 Ip valuation method using ip valuation model and apparatus thereof
KR102038703B1 (en) * 2017-12-27 2019-11-26 (주)가디엘 Method for estimation on online multivariate time series using ensemble dynamic transfer models and system thereof
KR20210060146A (en) 2019-11-18 2021-05-26 삼성전자주식회사 Method and apparatus for processing data using deep neural network model, method and apparatus for trining deep neural network model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5819247A (en) * 1995-02-09 1998-10-06 Lucent Technologies, Inc. Apparatus and methods for machine learning hypotheses
EP0994423A2 (en) * 1998-10-16 2000-04-19 Mitsubishi Denki Kabushiki Kaisha Smoothing algorithm for bayesian classifier

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20000024514A (en) * 2000-02-17 2000-05-06 전성현 Agency web browser
KR20010105126A (en) * 2000-05-19 2001-11-28 권영주 A estavlishment of relation DBMS by using sample and advertizing system and method using the same
KR20010105967A (en) * 2000-05-19 2001-11-29 이기영 Method of web data mining using on-line web data acquisition and analysis and consulting system using the same
KR20010082412A (en) * 2001-06-05 2001-08-30 안종배 On and Off Digital Marketing Business Model using both the technology of digital message coding/decoding and the method of automatic electronic mailing system including promotion and survey question(s).
KR100462298B1 (en) * 2001-06-08 2004-12-17 삼성에스디에스 주식회사 Apparatus for providing information using PPL in moving pictures and method thereof
KR20010083846A (en) * 2001-07-04 2001-09-03 김병기 An Online Estimation System for Examination Pass Probability with Data Mining Technology

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5819247A (en) * 1995-02-09 1998-10-06 Lucent Technologies, Inc. Apparatus and methods for machine learning hypotheses
EP0994423A2 (en) * 1998-10-16 2000-04-19 Mitsubishi Denki Kabushiki Kaisha Smoothing algorithm for bayesian classifier

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
AGRAWAL R., IMIELINSKI T., SWAMI A.: "Database mining: a performance perspective", IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 5, no. 6, December 1993 (1993-12-01), pages 914 - 925, XP002060682, DOI: doi:10.1109/69.250074 *
ALEXANDRE L.A., CAMPILHO A.C., KAMEL M.: "Combining independent and unbiased classifiers using weighted average", 15TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, 2000. PROCEEDINGS, vol. 2, 3 September 2000 (2000-09-03) - 7 September 2000 (2000-09-07), pages 495 - 498, XP010533862, DOI: doi:10.1109/ICPR.2000.906120 *
MILLER D.J., UYAR H.S., LIAN YAN: "Combined learning and use for classificatio and regression models", VII. PROCEEDINGS OF THE 1997 IEEE WORKSHOP ON NEURAL NETWORKS FOR SIGNAL PROCESSING, 24 September 1997 (1997-09-24) - 26 September 1997 (1997-09-26), pages 102 - 111, XP010245411, DOI: doi:10.1109/NNSP.1997.622388 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI740896B (en) * 2016-03-04 2021-10-01 香港商阿里巴巴集團服務有限公司 Training method and training system of machine learning system
US10832162B2 (en) 2016-09-08 2020-11-10 International Business Machines Corporation Model based data processing
CN109920551A (en) * 2019-01-24 2019-06-21 华东师范大学 A system for analyzing the characteristics of social behavior of children with autism based on machine learning
WO2020178843A1 (en) * 2019-03-05 2020-09-10 Telefonaktiebolaget Lm Ericsson (Publ) System and method for managing resources
CN112378619A (en) * 2020-11-06 2021-02-19 东北财经大学 Application of FER-FSE with ReMD-OSELM in total pressure real-time modeling in wind tunnel test stamping stage
CN112378619B (en) * 2020-11-06 2022-08-19 东北财经大学 Application of FER-FSE with ReMD-OSELM in total pressure real-time modeling in wind tunnel test stamping stage
CN113051827A (en) * 2021-03-30 2021-06-29 南华大学 Mine ventilation friction wind resistance prediction method based on classification and regression tree

Also Published As

Publication number Publication date
KR100640264B1 (en) 2007-02-28
KR20030071939A (en) 2003-09-13
AU2003212671A1 (en) 2003-09-16

Similar Documents

Publication Publication Date Title
Zhang et al. Attribute and instance weighted naive Bayes
Chen et al. Survey and taxonomy of feature selection algorithms in intrusion detection system
Shekhar et al. Gaussian process bandits with adaptive discretization
Denoeux Maximum likelihood estimation from uncertain data in the belief function framework
Zhang et al. Kernel mixture model for probability density estimation in Bayesian classifiers
Li et al. Forest-type regression with general losses and robust forest
WO2003075187A1 (en) Apparatus and method for constructing data mining model using ensemble machines
Dinakaran et al. Ensemble method of effective AdaBoost algorithm for decision tree classifiers
Gama Iterative bayes
Pimentel et al. A generalized multivariate approach for possibilistic fuzzy C-means clustering
Akhtar et al. Zeroth and first order stochastic Frank-Wolfe algorithms for constrained optimization
Ding et al. Gradient information for representation and modeling
Labroche Online fuzzy medoid based clustering algorithms
Doring et al. Fuzzy clustering of quantitative and qualitative data
Miao et al. Informative core identification in complex networks
Liang et al. A normalizing flow-based co-embedding model for attributed networks
Song et al. Provably accelerated decentralized gradient methods over unbalanced directed graphs
Yu et al. Learning Uncertainty for Unknown Domains with Zero-Target-Assumption.
Khalid et al. Scalable and practical One-Pass clustering algorithm for recommender system
Ruiz et al. A principled approach to network-based classification and data representation
Mahnig et al. Optimal mutation rate using Bayesian priors for estimation of distribution algorithms
Qi Extending expectation propagation for graphical models
Keriven Entropic optimal transport on random graphs
Jin et al. Modeling regularity to improve scalability of model-based multiobjective optimization algorithms
Thirunavukkarasu et al. Analysis of classification techniques in data mining

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP