Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Relational Model Trees Department of Computer Science University of Bari Department of Computer Science University of Bari Knowledge Acquisition.

Similar presentations


Presentation on theme: "Mining Relational Model Trees Department of Computer Science University of Bari Department of Computer Science University of Bari Knowledge Acquisition."— Presentation transcript:

1 Mining Relational Model Trees Department of Computer Science University of Bari Department of Computer Science University of Bari Knowledge Acquisition & Machine Learning Lab Annalisa Appice Dipartimento di Informatica Universita’ di Bari

2 Mining Relational Model Trees2 Regression problem in classical data mining Given m independent (or predictor) variables X i (both continuous and discrete) a continuous dependent (or response) variable Y to be predicted a set of n training cases (x 1, x 2, …, x m, y) Build a function y=g(x) such that it correctly predicts the value of the response variable for each m-tuple (x 1, x 2, …, x m )

3 Mining Relational Model Trees3 Regression trees and model trees Y = 0.9 Y = 3 +1.1X 1 Y=3X 1 +1.1X 2 Model trees: approximation by means of a piecewise multiple (linear) function X 1 0.3 X 2 2.1 Partitioning of observations + local regression models  regression or models trees X 1 0.1 Y = 0.9 Y=0.5 Y = 1.9 X 2 0.1 Regression trees: approximation by means of a piecewise constant function

4 Mining Relational Model Trees4 Model trees: state of the art Statistics Ciampi (1991): RECPAM Siciliano & Mola (1994) Data Mining Karalic, (1992): RETIS Quinlan, (1992): M5 Wang & Witten, (1997): M5’ Lubinsky, (1994): TSIR Torgo, (1997): HTL … The tree-structure is generated according to a top-down strategy. Phase 2: association of models to the leaves Y=3+2X 1 Phase 1: partitioning of the training set X1  3X1  3

5 Mining Relational Model Trees5 Model trees: state of the art Models in the leaves have only a “local” validity  they are built on the basis of training cases falling in the corresponding partition of the feature space. “Global” effects can be represented by variables that are introduced in the regression models at higher levels of the model trees  A different tree-structure is required! Internal nodes can either define a further partitioning of the feature space or introduce some regression variables in the models to be associated to the leaves.

6 Mining Relational Model Trees6 Two types of nodes Two types of nodes: Splitting nodes perform a Boolean test. tRtR X i   Y=a+bX u Y=c+dX w t tLtL continuous variable t X i {x i1,…,x ih } Y=a+bX u Y=c+dX w tRtR tLtL discrete variable tLtL Regression nodes compute only a straight-line regression. They have only one child. Y=a+bX i X j   Y=c+dX u Y=e+fX w nLnL nRnR t t’ t’ R t’ L

7 Mining Relational Model Trees7 What is passed down? Splitting nodes pass down to each child only a subgroup of training cases, without any change on the variables. Regression nodes pass down to their unique child all training cases. Values of the variables not included in the model are transformed to remove the linear effect of those variables already included. Y’=a 3 +b 3 X’ 2 Y=a 1 +b 1 X 1 Y, X 1, X 2 X’ 2 =X 2 - (a 2 +b 2 X 1 ) Y’= Y - (a 1 +b 1 X 1 )

8 Mining Relational Model Trees8 An example of model tree 65 4 3 2 Y=c+dX 3 Y=e+fX 2 X 4   Y=g+hX 3 0 Y=a+bX 1 1 X 3   7 Y=i+lX 4 X 2   T Leaves are associated with a straight-line regression function 65 4 3 2 Y=c+dX 3 Y=e+fX 2 X 4   Y=g+hX 3 0 Y=a+bX 1 1 X 3   7 Y=i+lX 4 X 2   T The multiple regression model associated to a leaf is the composition of straight-line regression functions found along the path from the root to a leaf

9 Mining Relational Model Trees9 Building a regression model stepwise: some tricks Example: build a multiple regression model with two independent variables: Y=a+bX 1 + cX 2 through a sequence of straight-line regressions 1.Build: Y = a 1 +b 1 X 1 2.Build: X 2 = a 2 +b 2 X 1 3.Compute the residuals on X 2 : X' 2 = X 2 - (a 2 +b 2 X 1 ) 4.Compute the residuals on Y: Y' = Y - (a 1 +b 1 X 1 ) 5.Regress Y’ on X' 2 alone: Y’ = a 3 + b 3 X' 2. By substituting the equation of X' 2 in the last equation: Y = a 3 + a 1 - a 2 b 3 + b 3 X 2 –(b 2 b 3 -b 1 )X 1. it can be proven that a=a 3 -a 2 b 3 + a 1, b=-b 2 b 3 +b 1 and c=b 3.

10 Mining Relational Model Trees10 The global effect of regression nodes Both regression models associated to the leaves include X i. The contribution of X i to Y can be different for each leaf, but It can be reliably estimated on the whole region R Y=a+bX i Xj<Xj< Y=c+dX u Y=e+fX w nLnL nRnR t t’ t’ R t’ L XjXj Y  R R1R1 R2R2

11 Mining Relational Model Trees11 An example of model tree 65 4 3 2 Y=c+dX 3 Y=e+fX 2 X 4   Y=g+hX 3 0 Y=a+bX 1 1 X 3   7 Y=i+lX 4 X 2   T This regression node introduces a variable in the regression model at the descendant leaves The variable X 1 captures a “global” effect in the underlying multiple regression model The variables X 2, X 3 and X 4 capture a “local” effect

12 Mining Relational Model Trees12 Advantages of the proposed tree structure 1.It captures both the “global” and the “local” effects of regression variables 2.Multiple regression models at the leaves can be efficiently built stepwise 3.The multiple regression model at a leaf can be easily computed  the heuristic function for the selection of regression and splitting nodes can take it into account

13 Mining Relational Model Trees13 Evaluating splitting and regression nodes Splitting node: X i   Y=a+bX u Y=c+dX v t tRtR tLtL R(t L ) (R(t L ) ) is the resubstitution error associated of the left (right) child. Regression node: Y=a+bX i X j   Y=c+dX u Y=e+fX v nLnL nRnR t t’ t’ R t’ L (X i,Y) = min { R(t), (X j,Y) for all possible variables X j }.

14 Mining Relational Model Trees14 Filtering useless splitting nodes Problem: a splitting node with identical straight- line regressions associated with children  the split is really modelling a regression step How to recognize? Solution: compare the two regression lines associated with children of a splitting according to a statistical test for coincident regression lines (Weisberg, 1985).

15 Mining Relational Model Trees15 Stopping criteria 1.The first performs the partial F-test to evaluate the contribution of a new independent variable to the model. 2.The second requires the number of cases in each node to be greater than a minimum value. 3.The third operates when all continuous variables along the path from the root to the current node are used in regression steps and there are no discrete variables in the training set. 4.The fourth creates a leaf if the error in the current node is below a fraction of the error in the root node. 5.The fifth stops the growth when the coefficient of determination is greater than a minimum value.

16 Mining Relational Model Trees16 Related works … and problems In principle, the optimal split should be chosen on the basis of the fit of each regression model to the data. Problem: in some systems (M5, M5’ and HTL) the heuristic function does not take into account the model associated with the leaves of the tree.  The evaluation function is incoherent with respect to the model tree being built.  Some simple regression models are not correctly discovered

17 Mining Relational Model Trees17 Related works … and problems Example: Cubist splits the data at -0.1 and builds the following models: X  -0.1:Y = 0.78 + 0.175*X X > -0.1: Y = 1.143 - 0.281*X x 0.4 y=0.963+0.851xy=1.909-0.868x TrueFalse 0 0,2 0,4 0,6 0,8 1 1,2 1,4 1,6 1,8 -1,5-0,500,511,522,5

18 Mining Relational Model Trees18 Related works … and problems Retis solves this problem by computing the best multiple regression model at the leaves for each splitting node. The problem is theoretically solved, but … 1. Computationally expensive approach: a multiple regression model for each possible test. The choice of the first split is O(m 3 N 2 ). 2. All continuous variables are involved in multiple linear models associated to the leaves. So, when some of the independent variables are linearly related to each other, several problems may occur (Collinearity).

19 Mining Relational Model Trees19 Related works … and problems TSIR induces model trees with regression nodes and splitting nodes, but … The effect of the regressed variable in a regression node is not removed when cases are passed down  the multiple regression model associated to each leaf cannot be correctly interpreted from a statistical viewpoint.

20 Mining Relational Model Trees20 Computational complexity It can be proved that SMOTI has an O(m 3 n 2 ) worst case complexity for the selection of any node (splitting or regression). RETIS has the same complexity for node selection, although RETIS does not select a subset of variables to solve collinearity problems.

21 Mining Relational Model Trees21 Empirical evaluation For pairwise comparison with Retis and M5’, which art the state-of-the-art model tree induction systems the non-parametric Wilcoxon two-sample paired signed rank test is used. Experiments (Malerba et al, 2004 1 ): laboratory-sized data sets UCI datasets

22 Mining Relational Model Trees22 …Empirical evaluation on laboratory-sized data… M5 Retis SMOTI

23 Mining Relational Model Trees23 …Empirical evaluation on laboratory-sized data M5 Retis SMOTI Number of examples Time(s)

24 Mining Relational Model Trees24 … Empirical Evaluation on UCI data… Dataset Avg.MSE SMOTI vs.Retis SMOTI vs. M5’ SMOTIRetisM5’ Abalone2.536.032.77(+)0.0019(+)0.19 AutoMpg3.14NA3.20NA(+)0.55 AutoPrice2246.03NA2358.81NA(+)0.69 Bank8FM0.030.460.04(+)0.0019(+)0.064 Cleveland1.312.971.24(+)0.0097(-)0.23 DeltaAilerons0.00020.0010.0002(+)0.0273(-)0.64 DeltaElevators0.0040.0050.0016(+)0.13(-)0.19 Housing3.5836.364.27(+)0.0019(+)0.04 Kinematics0.151.980.19(+)0.0019(+)0.0039 MachineCPU55.31305.6057.35(+)0.0039(+)0.55 Pyrimidines0.100.070.09(-)0.4316(-)0.84 Stock1.821.591.10(-)0.4375(-)0.03 Triazines0.20NA0.15NA(-)0.02 WisconsinCancer51.41NA45.40NA(-)0.625

25 Mining Relational Model Trees25 … Empirical Evaluation on UCI data. For some datasets SMOTI mines interesting patterns that no previous study on model trees has ever revealed. This aspect proves the easy interpretability of the model trees induced by SMOTI. For example: Abalone (marine crustaceans). The goal is to predict the age (number of rings). SMOTI builds a model tree with a regression node in the root. The straight-line regression selected at the root is almost invariant for all model trees and expresses a linear dependence between the number of rings (dependent variable) and the shucked weight (independent variable). This is a clear example of global effect.

26 Mining Relational Model Trees26 SMOTI: open issues The DM system KDB2000 http://www.di.uniba.it/~malerba/software/kdb2000/index.htm that implements SMOTI is not tightly integrated with the DBMS  Tighter integration with a DBMS Cannot be applied directly to multi-relational data mining tasks  the unit of analysis is an individual described by a set of random variables each of which result in just one single value

27 Mining Relational Model Trees27 From classical to relational data mining...while in the most real world application complex objects are described in terms of properties and relations Example In spatial domains the effect of a predictor variable at any site may not be limited to the specified site (spatial autocorrelation) X r j E.g.: no communal establishment (schools, hospitals) in an ED, but many of them are located in the nearby EDs.

28 Mining Relational Model Trees28 Multi-relational representation Augment data table with information about neighboring units. ED#MigrInWard#Establishments#Employees on 10% sample population #Migrants 03BSFA0431943 03BSFA18101845 03BSFN01185273 …………… Reference ED Neighbouring ED #MigrInWard#Establishments#Employees on 10% sample population #Migrants 03BSFA0403BSFA05140423 03BSFA0403BSFB1880842 03BSFA0403BSFQ01111725 ……………… target relevant objects

29 Mining Relational Model Trees29 Regression Problem in relational data mining Given a training set O stored in relational tables S={T 0,T 1,…,T h } of a relational database D a set of v primary key constraints PK on relations in S, a set of w foreign key constraints FK on relations in S, a target relation T(X 1,…,X n, Y )  S, a target continuous attribute Y in T, different from the primary key or foreign key in T. Find a multi-relational regression model which predicts the value of Y for for some object represented as a tuple in T and related tuples in S according to foreign key paths.

30 Mining Relational Model Trees30 How to work with (multi-)relational data? Moulding relational database in a single table such that traditional attribute-value algorithms are able to work on create a single relation by deriving attributes from other joined tables construct of a single relation that summarizes and/or aggregates information found in other tables Solve mining problems in their original representation. FORS (Karalic, 1997) SRT(Kramer, 1996), S-CART (Kramer, 1999),TILDE- RT(Blockeel, 1998 )

31 Mining Relational Model Trees31 Strengths and Weaknesses of current multi- relational regression methods Strengths solve Relational Regression problems in their original representation. able to exploit background knowledge in the mining process learn multi-relational patterns Weaknesses knowledge of data model is not used to guide the search process data is stored as Prolog facts not integrated with the database do not differentiate global vs. local effects of variables in a regression model Idea: to combine the achievements of the KDD field on the integration of data mining with database systems, with results reported in the ILP field on how to upgrade propositional data mining algorithms to multi-relational representations.

32 Mining Relational Model Trees32 Global/local effect+ multi-relational model =Mr-SMOTI Upgrading SMOTI to multi-relational representations Tightly integrating the data mining engine with a relational DBMS Mr-SMOTI is the relational extension of SMOTI that outputs relational model trees such that each node corresponds with a subset of training data and it is associated with a portion of D intensionally described by a relational pattern, each leaf is associated with a (multiple) regression function which may involve predictor variables from several tables in D, each variable that is eventually introduced in left branch of a node must not occur in the right branch of that node, relational patterns associated with nodes are represented with regression selection graphs that extends selection graph definition (Knobbe,99), Regression selection graphs are translated into SQL expressions stored in XML format. Mr-SMOTI

33 Mining Relational Model Trees33 What is a regression selection graph? It corresponds to tuples describing a subset of the instances from database eventually modified by removing effect of regression steps Nodes correspond to the tables from the database whose attributes are replaced by corresponding residuals Arcs correspond to foreign key associations between tables Open arcs = “have at least one” Closed arcs = “have no of ” Order Customer Detail Article Customer Order Date in {02/09/02} Order Detail Quantity 70 CreditLineAgent

34 Mining Relational Model Trees34 Relational splitting nodes add condition + add negative condition add present arc and open node + add absent arc and closed node Customer Order Detail add condition + add negative condition (split condition) add present arc and open node + add absent arc and closed node (join condition) Customer Order Customer Order Sale 120Sale >120 Detail 1st case

35 Mining Relational Model Trees35 Relational splitting nodes add condition + add negative condition (split condition) add present arc and open node + add absent arc and closed node (join condition) Customer Order Detail Customer Order Quantity 22 Detail Customer OrderDetail Customer Order Quantity 22 Detail 2nd case

36 Mining Relational Model Trees36 Relational splitting nodes add condition + add negative condition (split condition) add present arc and open node + add absent arc and closed node (join condition) Customer Order Customer Order Detail Customer Order Customer Order Detail

37 Mining Relational Model Trees37 Relational splitting nodes with look-ahead Customer Order Quantity 22 Detail Customer Order Quantity 22 Detail

38 Mining Relational Model Trees38 Relational regression nodes add regression condition Customer(Id, Sale,CreditLine,Agent)Order(Id, Date, Client, Pieces) Order(Id, Date, Client, Pieces+2.5Sale+3.2) Customer(Id, Sale, CreditLine-5Sale+0.5,Agent) CreditLine’= CreditLine-(5Sale-0.5) Pieces’= Pieces-(-2.5Sale-3.2)

39 Mining Relational Model Trees39 Relational model trees: an example Customer … Order Customer Order Customer Order Date in {02/09/02} Customer Order Date in {02/09/02} Customer(Id, Sale, CreditLine-5 Sale+0.5,Agent) Order(Id, Date, Client, Pieces+2.5Sale+3.2) Customer(Id, Sale, CreditLine-5Sale+0.5-0.1(Pieces+2.5Sale+3.2)+2,Agent) Order(Id, Date, Client, Pieces+2.5Sale+3.2) … Select Id, avg(5.25Sale+0.1Pieces-2.18) as CreditLine From Customer, Order Where Customer.ID=Order.Client Group by Customer.Id

40 Mining Relational Model Trees40 How to choose the best relational node? Start with root node that is associated with selection graph containing only target node Find greedy heuristics to choose regression selection graph refinements use binary splits for simplicity for each refinement get complementary refinement store regression coefficient in order to compute residuals on continuous attributes choose the best refinement based on evaluation functions

41 Mining Relational Model Trees41 Evaluating relational splitting node Customer Order Customer Order

42 Mining Relational Model Trees42 Evaluating relational regression node (t) = min {R(t),(t’)}. where R(t) is the resubstitution error computed on the tuples returned on tuples extracted by regression selection graph associated with t, t’ is the best splitting node following t.

43 Mining Relational Model Trees43 Stopping criteria 1.The first requires the number of target objects in each node to be greater than a minimum value. 2.The second operates when all continuous attributes along the path from the root to the current node are used in regression steps and there are no add open node and present arc refinement including new continuous attributes. 3.The third stops the growth when the coefficient of determination is greater than a minimum value.

44 Mining Relational Model Trees44 Mr-SMOTI: some details Mr-SMOTI has been implemented as a component of the KDD system MURENA. MURENA has been implemented in java and interfaces an Oracle datatabase. http://www.di.uniba.it/%7Ececi/micFiles/systems/ The%20MURENA%20project.html

45 Mining Relational Model Trees45 Empirical evaluation on laboratory-sized data

46 Mining Relational Model Trees46 Empirical evaluation on laboratory-sized data Wilcoxon test (alpha=0.05) …

47 Mining Relational Model Trees47 Empirical evaluation on real data

48 Mining Relational Model Trees48 Improving efficency by materializing intermediate results

49 Mining Relational Model Trees49 Questions?


Download ppt "Mining Relational Model Trees Department of Computer Science University of Bari Department of Computer Science University of Bari Knowledge Acquisition."

Similar presentations


Ads by Google