Presentation is loading. Please wait.

Presentation is loading. Please wait.

Anna Atramentov Major: Computer Science Program of Study Committee: Vasant Honavar, Major Professor Drena Leigh Dobbs Yan-Bin Jia Iowa State University,

Similar presentations


Presentation on theme: "Anna Atramentov Major: Computer Science Program of Study Committee: Vasant Honavar, Major Professor Drena Leigh Dobbs Yan-Bin Jia Iowa State University,"— Presentation transcript:

1 Anna Atramentov Major: Computer Science Program of Study Committee: Vasant Honavar, Major Professor Drena Leigh Dobbs Yan-Bin Jia Iowa State University, Ames, Iowa 2003 A Multi-Relational Decision Tree Learning Algorithm – Implementation and Experiments

2 KDD and Relational Data Mining Term KDD stands for Knowledge Discovery in Databases Traditional techniques in KDD work with the instances represented by one table Relational Data Mining is a subfield of KDD where the instances are represented by several tables DayOutlookTemp-reHumidityWindPlay Tennis d1SunnyHotHighWeakNo d2SunnyHotHighStrongNo d3OvercastHotHighWeakYes d4OvercastColdNormalWeakNo Department d1Math1000 d2Physics300 d3Computer Science400 Staff p1Daled1Professor70 - 80k p2Martind3Postdoc30-40k p3Victord2Visitor Scientist 40-50k p4Davidd3Professor80-100k Graduate Student s1John2.04p1d3 s2Lisa3.510p4d3 s3Michel3.93p4d4

3 Motivation Importance of relational learning: Growth of data stored in MRDB Techniques for learning unstructured data often extract the data into MRDB Promising approach to relational learning: MRDM (Multi-Relational Data Mining) framework developed by Knobbe’s (1999) MRDTL (Multi-Relational Decision Tree Learning) algorithm implemented by Leiva (2002) Goals Speed up MRDM framework and in particular MRDTL algorithm Incorporate handling of missing values Perform more extensive experimental evaluation of the algorithm

4 Relational Learning Literature  Inductive Logic Programming (Dzeroski and Lavrac, 2001; Dzeroski et al., 2001; Blockeel, 1998; De Raedt, 1997)  First order extensions of probabilistic models  Relational Bayesian Networks(Jaeger, 1997)  Probabilistic Relational Models (Getoor, 2001; Koller, 1999)  Bayesian Logic Programs (Kersting et al., 2000)  Combining First Order Logic and Probability Theory  Multi-Relational Data Mining (Knobbe et al., 1999)  Propositionalization methods (Krogel and Wrobel, 2001)  PRMs extension for cumulative learning for learning and reasoning as agents interact with the world (Pfeffer, 2000)  Approaches for mining data in form of graph (Holder and Cook, 2000; Gonzalez et al., 2000)

5 Problem Formulation Example of multi-relational database Given: Data stored in relational data base Goal: Build decision tree for predicting target attribute in the target table schema instances Department d1Math1000 d2Physics300 d3Computer Science400 Staff p1Daled1Professor70 - 80k p2Martind3Postdoc30-40k p3Victord2Visitor Scientist 40-50k p4Davidd3Professor80-100k Graduate Student s1John2.04p1d3 s2Lisa3.510p4d3 s3Michel3.93p4d4 Department ID Specialization #Students Staff ID Name Department Position Salary Grad.Student ID Name GPA #Publications Advisor Department

6 No {d3, d4}{d1, d2} {d1, d2, d3, d4} Tree_induction(D: data) A = optimal_attribute(D) if stopping_criterion (D) return leaf(D) else D left := split(D, A) D right := split complement (D, A) child left := Tree_induction(D left ) child right := Tree_induction(D right ) return node(A, child left, child right ) Propositional decision tree algorithm. Construction phase DayOutlookTemp-reHumidityWindPlay Tennis d1SunnyHotHighWeakNo d2SunnyHotHighStrongNo d3OvercastHotHighWeakYes d4OvercastColdNormalWeakNo Outlook not sunny … … … … Temperature hot not hot No Yes {d3} {d4} sunny DayOutlookTempHum-tyWindPlayT d1SunnyHotHighWeakNo d2SunnyHotHighStrongNo DayOutlookTempHum-tyWindPlayT d3OvercastHotHighWeakYes d4OvercastColdNormalWeakNo

7 MR setting. Splitting data with Selection Graphs IDSpecialization#Students d1Math1000 d2Physics300 d3Computer Science400 DepartmentGraduate Student IDNameDepartmentPositionSalary p1Daled1Professor70 - 80k p2Martind3Postdoc30-40k p3Victord2Visitor Scientist 40-50k p4Davidd3Professor80-100k Staff IDNameGPA#Public.AdvisorDepartment s1John2.04p1d3 s2Lisa3.510p4d3 s3Michel3.93p4d4 Staff Grad. Student GPA >2.0 Department Staff Grad.Student complement selection graphs StaffGrad. Student GPA >2.0 StaffGrad. Student IDNameDepartmentPositionSalary p1Daled1Professor70-80k IDNameDepartmentPositionSalary p4Davidd3Professor80-100k IDNameDepartmentPositionSalary p2Martind3Postdoc30-40k p3Victord2Visitor Scientist 40-50k

8 What is selection graph? Staff Grad.Student GPA >3.9 Grad.Student Department It corresponds to the subset of the instances from target table Nodes correspond to the tables from the database Edges correspond to the associations between tables Open edge = “have at least one” Closed edge = “have non of ” Department Staff Grad.Student Specialization =math

9 Transforming selection graphs into SQL queries Staff Grad. Student Select T0.id Select distinct T0.id From From Staff Where T0.position=Professor Position = Professor Select T0.id Select distinct T0.id From T0, Graduate_Student T1 From Staff T0, Graduate_Student T1 Where T0.id=T1.Advisor Select T0.id Select distinct T0.id From T0 From Staff T0 Where T0.id not in ( Select T1. id ( Select T1. id From Graduate_Student T1) From Graduate_Student T1) GPA >3.9 Select distinct T0. id Graduate_Student T1 From Staff T0, Graduate_Student T1 T0.id=T1.Advisor Where T0.id=T1.Advisor T0. id not in ( Select T1. id From Graduate_Student T1 From Graduate_Student T1 Where T1.GPA > 3.9) Where T1.GPA > 3.9) Generic query: select distinct T0.primary_key from table_list where join_list and condition_list

10 MR decision tree Staff …… …… … … Grad. Student GPA >3.9 Grad.Student Each node contains selection graph Each child selection graph is a supergraph of the parent selection graph

11 How to choose selection graphs in nodes? Problem: There are too many supergraph selection graphs to choose from in each node Solution: start with initial selection graph find greedy heuristic to choose supergraph selection graphs: refinements use binary splits for simplicity for each refinement get complement refinement choose the best refinement based on information gain criterion Problem: Some potentially good refinements may give no immediate benefit Solution: look ahead capability Staff …… …… … … Grad. Student GPA >3.9 Grad.Student

12 Refinements of selection graph add condition to the node - explore attribute information in the tables add present edge and open node – explore relational properties between the tables Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student Specialization =math

13 Refinements of selection graph Position = Professor Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student GPA >3.9 Grad.Student Department Position != Professor Staff Grad.Student GPA >3.9 Grad.Student Department refinement complement refinement Department Staff Grad.Student add condition to the node add condition to the node add present edge and open node Specialization =math

14 Refinements of selection graph Staff Grad.Student GPA >3.9 Grad.Student Department GPA >2.0 Staff Grad.Student GPA >3.9 Grad.Student Department Grad.Student GPA >2.0 Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student add condition to the node add condition to the node add present edge and open node refinement complement refinement Specialization =math

15 Refinements of selection graph Staff Grad.Student GPA >3.9 Grad.Student Department #Students >200 Staff Grad.Student GPA >3.9 Grad.Student Department #Students >200 Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student add condition to the node add condition to the node add present edge and open node refinement complement refinement Specialization =math

16 Refinements of selection graph Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student add condition to the node add present edge and open node add present edge and open node refinement complement refinement Note: information gain = 0 Specialization =math

17 Refinements of selection graph Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student refinement complement refinement add condition to the node add present edge and open node add present edge and open node Specialization =math

18 Refinements of selection graph Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student GPA >3.9 Grad.Student DepartmentStaff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student refinement complement refinement add condition to the node add present edge and open node add present edge and open node Specialization =math

19 Refinements of selection graph Staff Grad.Student GPA >3.9 Grad.Student DepartmentGrad.S Staff Grad.Student GPA >3.9 Grad.Student DepartmentGrad.S Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student refinement complement refinement add condition to the node add present edge and open node add present edge and open node Specialization =math

20 Look ahead capability Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student GPA >3.9 Grad.Student Department refinement complement refinement Specialization =math

21 Look ahead capability Department Staff Grad.Student #Students > 200 Staff Grad.Student GPA >3.9 Grad.Student Department refinement complement refinement #Students > 200 Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student GPA >3.9 Grad.Student Department Specialization =math

22 MRDTL algorithm. Construction phase Staff …… …… …… Grad.Student StaffGrad. Student GPA >3.9 Staff Grad.Student GPA >3.9 Grad.Student for each non-leaf node: consider all possible refinements and their complements of the node’s selection graph choose the best ones based on information gain criterion create children nodes

23 MRDTL algorithm. Classification phase Staff …… …… … … Grad. Student GPA >3.9 Grad.Student StaffGrad. Student GPA >3.9 Department Spec=math StaffGrad. Student GPA >3.9 Department Spec=physics Position = Professor …………….. 70-80k80-100k for each leaf: apply selection graph of the leaf to the test data classify resulting instances with classification of the leaf

24 The most time consuming operations of MRDTL Entropy associated with this selection graph: Staff Grad.Student GPA >3.9 Grad.Student Department Specialization =math IDNameDepPositionSalary p1Daled1Postdoc c1c1 p2Martind1Postdoc c1c1 p3Davidd4Postdoc c1c1 p4Peterd3Postdoc c1c1 p5Adriand2Professor c2c2 p6Doinad3Professor c2c2 …… …… n1n1 n2n2 … E =  (n i /N) log (n i /N) Query associated with counts n i : select distinct Staff.Salary, count(distinct Staff.ID) from Staff, Grad.Student, Deparment where join_list and condition_list group by Staff.Salary Result of the query is the following list: c i, n i

25 The most time consuming operations of MRDTL Staff Grad.Student GPA >3.9 Grad.Student Department GPA >2.0 Staff Grad.Student GPA >3.9 Grad.Student Department Grad.Student GPA >2.0 Staff Grad.Student GPA >3.9 Grad.Student Department Specialization =math Entropy associated with each of the refinements select distinct Staff.Salary, count(distinct Staff.ID) from table_list where join_list and condition_list group by Staff.Salary

26 A way to speed up - eliminate redundant calculations Staff Grad.Student GPA >3.9 Grad.Student Department Specialization =math Problem: For selection graph with 162 nodes the time to execute a query is more than 3 minutes! Redundancy in calculation: For this selection graph tables Staff and Grad.Student will be joined over and over for all the children refinements of the tree A way to fix: calculate it only once and save for all further calculations

27 Speed Up Method. Sufficient tables Staff Grad.Student GPA >3.9 Grad.Student Department Specialization =math Staff_IDGrad.Student_IDDep_IDSalary p1s1d1 c1c1 p2s1d1 c1c1 p3s6d4 c1c1 p4s3d3 c1c1 p5s1d2 c2c2 p6s9d3 c2c2 …… ……

28 Speed Up Method. Sufficient tables Entropy associated with this selection graph: Staff Grad.Student GPA >3.9 Grad.Student Department Specialization =math n1n1 n2n2 … E =  (n i /N) log (n i /N) Query associated with counts n i : select S.Salary, count(distinct S.Staff_ID) from S group by S.Salary Result of the query is the following list: c i, n i Staff_IDGrad.Student_IDDep_IDSalary p1s1d1 c1c1 p2s1d1 c1c1 p3s6d4 c1c1 p4s3d3 c1c1 p5s1d2 c2c2 p6s9d3 c2c2 …… ……

29 Speed Up Method. Sufficient tables Staff Grad.Student GPA >3.9 Grad.Student Department Specialization =math select S.Salary, X.A, count(distinct S.Staff_ID) from S, X where S.X_ID = X.ID group by S.Salary, X.A Queries associated with the add condition refinement: Calculations for the complement refinement: count(c i, R comp (S)) = count(c i, S) – count(c i, R(S))

30 Speed Up Method. Sufficient tables Staff Grad.Student GPA >3.9 Grad.Student Department Specialization =math Queries associated with the add edge refinement: select S.Salary, count(distinct S.Staff_ID) from S, X, Y where S.X_ID = X.ID, and e.cond group by S.Salary Calculations for the complement refinement: count(c i, R comp (S)) = count(c i, S) – count(c i, R(S))

31 Speed Up Method Significant speed up in obtaining the counts needed for the calculations of the entropy and information gain The speed up is reached by the additional space used by the algorithm

32 Handling Missing Values Staff.Position Staff.NameStaff.DepDepartment.Spec For each attribute which has missing values we build a Naïve Bayes model: IDSpecialization#Students d1Math1000 d2Physics300 d3Computer Science400 Department Graduate Student IDNameDepartmentPositionSalary p1Daled1?70 - 80k p2Martind3?30-40k p3Victord2Visitor Scientist 40-50k p4Davidd3?80-100k Staff IDNameGPA#Public.AdvisorDepartment s1John2.04p1d3 s2Lisa3.510p1d3 s3Michel3.93p4d4 … Staff.Position, b Staff.Name, aP(a|b)

33 Handling Missing Values Then the most probable value for the missing attribute is calculated by formula: IDSpecialization#Students d1Math1000 Department Graduate Student IDNameDepartmentPositionSalary p1Daled1?70 - 80k Staff IDNameGPA#Public.AdvisorDepartment s1John2.04p1d3 s2Lisa3.510p1d3 P(v i | X 1.A 1, X 2.A 2, X 3.A 3 …) = P(X 1.A 1, X 2.A 2, X 3.A 3 …| v i ) P(v i ) / P(X 1.A 1, X 2.A 2, X 3.A 3 … ) = P(X 1.A 1 | v i ) P(X 2.A 2 | v i ) P(X 3.A 3 | v i ) … P(v i ) / P(X 1.A 1, X 2.A 2, X 3.A 3 … )

34 Experimental results. Mutagenesis Most widely DB used in ILP. Describes molecules of certain nitro aromatic compounds. Goal: predict their mutagenic activity (label attribute) – ability to cause DNA to mutate. High mutagenic activity can cause cancer. Two subsets regression friendly (188 molecules) and regression unfriendly (42 molecules). We used only regression friendly subset. 5 levels of background knowledge: B0, B1, B2, B3, B4. They provide richer descriptions of the examples. We used B2 level.

35 Experimental results. Mutagenesis Results of 10-fold cross-validation for regression friendly set. Data SetAccuracySel graph size (max) Tree sizeTime with speed up Time without speed up mutagenesis87.5%3928.4552.15 Best-known reported accuracy is 86% Schema of the mutagenesis database

36 Consists of a variety of details about the various genes of one particular type of organism. Genes code for proteins, and these proteins tend to localize in various parts of cells and interact with one another in order to perform crucial functions. 2 Tasks: Prediction of gene/protein localization and function 862 training genes, 381 test genes. Experimental results. KDD Cup 2001  Many attribute values are missing: 70% of CLASS attribute, 50% of COMPLEX, and 50% of MOTIF in composition table FUNCTION

37 localizationAccuracySel graph size (max) Tree sizeTime with speed up Time without speed up With handling missing values 76.11%19213202.9 secs1256.38 secs Without handling missing values 50.14%33575550.76 secs2257.20 secs Experimental results. KDD Cup 2001 functionAccuracySel graph size (max) Tree size (max) Time with speed up Time without speed up With handling missing values 91.44%963151.19 secs307.83 secs Without handling missing values 88.56%91961.29 secs118.41 secs Best-known reported accuracy is 93.6% Best-known reported accuracy is 72.1%

38 Experimental results. PKDD 2001 Discovery Challenge Consists of 5 tables Target table consists of 1239 records The task is to predict the degree of the thrombosis attribute from ANTIBODY_EXAM table The results for 5:2 cross validation: Data SetAccuracySel Graph size (max) Tree sizeTime with speed up Time without speed up thrombosis98.1%3171127.75198.22 Best-known reported accuracy is 99.28% PATIENT_INFO DIAGNOSIS THROMBOSIS ANTIBODY_EXAM ANA_PATTERN

39 Summary the algorithm significantly outperforms MRDTL in terms of running time the accuracy results are comparable with the best reported results obtained using different data-mining algorithms Future work Incorporation of the more sophisticated techniques for handling missing values Incorporating of more sophisticated pruning techniques or complexity regularizations More extensive evaluation of MRDTL on real-world data sets Development of ontology-guided multi-relational decision tree learning algotihms to generate classifiers at multiple levels of abstraction [Zhang et al., 2002] Development of variants of MRDTL that can learn from heterogeneous, distributed, autonomous data sources, based on recently developed techniques for distributed learning and ontology based data integration

40 Thanks to Dr. Honavar for providing guidance, help and support throughout this research Colleges from Artificial Intelligence Lab for various helpful discussions My committee members: Drena Dobbs and Yan-Bin Jia for their help Professors and lecturers of the Computer Science department for the knowledge that they gave me through lectures and discussions Iowa State University and Computer Science department for funding in part this research


Download ppt "Anna Atramentov Major: Computer Science Program of Study Committee: Vasant Honavar, Major Professor Drena Leigh Dobbs Yan-Bin Jia Iowa State University,"

Similar presentations


Ads by Google