Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining databases with different schema: Integrating incompatible classifiers Andreas L Prodromidis Salvatore Stolfo Dept of Computer Science Columbia University.

Similar presentations


Presentation on theme: "Mining databases with different schema: Integrating incompatible classifiers Andreas L Prodromidis Salvatore Stolfo Dept of Computer Science Columbia University."— Presentation transcript:

1 Mining databases with different schema: Integrating incompatible classifiers Andreas L Prodromidis Salvatore Stolfo Dept of Computer Science Columbia University andreas@cs.columbia.edusol@cs.columbia.edu

2 Reference material n http://www.cs.columbia.edu/~sal/JAM/ PROJECT/recent-project-papers.htm http://www.cs.columbia.edu/~andreas/pu blications/publications.html Information has been updated in Andreas’ Ph.D. paper : "Management of Intelligent Learning Agents in Distributed Data Mining Systems" October 1999.

3 Data Mining and Data Schema Mismatch n Introduction n Database Compatibility n Meta-learning n Bridging Methods n Experiments and Evaluations n Conclusion

4 Introduction n The myth of an entirely local database n Can one algorithm give you everything? n There is distributed DM (JAM) and then there is distributed DM with different schemas n Prediction, Machine and Meta-learning

5 Compatibility n Data about the same topic - example Credit Card Transactions n Different banks record and store information differently n The same bank’s database will change over time. n The “incompatible-schema” problem

6 Compatibility n Similar but different data yield different classifier. n Classifiers depend on the structure of the data n Lately this has been discovered to be a problem hampering company mergers

7 Meta-Learning n Meta-learning: n Why? A way to deal with the scaling problem of distributed data sources. n What? A concept of deriving a higher level of information from already learned classifiers –Meta classifiers are defined recursively as collections of classifiers structured in multi- level trees and determining the optimal set of classifiers is a combinatorial problem. –Must be pruned to be efficient

8 Meta-learning

9 Meta-learning n Methods: –voting –stacking –SCANN - stacking correspondence analysis and nearest neighbor –Other methods n bagging n boosting n referreeing n arbitrating

10 Bridging Methods n Databases with the same schema n Schema(Dba)={A1,A2,A3,…,An,C} n Schema(DBb)={B1,B2,B3,…,Bn,C} n Databases with one more attribute that the other where An+1 does not relate to Bn+1 or An+1 != Bn+1

11 Bridging Methods n Databases with one more attribute that the other where An+1 does not relate to Bn+1 or An+1 != Bn+1 –missing data and attribute value predictions n Schema(Dba)={A1,A2,A3,…,An,An+1,C} n Schema(DBb)={B1,B2,B3,…,Bn,Bn+1,C} –If you can’t predict - Null - average for the column, or most likely based on other reoccurring attribute combinations of other columns.

12 Bridging Methods n Databases with similar but different attributes –changing sizes and ranges of data values (normalization of time increments) –An+1  Bn+1 –Bridging methods must translate based on input from data experts or other normalization efforts and probabilities

13 Bridging Methods

14 Experiments and Evaluations n Working with CC data from 2 different banking institutions –First Union and Chase n Using 5 Mining algorithms to derive classifiers –DT - CART, ID3 and C4.5 –NB - Bayes –Rule Induction - Ripper based on IREP n Using predictive algorithms to fill-in missing data with regression methods –CART, MARS, local weighted and linear

15 About the data n Chase Credit Card –500,000 records spanning one year –Evenly distributed –20% fraud, 80% non fraud n First Union Credit Card –500,000 records spanning one year –Unevenly distributed –15% fraud, 85% non fraud

16 About the differences n Chase includes 2 attributes not present in First Union data –Add two fictitious fields –Classifier agents support unknown values n Chase and First Union define an attribute with different semantics –Project Chase values on First Union semantics

17 Charts of Results Started out With an estimated saving of $325K to $550K 86 to 90 % Total accuracy with base level classifiers

18 Other results n With meta-classifiers used on First Union composed of Chase base and bridging classifiers –Accuracy improved to 95% –Est savings went up to $800K

19 Other results n With meta-classifiers used composed of base from both chase and First Union base classifiers –Accuracy improved to almost 98% –Est saving went up to $900K

20 Conclusion n There is a lot of ground to cover n Distributed DM is a viable option for the scalability and performance issues. n This paper investigated the idea of using databases with differing schema and bridging those differences so that classifiers could be built and combined into meta-classifiers.

21 Conclusion n They conducted experiments and proved that meta-classifiers can be built using real Credit Card transaction data with different schemas n These meta-classifiers proved to be reasonably accurate in testing.

22 Questions???


Download ppt "Mining databases with different schema: Integrating incompatible classifiers Andreas L Prodromidis Salvatore Stolfo Dept of Computer Science Columbia University."

Similar presentations


Ads by Google