Download presentation

Presentation is loading. Please wait.

Published byOswald Trevor Shepherd Modified about 1 year ago

1
Mining databases with different schema: Integrating incompatible classifiers Andreas L Prodromidis Salvatore Stolfo Dept of Computer Science Columbia University

2
Reference material n PROJECT/recent-project-papers.htm blications/publications.html Information has been updated in Andreas’ Ph.D. paper : "Management of Intelligent Learning Agents in Distributed Data Mining Systems" October 1999.

3
Data Mining and Data Schema Mismatch n Introduction n Database Compatibility n Meta-learning n Bridging Methods n Experiments and Evaluations n Conclusion

4
Introduction n The myth of an entirely local database n Can one algorithm give you everything? n There is distributed DM (JAM) and then there is distributed DM with different schemas n Prediction, Machine and Meta-learning

5
Compatibility n Data about the same topic - example Credit Card Transactions n Different banks record and store information differently n The same bank’s database will change over time. n The “incompatible-schema” problem

6
Compatibility n Similar but different data yield different classifier. n Classifiers depend on the structure of the data n Lately this has been discovered to be a problem hampering company mergers

7
Meta-Learning n Meta-learning: n Why? A way to deal with the scaling problem of distributed data sources. n What? A concept of deriving a higher level of information from already learned classifiers –Meta classifiers are defined recursively as collections of classifiers structured in multi- level trees and determining the optimal set of classifiers is a combinatorial problem. –Must be pruned to be efficient

8
Meta-learning

9
Meta-learning n Methods: –voting –stacking –SCANN - stacking correspondence analysis and nearest neighbor –Other methods n bagging n boosting n referreeing n arbitrating

10
Bridging Methods n Databases with the same schema n Schema(Dba)={A1,A2,A3,…,An,C} n Schema(DBb)={B1,B2,B3,…,Bn,C} n Databases with one more attribute that the other where An+1 does not relate to Bn+1 or An+1 != Bn+1

11
Bridging Methods n Databases with one more attribute that the other where An+1 does not relate to Bn+1 or An+1 != Bn+1 –missing data and attribute value predictions n Schema(Dba)={A1,A2,A3,…,An,An+1,C} n Schema(DBb)={B1,B2,B3,…,Bn,Bn+1,C} –If you can’t predict - Null - average for the column, or most likely based on other reoccurring attribute combinations of other columns.

12
Bridging Methods n Databases with similar but different attributes –changing sizes and ranges of data values (normalization of time increments) –An+1 Bn+1 –Bridging methods must translate based on input from data experts or other normalization efforts and probabilities

13
Bridging Methods

14
Experiments and Evaluations n Working with CC data from 2 different banking institutions –First Union and Chase n Using 5 Mining algorithms to derive classifiers –DT - CART, ID3 and C4.5 –NB - Bayes –Rule Induction - Ripper based on IREP n Using predictive algorithms to fill-in missing data with regression methods –CART, MARS, local weighted and linear

15
About the data n Chase Credit Card –500,000 records spanning one year –Evenly distributed –20% fraud, 80% non fraud n First Union Credit Card –500,000 records spanning one year –Unevenly distributed –15% fraud, 85% non fraud

16
About the differences n Chase includes 2 attributes not present in First Union data –Add two fictitious fields –Classifier agents support unknown values n Chase and First Union define an attribute with different semantics –Project Chase values on First Union semantics

17
Charts of Results Started out With an estimated saving of $325K to $550K 86 to 90 % Total accuracy with base level classifiers

18
Other results n With meta-classifiers used on First Union composed of Chase base and bridging classifiers –Accuracy improved to 95% –Est savings went up to $800K

19
Other results n With meta-classifiers used composed of base from both chase and First Union base classifiers –Accuracy improved to almost 98% –Est saving went up to $900K

20
Conclusion n There is a lot of ground to cover n Distributed DM is a viable option for the scalability and performance issues. n This paper investigated the idea of using databases with differing schema and bridging those differences so that classifiers could be built and combined into meta-classifiers.

21
Conclusion n They conducted experiments and proved that meta-classifiers can be built using real Credit Card transaction data with different schemas n These meta-classifiers proved to be reasonably accurate in testing.

22
Questions???

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google