Download presentation

Presentation is loading. Please wait.

Published byOswald Trevor Shepherd Modified over 3 years ago

1
Mining databases with different schema: Integrating incompatible classifiers Andreas L Prodromidis Salvatore Stolfo Dept of Computer Science Columbia University andreas@cs.columbia.edusol@cs.columbia.edu

2
Reference material n http://www.cs.columbia.edu/~sal/JAM/ PROJECT/recent-project-papers.htm http://www.cs.columbia.edu/~andreas/pu blications/publications.html Information has been updated in Andreas’ Ph.D. paper : "Management of Intelligent Learning Agents in Distributed Data Mining Systems" October 1999.

3
Data Mining and Data Schema Mismatch n Introduction n Database Compatibility n Meta-learning n Bridging Methods n Experiments and Evaluations n Conclusion

4
Introduction n The myth of an entirely local database n Can one algorithm give you everything? n There is distributed DM (JAM) and then there is distributed DM with different schemas n Prediction, Machine and Meta-learning

5
Compatibility n Data about the same topic - example Credit Card Transactions n Different banks record and store information differently n The same bank’s database will change over time. n The “incompatible-schema” problem

6
Compatibility n Similar but different data yield different classifier. n Classifiers depend on the structure of the data n Lately this has been discovered to be a problem hampering company mergers

7
Meta-Learning n Meta-learning: n Why? A way to deal with the scaling problem of distributed data sources. n What? A concept of deriving a higher level of information from already learned classifiers –Meta classifiers are defined recursively as collections of classifiers structured in multi- level trees and determining the optimal set of classifiers is a combinatorial problem. –Must be pruned to be efficient

8
Meta-learning

9
Meta-learning n Methods: –voting –stacking –SCANN - stacking correspondence analysis and nearest neighbor –Other methods n bagging n boosting n referreeing n arbitrating

10
Bridging Methods n Databases with the same schema n Schema(Dba)={A1,A2,A3,…,An,C} n Schema(DBb)={B1,B2,B3,…,Bn,C} n Databases with one more attribute that the other where An+1 does not relate to Bn+1 or An+1 != Bn+1

11
Bridging Methods n Databases with one more attribute that the other where An+1 does not relate to Bn+1 or An+1 != Bn+1 –missing data and attribute value predictions n Schema(Dba)={A1,A2,A3,…,An,An+1,C} n Schema(DBb)={B1,B2,B3,…,Bn,Bn+1,C} –If you can’t predict - Null - average for the column, or most likely based on other reoccurring attribute combinations of other columns.

12
Bridging Methods n Databases with similar but different attributes –changing sizes and ranges of data values (normalization of time increments) –An+1 Bn+1 –Bridging methods must translate based on input from data experts or other normalization efforts and probabilities

13
Bridging Methods

14
Experiments and Evaluations n Working with CC data from 2 different banking institutions –First Union and Chase n Using 5 Mining algorithms to derive classifiers –DT - CART, ID3 and C4.5 –NB - Bayes –Rule Induction - Ripper based on IREP n Using predictive algorithms to fill-in missing data with regression methods –CART, MARS, local weighted and linear

15
About the data n Chase Credit Card –500,000 records spanning one year –Evenly distributed –20% fraud, 80% non fraud n First Union Credit Card –500,000 records spanning one year –Unevenly distributed –15% fraud, 85% non fraud

16
About the differences n Chase includes 2 attributes not present in First Union data –Add two fictitious fields –Classifier agents support unknown values n Chase and First Union define an attribute with different semantics –Project Chase values on First Union semantics

17
Charts of Results Started out With an estimated saving of $325K to $550K 86 to 90 % Total accuracy with base level classifiers

18
Other results n With meta-classifiers used on First Union composed of Chase base and bridging classifiers –Accuracy improved to 95% –Est savings went up to $800K

19
Other results n With meta-classifiers used composed of base from both chase and First Union base classifiers –Accuracy improved to almost 98% –Est saving went up to $900K

20
Conclusion n There is a lot of ground to cover n Distributed DM is a viable option for the scalability and performance issues. n This paper investigated the idea of using databases with differing schema and bridging those differences so that classifiers could be built and combined into meta-classifiers.

21
Conclusion n They conducted experiments and proved that meta-classifiers can be built using real Credit Card transaction data with different schemas n These meta-classifiers proved to be reasonably accurate in testing.

22
Questions???

Similar presentations

OK

© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.

© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on network switching tutorial Ppt on social issues in today's society Ppt on power sharing in democracy it is important Ppt on hydrogen fuel cells Ppt on different solid figures songs Ppt on heritage of indian culture Ppt on recycling of waste oil Download ppt on teamviewer 7 Ppt on ip address classes ranges Convert word file to ppt online free