Mining databases with different schema: Integrating incompatible classifiers Andreas L Prodromidis Salvatore Stolfo Dept of Computer Science Columbia University.

Slides:



Advertisements
Similar presentations
A Fully Distributed Framework for Cost-sensitive Data Mining Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson, Hawthorne, New York Salvatore J. Stolfo.
Advertisements

A Framework for Scalable Cost- sensitive Learning Based on Combining Probabilities and Benefits Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Salvatore.
Is Random Model Better? -On its accuracy and efficiency-
Florida International University COP 4770 Introduction of Weka.
Imbalanced data David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Lecture Notes for Chapter 4 Introduction to Data Mining
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Ensemble Learning: An Introduction
Fraud Detection Experiments Chase Credit Card –500,000 records spanning one year –Evenly distributed –20% fraud, 80% non fraud First Union Credit Card.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Learning Programs Danielle and Joseph Bennett (and Lorelei) 4 December 2007.
Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Decision Trees Prepared by David Douglas, University of ArkansasHosted.
Introduction to Directed Data Mining: Decision Trees
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
Overview of Distributed Data Mining Xiaoling Wang March 11, 2003.
CS490D: Introduction to Data Mining Prof. Chris Clifton April 14, 2004 Fraud and Misuse Detection.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Minority Report in Fraud Detection:Classification of Skewed.
Data Mining Chun-Hung Chou
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
Issues with Data Mining
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Chapter 9 – Classification and Regression Trees
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
Data MINING Data mining is the process of extracting previously unknown, valid and actionable information from large data and then using the information.
11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,
Today Ensemble Methods. Recap of the course. Classifier Fusion
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Ensemble Methods: Bagging and Boosting
Institut für Softwarewissenschaft - Universität WienP.Brezany 1 Meta-Learning in Distributed Datamining Systems Peter Brezany Institut für Softwarewissenschaft.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Ensemble with Neighbor Rules Voting Itt Romneeyangkurn, Sukree Sinthupinyo Faculty of Computer Science Thammasat University.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Konstantina Christakopoulou Liang Zeng Group G21
Data Mining and Decision Support
Decision Tree Algorithms Rule Based Suitable for automatic generation.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.
Data Mining Copyright KEYSOFT Solutions.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Miloš Kotlar 2012/115 Single Layer Perceptron Linear Classifier.
Data Mining Practical Machine Learning Tools and Techniques Chapter 6.5: Instance-based Learning Rodney Nielsen Many / most of these slides were adapted.
FNA/Spring CENG 562 – Machine Learning. FNA/Spring Contact information Instructor: Dr. Ferda N. Alpaslan
DECISION TREE INDUCTION CLASSIFICATION AND PREDICTION What is classification? what is prediction? Issues for classification and prediction. What is decision.
Data Mining Practical Machine Learning Tools and Techniques
Presented by Khawar Shakeel
Machine Learning overview Chapter 18, 21
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Classification Nearest Neighbor
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Translation of ER-diagram into Relational Schema
Combining Base Learners
Data Mining Practical Machine Learning Tools and Techniques
Classification Nearest Neighbor
Welcome! Knowledge Discovery and Data Mining
A task of induction to find patterns
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Presentation transcript:

Mining databases with different schema: Integrating incompatible classifiers Andreas L Prodromidis Salvatore Stolfo Dept of Computer Science Columbia University

Reference material n PROJECT/recent-project-papers.htm blications/publications.html Information has been updated in Andreas’ Ph.D. paper : "Management of Intelligent Learning Agents in Distributed Data Mining Systems" October 1999.

Data Mining and Data Schema Mismatch n Introduction n Database Compatibility n Meta-learning n Bridging Methods n Experiments and Evaluations n Conclusion

Introduction n The myth of an entirely local database n Can one algorithm give you everything? n There is distributed DM (JAM) and then there is distributed DM with different schemas n Prediction, Machine and Meta-learning

Compatibility n Data about the same topic - example Credit Card Transactions n Different banks record and store information differently n The same bank’s database will change over time. n The “incompatible-schema” problem

Compatibility n Similar but different data yield different classifier. n Classifiers depend on the structure of the data n Lately this has been discovered to be a problem hampering company mergers

Meta-Learning n Meta-learning: n Why? A way to deal with the scaling problem of distributed data sources. n What? A concept of deriving a higher level of information from already learned classifiers –Meta classifiers are defined recursively as collections of classifiers structured in multi- level trees and determining the optimal set of classifiers is a combinatorial problem. –Must be pruned to be efficient

Meta-learning

Meta-learning n Methods: –voting –stacking –SCANN - stacking correspondence analysis and nearest neighbor –Other methods n bagging n boosting n referreeing n arbitrating

Bridging Methods n Databases with the same schema n Schema(Dba)={A1,A2,A3,…,An,C} n Schema(DBb)={B1,B2,B3,…,Bn,C} n Databases with one more attribute that the other where An+1 does not relate to Bn+1 or An+1 != Bn+1

Bridging Methods n Databases with one more attribute that the other where An+1 does not relate to Bn+1 or An+1 != Bn+1 –missing data and attribute value predictions n Schema(Dba)={A1,A2,A3,…,An,An+1,C} n Schema(DBb)={B1,B2,B3,…,Bn,Bn+1,C} –If you can’t predict - Null - average for the column, or most likely based on other reoccurring attribute combinations of other columns.

Bridging Methods n Databases with similar but different attributes –changing sizes and ranges of data values (normalization of time increments) –An+1  Bn+1 –Bridging methods must translate based on input from data experts or other normalization efforts and probabilities

Bridging Methods

Experiments and Evaluations n Working with CC data from 2 different banking institutions –First Union and Chase n Using 5 Mining algorithms to derive classifiers –DT - CART, ID3 and C4.5 –NB - Bayes –Rule Induction - Ripper based on IREP n Using predictive algorithms to fill-in missing data with regression methods –CART, MARS, local weighted and linear

About the data n Chase Credit Card –500,000 records spanning one year –Evenly distributed –20% fraud, 80% non fraud n First Union Credit Card –500,000 records spanning one year –Unevenly distributed –15% fraud, 85% non fraud

About the differences n Chase includes 2 attributes not present in First Union data –Add two fictitious fields –Classifier agents support unknown values n Chase and First Union define an attribute with different semantics –Project Chase values on First Union semantics

Charts of Results Started out With an estimated saving of $325K to $550K 86 to 90 % Total accuracy with base level classifiers

Other results n With meta-classifiers used on First Union composed of Chase base and bridging classifiers –Accuracy improved to 95% –Est savings went up to $800K

Other results n With meta-classifiers used composed of base from both chase and First Union base classifiers –Accuracy improved to almost 98% –Est saving went up to $900K

Conclusion n There is a lot of ground to cover n Distributed DM is a viable option for the scalability and performance issues. n This paper investigated the idea of using databases with differing schema and bridging those differences so that classifiers could be built and combined into meta-classifiers.

Conclusion n They conducted experiments and proved that meta-classifiers can be built using real Credit Card transaction data with different schemas n These meta-classifiers proved to be reasonably accurate in testing.

Questions???