Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Imbalanced data David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Tree Approach in Data Mining
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Feature Selection Presented by: Nafise Hatamikhah
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Evaluation.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Ensemble Learning: An Introduction
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Three kinds of learning
Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah Supervisor: Dr. Sid Ray.
Classification.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Experimental Evaluation
Introduction. 1.Data Mining and Knowledge Discovery 2.Data Mining Methods 3.Supervised Learning 4.Unsupervised Learning 5.Other Learning Paradigms 6.Introduction.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 5 Data mining : A Closer Look.
Ensemble Learning (2), Tree and Forest
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Ensembles of Classifiers Evgueni Smirnov
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
MINING MULTI-LABEL DATA BY GRIGORIOS TSOUMAKAS, IOANNIS KATAKIS, AND IOANNIS VLAHAVAS Published on July, 7, 2010 Team Members: Kristopher Tadlock, Jimmy.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Copyright R. Weber Machine Learning, Data Mining INFO 629 Dr. R. Weber.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
Data Mining and Decision Support
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Instance-Based Learning Evgueni Smirnov. Overview Instance-Based Learning Comparison of Eager and Instance-Based Learning Instance Distances for Instance-Based.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Ensemble Classifiers.
Data Transformation: Normalization
Chapter 6 Classification and Prediction
Introduction to Data Mining, 2nd Edition by
A Unifying View on Instance Selection
Classification & Prediction
Classification and Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
Data Transformations targeted at minimizing experimental variance
Decision Trees for Mining Data Streams
Presentation transcript:

Data Reduction via Instance Selection Chapter 1

Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. Instance selection  Choosing a subset of data to achieve the original purpose of a data mining application as if the whole data is used.  The ideal outcome of instance selection is model independent

Instance Selection Instance selection has the following prominent functions.  Enabling When a data set is too huge, it may not be possible to run a data mining algorithm  Focusing One application is normally only about one aspect of the domain  Cleaning Remove irrelevant ones as well as noise and/or redundant data. Tradeoff between the sample size and mining quality. Instance selection methods associated with data mining tasks such as classification and clustering

Methods of Classification (1) Four types of selected instance  Critical point The points that matter the most to a classifier Originated from the learning method of NN. Keep only the critical points so that noisy data points are removed as well as the data set is reduced.  Boundary points Support vectors  Prototypes Representatives of groups of instances via averaging

Methods of Classification (2)  Tree based sampling Instance selection can be done via the decision tree built Delegate sampling  Construct a decision tree such that instances at the leaves of the tree are approximately uniformly distributed  Samples instances from the leaves in inverse proportion to the density at the leaf and assigns weights to the sample points that are proportional to the leaf density

Methods of Clustering Prototypes  Pseudo data points generated from the formed clusters Prototypes & sufficient statistics  Representing a cluster using both defiant points and pseudo points Data description in a hierarchy  When the clustering produces a hierarchy, the prototype approach to instance selection will not work as there are many ways of choosing appropriate clusters. Squashed Data  Each data point has a weight and the sum of the weights is equal to the number of instances in the original data set.  Obtaining squashed data Model free, model dependent (or likelihood based)

Instance Labeling Problem of selecting which unlabeled data for labeling

Towards the Next Step Universal model of instance selection.  This would allow the selected instances to be useful for a group of mining algorithms in solving real world problems.  Not yet found. Different groups of learning algorithms need different instance selectors in order to suit their learning/search bias well.

Evaluation Issues Sampling  Performance is of sufficient statistics and can depend on the data distribution.  If it is of normal distribution, means and variances are the two major measure. Classification  Performance is more about predictive accuracy as well as aspects such as comprehensibility and simplicity. Clustering  Performance is naturally about clusters: inter- and intra-cluster similarity, shapes of clusters, number of clusters etc.

Evaluation Measures Direct Measure  Keep as much resemblance as possible between the selected data and the original data.  Ex) Entropy, moments, and histograms. Indirect Measure  For example, a classifier can be used to check whether instance selection results in better, equal, or worse predictive accuracy.  Conventional evaluation methods in sampling, classification, and clustering can be used in assessing the performance of instance selection.  Ex) Precision, recall.

Platform Compatibility & Database Protectability Platform compatibility  The subset of selected instances should be compatible with the mining algorithm used in application. Database protectability  Under any circumstances, the original data should be kept intact.

Related Work Feature selection  We can transpose an instance selection problem to a problem of attribute selection. Boosting Active Learning

Conclusion and Future Work The central point of instance selection is approximation