Why preprocessing? Learning method needs data type: numerical, nominal,.. Learning method cannot deal well enough with noisy / incomplete data Too many.

Slides:



Advertisements
Similar presentations
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Building Global Models from Local Patterns A.J. Knobbe.
Decision Tree under MapReduce Week 14 Part II. Decision Tree.

Exploratory Data Mining and Data Preparation
Slides for “Data Mining” by I. H. Witten and E. Frank
Rule induction: Ross Quinlan's ID3 algorithm Fredda Weinberg CIS 718X Fall 2005 Professor Kopec Assignment #3.
2015年7月2日星期四 2015年7月2日星期四 2015年7月2日星期四 Data Mining: Concepts and Techniques1 Data Transformation and Feature Selection/Extraction Qiang Yang Thanks: J.
Nearest Neighbor Classifiers other names: –instance-based learning –case-based learning (CBL) –non-parametric learning –model-free learning.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
 1.1: Introduction  1.2: Descriptions  1.2.1: White wine description  1.2.2: Brest Tissue description  1.3: Conclusion.
COSC 4335 DM: Preprocessing Techniques
CSCI 347 / CS 4206: Data Mining Module 05: WEKA Topic 04: Data Preparation Tools.
Data Engineering Data preprocessing and transformation Data Engineering Data preprocessing and transformation.
Machine Learning Chapter 3. Decision Tree Learning
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
The Knowledge Discovery Process; Data Preparation & Preprocessing
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
1 Data preparation: Selection, Preprocessing, and Transformation Literature: Literature: I.H. Witten and E. Frank, Data Mining, chapter 2 and chapter 7.
Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.
CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad.
Applying Neural Networks Michael J. Watts
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Introduction to Weka Xingquan (Hill) Zhu Slides copied from Jeffrey Junfeng Pan (UST)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Data Mining and Decision Support
Machine Learning (ML) with Weka Weka can classify data or approximate functions: choice of many algorithms.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
Decision Trees with Numeric Tests. Industrial-strength algorithms  For an algorithm to be useful in a wide range of real- world applications it must:
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
Data statistics and transformation revision Michael J. Watts
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Applying Neural Networks
Decision Trees with Numeric Tests
School of Computer Science & Engineering
Data preprocessing and transformation
What is Regression Analysis?
Classification & Prediction
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Lecture 7: Data Preprocessing
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
CSCI N317 Computation for Scientific Applications Unit Weka
Machine Learning in Practice Lecture 23
Machine Learning in Practice Lecture 17
Data Transformations targeted at minimizing experimental variance
Chapter 7: Transformations
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Group 9 – Data Mining: Data
Feature Selection Methods
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Why preprocessing? Learning method needs data type: numerical, nominal,.. Learning method cannot deal well enough with noisy / incomplete data Too many data (memory, time) –Examples –Attributes –Values Data violate asumption of method –Correlated attributes

Bias in learning method –E.g. linearity

Preprocessing part of DM/ML Ideally methods should include transformations Practically: preprocessing takes most of the time of DM process Transformations * learning methods  large search space Some preprocessing is useful for all learning methods, some is specific Main types: –Remove bad examples / features –discretisation

Attribute selection Aka “feature subset selection” Features that cannot contribute at all to prediction/classification cause problems for (some) learners Redundant attributes can also be harmful “wrapper approach”: evaluate feature subsets by learning with them “Filter approach”: try to identify bad attributes without learning, eg. associated with target class and association between attributes

Many combinations … Optimal attribute subset depends on learner Redundant: combine, not remove –E.g. “thermometer value”, “subjective temperature”  average value is more reliable than one of these!

Discretisation supervised / unsupervised Fixed size “bins” / fixed number of “bins” / flexible Supervised ~ 1 attribute learning with intervals So: information gain, MDL (!?); maybe chi- square for stopping Recursive splitting = 1-pass splitting

Discrete  numerical Each attribute-value combination as separate binary attribute 1/0 Or|: “scaling”: red10 yellow7 red9 Green5 3 Yellow6

More transformations Principal component analysis –Find principal components (~ correlated attributes) –Remove components with little variance –Use components as attributes for learning

Data cleansing Impossible values Outliers (from distribution, median/mean) Outliers (from predictions) Risk: throw away unexpected but correct / data: anomalies