Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

Slides:



Advertisements
Similar presentations
UNIT – 1 Data Preprocessing
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
DATA PREPROCESSING Why preprocess the data?
1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining: Concepts and Techniques
Exploratory Data Mining and Data Preparation
6/10/2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data.
Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.
Pre-processing for Data Mining CSE5610 Intelligent Software Systems Semester 1.
Chapter 4 Data Preprocessing
Data Preprocessing.
Data Mining on NIJ data Sangjik Lee. Unstructured Data Mining Text Keyword Extraction Structured Data Base Data Mining Image Feature Extraction Structured.
2015年7月2日星期四 2015年7月2日星期四 2015年7月2日星期四 Data Mining: Concepts and Techniques1 Data Transformation and Feature Selection/Extraction Qiang Yang Thanks: J.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 1 Data Preprocessing
CS2032 DATA WAREHOUSING AND DATA MINING
COSC 4335 DM: Preprocessing Techniques
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
The Knowledge Discovery Process; Data Preparation & Preprocessing
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
November 24, Data Mining: Concepts and Techniques.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
1 2. Data Preparation and Preprocessing Data and Its Forms Preparation Preprocessing and Data Reduction.
Managing Data for DSS II. Managing Data for DS Data Warehouse Common characteristics : –Database designed to meet analytical tasks comprising of data.
Data Mining and Decision Support
February 18, 2016Data Mining: Babu Ram Dawadi1 Chapter 3: Data Preprocessing Preprocess Steps Data cleaning Data integration and transformation Data reduction.
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao.
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
Data Transformation: Normalization
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Mining: Data Preparation
School of Computer Science & Engineering
Noisy Data Noise: random error or variance in a measured variable.
UNIT-2 Data Preprocessing
A Unifying View on Instance Selection
Classification & Prediction
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 3 —
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Lecture 7: Data Preprocessing
Classification and Prediction
Data Preprocessing Modified from
Chapter 1 Data Preprocessing
©Jiawei Han and Micheline Kamber
Feature space tansformation methods
Data Transformations targeted at minimizing experimental variance
Data Mining Data Preprocessing
Chapter 7: Transformations
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Feature Selection Methods
By Sandeep Patil, Department of Computer Engineering, I²IT
Data Pre-processing Lecture Notes for Chapter 2
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Tel Hope Foundation’s International Institute of Information Technology, (I²IT). Tel
Presentation transcript:

Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

CSE 572: Data Mining by H. Liu Data preprocessing A necessary step for serious, effective, real-world data mining It’s often omitted in “academic” DM, but can’t be over-stressed in practical DM The need for pre-processing in DM Data reduction - too much data Data cleaning – extant noise Data integration and transformation 1/17/2019 CSE 572: Data Mining by H. Liu

CSE 572: Data Mining by H. Liu Data reduction Data cube aggregation Feature selection and dimensionality reduction Sampling random sampling and others Instance selection (search based) Data compression PCA, Wavelet transformation Data discretization 1/17/2019 CSE 572: Data Mining by H. Liu

CSE 572: Data Mining by H. Liu Feature selection The basic problem Finding a subset of original features that can learn the domain better or equally better What are the advantages of doing so? Curse of dimensionality From 1-d, 2-d, to 3-d: an illustration Another example – the wonders of reducing the number of features since # of instances available to learning is dependent on # of features 1/17/2019 CSE 572: Data Mining by H. Liu

CSE 572: Data Mining by H. Liu Illustration of the difficulty of the problem Search space (an example with 4 features) The figure is from the Weka book. 1/17/2019 CSE 572: Data Mining by H. Liu

CSE 572: Data Mining by H. Liu Reduce the chance of data overfitting Examples From 2-D to 3-D Are the features selected really good? If they are, they may help mitigate the overfitting How do we know? Experiments A standard procedure of feature selection Search SFS, SBS, Beam Search, Branch&Bound Optimality of a selected set of features Evaluation measures on goodness of selected features Accuracy, distance, inconsistency, 1/17/2019 CSE 572: Data Mining by H. Liu

Quality (goodness) metrics F1 F2 F3 C 1 Some example metrics Dependency: depending on classes Distance: separating classes Information: entropy, how? Consistency: 1 - #inconsistencies/N Example: (F1, F2, F3) and (F1,F3) Both sets have 2/6 inconsistency rate Accuracy (classifier based): 1 - errorRate Which one algorithm is better - comparisons Time complexity, number of features, removing redundancy A dilemma of feature selection evaluation: if we know what relevant features are, there is no need for FS; if we don’t, how do we know what relevant features are? 1/17/2019 CSE 572: Data Mining by H. Liu

CSE 572: Data Mining by H. Liu Normalization Decimal scaling v’(i) = v(i)/10k for the smallest k such that max(|v’(i)|)<1. For the range between -991 and 99, k is 3 (1000), -991  -.991 Min-max normalization into the new max/min range: v’ = (v - minA)/(maxA - minA) * (new_maxA - new_minA) + new_minA v = 73600 in [12000,98000]  v’= 0.716 in [0,1] (new range) Zero-mean normalization: v’ = (v - meanA) / std_devA (1, 2, 3), mean and std_dev are 2 and 1, (-1, 0, 1) If meanIncome = 54000 and std_devIncome = 16000, then v = 73600  1.225 1/17/2019 CSE 572: Data Mining by H. Liu

CSE 572: Data Mining by H. Liu Discretization Motivation from Decision Tree Induction The concept of discretization Sort the values of a feature Group continuous values together Reassign values to each group The methods Equ-width Equ-frequency Entropy-based A possible problem: still too many intervals So, when to stop? 1/17/2019 CSE 572: Data Mining by H. Liu

CSE 572: Data Mining by H. Liu Binning Attribute values (for one attribute e.g., age): 0, 4, 12, 16, 16, 18, 24, 26, 28 Equi-width binning – for bin width of e.g., 10: Bin 1: 0, 4 [-,10) bin Bin 2: 12, 16, 16, 18 [10,20) bin Bin 3: 24, 26, 28 [20,+) bin We use – to denote negative infinity, + for positive infinity Equi-frequency binning – for bin density of e.g., 3: Bin 1: 0, 4, 12 [-,14) bin Bin 2: 16, 16, 18 [14,21) bin Bin 3: 24, 26, 28 [21,+] bin Any problems with the above methods? 1/17/2019 CSE 572: Data Mining by H. Liu

CSE 572: Data Mining by H. Liu Data cleaning Missing values ignore it fill in manually use a global value/mean/most frequent Noise smoothing (binning) outlier removal Inconsistency domain knowledge, domain constraints 1/17/2019 CSE 572: Data Mining by H. Liu

CSE 572: Data Mining by H. Liu Data integration Data integration - combines data from multiple sources into a coherent data store Schema integration entity identification problem Redundancy an attribute may be derived from another table correlation analysis Data value conflicts 1/17/2019 CSE 572: Data Mining by H. Liu

CSE 572: Data Mining by H. Liu Data transformation Data is transformed or consolidated into forms appropriate for mining Methods include smoothing aggregation generalization normalization (min-max) feature construction using neural networks Traditional transformation methods 1/17/2019 CSE 572: Data Mining by H. Liu

CSE 572: Data Mining by H. Liu Feature extraction The basic problem creating new features that are combinations of original features Feature selection and extraction complement each other A common approach – PCA http://en.wikipedia.org/wiki/Principal_components_analysis http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf Dimensionality reduction via feature extraction (or transformation) D’ = DA, D is mean centered (NXn), A (nXm), so, D’ is (NXm) Its variants (SVD, LSI) are used widely in text mining and web mining 1/17/2019 CSE 572: Data Mining by H. Liu

CSE 572: Data Mining by H. Liu Transformation: PCA D’ = DA, D is mean-centered, (Nn) Calculate and rank eigenvalues of the covariance matrix Select largest ’s such that r > threshold (e.g., .95) corresponding eigenvectors form A (nm) Example of Iris data E-values Diff Prop Cumu 1 2.91082 1.98960 0.72771 0.72770 2 0.92122 0.77387 0.23031 0.95801 3 0.14735 0.12675 0.03684 0.99485 4 0.02061 0.00515 1.00000 m n r = (  i ) / (  i ) i=1 i=1 V1 V2 V3 V4 F1 0.522372 0.372318 -.721017 -.261996 F2 -.263355 0.925556 0.242033 0.124135 F3 0.581254 0.021095 0.140892 0.801154 F4 0.565611 0.065416 0.633801 -.523546 1/17/2019 CSE 572: Data Mining by H. Liu

CSE 572: Data Mining by H. Liu Summary Data preprocessing cannot be overstressed in real-world applications It is an important, difficult, and low-profile task There are different types of approaches for different preprocessing problems It should be considered with the mining algorithms to improve data mining effectiveness 1/17/2019 CSE 572: Data Mining by H. Liu