Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

CSE 572: Data Mining by H. Liu
Data preprocessing A necessary step for serious, effective, real-world data mining It’s often omitted in “academic” DM, but can’t be over-stressed in practical DM The need for pre-processing in DM Data reduction - too much data Data cleaning – extant noise Data integration and transformation 1/17/2019 CSE 572: Data Mining by H. Liu

Data reduction Data cube aggregation Feature selection and dimensionality reduction Sampling random sampling and others Instance selection (search based) Data compression PCA, Wavelet transformation Data discretization 1/17/2019 CSE 572: Data Mining by H. Liu

Feature selection The basic problem Finding a subset of original features that can learn the domain better or equally better What are the advantages of doing so? Curse of dimensionality From 1-d, 2-d, to 3-d: an illustration Another example – the wonders of reducing the number of features since # of instances available to learning is dependent on # of features 1/17/2019 CSE 572: Data Mining by H. Liu

Illustration of the difficulty of the problem Search space (an example with 4 features) The figure is from the Weka book. 1/17/2019 CSE 572: Data Mining by H. Liu

Reduce the chance of data overfitting Examples From 2-D to 3-D Are the features selected really good? If they are, they may help mitigate the overfitting How do we know? Experiments A standard procedure of feature selection Search SFS, SBS, Beam Search, Branch&Bound Optimality of a selected set of features Evaluation measures on goodness of selected features Accuracy, distance, inconsistency, 1/17/2019 CSE 572: Data Mining by H. Liu

Quality (goodness) metrics
F1 F2 F3 C 1 Some example metrics Dependency: depending on classes Distance: separating classes Information: entropy, how? Consistency: 1 - #inconsistencies/N Example: (F1, F2, F3) and (F1,F3) Both sets have 2/6 inconsistency rate Accuracy (classifier based): 1 - errorRate Which one algorithm is better - comparisons Time complexity, number of features, removing redundancy A dilemma of feature selection evaluation: if we know what relevant features are, there is no need for FS; if we don’t, how do we know what relevant features are? 1/17/2019 CSE 572: Data Mining by H. Liu

Normalization Decimal scaling v’(i) = v(i)/10k for the smallest k such that max(|v’(i)|)<1. For the range between -991 and 99, k is 3 (1000), -991  -.991 Min-max normalization into the new max/min range: v’ = (v - minA)/(maxA - minA) * (new_maxA - new_minA) + new_minA v = in [12000,98000]  v’= in [0,1] (new range) Zero-mean normalization: v’ = (v - meanA) / std_devA (1, 2, 3), mean and std_dev are 2 and 1, (-1, 0, 1) If meanIncome = and std_devIncome = 16000, then v =  1.225 1/17/2019 CSE 572: Data Mining by H. Liu

Discretization Motivation from Decision Tree Induction The concept of discretization Sort the values of a feature Group continuous values together Reassign values to each group The methods Equ-width Equ-frequency Entropy-based A possible problem: still too many intervals So, when to stop? 1/17/2019 CSE 572: Data Mining by H. Liu

Binning Attribute values (for one attribute e.g., age): 0, 4, 12, 16, 16, 18, 24, 26, 28 Equi-width binning – for bin width of e.g., 10: Bin 1: 0, 4 [-,10) bin Bin 2: 12, 16, 16, 18 [10,20) bin Bin 3: 24, 26, 28 [20,+) bin We use – to denote negative infinity, + for positive infinity Equi-frequency binning – for bin density of e.g., 3: Bin 1: 0, 4, [-,14) bin Bin 2: 16, 16, [14,21) bin Bin 3: 24, 26, [21,+] bin Any problems with the above methods? 1/17/2019 CSE 572: Data Mining by H. Liu

Data cleaning Missing values ignore it fill in manually use a global value/mean/most frequent Noise smoothing (binning) outlier removal Inconsistency domain knowledge, domain constraints 1/17/2019 CSE 572: Data Mining by H. Liu

Data integration Data integration - combines data from multiple sources into a coherent data store Schema integration entity identification problem Redundancy an attribute may be derived from another table correlation analysis Data value conflicts 1/17/2019 CSE 572: Data Mining by H. Liu

Data transformation Data is transformed or consolidated into forms appropriate for mining Methods include smoothing aggregation generalization normalization (min-max) feature construction using neural networks Traditional transformation methods 1/17/2019 CSE 572: Data Mining by H. Liu

Feature extraction The basic problem creating new features that are combinations of original features Feature selection and extraction complement each other A common approach – PCA Dimensionality reduction via feature extraction (or transformation) D’ = DA, D is mean centered (NXn), A (nXm), so, D’ is (NXm) Its variants (SVD, LSI) are used widely in text mining and web mining 1/17/2019 CSE 572: Data Mining by H. Liu

Transformation: PCA D’ = DA, D is mean-centered, (Nn) Calculate and rank eigenvalues of the covariance matrix Select largest ’s such that r > threshold (e.g., .95) corresponding eigenvectors form A (nm) Example of Iris data E-values Diff Prop Cumu 1 2 3 4 m n r = (  i ) / (  i ) i= i=1 V1 V2 V3 V4 F1 F2 F3 F4 1/17/2019 CSE 572: Data Mining by H. Liu

Summary Data preprocessing cannot be overstressed in real-world applications It is an important, difficult, and low-profile task There are different types of approaches for different preprocessing problems It should be considered with the mining algorithms to improve data mining effectiveness 1/17/2019 CSE 572: Data Mining by H. Liu

Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

Similar presentations

Presentation on theme: "Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

Similar presentations

Presentation on theme: "Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc."— Presentation transcript:

Similar presentations

About project

Feedback