Download presentation
Presentation is loading. Please wait.
Published byPhoebe Byrd Modified over 9 years ago
1
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
2
Outline Ch.7 Collecting data Ch.8 Preparing data Ch.9 Data preprocessing
3
Ch.7 Collecting data
4
Collecting data Collecting “example patterns” –Inputs (vectors of independent variables) –Outputs (vectors dependent variables) More data is better Begin with an elementary set of data
5
Collecting data Choose an appropriate sampling rate for time-series data. Make sure the data measurements units are consistent. Keep non-essential variables not in the input vector Make sure no major structural (systemic) changes have occurred during collection.
6
Collecting data How much data is enough? –Training and testing using a subset of data –If the performance does not increase when full data is used, data is enough –There are statistical validating methods (Ch.11) Using simulated data –When it is difficult to collect (sufficient) data Realistic Representative
7
Ch.8 Preparing data
8
Preparing data Handling –Missing data –Categorical data –Inconsistent data and outliers
9
Missing data Discard incomplete example patterns Manually enter a reasonable, probable, or expected values Use an statistic generated from the example patterns with that value –Mean, mode Encode missing values explicitly by creating new indicator variables Generate a predictive model to predict each of the missing data value
10
Categorical data Ordinal: –Convert to a numerical representation in a straightforward manner –“Low”, “medium”, “high” => 0, 1, 2 Nominal: –“One of n” representation –Encode the input variables as n different binary inputs, when there are n distinct categories.
11
Further process of “one of n” When n is too large, reduce the number of inputs in the new encoding. –Manually –PCA-based reduction Reduce the one-of-n representation to a one-of-m representation where m is less than n. –Eigenvalue-based reduction –Output variable-based reduction
12
Inconsistent data and outliers Removing erroneous data Identifying inconsistent data –Thresholding, filtering Outliers –Data points that lie outside of the normal region of interest in the input space, which may be Unusual situations that are “correct” Misleading or incorrect measurements
13
Outliers Ways to spot outliers –Plot: box plot, histogram… –Number of S.D. from the mean Handling outliers –Remove them Assumption: the input space where the outliers reside are not concerned –“Winzorize” them Convert the values of outliers into the values of upper or lower thresholds. Outliers can always be reintroduced into the satisfying model to study the changes in the performance of the model.
14
Ben Shabad
15
Ch.9 Data preprocessing
16
Reasons to preprocess data Reducing noise Enhancing the signal Reducing input space Feature extraction Normalizing data Modifying prior probabilities (specific for classification)
17
Reducing noise Averaging data values Thresholding data –Convert numeric format data into categorical –E.g. grey-scale => monotone image
18
Reducing input space Principle component analysis (PCA) –Identify m-dimensional subspace of the n-dimensional input space –original n variables are reduced to m variables that are mutually orthogonal (independent) Eliminating correlated input variables –Identify highly correlated input variables by Statistical correlation tests Visual inspection of graphed data variables Seeing if a data variable can be modeled using one or more others.
19
Reducing input space Combining non-correlated input variables Sensitivity analysis –If variations of a particular input variable cause large changes in the estimation model output, the variable is very significant. –Sensitivity analysis prunes input variables based on information provided by both input and output data.
20
Normalizing data Not “transform to normal distribution” For models that perform better –Non-parametric algorithms implicitly assume distances in different directions carry the same weight (e.g. K-nearest neighbor, ”KNN”) –Backpropagation (BP) and multi-layered perception (MLP) models often perform better if all inputs and outputs are normalized Avoiding numerical problems
21
Types of normalization Min-max normalization –It preserves all relationships of the data values exactly –It would compress the normal range if extreme values or outliers exist Z-score normalization Sigmoidal normalization
22
Other considerations According to the characteristics of the specific classifiers being used for modeling –E.g. CHAID uses categorical data directly Input variables produce the best modeling accuracy when exhibiting a uniform or Gaussian distribution Add expert knowledge when preprocessing data
23
Get prepared and then go!
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.