Lecture 7: Data Preprocessing

Lecture 7: Data Preprocessing
CSE 482 Lecture 7: Data Preprocessing

Overview Previous lecture Today’s lecture Data quality issues
Data preprocessing: Transforming the raw data into a more “useful” representation for subsequent analysis Includes data cleaning, aggregation, feature extraction, etc

Data Preprocessing Tasks
Data cleaning Noise, outliers, missing values, duplicate data Sampling Aggregation Discretization Feature extraction

Sampling Sampling is a technique for data reduction
The key principle for effective sampling is to find a representative sample A sample is representative if it has approximately the same property (of interest) as the original set of data 8000 points Points Points

Types of Sampling Simple Random Sampling Stratified sampling
There is an equal probability of selecting any particular item Stratified sampling Split the data into several partitions; then draw random samples from each partition Sampling without replacement As each item is selected, it is removed from the population Sampling with replacement Objects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than once

Python Example

DataFrame.sample()

Aggregation Sometimes, less is more Purpose
Aggregation combines two or more observations into a single observation Purpose Data reduction smaller data set means less memory and processing time Change of scale aggregation generates a coarser-level view of the data More “stable” data aggregated data tends to have less variability (less noisy)

Aggregation Precipitation at Maple City, Michigan
Harder to detect trends/patterns Daily precipitation Monthly precipitation Annual precipitation Long-term trends and cycles are easier to discern

Discretization Ordinal attribute Numeric attribute:
shirt size (small/medium/large), hurricane (category 1 – 5) Numeric attribute: Weight, height, salary, # days since Jan Discretization is used to split the range of a numeric attribute into discrete number of intervals Age can be discretized into [child, young adult, adult, and senior]. There may be no apparent relationship between age attribute and tendency to buy a particular product But the relationship may exist only among certain age groups (e.g.young adults)

Unsupervised Discretization
Equal interval width Split the range of the numeric attribute into equal length intervals (bins) Pros: cheap and easy to implement Cons: susceptible to outliers Equal frequency Split the range of the numeric attribute in such a way that each interval (bin) has the same number of points Pros: robust to outliers Cons: more expensive (must sort data), may not be consistent with inherent structure of the data

Python Example Discretize into 5 equal-width bins
Discretize into 5 equal-frequency bins (values are the quantiles)

Supervised Discretization Example
Age Buy 10 No 15 18 Yes 19 24 29 30 31 40 44 55 64 Let “Buy” be the class attribute Suppose we’re interested to discretize the Age attribute We also want the intervals (bins) to contain data points from the same class (i.e., we want the bins to be as close to homogeneous as possible)

Example Equal width: interval = (64-10)/3 = 54/3 = 18 Age Buy 10 No 15
Yes 19 24 29 30 31 40 44 55 64 Equal frequency: Both approaches can produce intervals that contain non-homogeneous classes

Supervised Discretization
Age Buy 10 No 15 18 Yes 19 24 29 30 31 40 44 55 64 Yes No Supervised discretization: In supervised discretization, our goal is to ensure that each bin contains data points from one class.

Entropy-based Discretization
A widely-used supervised discretization method Entropy is a measure of impurity Higher entropy implies data points are from a large number of classes (heterogeneous) Lower entropy implies most of the data points are from the same class Where pj is the proportion of data points belonging to class j

Entropy Suppose you want to discretize age of users based on whether they buy or don’t buy a product Age Buy 10 No 15 18 Yes 19 24 29 30 31 40 44 55 64 Class here is Yes or No (whether the user buys a product) For each bin, calculate the proportion of data points belonging to each class

Entropy P(Yes) = 0/6 = 0 P(No) = 6/6 = 1
Where pj is the fraction of data objects belonging to class j P(Yes) = 0/6 = P(No) = 6/6 = 1 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(Yes) = 1/ P(Yes) = 5/6 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 P(Yes) = 2/ P(No) = 4/6 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92 As the bin becomes less homogeneous, entropy increases

Recursively find the best partition that minimizes entropy Split point = 35.5

Find the next best partition that minimizes entropy Split point = 16.5 Split point = 35.5

Curse of Dimensionality
Suppose you want to build a model to predict whether a user will buy an item at an online store Simple model: predict whether users will buy based on their age Model is likely to perform poorly Can we improve it?

Suppose you want to build a model to predict whether a user will buy an item at an online store A more complicated model Model is likely to be more accurate since we can use the two attributes to separate the ones who buy from those who don’t Can we do even better?

Suppose you want to build a model to predict whether a user will buy an item at an online store Can we keep improving the model by adding more features?

Given a data set with fixed number of objects, increasing number of attributes (i.e., dimensionality of data) may actually degrade model performance As number of dimensions increases Higher chance for the model to overfit noisy observations Need for more examples to figure out which attribute is most relevant to predict the different classes

Overcoming Curse of Dimensionality
Feature subset selection Pick a subset of attributes to build your prediction model Eliminate the irrelevant and highly correlated ones Feature extraction Construct a new set of attributes based on (linear or nonlinear) combination of the original attributes

Feature Selection Example
Select the non-correlated features for your analysis Correlation matrix:

Feature Extraction Creation of a new set of attributes from the original raw data Example: face detection in images Raw pixels are too fine-grained to enable accurate detection of a face By generating higher-level features, such as those representing the presence or absence of certain facial features (e.g., mouth, eyebrow, etc) can help improve detection accuracy

Principal Component Analysis
A widely-used (classical) approach for feature extraction The goal of PCA is to construct a new set of dimensions (attributes) that better captures variability of the data. The first dimension is chosen to capture as much of the variability as possible. The second dimension is orthogonal to the first, and, captures as much of the remaining variability as possible, and so on.

Principal Component Analysis
k << d Projected Data d Data Frame (table) N d Principal components k

Example

Example Note: membership years, amount spent, and number of purchases are quite correlated

Computing Principal Components
Given a data set D Calculate the covariance matrix C PCs are the eigenvectors of the covariance matrix To calculate the projected data: Center each column in the data: D’ Calculate the projections: (T: transpose operation) projectedT = (PC)T x (D’)T We want to project the data from 5 features to 2 features (principal components) k x N k x d d x N

Example We can use numpy linear algebra functions to calculate eigenvectors and perform matrix multiplication data.cov() – calculate covariance matrix data.as_matrix() – convert DataFrame to Numpy array Linalg.eig(cov) – calculate eigenvalues & eigenvectors (eigenvectors are the pcs) (A – mean(A.T, axis=1)) - center columns of data matrix dot(pc.T, M).T - multiply PC with centered data matrix

Example 6 3 7

Example 1st PC 2nd PC

Summary In this lecture, we discuss about Next lecture
Data preprocessing approaches Examples of using Python to do data preprocessing Next lecture Data summarization and visualization

Lecture 7: Data Preprocessing

Similar presentations

Presentation on theme: "Lecture 7: Data Preprocessing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 7: Data Preprocessing

Similar presentations

Presentation on theme: "Lecture 7: Data Preprocessing"— Presentation transcript:

Similar presentations

About project

Feedback