1 2. Data Preparation and Preprocessing Data and Its Forms Preparation Preprocessing and Data Reduction.

1 2. Data Preparation and Preprocessing Data and Its Forms Preparation Preprocessing and Data Reduction

2/4/03CSE 575 Data Mining by H. Liu 2 Data Types and Forms Attribute-vector data: Data types numeric, categorical (see the hierarchy for its relationship) static, dynamic (temporal) Other data forms distributed data text, Web, meta data images, audio/video You have seen most of them after the invited talks.

2/4/03CSE 575 Data Mining by H. Liu 3 Data Preparation An important & time consuming task in KDD High dimensional data (20, 100, 1000) Huge size data Missing data Outliers Erroneous data (inconsistent, misrecorded, distorted) Raw data

2/4/03CSE 575 Data Mining by H. Liu 4 Data Preparation Methods Data annotation as in driving data analysis Data normalization Another example is of image mining Dealing with sequential or temporal data Transform it to tabular form Removing outliers Different types

2/4/03CSE 575 Data Mining by H. Liu 5 Normalization Decimal scaling v’(i) = v(i)/10 k for the smallest k such that max(|v’(i)|)<1. For the range between -991 and 99, k is 1000, -991 .991 Min-max normalization into the new max/min range: v’ = (v - minA)/(maxA - minA) * (new_maxA - new_minA) + new_minA v = 73600 in [12000,98000]  v’= 0.716 in [0,1] (new range) Zero-mean normalization: v’ = (v - meanA) / std_devA (1, 2, 3), mean and std_dev are 2 and 1, (-1, 0, 1) If meanIncome = 54000 and std_devIncome = 16000, then v = 73600  1.225

2/4/03CSE 575 Data Mining by H. Liu 6 Temporal Data The goal is to forecast t(n+1) from previous values X = {t(1), t(2), …, t(n)} An example with two features and widow size 3 How to determine the window size? TimeAB 17215 210211 36214 411221 512210 614218 Inst A(n-2)A(n-1)A(n)B(n-2)B(n-1)B(n) 17106215211214 210611211214221 361112214221210 4111214221210218

2/4/03CSE 575 Data Mining by H. Liu 7 Outlier Removal Data points inconsistent with the majority of data Different outliers Valid: CEO’s salary, Noisy: One’s age = 200, widely deviated points Removal methods Clustering Curve-fitting Hypothesis-testing with a given model

2/4/03CSE 575 Data Mining by H. Liu 8 Data Preprocessing Data cleaning missing data noisy data inconsistent data Data reduction Dimensionality reduction Instance selection Value discretization

2/4/03CSE 575 Data Mining by H. Liu 9 Missing Data Many types of missing data not measured truly missed wrongly placed, and ? Some methods leave as is ignore/remove the instance with missing value manual fix (assign a value for implicit meaning) statistical methods (majority, most likely,mean, nearest neighbor, …)

2/4/03CSE 575 Data Mining by H. Liu 10 Noisy Data Random error or variance in a measured variable inconsistent values for features or classes (process) measuring errors (source) Noise is normally a minority in the data set Why? Removing noise Clustering/merging Smoothing (rounding, averaging within a window) Outlier detection (deviation-based or distance-based)

2/4/03CSE 575 Data Mining by H. Liu 11 Inconsistent Data Inconsistent with our models or common sense Examples The same name occurs differently in an application Different names appear the same (Dennis vs. Denis) Inappropriate values (Male-Pregnant, negative age) One bank’s database shows that 5% of its customers were born in 11/11/11 …

2/4/03CSE 575 Data Mining by H. Liu 12 Dimensionality Reduction Feature selection select m from n features, m≤ n remove irrelevant, redundant features the saving in search space Feature transformation (PCA) form new features (a) in a new domain from original features (f) many uses, but it does not reduce the original dimensionality often used in visualization of data

2/4/03CSE 575 Data Mining by H. Liu 13 Feature Selection Problem illustration Full set Empty set Enumeration Search Exhaustive/Complete (Enumeration/BAA) Heuristic (Sequential forward/backward) Stochastic (generate/evaluate) Individual features or subsets generation/evaluation

2/4/03CSE 575 Data Mining by H. Liu 14 Feature Selection (2) Goodness metrics Dependency: depending on classes Distance: separating classes Information: entropy Consistency: 1 - #inconsistencies/N Example: (F1, F2, F3) and (F1,F3) Both sets have 2/6 inconsistency rate Accuracy (classifier based): 1 - errorRate Their comparisons Time complexity, number of features, removing redundancy F1F2F3C 0011 0010 0011 1001 1000 1000

2/4/03CSE 575 Data Mining by H. Liu 15 Feature Selection (3) Filter vs. Wrapper Model Pros and cons time generality performance such as accuracy Stopping criteria thresholding (number of iterations, some accuracy,…) anytime algorithms providing approximate solutions solutions improve over time

2/4/03CSE 575 Data Mining by H. Liu 16 Feature Selection (Examples) SFS using consistency (cRate) select 1 from n, then 1 from n-1, n-2,… features increase the number of selected features until prespecified cRate is reached. LVF using consistency (cRate) 1 randomly generate a subset S from the full set 2 if it satisfies prespecified cRate, keep S with min #S 3 go back to 1 until a stopping criterion is met LVF is an any time algorithm Many other algorithms: SBS, B&B,...

2/4/03CSE 575 Data Mining by H. Liu 17 Transformation: PCA D’ = DA, D is mean- centered, (N  n) Calculate and rank eigenvalues of the covariance matrix Select largest ’s such that r > threshold (e.g.,.95) corresponding eigenvectors form A (n  m) Example of Iris data E-valuesDiffPropCumu 1 2.910821.989600.727710.72770 2 0.921220.773870.230310.95801 3 0.147350.126750.036840.99485 4 0.020610.005151.00000 V1V2V3V4 F10.5223720.372318-.721017-.261996 F2-.2633550.9255560.2420330.124135 F30.5812540.0210950.1408920.801154 F40.5656110.0654160.633801-.523546 m n r = (  i ) / (  i ) i=1 i=1

2/4/03CSE 575 Data Mining by H. Liu 18 Instance Selection Sampling methods random sampling stratified sampling Search-based methods Representatives Prototypes Sufficient statistics (N, mean, stdDev) Support vectors

2/4/03CSE 575 Data Mining by H. Liu 19 Value Descritization Binning methods Equal-width Equal-frequency Class information is not used Entropy-based ChiMerge Chi2

2/4/03CSE 575 Data Mining by H. Liu 20 Binning Attribute values (for one attribute e.g., age): 0, 4, 12, 16, 16, 18, 24, 26, 28 Equi-width binning – for bin width of e.g., 10: Bin 1: 0, 4 [-,10) bin Bin 2: 12, 16, 16, 18[10,20) bin Bin 3: 24, 26, 28[20,+) bin We use – to denote negative infinity, + for positive infinity Equi-frequency binning – for bin density of e.g., 3: Bin 1: 0, 4, 12 [-,14) bin Bin 2: 16, 16, 18 [14,21) bin Bin 3: 24, 26, 28 [21,+] bin Any problems with the above methods?

2/4/03CSE 575 Data Mining by H. Liu 21 Entropy-based Given attribute-value/class pairs: (0,P), (4,P), (12,P), (16,N), (16,N), (18,P), (24,N), (26,N), (28,N) Entropy-based binning via binarization: Intuitively, find best split so that the bins are as pure as possible Formally characterized by maximal information gain. Let S denote the above 9 pairs, p=4/9 be fraction of P pairs, and n=5/9 be fraction of N pairs. Entropy(S) = - p log p - n log n. Smaller entropy – set is relatively pure; smallest is 0. Large entropy – set is mixed. Largest is 1.

2/4/03CSE 575 Data Mining by H. Liu 22 Entropy-based (2) Let v be a possible split. Then S is divided into two sets: S1: value v Information of the split: I(S1,S2) = (|S1|/|S|) Entropy(S1)+ (|S2|/|S|) Entropy(S2) Information gain of the split: Gain(v,S) = Entropy(S) – I(S1,S2) Goal: split with maximal information gain. Possible splits: mid points b/w any two consecutive values. For v=14, I(S1,S2) = 0 + 6/9*Entropy(S2) = 6/9 * 0.65 = 0.433 Gain(14,S) = Entropy(S) - 0.433 maximum Gain means minimum I. The best split is found after examining all possible split points.

2/4/03CSE 575 Data Mining by H. Liu 23 Given attribute-value/class pairs Build a contingency table for every pair of intervals (I) Chi-Squared Test (goodness-of-fit), Parameters: df = k-1 and p% level of significance Chi2 algorithm provides an automatic way to adjust p ChiMerge and Chi2 FC 12P N P 16N N P 24N N N C1C2  I-1A 11 A 12 R1R1 I-2A 21 A 22 R2R2  C1C1 C2C2 N 2 k  2 =   (A ij – E ij ) 2 / E ij i=1 j=1

2/4/03CSE 575 Data Mining by H. Liu 24 Summary Data have many forms Attribute-vectors is the most common form Raw data need to be prepared and preprocessed for data mining Data miners have to work on the data provided Domain expertise is important in DPP Data preparation: Normalization, Transformation Data preprocessing: Cleaning and Reduction DPP is a critical and time-consuming task Why?

2/4/03CSE 575 Data Mining by H. Liu 25 Bibliography H. Liu & H. Motoda, 1998. Feature Selection for Knowledge Discovery and Data Mining. Kluwer. M. Kantardzic, 2003. Data Mining - Concepts, Models, Methods, and Algorithms. IEEE and Wiley Inter- Science. H. Liu & H. Motoda, edited, 2001. Instance Selection and Construction for Data Mining. Kluwer. H. Liu, F. Hussain, C.L. Tan, and M. Dash, 2002. Discretization: An Enabling Technique. DMKD 6:393- 423.

1 2. Data Preparation and Preprocessing Data and Its Forms Preparation Preprocessing and Data Reduction.

Similar presentations

Presentation on theme: "1 2. Data Preparation and Preprocessing Data and Its Forms Preparation Preprocessing and Data Reduction."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 2. Data Preparation and Preprocessing Data and Its Forms Preparation Preprocessing and Data Reduction.

Similar presentations

Presentation on theme: "1 2. Data Preparation and Preprocessing Data and Its Forms Preparation Preprocessing and Data Reduction."— Presentation transcript:

Similar presentations

About project

Feedback