Download presentation

Presentation is loading. Please wait.

Published byColton Greenhill Modified about 1 year ago

1
Data Mining: Data Preparation

2
Data Preprocessing zWhy preprocess the data? zData cleaning zData integration and transformation zData reduction zDiscretization and concept hierarchy generation zSummary

3
Why Data Preprocessing? zData in the real world is dirty yincomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data ynoisy: containing errors or outliers yinconsistent: containing discrepancies in codes or names zNo quality data, no quality mining results! yQuality decisions must be based on quality data yData warehouse needs consistent integration of quality data

4
Multi-Dimensional Measure of Data Quality zA well-accepted multidimensional view: yAccuracy yCompleteness yConsistency yTimeliness yBelievability yValue added yInterpretability yAccessibility

5
Major Tasks in Data Preprocessing zData cleaning yFill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies zData integration yIntegration of multiple databases, data cubes, or files zData transformation yNormalization and aggregation zData reduction yObtains reduced representation in volume but produces the same or similar analytical results zData discretization yPart of data reduction but with particular importance, especially for numerical data

6
Forms of data preprocessing

7
Data Preprocessing zWhy preprocess the data? zData cleaning zData integration and transformation zData reduction zDiscretization and concept hierarchy generation zSummary

8
Data Cleaning zData cleaning tasks yFill in missing values yIdentify outliers and smooth out noisy data yCorrect inconsistent data

9
Missing Data zData is not always available yE.g., many tuples have no recorded value for several attributes, such as customer income in sales data zMissing data may be due to yequipment malfunction yinconsistent with other recorded data and thus deleted ydata not entered due to misunderstanding ycertain data may not be considered important at the time of entry ynot register history or changes of the data zMissing data may need to be inferred.

10
How to Handle Missing Data? zIgnore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably) zFill in the missing value manually: tedious + infeasible? zUse a global constant to fill in the missing value: e.g., “unknown”, a new class?! zUse the attribute mean to fill in the missing value zUse the most probable value to fill in the missing value: inference- based such as Bayesian formula or decision tree

11
Noisy Data zNoise: random error or variance in a measured variable zIncorrect attribute values may due to yfaulty data collection instruments ydata entry problems ydata transmission problems ytechnology limitation yinconsistency in naming convention zOther data problems which requires data cleaning yduplicate records yincomplete data yinconsistent data

12
How to Handle Noisy Data? zBinning method: yfirst sort data and partition into (equi-depth) bins ythen smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. zClustering ydetect and remove outliers zCombined computer and human inspection ydetect suspicious values and check by human zRegression ysmooth by fitting the data into regression functions

13
Simple Discretization Methods: Binning zEqual-width (distance) partitioning: yIt divides the range into N intervals of equal size: uniform grid yif A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. yThe most straightforward yBut outliers may dominate presentation ySkewed data is not handled well. zEqual-depth (frequency) partitioning: yIt divides the range into N intervals, each containing approximately same number of samples yGood data scaling yManaging categorical attributes can be tricky.

14
Binning Methods for Data Smoothing * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

15
Data Preprocessing zWhy preprocess the data? zData cleaning zData integration and transformation zData reduction zDiscretization and concept hierarchy generation zSummary

16
Data Integration zData integration: ycombines data from multiple sources into a coherent store zSchema integration yintegrate metadata from different sources yEntity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-# zDetecting and resolving data value conflicts yfor the same real world entity, attribute values from different sources are different ypossible reasons: different representations, different scales, e.g., metric vs. British units

17
Handling Redundant Data zRedundant data occur often when integration of multiple databases yThe same attribute may have different names in different databasesCareful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

18
Data Transformation zSmoothing: remove noise from data zAggregation: summarization, data cube construction zGeneralization: concept hierarchy climbing zNormalization: scaled to fall within a small, specified range ymin-max normalization yz-score normalization ynormalization by decimal scaling

19
Data Transformation: Normalization zmin-max normalization zz-score normalization znormalization by decimal scaling Where j is the smallest integer such that Max(| |)<1

20
Data Preprocessing zWhy preprocess the data? zData cleaning zData integration and transformation zData reduction zDiscretization and concept hierarchy generation zSummary

21
Data Reduction Strategies zWarehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set zData reduction yObtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results zData reduction strategies yData cube aggregation yDimensionality reduction yNumerosity reduction yDiscretization and concept hierarchy generation

22
Data Cube Aggregation zThe lowest level of a data cube ythe aggregated data for an individual entity of interest ye.g., a customer in a phone calling data warehouse. zMultiple levels of aggregation in data cubes yFurther reduce the size of data to deal with zReference appropriate levels yUse the smallest representation which is enough to solve the task

23
Dimensionality Reduction zFeature selection (i.e., attribute subset selection): ySelect a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features yreduce # of patterns in the patterns, easier to understand

24
Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6? Class 1 Class 2 Class 1 Class 2 > Reduced attribute set: {A1, A4, A6}

25
Heuristic Feature Selection Methods zThere are 2 d possible sub-features of d features zSeveral heuristic feature selection methods: yBest single features under the feature independence assumption: choose by significance tests. yBest step-wise feature selection: xThe best single-feature is picked first xThen next best feature condition to the first,... yStep-wise feature elimination: xRepeatedly eliminate the worst feature yBest combined feature selection and elimination: yOptimal branch and bound: xUse feature elimination and backtracking

26
Regression and Log- Linear Models zLinear regression: Data are modeled to fit a straight line yOften uses the least-square method to fit the line zMultiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vector zLog-linear model: approximates discrete multidimensional probability distributions

27
zLinear regression : Y = + X yTwo parameters, and specify the line and are to be estimated by using the data at hand. yusing the least squares criterion to the known values of Y 1, Y 2, …, X 1, X 2, …. zMultiple regression : Y = b0 + b1 X1 + b2 X2. yMany nonlinear functions can be transformed into the above. zLog-linear models : yThe multi-way table of joint probabilities is approximated by a product of lower-order tables. yProbability: p(a, b, c, d) = ab ac ad bcd Regress Analysis and Log-Linear Models

28
Histograms zA popular data reduction technique zDivide data into buckets and store average (sum) for each bucket zCan be constructed optimally in one dimension using dynamic programming zRelated to quantization problems.

29
Clustering zPartition data set into clusters, and one can store cluster representation only zCan be very effective if data is clustered but not if data is “smeared” zCan have hierarchical clustering and be stored in multi- dimensional index tree structures zThere are many choices of clustering definitions and clustering algorithms, further detailed in Chapter 8

30
Sampling zAllow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data zChoose a representative subset of the data ySimple random sampling may have very poor performance in the presence of skew zDevelop adaptive sampling methods yStratified sampling: xApproximate the percentage of each class (or subpopulation of interest) in the overall database xUsed in conjunction with skewed data

31
Sampling SRSWOR (simple random sample without replacement) SRSWR Raw Data

32
Data Preprocessing zWhy preprocess the data? zData cleaning zData integration and transformation zData reduction zDiscretization and concept hierarchy generation zSummary

33
Discretization zThree types of attributes: yNominal — values from an unordered set yOrdinal — values from an ordered set yContinuous — real numbers zDiscretization: *divide the range of a continuous attribute into intervals ySome classification algorithms only accept categorical attributes. yReduce data size by discretization yPrepare for further analysis

34
Discretization and Concept hierachy zDiscretization yreduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values. zConcept hierarchies yreduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).

35
Discretization for numeric data zBinning (see sections before) zHistogram analysis (see sections before) zClustering analysis (see sections before)

36
Data Preprocessing zWhy preprocess the data? zData cleaning zData integration and transformation zData reduction zDiscretization and concept hierarchy generation zSummary

37
Summary zData preparation is a big issue for both warehousing and mining zData preparation includes yData cleaning and data integration yData reduction and feature selection yDiscretization zA lot a methods have been developed but still an active area of research

38
References zD. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications of ACM, 42:73-78, zJagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data Engineering, 20(4), December zD. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, zT. Redman. Data Quality: Management and Technology. Bantam Books, New York, zY. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM, 39:86-95, zR. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering, 7: , 1995.

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google