Data Preprocessing.

Slides:



Advertisements
Similar presentations
UNIT – 1 Data Preprocessing
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
DATA PREPROCESSING Why preprocess the data?
1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

Data Mining: Concepts and Techniques
Data Preprocessing.
6/10/2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data.
Data Preprocessing CS 536 – Data Mining These slides are adapted from J. Han and M. Kamber’s book slides (
Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.
Pre-processing for Data Mining CSE5610 Intelligent Software Systems Semester 1.
Chapter 4 Data Preprocessing
Spatial and Temporal Data Mining
2015年7月2日星期四 2015年7月2日星期四 2015年7月2日星期四 Data Mining: Concepts and Techniques1 Data Transformation and Feature Selection/Extraction Qiang Yang Thanks: J.
Peter Brezany and Christian Kloner Institut für Scientific Computing
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 1 Data Preprocessing
CS2032 DATA WAREHOUSING AND DATA MINING
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
September 5, 2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning.
D ATA P REPROCESSING 1. C HAPTER 3: D ATA P REPROCESSING Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization.
Preprocessing of Lifelog September 18, 2008 Sung-Bae Cho 0.
The Knowledge Discovery Process; Data Preparation & Preprocessing
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
November 24, Data Mining: Concepts and Techniques.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge.
Data Cleaning Data Cleaning Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data.
2016年1月17日星期日 2016年1月17日星期日 2016年1月17日星期日 Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
Managing Data for DSS II. Managing Data for DS Data Warehouse Common characteristics : –Database designed to meet analytical tasks comprising of data.
February 18, 2016Data Mining: Babu Ram Dawadi1 Chapter 3: Data Preprocessing Preprocess Steps Data cleaning Data integration and transformation Data reduction.
Data Preprocessing: Data Reduction Techniques Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
Data Mining What is to be done before we get to Data Mining?
Bzupages.comData Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 3 — ©Jiawei Han and Micheline.
1 Web Mining Faculty of Information Technology Department of Software Engineering and Information Systems PART 4 – Data pre-processing Dr. Rakan Razouk.
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Data Transformation: Normalization
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Mining: Data Preparation
Noisy Data Noise: random error or variance in a measured variable.
UNIT-2 Data Preprocessing
©Jiawei Han and Micheline Kamber Department of Computer Science
©Jiawei Han and Micheline Kamber Department of Computer Science
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Modified from
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Chapter 1 Data Preprocessing
©Jiawei Han and Micheline Kamber
Dedicated to the memory of
Data Mining: Concepts and Techniques
Data Transformations targeted at minimizing experimental variance
Data Mining Data Preprocessing
Chapter 7: Transformations
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
By Sandeep Patil, Department of Computer Engineering, I²IT
Resource: J. Han and other books
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Tel Hope Foundation’s International Institute of Information Technology, (I²IT). Tel
Data Proprocessing — Chapter 2 —
Presentation transcript:

Data Preprocessing

Data Preprocessing An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes data cleaning data integration data transformation data reduction

Forms of Data Preprocessing Data Cleaning Data integration Data transformation -2, 32, 100, 59, 48 -0.02, 0.32, 1.00, 0.59, 0.48 Data reduction A1 A2 A3 ... A126 A1 A2 A3 ... A115 T1 T2 T2000 T1 T4 T1456

Data Preprocessing Data cleaning fill in missing values smooth noisy data identify outliers correct data inconsistency

Data Preprocessing Data integration combines data from multiple sources to form a coherent data store. Metadata, correlation analysis, data conflict detection and resolution of semantic heterogeneity contribute towards smooth data integration.

Data Preprocessing Data transformation convert the data into appropriate forms for mining. E.g. attribute data maybe normalized to fall between a small range such as 0.0 to 1.0

Data Preprocessing Data reduction data cube aggregation, dimension reduction, data compression, numerosity reduction and discretization. Used to obtain a reduced representation of the data while minimizing the loss of information content.

Data Preprocessing Automatic generation of concept hierarchies for numeric data binning, histogram analysis cluster analysis, entropy based discretization segmentation by natural partitioning for categoric data, concept hierarchies may be generated based on the number of distinct values of the attributes defining hierarchies.

Forms of Data Preprocessing Data Cleaning Data integration Data transformation -2, 32, 100, 59, 48 -0.02, 0.32, 1.00, 0.59, 0.48 Data reduction A1 A2 A3 ... A126 A1 A2 A3 ... A115 T1 T2 T2000 T1 T4 T1456

Data Cleaning Handling data that are incomplete, noisy and inconsistent It is an imperfect world

Data Cleaning :Missing Values Method of filling the missing values Ignore the tuple Fill in the missing value manually Use a global constant Use the attribute mean Use the attribute mean for all samples belonging to the same class Use the most probable value

Data Cleaning:Noisy Data Noise - random error or variance in a measured variable smooth out the data to remove the noise

Data Cleaning:Noisy Data Data Smoothing Techniques Binning smooth a sorted data value by consulting its neighborhood the sorted values are distributed into a number of buckets or bins smoothing by bin means smoothing by bin medians smoothing by bin boundaries

Simple Discretization Methods: Binning Equal-width (distance) partitioning: Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. The most straightforward, but outliers may dominate presentation Skewed data is not handled well. Equal-depth (frequency) partitioning: Divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky.

Binning Methods for Data Smoothing * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

Cluster Analysis Clustering Combined computer and human inspection Outliers may be detected by clustering, where similar values are organized into groups or clusters. Combined computer and human inspection Regression

Cluster Analysis

Regression y Y1 Y1’ y = x + 1 X1 x

Data Smoothing Techniques Binning Example sorted data for price: 4, 8, 15, 21, 21, 24, 25, 28, 34 Partition into equidepth bins Bin 1: 4, 8, 15 Bin 2: 21, 21, 24 Bin 3: 25, 28, 34

Data Smoothing Techniques : Binning smoothing by bin means Bin 1: 9, 9, 9 Bin 2: 22, 22, 22 Bin 3: 29, 29, 29 smoothing by bin boundaries Bin 1: 4, 4, 15 Bin 2: 21, 21, 24 Bin 3: 25, 25, 34

Data Cleaning : Inconsistent Data Can be corrected manually using external references Source of inconsistency error made at data entry, can be corrected using paper trace

Forms of Data Preprocessing Data Cleaning Data integration Data transformation -2, 32, 100, 59, 48 -0.02, 0.32, 1.00, 0.59, 0.48 Data reduction A1 A2 A3 ... A126 A1 A2 A3 ... A115 T1 T2 T2000 T1 T4 T1456

Data Integration and Transformation combines data from multiple sources into a coherent data store e.g. data warehouse sources may include multiple database, data cubes or flat files Issues in data integration schema integration redundancy detection and resolution of data value conflicts Data Transformation data are transformed or consolidates into forms appropriate for mining involves smoothing Aggregation Generalization Normalization Attribute construction

Data Integration Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-# Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units

Data Integration Redundant data occur often when integration of multiple databases The same attribute may have different names in different databases One attribute may be a “derived” attribute in another table, e.g., annual revenue Redundant data may be able to be detected by correlational analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Data Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction New attributes constructed from the given ones

Data Transformation: Normalization min-max normalization z-score normalization normalization by decimal scaling Where j is the smallest integer such that Max(| |)<1

Forms of Data Preprocessing Data Cleaning Data integration Data transformation -2, 32, 100, 59, 48 -0.02, 0.32, 1.00, 0.59, 0.48 Data reduction A1 A2 A3 ... A126 A1 A2 A3 ... A115 T1 T2 T2000 T1 T4 T1456

Data Reduction To obtain a reduced representation of the data set that is much smaller in volume but closely maintains the integrity of the original data mining on the reduced dataset should be more efficient yet produce the same analytical results.

Data Reduction Data Reduction Data cube Aggregation Dimensionality compression Numerosity reduction Discretization and Concept Hierarchy generation

Data Cube Aggregation The lowest level of a data cube the aggregated data for an individual entity of interest e.g., a customer in a phone calling data warehouse. Multiple levels of aggregation in data cubes Further reduce the size of data to deal with Reference appropriate levels Use the smallest representation which is enough to solve the task Queries regarding aggregated information should be answered using data cube, when possible

Data Cube Aggregation Sales data for company AllElectronics for 1997 - 1999 Year = 1999 Year = 1998 Year = 1997 Quarter Sales Q1 $224,000 Q2 $408,000 Q3 $350,000 Q4 $586,000 Year Sales 1997 $1,568,000 1998 $2,356,000 1999 $3,594,000

Data Reduction Data Reduction Data cube Aggregation Dimensionality compression Numerosity reduction Discretization and Concept Hierarchy generation

Dimensionality Reduction Standard form Data preparation Dimension reduction Prediction Methods Data Subset Evaluation The role of dimension reduction in Data Mining

Dimensionality Reduction Data sets for analysis may contain hundreds of attributes that may be irrelevant to the mining task or redundant Dimensionality reduction reduces the dataset size by removing such attributes among them

Dimensionality Reduction How can we find a good subset of the original attributes?? attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes.

Dimensionality Reduction Attribute subset selection techniques Forward selection start with empty set of attributes the best of the original attributes is determined and added to the set. At each subsequent iteration or step, the best of the remaining original attributes is added to the set. Stepwise backward elimination starts with the full set of attributes At each step, it removes the worst attribute remaining in the set.

Dimensionality Reduction Attribute subset selection techniques Combination of forward selection and backward elimination the procedure combines and selects the best attribute and removes the worst from among the remaining attributes

Dimensionality Reduction Attribute subset selection techniques Decision tree induction ID3, C4.5 intended for classification construct a flow chart like structure where each internal (nonleaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external node denotes a class prediction At each node the algorithm chooses the best attribute to partition the data into individual classes.

Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6? Class 2 Class 1 Class 2 Class 1 Reduced attribute set: {A1, A4, A6} >

Dimensionality Reduction Attribute subset selection techniques Reducts computation by rough set theory selection of attributes are identified by the concept of discernibility relations of classes in the dataset Will be discussed in next class.

Data Reduction Data Reduction Data cube Aggregation Dimensionality compression Numerosity reduction Discretization and Concept Hierarchy generation

Data Compression Apply data encoding or transformation to obtain a reduced or compressed representation of the original data lossless although typically lossless, they allow only limited manipulation of data. lossy

Data Compression Two methods of lossy data compression Wavelet Transforms Principle Component Analysis

Data Compression Wavelet Transforms is a linear signal processing technique that when applied to a data vector D, transforms it to a numerically different vector D’ of wavelet coefficients

Data Compression Principle Component Analysis suppose the data to be compresses consist of N tuples from k dimensions. PCA searches for c k-dimensional orthogonal vectors that can best be used to represent the data where c  k. the original data are projected onto a much smaller space

Data Reduction Data Reduction Data cube Aggregation Dimensionality compression Numerosity reduction Discretization and Concept Hierarchy generation

Numerosity Reduction Numerosity reduction technique can be applied to reduce the data volume by choosing alternative, smaller forms of data representation techniques Regression and Log-Linear Models Histograms Clustering Sampling

Data Reduction Data Reduction Data cube Aggregation Dimensionality compression Numerosity reduction Discretization and Concept Hierarchy generation

Discretization Three types of attributes: Discretization: Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers Discretization: divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis

Discretization and Concept hierarchy reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values Concept hierarchies reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)

Discretization Example : Manual discretization of AUS data set

Discretization and Concept Hierarchy Generation for Numeric Data Binning (see sections before) Histogram analysis (see sections before) Clustering analysis (see sections before) Entropy-based discretization Segmentation by natural partitioning

Entropy-Based Discretization Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.

Entropy-Based Discretization The process is recursively applied to partitions obtained until some stopping criterion is met, Experiments show that it may reduce data size and improve classification accuracy

Segmentation by Natural Partitioning A simply 3-4-5 rule can be used to segment numeric data into relatively uniform, “natural” intervals. If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals (see fig 3.16,pg137)

Concept Hierarchy Generation Many techniques can be applied recursively in order to provide a hierarchical partitioning of the attribute - concept hierarchy Concept hierarchy useful for mining at multiple levels of abstraction

Concept Hierarchy Generation for Categorical Data Specification of a partial ordering of attributes explicitly at the schema level by users or experts street<city<state<country Specification of a portion of a hierarchy by explicit data grouping {Urbana, Champaign, Chicago}<Illinois Specification of a set of attributes. System automatically generates partial ordering by analysis of the number of distinct values E.g., street < city <state < country Specification of only a partial set of attributes E.g., only street < city, not others

Automatic Concept Hierarchy Generation Some concept hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the given data set The attribute with the most distinct values is placed at the lowest level of the hierarchy Note: Exception—weekday, month, quarter, year country 15 distinct values province_or_ state 365 distinct values city 3567 distinct values street 674,339 distinct values

Discretization and Concept Hierarchy Generation Manual Discretization The information to convert the continuous values into discrete values are obtain from the expert of the domain area Example( refer to UCI machine learning data banks)

Data Discretization

Table 5: The invariance features for mathematical symbols Data Discretization Table 5: The invariance features for mathematical symbols Symbol h02 h03 h11 h12 h13 h21 h22 h30 h31  0.86711 0.18849 0.08184 0.16839 0.12728 0.01923 0.24873 0.12638 0.04125   0.54536 0.02198 0.02583 0.0241 0.01231 0.01844 0.1193 0.00087 0.00535  0.58806 0.05518 0.08122 0.00895 0.07504 0.01626 0.18318 0.03664 0.05776 0.61814 0.00880 0.05408 0.01927 0.05894 0.00178 0.07934 0.01363 0.02165  0.88477 0.14812 0.01660 0.13137 0.06236 0.02861 0.21195 0.04551 0.00528 0.80491 0.05006 0.03593 0.01596 0.04019 0.00195 0.12116 0.01324 0.01841  0.73293 0.05052 0.16291 0.05135 0.11263 0.02107 0.1385 0.00799 0.07375 0.66253 0.08034 0.03918 0.01415 0.10883 0.01978 0.11662 0.0049 0.01161  0.91948 0.02059 0.01081 0.06653 0.00924 0.01543 0.15602 0.00388 0.00697 0.82281 0.06182 0.02135 0.03221 0.03237 0.01006 0.12365 0.00398 0.00606  2.213 0.71402 0.059 0.22918 0.00903 0.01181 0.63556 0.05279 0.08960 2.15402 0.18761 0.08548 0.33771 0.81689 0.11741 0.70659 0.03468 0.13071  0.15565 0.00002 0.00662 0.00547 0.00182 0.00775 0.03896 0.02263 0.00017 0.16081 0.01299 0.01091 0.00812 0.00205 0.01267 0.04902 0.04908 0.01069  

Table 6: Discretization of the mathematical symbols Data Discretization   Table 6: Discretization of the mathematical symbols Orientation h02 h03 h11 h12 h13 h21 h22 h30 h31 Results Orientation #1 1 2  Orientation #2        

Summary Data preparation is a big issue for both warehousing and mining Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization A lot a methods have been developed but still an active area of research