Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

Slides:



Advertisements
Similar presentations
UNIT – 1 Data Preprocessing
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Lesson Describing Distributions with Numbers parts from Mr. Molesky’s Statmonkey website.
Descriptive Measures MARE 250 Dr. Jason Turner.
DATA PREPROCESSING Why preprocess the data?
1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Calculating & Reporting Healthcare Statistics
Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.
B a c kn e x t h o m e Parameters and Statistics statistic A statistic is a descriptive measure computed from a sample of data. parameter A parameter is.
Sullivan – Statistics: Informed Decisions Using Data – 2 nd Edition – Chapter 3 Introduction – Slide 1 of 3 Topic 16 Numerically Summarizing Data- Averages.
ISE 261 PROBABILISTIC SYSTEMS. Chapter One Descriptive Statistics.
Measures of Central Tendency
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 1 Data Preprocessing
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Describing distributions with numbers
@ 2012 Wadsworth, Cengage Learning Chapter 5 Description of Behavior Through Numerical 2012 Wadsworth, Cengage Learning.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
September 5, 2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
APPENDIX B Data Preparation and Univariate Statistics How are computer used in data collection and analysis? How are collected data prepared for statistical.
Chapter 2 Describing Data.
Describing distributions with numbers
Biostatistics Class 1 1/25/2000 Introduction Descriptive Statistics.
Skewness & Kurtosis: Reference
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Lecture 5 Dustin Lueker. 2 Mode - Most frequent value. Notation: Subscripted variables n = # of units in the sample N = # of units in the population x.
Dr. Serhat Eren 1 CHAPTER 6 NUMERICAL DESCRIPTORS OF DATA.
Data Mining Lecture 5. Course Syllabus Case Study 1: Working and experiencing on the properties of The Retail Banking Data Mart (Week 4 – Assignment1)
9/28/2012HCI571 Isabelle Bichindaritz1 Working with Data Data Summarization.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
1 Chapter 4 Numerical Methods for Describing Data.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

Data Cleaning Data Cleaning Importance “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “Data.
January 17, 2016Data Mining: Concepts and Techniques 1 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial,
Statistics topics from both Math 1 and Math 2, both featured on the GHSGT.
LIS 570 Summarising and presenting data - Univariate analysis.
3/13/2016 Data Mining 1 Lecture 2-1 Data Exploration: Understanding Data Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB)
Data Mining What is to be done before we get to Data Mining?
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Exploratory Data Analysis
Chapter 16: Exploratory data analysis: numerical summaries
Data Mining: Concepts and Techniques
ISE 261 PROBABILISTIC SYSTEMS
Data Mining: Concepts and Techniques
Data Mining: Data Preparation
Data Preprocessing CENG 514 September 12, 2018.
Data Preprocessing CENG 514 June 17, 2018.
Noisy Data Noise: random error or variance in a measured variable.
Descriptive Statistics (Part 2)
UNIT-2 Data Preprocessing
Description of Data (Summary and Variability measures)
Descriptive Statistics
Chapter 5: Describing Distributions Numerically
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Data Preprocessing Modified from
Data Analysis and Statistical Software I Quarter: Spring 2003
Data Mining Data Preprocessing
Data Mining: Characterization
Advanced Algebra Unit 1 Vocabulary
Presentation transcript:

Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot

2 Why Data Preprocessing?  Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names  Data may not be normalized  Data may be huge  No quality data, no quality mining results! Quality decisions must be based on quality data

3 Why Is Data Dirty?  Incomplete data may come from attributes of interest may not be available e.g. customer information for sales transaction data certain data may not be considered important at the time of entry equipment malfunction data not entered due to misunderstanding inconsistent with other recorded data and thus deleted not register history or changes of the data

4 Why Is Data Dirty? (contd…)  Noisy data (incorrect values) may come from faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention  Inconsistent data may come from Different data sources  Duplicate records also need data cleaning

5 Major Tasks in Data Preprocessing  Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.  Data integration Integration of multiple databases, data cubes, or files Some attributes representing a given concept may have different names in different databases, causing inconsistencies and redundancies. Having a large amount of redundant data may slow down or confuse the knowledge discovery process.

6 Major Tasks in Data Preprocessing…  Data transformation Distance based mining algorithm provide better results if the data to be analyzed have been normalized, that is, scaled to a specific range such as [0.0, 1.0]. It would be useful for your analysis to obtain aggregate information as to the sales per customer region—something that is not part of any pre- computed data cube in your data warehouse.

7 Major Tasks in Data Preprocessing…  Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Strategies include data aggregation (e.g., building a data cube), attribute subset selection (e.g., removing irrelevant attributes through correlation analysis), dimensionality reduction (e.g., using encoding schemes such as minimum length encoding or wavelets), and numerosity reduction (e.g., “replacing” the data by alternative, smaller representations such as clusters or parametric models). Data can also be “reduced” by generalization with the use of concept hierarchies, where low-level concepts, such as city for customer location, are replaced with higher-level concepts, such as region or province or state.  Data discretization Part of data reduction but with particular importance, especially for numerical data.

8 Forms of Data Preprocessing

9 Descriptive Data Summarization  Motivation To better understand the data, get an overall picture, and identify typical properties

10 Measuring the Central Tendency  Mean Algebraic measure  Can be computed by applying an algebraic function to one or more distributive measures (sum()/count()) Weighted arithmetic mean Trimmed mean  Mean is sensitive to extreme/outlier values  chopping extreme values  Median Better measure for skewed data Holistic measure  Can only be computed on the entire data set Middle value if odd number of values, or average of the middle two values otherwise Estimated by interpolation (for grouped data)

11 Measuring the Central Tendency (contd…)  Mode Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula  For unimodal frequency curves that are moderately skewed

12 Measuring the Dispersion of Data  Degree to which numerical data tend to spread is called dispersion or variance.  The kth percentile of a set of data in numerical order is the value xi having the property that k percent of the data entries lie at or below xi.  The median (discussed in the previous subsection) is the 50th percentile.

13 Measuring the Dispersion of Data  Degree to which numerical data tend to spread  Range, Quartiles, outliers and boxplots Range: Difference between largest and smallest value Quartiles: Q1 (25th percentile), Q3 (75th percentile) Inter-quartile range: IQR = Q3 – Q1 Outlier: usually, a value higher/lower than 1.5 x IQR Five number summary: min, Q1, M, Q3, max Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually

14 Measuring the Dispersion of Data (contd…)  Variance and standard deviation Variance: (algebraic, scalable computation) Standard deviation σ is the square root of variance σ 2

15 Measuring the Dispersion of Data (contd…)  The basic properties of the standard deviation, σ, as a measure of spread are σ measures spread about the mean and should be used only when the mean is chosen as the measure of center. σ =0 only when there is no spread, that is, when all observations have the same value. Otherwise s > 0.  The variance and standard deviation are algebraic measures because they can be computed from distributive measures. That is, N (which is count() in SQL), ∑x i (which is the sum() of x i ), and ∑x i 2 (which is the sum() of x i 2 ) can be computed in any partition and then merged to feed into the algebraic Equation.  Thus the computation of the variance and standard deviation is scalable in large databases.

16 Histogram Analysis  Graph displays of basic statistical class descriptions Frequency histograms  A univariate graphical method  Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data  A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or buckets. Typically, the width of each bucket is uniform. Each bucket is represented by a rectangle whose height is equal to the count or relative frequency of the values at the bucket.

17 Histogram Analysis…

18 Quantile Plot  Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences)  Plots quantile information For a data x i data sorted in increasing order, f i indicates that approximately 100 f i % of the data are below or equal to the value x i

19 Quantile-Quantile (Q-Q) Plot  Graphs the quantiles of one univariate distribution against the corresponding quantiles of another  Allows the user to view whether there is a shift in going from one distribution to another

20 Scatter plot  Provides a first look at bivariate data to see clusters of points, outliers, etc  Each pair of values is treated as a pair of coordinates and plotted as points in the plane

21 Positively and Negatively Correlated Data

22 Not Correlated Data

23 Loess Curve  Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence  Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression