Data Mining: Concepts and Techniques

Slides:



Advertisements
Similar presentations
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Advertisements

Lesson Describing Distributions with Numbers parts from Mr. Molesky’s Statmonkey website.
Descriptive Measures MARE 250 Dr. Jason Turner.
DATA PREPROCESSING Why preprocess the data?
Calculating & Reporting Healthcare Statistics
Jan Shapes of distributions… “Statistics” for one quantitative variable… Mean and median Percentiles Standard deviations Transforming data… Rescale:
Introduction to Educational Statistics
Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Skewness & Kurtosis: Reference
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
The Central Tendency is the center of the distribution of a data set. You can think of this value as where the middle of a distribution lies. Measure.
Lecture 5 Dustin Lueker. 2 Mode - Most frequent value. Notation: Subscripted variables n = # of units in the sample N = # of units in the population x.
Measures of Dispersion How far the data is spread out.
INVESTIGATION 1.
Dr. Serhat Eren 1 CHAPTER 6 NUMERICAL DESCRIPTORS OF DATA.
9/28/2012HCI571 Isabelle Bichindaritz1 Working with Data Data Summarization.
INVESTIGATION Data Colllection Data Presentation Tabulation Diagrams Graphs Descriptive Statistics Measures of Location Measures of Dispersion Measures.
LECTURE CENTRAL TENDENCIES & DISPERSION POSTGRADUATE METHODOLOGY COURSE.
1 Descriptive Statistics 2-1 Overview 2-2 Summarizing Data with Frequency Tables 2-3 Pictures of Data 2-4 Measures of Center 2-5 Measures of Variation.
Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets.
Edpsy 511 Exploratory Data Analysis Homework 1: Due 9/19.
Statistics topics from both Math 1 and Math 2, both featured on the GHSGT.
LIS 570 Summarising and presenting data - Univariate analysis.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 5. Measuring Dispersion or Spread in a Distribution of Scores.
CHAPTER 2: Basic Summary Statistics
Chapter 5 Describing Distributions Numerically Describing a Quantitative Variable using Percentiles Percentile –A given percent of the observations are.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Slide 1 Copyright © 2004 Pearson Education, Inc.  Descriptive Statistics summarize or describe the important characteristics of a known set of population.
© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 10 Descriptive Statistics Numbers –One tool for collecting data about communication.
Descriptive Statistics ( )
Statistics for Managers Using Microsoft® Excel 5th Edition
Business and Economics 6th Edition
BAE 6520 Applied Environmental Statistics
Data Mining: Concepts and Techniques
Measures of Dispersion
Chapter 3 Describing Data Using Numerical Measures
APPROACHES TO QUANTITATIVE DATA ANALYSIS
Chapter 6 ENGR 201: Statistics for Engineers
CHAPTER 3 Data Description 9/17/2018 Kasturiarachi.
Midrange (rarely used)
Description of Data (Summary and Variability measures)
Chapter 3 Describing Data Using Numerical Measures
Numerical Descriptive Measures
Central tendency and spread
Box and Whisker Plots Algebra 2.
Distributions (Chapter 1) Sonja Swanson
Basic Statistical Terms
BUS7010 Quant Prep Statistics in Business and Economics
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Describing Data with Numerical Measures
Numerical Descriptive Measures
Numerical Descriptive Measures
Summary (Week 1) Categorical vs. Quantitative Variables
Summary (Week 1) Categorical vs. Quantitative Variables
Data Transformations targeted at minimizing experimental variance
Describing Distributions Numerically
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Descriptive Statistics
CHAPTER 2: Basic Summary Statistics
Advanced Algebra Unit 1 Vocabulary
Business and Economics 7th Edition
Numerical Descriptive Measures
UNIT 8: Statistical Measures
Central Tendency & Variability
Presentation transcript:

Data Mining: Concepts and Techniques Data Mining: Concepts and Techniques — Chapter 2 — Dr. Maher Abuhamdeh Statistical June 8, 2018 Data Mining: Concepts and Techniques

Mining Data Descriptive Characteristics Motivation To better understand the data: central tendency, variation and spread Data dispersion characteristics median, max, min, quantiles, outliers, variance, etc. Numerical dimensions correspond to sorted intervals Data dispersion: analyzed with multiple granularities of precision Boxplot or quantile analysis on sorted intervals Dispersion analysis on computed measures Folding measures into numerical dimensions Boxplot or quantile analysis on the transformed cube June 8, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Mean Consider Sample of 6 Values: 34, 43, 81, 106, 106 and 115  To compute the mean, add and divide by 6 (34 + 43 + 81 + 106 + 106 + 115)/6  =  80.83     The population mean is the average of the entire population and is usually hard to compute. We use the Greek letter μ for the population mean.                                June 8, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Mode The mode of a set of data is the number with the highest frequency.  In the above example 106 is the mode, since it occurs twice and the rest of the outcomes occur only once. June 8, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Median A problem with the mean, is if there is one outcome that is very far from the rest of the data. The median is the middle score. If we have an even number of events we take the average of the two middles.   Assume a sample of 10 house prices. In $100,000, the prices are: 2.7,   2.9,   3.1,   3.4,   3.7,  4.1,   4.3,   4.7,  4.7,  40.8 mean = 710,000.  it does not reflect prices in the area. The value 40.8 x $100,000  =  $4.08 million skews the data.  Outlier. median =   (3.7 + 4.1) / 2 =  3.9 .. That is $390,000.             This is A better Representative of the data. June 8, 2018 Data Mining: Concepts and Techniques

Variance and Standard Deviation variance of a sample         standard deviation of a sample         June 8, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Example 44,  50,   38,   96,   42,   47,  40, 39, 46,  50       mean =  x ̅  =  49.2 Calculate the mean, x. Write a table that subtracts the mean from each observed value. Square each of the differences. Add this column. Divide by n -1 where n is the number of items in the sample  This is the variance. To get the standard deviation we take the square root of the variance.   June 8, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Example Cont. x x - 49.2 (x - 49.2 )2   44 -5.2 27.04 50 0.8 0.64 38 11.2 125.44 96 46.8 2190.24 42 -7.2 51.84 47 -2.2 4.84 40 -9.2 84.64 39 -10.2 104.04 46 -3.2 10.24 Tot   2600.4 Variance =   2600.4/ (10-1) = 288.7         Standard deviation = square root of  289 = 17 = σ This means is that most of the numbers probably fit between $32.20 and $66.20. June 8, 2018 Data Mining: Concepts and Techniques

Properties of Normal Distribution Curve The normal (distribution) curve From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation) From μ–2σ to μ+2σ: contains about 95% of it From μ–3σ to μ+3σ: contains about 99.7% of it June 8, 2018 Data Mining: Concepts and Techniques

Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data -vely skewed +vely skewed June 8, 2018 Data Mining: Concepts and Techniques

Measuring the Dispersion of Data Quartiles, outliers and boxplots Quartiles: Q1 (25th percentile), Q3 (75th percentile) Inter-quartile range: IQR = Q3 – Q1 Five number summary: min, Q1, M, Q3, max Outlier: usually, a value higher/lower than 1.5 x IQR Variance and standard deviation (sample: s, population: σ) Variance: (algebraic, scalable computation) Standard deviation s (or σ) is the square root of variance s2 (or σ2) June 8, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Boxplot Analysis Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum June 8, 2018 Data Mining: Concepts and Techniques

Relation between Mean and Standard deviation The length of the students as below (in CM) 200 , 147 ,173 , 185 , 160 The mean equal 173 June 8, 2018 Data Mining: Concepts and Techniques

Relation between Mean and Standard deviation June 8, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques June 8, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Calculate the difference between each of the length of (Mean) June 8, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Calculate the (Variance) which is equal 343.60 Calculate the standard deviation which is equal 18.53   June 8, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques The first student is unusually long The second student is short The others are considered as normal lengths If Mean close with Standard deviation increased accuracy (homogeneity) If Mean far away with Standard deviation decreased accuracy (non-homogeneity) June 8, 2018 Data Mining: Concepts and Techniques

How to Handle Noisy Data? Binning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. June 8, 2018 Data Mining: Concepts and Techniques

Simple Discretization Methods: Binning Equal-width (distance) partitioning Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. The most straightforward, but outliers may dominate presentation Skewed data is not handled well Equal-depth (frequency) partitioning Divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky June 8, 2018 Data Mining: Concepts and Techniques

Binning Methods for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 June 8, 2018 Data Mining: Concepts and Techniques

How to Handle Noisy Data? 2. Regression smooth by fitting the data into regression functions A regression is a technique that conforms data values to a function. Linear regression involves finding the “best” line to fit two attributes (or variables) so that one attribute can be used to predict the other. X Y 1 2 3 5 6 7 8 June 8, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Regression Error of predication To get best filling line we need to find the minimizes of the sum of the squared error of predication y Y1 Y1’ y = x + 1 X1 x June 8, 2018 Data Mining: Concepts and Techniques

How to Handle Noisy Data? 3. Clustering Outliers may be detected by clustering, for example, where similar values are organized into groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may be considered outliers, then we need to remove them June 8, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Cluster Analysis June 8, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Normalization Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction New attributes constructed from the given ones June 8, 2018 Data Mining: Concepts and Techniques

Data Transformation: Normalization Normalization : where the attribute data are scaled so as to fall within a small specified range such as [-1.0 to 1.0] or [0.0 to 1.0] We study three methods for normalization Min – max normalization z - score normalization Decimal scaling June 8, 2018 Data Mining: Concepts and Techniques

Data Transformation: Normalization Min-max normalization: to [new_minA, new_maxA] Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped to Z-score normalization (μ: mean, σ: standard deviation): Ex. Let μ = 54,000, σ = 16,000. Then Normalization by decimal scaling Where j is the smallest integer such that Max(|ν’|) < 1 June 8, 2018 Data Mining: Concepts and Techniques

Normalization by decimal scaling Example: Suppose values of A range from -986 to 917 . The maximum absolute value of A is 986 . To normalize by decimal scaling we divide each value by 1000 (j = 3) so that -986 normalizes to -0.986 June 8, 2018 Data Mining: Concepts and Techniques

Remakes for three Normalization method Min-max normalization problem Out of bound error if a future input case for normalization falls outside of the original data range. Z-score normalization is useful when the actual min. and max. of attribute A are unknown or when there outliers that dominate the min – max normalization. June 8, 2018 Data Mining: Concepts and Techniques