Exploratory Data Analysis

Slides:



Advertisements
Similar presentations
Describing Quantitative Variables
Advertisements

Measures of Dispersion
Descriptive Statistics
Descriptive Statistics – Central Tendency & Variability Chapter 3 (Part 2) MSIS 111 Prof. Nick Dedeke.
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
2.1 Summarizing Qualitative Data  A graphic display can reveal at a glance the main characteristics of a data set.  Three types of graphs used to display.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Chapter 2 Describing Data with Numerical Measurements
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Programming in R Describing Univariate and Multivariate data.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Let’s Review for… AP Statistics!!! Chapter 1 Review Frank Cerros Xinlei Du Claire Dubois Ryan Hoshi.
CHAPTER 3 : DESCRIPTIVE STATISTIC : NUMERICAL MEASURES (STATISTICS)
CHAPTER 1 Basic Statistics Statistics in Engineering
Methods for Describing Sets of Data
© Copyright McGraw-Hill CHAPTER 3 Data Description.
CHAPTER 1 Basic Statistics Statistics in Engineering
Percentiles and Box – and – Whisker Plots Measures of central tendency show us the spread of data. Mean and standard deviation are useful with every day.
Chapter 2 Describing Data.
Basic Statistics  Statistics in Engineering  Collecting Engineering Data  Data Summary and Presentation  Probability Distributions - Discrete Probability.
Basic Statistics  Statistics in Engineering  Collecting Engineering Data  Data Summary and Presentation  Probability Distributions - Discrete Probability.
STATISTICS. Statistics * Statistics is the area of science that deals with collection, organization, analysis, and interpretation of data. * A collection.
Lecture 5 Dustin Lueker. 2 Mode - Most frequent value. Notation: Subscripted variables n = # of units in the sample N = # of units in the population x.
Categorical vs. Quantitative…
Numerical Statistics Given a set of data (numbers and a context) we are interested in how to describe the entire set without listing all the elements.
Dr. Serhat Eren 1 CHAPTER 6 NUMERICAL DESCRIPTORS OF DATA.
Subbulakshmi Murugappan H/P:
LECTURE CENTRAL TENDENCIES & DISPERSION POSTGRADUATE METHODOLOGY COURSE.
1 Descriptive Statistics 2-1 Overview 2-2 Summarizing Data with Frequency Tables 2-3 Pictures of Data 2-4 Measures of Center 2-5 Measures of Variation.
CHAPTER 3 : DESCRIPTIVE STATISTIC : NUMERICAL MEASURES (STATISTICS)
CHAPTER 1 Basic Statistics Statistics in Engineering
FARAH ADIBAH ADNAN ENGINEERING MATHEMATICS INSTITUTE (IMK) C HAPTER 1 B ASIC S TATISTICS.
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons. 3-1 Business Statistics, 4e by Ken Black Chapter 3 Descriptive Statistics.
Chapter 6: Interpreting the Measures of Variability.
LIS 570 Summarising and presenting data - Univariate analysis.
CHAPTER 1 Basic Statistics Statistics in Engineering
CHAPTER 1 EQT 271 (part 1) BASIC STATISTICS. Basic Statistics 1.1Statistics in Engineering 1.2Collecting Engineering Data 1.3Data Presentation and Summary.
Chapter 3 EXPLORATION DATA ANALYSIS 3.1 GRAPHICAL DISPLAY OF DATA 3.2 MEASURES OF CENTRAL TENDENCY 3.3 MEASURES OF DISPERSION.
Slide 1 Copyright © 2004 Pearson Education, Inc.  Descriptive Statistics summarize or describe the important characteristics of a known set of population.
Descriptive Statistics
Descriptive Statistics ( )
Methods for Describing Sets of Data
Basic Statistics Statistics in Engineering (collect, organize, analyze, interpret) Collecting Engineering Data Data Presentation and Summary Types of.
CHAPTER 2 : DESCRIPTIVE STATISTICS: TABULAR & GRAPHICAL PRESENTATION
ISE 261 PROBABILISTIC SYSTEMS
Descriptive Statistics
Chapter 3 Describing Data Using Numerical Measures
CHAPTER 5 Basic Statistics
CHAPTER 3 Data Description 9/17/2018 Kasturiarachi.
NUMERICAL DESCRIPTIVE MEASURES
Description of Data (Summary and Variability measures)
Chapter 3 Describing Data Using Numerical Measures
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
Box and Whisker Plots Algebra 2.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Displaying and Summarizing Quantitative Data
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
Honors Statistics Review Chapters 4 - 5
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
Presentation transcript:

Exploratory Data Analysis l Chapter 3 l Exploratory Data Analysis 3.1 Graphical Displays of Data 3.2 Measures of Central Tendency 3.3 Measures of Dispersion

3.1 Graphical Displays of Data Most of the statistical information in newspapers, magazines, company reports and other publications consists of data that are summarized and presented in a form that is easy for the reader to understand.

3.1 Graphical Displays of Data Presentation of Qualitative Data A graphic display can reveal at a glance the main characteristics of a data set. Their presentation are depend on the nature of data, whether the data is in quantitative(ex. income and CGPA) or qualitative(ex. Gender and ethnic group). Three types of graphs used to display qualitative data: bar graph / column chart pie chart line chart

3.1 Graphical Displays of Data Presentation of Qualitative Data

3.1 Graphical Displays of Data Bar Chart Bar chart is used to display the frequency distribution in the graphical form. It consists of two orthogonal axes and one of the axes represent the observations while the other one represents the frequency of the observations. The frequency of the observations is represented by a bar.

3.1 Graphical Displays of Data Pie Chart Pie Chart is used to display the frequency distribution. It displays the ratio of the observations. It is a circle consists of a few sectors. The sectors represent the observations while the area of the sectors represent the proportion of the frequencies of that observations.

3.1 Graphical Displays of Data Line Chart Line chart is used to display the trend of observations. It consists of two orthogonal axes and one of the axes represent the observations while the other one represents the frequency of the observations. The frequency of the observations are joint by lines. Example: Table below shows the number of sandpipers recorded between January 1989 till December 1989.

3.1 Graphical Displays of Data Presentation of Quantitative Data There are few graphs available for the graphical presentation of the quantitative data. Frequency polygon Histogram Ogive Boxplot (Will be our focus in this chapter)

3.1 Graphical Displays of Data Presentation of Quantitative Data Histogram Histogram looks like the bar chart except that the horizontal axis represent the data which is quantitative in nature. There is no gap between the bars.

3.1 Graphical Displays of Data Presentation of Quantitative Data Frequency Polygon Frequency polygon looks like the line chart except that the horizontal axis represent the class mark of the data which is quantitative in nature.

3.1 Graphical Displays of Data Presentation of Quantitative Data Ogive Ogive is a line graph with the horizontal axis represent the upper limit of the class interval while the vertical axis represent the cumulative frequencies.

3.1 Graphical Displays of Data Presentation of Quantitative Data Boxplot The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum.

3.1 Graphical Displays of Data Presentation of Quantitative Data Boxplot Divided by data sets into fourths or four equal parts.

3.1 Graphical Displays of Data Presentation of Quantitative Data Boxplot How to obtain Quartiles? Q2 – Median Q1 – Median between lowest value and Q2 Q3 – Median between Q2 and largest value Examples Odd set of numbers (1, 2, 5, 6, 7), 9, (12, 15, 18, 19, 27) Even set of numbers (3, 5, 7, 8, 9), (11, 15, 16, 20, 21) Q1 Q2 Q3 Q1 Q2 (9+11)/2 =10 Q3

3.1 Graphical Displays of Data Presentation of Quantitative Data Boxplot

3.1 Graphical Displays of Data Presentation of Quantitative Data Boxplot Outlier Extreme observations Can occur because of the error in measurement of a variable, during data entry or errors in sampling.

3.1 Graphical Displays of Data Presentation of Quantitative Data Boxplot Outlier Checking for outliers by using Quartiles Step 1: Determine the first and third quartiles of data. Step 2: Compute the interquartile range (IQR). Step 3: Determine the fences. Fences serve as cut-off points for determining outliers. Step 4: If data value is less than the lower fence or greater than the upper fence, considered outlier.

How to: Boxplot Interpretation l Chapter 3.1 l How to: Boxplot Interpretation

The Basics A boxplot splits the data set into quartiles. It consists of a minimum value, the first quartile (Q1) to the third quartile (Q3) @ median, and a maximum value Outliers are plotted separately as points on the chart

Interpreting Boxplot Things that can be described on boxplot The five numbers summary Range of the boxplot The IQR Shape of the data More than one boxplot Compare their shape and position

Interpreting Boxplot The five numbers summary Minimum = -25 First Quartile = 300 Second Quartile / Median = 400 Third Quartile = 600 Maximum = 1000

Interpreting Boxplot Range In the boxplot above, data values ranged from about -700 (the smallest outlier) to 1700 (the largest outlier), so the range is 2400. If you ignore outliers, the range is illustrated by the distance between the opposite ends of the whiskers - about 1000 in the boxplot above.

Interpreting Boxplot Interquartile Range (IQR) In the boxplot above, the range between the quartiles is equal to 600 - 300 or about 300 Based on Q1, we know that 25% of the data has a value below 300 @ 75% of the data has a value above 300 Based on Q2, we know half of the data has a value less than 400 Based on Q3, we know that 75% of the data has a value below 600 @ 25% of the data has a value above 600

Interpreting Boxplot Shape of the data Boxplots often provide information about the shape of a data set. The examples below show some common patterns

Interpreting Boxplot Shape of the data For our case, the boxplot is skewed to the right

Interpreting More Than One Boxplot The second boxplot is comparatively short This suggests that the overall data of the second boxplot has small variance (most of the data have similar values)

Interpreting More Than One Boxplot The first and third boxplot is comparatively tall This suggests that the variance for these boxplot is high (most of the data did not have similar values)

Interpreting More Than One Boxplot The third boxplot is much higher than the fourth boxplot This could suggest a differences in the value between groups. As can be seen, almost 75% of the data in the third boxplot have higher value than the fourth boxplot.

Interpreting More Than One Boxplot There are obvious variance differences between first and second boxplots; second boxplots and third boxplot

Interpreting More Than One Boxplot Same median, different distribution Look at the first, second and third boxplot. Their medians are all at the same place. We know that for the three boxplots, more than half of their data falls below Q2, which is 287.5. However they show differences in variance.

Exercise Describe about each boxplot Compare the boxplots, what can you say?

3.2 Measures of Central Tendency Fig 3.6.5 3.2 Measures of Central Tendency Measure of central tendency is a summary statistics that are used to summarize a set of observations. The common measures of central tendency are Mean Median Mode

3.2 Measures of Central Tendency Mean Mean (sample) is defined by The mean of a sample is the sum of the measurements divided by the number of measurements in the set. Mean is denoted by

3.2 Measures of Central Tendency Example The mean for this case is

3.2 Measures of Central Tendency Median Median is the middle value of a set of observations arranged in order of magnitude and normally is denoted by The median depends on the number of observations in the data, . -If is odd, then the median is the th observation of the ordered observations. -If is even, then the median is the arithmetic mean of the th observation and the th observation.

3.2 Measures of Central Tendency Example The median of this data (4, 6, 3, 1, 2, 5, 7, 3) is 3.5. Rearrange the data in order of magnitude becomes 1,2,3,3,4,5,6,7. As (even), the median is the mean of the 4th and 5th observations that is 3.5.

3.2 Measures of Central Tendency Mode Mode of a set of observations is the observation with the highest frequency and is usually denoted by . Sometimes mode can also be used to describe qualitative data. Mode has the advantage in that it is easy to calculate and eliminates the effect of extreme values. However, mode may not exist and even if it does exit, it may not be unique.

3.2 Measures of Central Tendency Mode If a set of data has 2 measurements with higher frequency, therefore the measurements are assumed as data mode and known as bimodal data. If a set of data has more than 2 measurements with higher frequency so the data can be assumed as no mode. Example: The mode for the observations 4,6,3,1,2,5,7,3 is 3.

3.2 Measures of Central Tendency Mode If a set of data has 2 measurements with higher frequency, therefore the measurements are assumed as data mode and known as bimodal data. If a set of data has more than 2 measurements with higher frequency so the data can be assumed as no mode. Example: The mode for the observations 4,6,3,1,2,5,7,3 is 3.

3.2 Measures of Central Tendency Mode If a set of data has 2 measurements with higher frequency, therefore the measurements are assumed as data mode and known as bimodal data. If a set of data has more than 2 measurements with higher frequency so the data can be assumed as no mode. Example: The mode for the observations 4,6,3,1,2,5,7,3 is 3.

3.3 Measures of Dispersion The measure of dispersion or spread is the degree to which a set of data tends to spread around the average value. It shows whether data will set is focused around the mean or scattered. The common measures of dispersion are variance and standard deviation. The standard deviation actually is the square root of the variance. The sample variance is denoted by and the sample standard deviation is denoted by s.

3.3 Measures of Dispersion Range Range is the simplest measure of dispersion to calculate. Range = Largest value – Smallest value Example: Range = 267,277 – 49,651 = 217,626 squaremiles.

3.3 Measures of Dispersion Variance The variance of a sample (also known as mean square) for the raw (ungrouped) data is denoted by and defined by: Example (using previous data in Range): Range = 267,277 – 49,651 = 217,626 squaremiles.

3.3 Measures of Dispersion Standard deviation It is simply a square root value of variance Example

How to interpret Standard deviation

Dogs with standard height Too High Too Short