The goal of data analysis is to gain information from the data. Exploratory data analysis: set of methods to display and summarize the data. Data on just.

Slides:



Advertisements
Similar presentations
Describing Quantitative Variables
Advertisements

DESCRIBING DISTRIBUTION NUMERICALLY
Descriptive Measures MARE 250 Dr. Jason Turner.
AP Statistics Section 2.1 B
Lecture 4 Chapter 2. Numerical descriptors
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Calculating & Reporting Healthcare Statistics
Chap 3-1 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 3 Describing Data: Numerical.
Chapter 1 Introduction Individual: objects described by a set of data (people, animals, or things) Variable: Characteristic of an individual. It can take.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Normal distributions Normal curves provide a simple, compact way to describe symmetric, bell-shaped distributions. SAT math scores for CS students Normal.
Looking at data: distributions - Describing distributions with numbers
1.2: Describing Distributions
Chap 3-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 3 Describing Data: Numerical Statistics for Business and Economics.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Describing Data: Numerical
AP Statistics Chapters 0 & 1 Review. Variables fall into two main categories: A categorical, or qualitative, variable places an individual into one of.
Describing distributions with numbers
Objective To understand measures of central tendency and use them to analyze data.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Chapter 1 Exploring Data
Objectives 1.2 Describing distributions with numbers
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
AP Stats Chapter 1 Review. Q1: The midpoint of the data MeanMedianMode.
STAT 280: Elementary Applied Statistics Describing Data Using Numerical Measures.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
+ Chapter 1: Exploring Data Section 1.3 Describing Quantitative Data with Numbers The Practice of Statistics, 4 th edition - For AP* STARNES, YATES, MOORE.
2.1 Density Curves and the Normal Distribution.  Differentiate between a density curve and a histogram  Understand where mean and median lie on curves.
Describing distributions with numbers
Lecture 3 Describing Data Using Numerical Measures.
Lecture 5 Dustin Lueker. 2 Mode - Most frequent value. Notation: Subscripted variables n = # of units in the sample N = # of units in the population x.
INVESTIGATION 1.
Chap 3-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 3 Describing Data Using Numerical.
INVESTIGATION Data Colllection Data Presentation Tabulation Diagrams Graphs Descriptive Statistics Measures of Location Measures of Dispersion Measures.
Chapter 3 Looking at Data: Distributions Chapter Three
Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal.
Numerical descriptors BPS chapter 2 © 2006 W.H. Freeman and Company.
Review BPS chapter 1 Picturing Distributions with Graphs What is Statistics ? Individuals and variables Two types of data: categorical and quantitative.
Numerical descriptors BPS chapter 2 © 2006 W.H. Freeman and Company.
Organizing Data AP Stats Chapter 1. Organizing Data Categorical Categorical Dotplot (also used for quantitative) Dotplot (also used for quantitative)
Notes Unit 1 Chapters 2-5 Univariate Data. Statistics is the science of data. A set of data includes information about individuals. This information is.
Numerical descriptions of distributions
+ Chapter 1: Exploring Data Section 1.3 Describing Quantitative Data with Numbers The Practice of Statistics, 4 th edition - For AP* STARNES, YATES, MOORE.
Plan for Today: Chapter 11: Displaying Distributions with Graphs Chapter 12: Describing Distributions with Numbers.
More Univariate Data Quantitative Graphs & Describing Distributions with Numbers.
Descriptive Statistics(Summary and Variability measures)
IPS Chapter 1 © 2012 W.H. Freeman and Company  1.1: Displaying distributions with graphs  1.2: Describing distributions with numbers  1.3: Density Curves.
+ Chapter 1: Exploring Data Section 1.3 Describing Quantitative Data with Numbers The Practice of Statistics, 4 th edition - For AP* STARNES, YATES, MOORE.
Exploratory Data Analysis
Chapter 1: Exploring Data
Description of Data (Summary and Variability measures)
DAY 3 Sections 1.2 and 1.3.
POPULATION VS. SAMPLE Population: a collection of ALL outcomes, responses, measurements or counts that are of interest. Sample: a subset of a population.
Data Analysis and Statistical Software I Quarter: Spring 2003
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Do Now In BIG CLEAR numbers, please write your height in inches on the index card.
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Presentation transcript:

The goal of data analysis is to gain information from the data. Exploratory data analysis: set of methods to display and summarize the data. Data on just one variable: the distribution of the observations is analyzed by I.Displaying the data in a graph that shows overall patterns and unusual observations (bar chart, histogram, density curve) II.Computing descriptive statistics that summarize specific aspects of the data (center and spread). Exploratory Data Analysis

Review of Histograms A histogram represents percent by area. The height of each block represents frequencies/percentages of the observations falling in the interval. The total area under a histogram is ______ if height in frequencies The total area under a histogram is ______ if height in percentages There is no fixed choice for the number of classes in a histogram: If class intervals are too small, the histogram will have spikes; If class intervals are too large, some information will be missed. Use your judgment! Typically statistical software will choose the class intervals for you, but you can modify them.

Center and Spread

The most common measures are the mean (or average) and the median. 1.The Mean or Average To calculate the average of a set of observations, add their value and divide by the number of observations: Data: Number of home runs hit by Babe Ruth as a Yankee 54, 59, 35, 41, 46, 25, 47, 60, 54, 46, 49, 46, 41, 34, 22 The mean number of home runs hit in a year is: Measuring Centers

2.The median The median M is the midpoint of a distribution, the number such that half the observations are smaller and the other half are larger. To find the median: 1.Sort all the observations in order of size from smallest to largest 2.If the number of observations n is odd, the median M is the center observation in the ordered list; I.e. M=(n+1)/2-th obs. 3.If the number of observations n is even, the median M is the mean of the two center observations in the ordered list. Example 1: Ordered list of home run hits by Babe Ruth: N=15 Median = 46 8 th Example 2: Ordered list of home run hits by Roger Maris: N=10 Median = (23+26)/2=24.5

Mean versus Median 1.The mean and median of a symmetric distribution are close together Mean Median 50% 2.In skewed distributions, the mean is farther out in the long tail than is the median. The mean is more sensitive to extreme values. MedianMeanMedianMean Right-skewed distributionLeft-skewed distribution Symmetric distribution

Mean or Median? The mean is a good measure for the center of a symmetric distribution The median is a resistant measure and should be used for skewed distributions. Its value is only slightly affected by the presence of extreme observations, no matter how large these observations are.

On average, the cars under study drive 18.9 miles per gallon, and 50% of the cars under study drive at least 18 miles per gallon. The mode is the observation value with the highest frequency The Mode

Q 1 M Q 3 Spread of a Distribution Two measures of spread: 1. The Quartiles: First quartile Q 1 = is the value such that 25% of the observations fall at or below it, (Q 1 is often called 25th percentile). The third quartile Q 3 = the value such that 75% of the observations fall at or below it, (Q 3 is often called 75th percentile). Typically used if the distribution of the observations is skewed. 25%

First quartile (Q1) = 16, third quartile (Q3) = 21 What does this mean in terms of the data?

Percentiles (also called Quantiles): In general the n th percentile is a value such that n% of the observations fall at or below or it; In the example before: 5 th percentile = th percentile = th percentile = th percentile = 22 Hence about 80% of the cars get between 11 and 22 miles per gallon. n th percentile n%

Descriptive measures for skewed distributions If the histogram of the data is skewed, use the following descriptive statistics: Min, Q1, Median, Q3, Max To describe the distribution of the observed variable. In our example, Min=8, Q1=16, Median=18, Q3=21, Max=61

The Standard Deviation If a distribution is symmetric: Use the average to measure the center and the Standard Deviation to measure the spread. The standard deviation s (or SD ) measures how far the observations are from the average. Example: A person’s metabolic rate= rate at which the body consumes energy. Rates of 7 men in a study on dieting: 1792, 1666, 1614, 1460, 1867, 1439, The mean is and the s.d. s = Metabolic rate Deviation=1867 – 1600=267Deviation=1600 –1439=161

In symbols, the standard deviation s of n observations is The variance of an observed variable is defined as the square of the standard deviation. Variance = s 2 Formula for the SD

Properties of the SD It measures the spread about the mean. Only used in association with the mean. Good descriptive measure for symmetric distributions If s = 0, all the observations have the same value It is a POSITIVE value, the larger s is, the more spread out the observations are around the mean It is NOT a resistant measure, a few extreme observations may affect its value (make it very large). The variance is the square of the s.d.

Interpreting the SD For many lists of observations – especially if their histogram is bell-shaped 1.Roughly 68% of the observations in the list lie within 1 standard deviation of the average 2.95% of the observations lie within 2 standard deviations of the average Average Ave-s.d. Ave+s.d. 68% 95% Ave-2s.d. Ave+2s.d.

Example In a large university, data were collected to study the academic achievements of computer science majors. We’ll consider the SAT math scores of 224 first year CS students. The average SATM score is with s.d. s= Histogram of the SATM Scores Are the average and s.d. good descriptions of the SATM scores distribution? Roughly 68% of the students have scores between 510 and 680 Roughly 95% of the students have scores between 422 and 768

CS students example: Descriptive statistics Mean = Std Deviation = Max= 800 Min= 300 Q1 = 540 Median = Q3= 650 IQR= xIQR=165 5 th percentile = th percentile = 750 Histogram of the SATM Scores % of scores

Analysis of the scores for male and female students: SATM scores for menSATM scores for women

Exploratory Data Analysis: 1.Always plot your data 2.Look for overall patterns & striking deviations such as outliers 3.Calculate a numerical summary to describe the center and the spread 4.NEXT STEP: sometimes the overall pattern is so regular that we can describe it through a smooth curve, called a density curve

Computing descriptive statistics in Excel There are two ways: 1. Use the formula palette – click on the f x button OR 2. Use the Data Analysis Toolpak & select descriptive statistics

The descriptive statistics tool Input range: sequence of cells containing the data Label in First row Output range: tell Excel where to put the output Summary statistics: to be checked

Select an empty cell, and type the function name you want to compute or use the function palette for the list of available functions. For instance to compute the min of the fuel consumption data in the city, type =min(b2:b31) Formulas for 5-number summary

Normal distributions Normal curves provide a simple, compact way to describe symmetric, bell-shaped distributions. SAT math scores for CS students Normal curve

Money spent in a supermarket Is the normal curve a good approximation?

The area under the histogram, i.e. the percentages of the observations, can be approximated by the corresponding area under the normal curve. If the histogram is symmetric, we say that the data are approximately normal (or normally distributed). We need to know only the average and the standard deviation of the observations!! SAT math scores for CS students