Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal.

Slides:



Advertisements
Similar presentations
Describing Quantitative Variables
Advertisements

Stat350, Lecture#4 :Density curves and normal distribution Try to draw a smooth curve overlaying the histogram. The curve is a mathematical model for the.
AP Statistics Section 2.1 B
DENSITY CURVES and NORMAL DISTRIBUTIONS. The histogram displays the Grade equivalent vocabulary scores for 7 th graders on the Iowa Test of Basic Skills.
The goal of data analysis is to gain information from the data. Exploratory data analysis: set of methods to display and summarize the data. Data on just.
Normal distributions Normal curves provide a simple, compact way to describe symmetric, bell-shaped distributions. SAT math scores for CS students Normal.
1.2: Describing Distributions
CHAPTER 3: The Normal Distributions Lecture PowerPoint Slides The Basic Practice of Statistics 6 th Edition Moore / Notz / Fligner.
The Normal Distributions
Chapter 2: The Normal Distribution
Chapter 7 Continuous Distributions. Continuous random variables Are numerical variables whose values fall within a range or interval Are measurements.
3.3 Density Curves and Normal Distributions
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 1 PROBABILITIES FOR CONTINUOUS RANDOM VARIABLES THE NORMAL DISTRIBUTION CHAPTER 8_B.
Stat 1510: Statistical Thinking and Concepts 1 Density Curves and Normal Distribution.
CHAPTER 3: The Normal Distributions ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
Density Curves and the Normal Distribution.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 2 Modeling Distributions of Data 2.2 Density.
Describing Location in a Distribution 2.1 Measures of Relative Standing and Density Curves 2.1 Measures of Relative Standing and Density Curves Text.
CHAPTER 3: The Normal Distributions
2.1 Density Curves and the Normal Distribution.  Differentiate between a density curve and a histogram  Understand where mean and median lie on curves.
Density Curves Section 2.1. Strategy to explore data on a single variable Plot the data (histogram or stemplot) CUSS Calculate numerical summary to describe.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 2 Modeling Distributions of Data 2.2 Density.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 2 Modeling Distributions of Data 2.2 Density.
2.1B D ESCRIBING L OCATION IN A D ISTRIBUTION TRANSFORM data DEFINE and DESCRIBE density curves.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
C HAPTER 2: T HE N ORMAL D ISTRIBUTIONS. S ECTION 2.1: D ENSITY CURVES AND THE N ORMAL D ISTRIBUTIONS 2 Chapter 1 gave a strategy for exploring data on.
IPS Chapter 1 © 2012 W.H. Freeman and Company  1.1: Displaying distributions with graphs  1.2: Describing distributions with numbers  1.3: Density Curves.
Section 2.1 Density Curves. Get out a coin and flip it 5 times. Count how many heads you get. Get out a coin and flip it 5 times. Count how many heads.
The Normal Distributions.  1. Always plot your data ◦ Usually a histogram or stemplot  2. Look for the overall pattern ◦ Shape, center, spread, deviations.
Exploratory Data Analysis
Chapter 2: Modeling Distributions of Data
Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
Chapter 2: Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
Ninth grade students in an English class were surveyed to find out about how many times during the last year they saw a movie in a theater. The results.
Describing Location in a Distribution
Chapter 2: Modeling Distributions of Data
Density Curves and Normal Distribution
CHAPTER 2 Modeling Distributions of Data
Chapter 2: Modeling Distributions of Data
Data Analysis and Statistical Software I ( ) Quarter: Autumn 02/03
CHAPTER 2 Modeling Distributions of Data
Data Analysis and Statistical Software I Quarter: Spring 2003
CHAPTER 2 Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
Chapter 2: Modeling Distributions of Data
Chapter 2: Modeling Distributions of Data
Do Now In BIG CLEAR numbers, please write your height in inches on the index card.
Chapter 2: Modeling Distributions of Data
Chapter 2: Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
Chapter 2: Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
Describing Location in a Distribution
Chapter 2: Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
Chapter 2: Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
Chapter 2: Modeling Distributions of Data
Chapter 2: Modeling Distributions of Data
Chapter 2: Modeling Distributions of Data
Chapter 2: Modeling Distributions of Data
Chapter 2: Modeling Distributions of Data
Chapter 2: Modeling Distributions of Data
Chapter 2: Modeling Distributions of Data
Chapter 2: Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
Presentation transcript:

Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal density (Section 1.3)  SAS procedures for analyzing univariate data Proc MEANS Proc UNIVARIATE CSC 323 Data analysis and Statistical software I

Describing distributions with numbers The distribution of the data is described through its center and its spread. For symmetric distributions use the mean and the standard deviation For skewed distributions, use the five number summary: Min, Q1, Median, Q3, Max The median is the midpoint of a distribution, the number such that half the observations are below it and the other half is above it. Q1 is the first quartile or 25 th percentile, the point such that 25% of the observations are below it. Q3 is the third quartile or 75 th percentile, the point such that 25% of the observations are above it

Example Fuel economy (miles per gallon) per model year 2001 cars on highway N=13 Median=? Q1=? Q3=? Possible outliers?

Example: SAT Math score of 224 Computer Science Students In a large university, data were collected to study the academic achievements of computer science majors. We’ll consider the SAT math scores of 224 first year CS students. The average SATM score is with s.d. s= Histogram of the SATM Scores Are the average and s.d. good descriptions of the SATM scores distribution? Roughly 68% of the students have scores between 510 and 680 Roughly 95% of the students have scores between 422 and 768 How did I compute these intervals?

Interpreting the s.d. value For many lists of observations – especially if their histogram is bell-shaped 1.Roughly 68% of the observations in the list lie within 1 standard deviation from the average 1.And 95% of the observations lie within 2 standard deviations from the average Average Ave-s.d. Ave+s.d. 68% 95% Ave-2s.d. Ave+2s.d.

CS students example: Descriptive statistics Mean = Std Deviation = Max= 800 Min= 300 Q1 = 540 Median = Q3= 650 IQR= xIQR=165 5 th percentile = th percentile = 750 Histogram of the SATM Scores % of scores

Analysis of the scores for male and female students: Box plot : SATM scores for menSATM scores for women

Exploratory Data Analysis: 1.Always plot your data 2.Look for overall patterns & striking deviations such as outliers 3.Calculate a numerical summary to describe the center and the spread: Symmetric distributions: Mean and standard deviation Asymmetric distributions: 5 number summary {Min, Q1, Median, Q3, Max} 4.NEXT STEP: sometimes the overall pattern is so regular that we can describe it through a smooth curve, called a density curve

Density curves Density curves describe the overall pattern of distributions. A density curve Is always on or above the horizontal axis Has area exactly 100% underneath it. The density curve is a mathematical model that can be used to describe empirical distributions SAT math scores for CS students

Normal distribution Normal curves provide a simple compact way to describe symmetric, bell-shaped distributions. SAT math scores for CS students Normal curve

The normal curve has the following expression: It is centered on the mean  and has spread equal to the standard deviation 

Two normal curves with the same mean but different standard deviation.

Money spent in a supermarket Is the normal curve a good approximation?

The area under the histogram, i.e. the percentages of the observations, can be approximated by the corresponding area under the normal curve. If the histogram is symmetric, we say that the data are approximately normal (or normally distributed). The approximating normal density curve is uniquely defined by the average and the standard deviation of the observations!! SAT math scores for CS students

The variable SAT math scores is normally distributed with mean  = (sample average) and std deviation  = (sample standard deviation). SAT math scores for CS students

The standard normal curve Simple mathematical formula: The curve is perfectly symmetric around 0 The normal approximation is commonly used in statistics. There is a special normal curve that is well known: The standard normal distribution has mean =0 and standard deviation =1

Benchmarks under the standard normal curve 50% In the normal distribution N( ,  ): Approximately 68% of the observations are between  -  and  +  (within 1 standard deviation from the mean) Approximately 95% of the observations are between  - 2  and  + 2  (within 2 standard deviations from the mean) Approximately 99.7% of the observations are between  - 3  and  + 3  (within 3 standard deviations from the mean)

Normal distribution function F(z) It is defined as the area under the standard normal to the left of z, that is F(z)=P(Z<=z) - The values of F(z) are tabulated, see Table A in your book appendix

Standard normal probabilities F(z)=P(Z<=z)

Application of the normal distribution to the data The normal distribution can be used to approximate the distribution of the data, when the data have a symmetric histogram! Result: If X is normally distributed N(m,s) with mean m and standard deviation s, then standardized value of X given by Z=(X-m)/s is a standard normal variable N(0,1) with mean 0 and standard deviation equal to 1 Thus, we can compute the relative frequencies for any normal distribution, by standardizing and using the probability Table A.

Example Mean = Std Dev. s = Problem: What is the percentage of CS students that had SAT math scores less than 700? Answer: Use the normal approximation - X is N(595.28, 86.40). The answer is the area under the normal density curve for X< 700 Standardize: subtract the average & divide by the standard deviation X< 700 equivalent to Z=(X )/86.40<( )/86.40=1.212 The distribution of the SATM scores for the CS students is approximately normal with mean and s.d : N(595.28, 86.40)

Answer: The answer is the area under the normal density curve for X< 700 Standardize: subtract the mean, then divide by the standard deviation X< 700 equivalent to Z=(X )/86.40<( )/86.40=1.212 Look at the Table A We need to find the area to the left of Z=1.212 Results: 88.59% of the CS students has SATM equal to 700 or lower Z=1.212 F(z)=.8859

How do we compute it? We use the values of the standard Normal distribution function F(z)=P(Z<=z). Problem: What is the percentage of CS students that had SAT math scores between 600 and 750? Approximate answer: 1) Standardize == __

Summary: Normal distribution calculations Follow the following steps: 1.State the problem. Calculate the sample average and the s.d. and define the interval you are interested in 2.Standardize 3.Compute the area under the standard normal density curve using the Table A.

Inverse Problem: What is the lowest SAT math score that a student must have to be in the top 25% of all CS students in the sample? 25% ? Find the value x, such that 25% of observations fall at or above it. Mean = Std Dev. s = Sample Q 3 =650

Example: Example: During a study on machine performance, the time between machine failures was recorded for 39 similar machines. From the data, the average time = hours and the sample standard deviation = 1.67h. 1.What is the percentage of machines that failed after 24 hours? 2.What is the percentage of machines with failure time between 20 and 22 hours? 3.How short should the failure time be for a machine to be in the bottom 10% ?

Answers The observations are on the variable Time of failure X that is approximately normal N(23.35, 1.67). What is the percentage of machines that failed after 24 hours? Compute the percentage for X>24, that is equal to the area under the normal distribution to the right of 24. Standardize: X>24 as Or equivalently Z> 0.39 Use the standard normal probability tables The area under the standard normal to the right of 0.39 is equal to (Area to the right of 0.39)= 1- (Area to the left of 0.39) So = = The answer is About 35% of the machines failed after 24 hours.

2. What is the percentage of machines with failure time between 20 and 22 hours? We need to compute the area under the normal distribution for 20 <X< 22. This is computed subtracting (Area for X<22)-(Area for X<20). Standardize X < 22 is in standard units X<20 is in standard units Use the standard normal probability tables The area under the standard normal distribution for Z<-0.81is The area under the standard normal distribution for Z<-2.00 is The answer is = % of the machines have failure time between 20 and 22 hours.

3. How short should the failure time be for a machine to be in the bottom 10% ? We need to compute the value x* for X~N(23.35, 1.67), such that the area under the normal distribution on the left of x* is equal to 0.1. X* From the normal probability tables, the standard value z* that corresponds to an area P(Z<z*)=0.1 is z*=-1.28 Thus, transforming the z-value back to the x-units, we have x*=-1.28*st.dev.+mean=-1.28* =21.21 So the bottom 10% of the cars have failure time equal to hours or shorter.

Normal approximations Is the normal approximation appropriate for these data? Overestimate this areaUnderestimate this area Use the normal approximation ONLY when the histogram of the observations is bell-shaped!

Normal quantile plots A useful tool for assessing if the data come from a normal distribution is a graph called normal quantile plot. If the points on a normal quantile plot lie close to a straight line, the plot indicates that the data are normal. Deviations from a straight line indicates that the data are not normal.

SAS for E.D.A. PROC MEANS PROC UNIVARIATE PROC CHART (GCHART) PROC UNIVARIATE To compute descriptive statistics To plot histograms To plot histograms, normal probability plots, boxplots.