Presentation is loading. Please wait.

Presentation is loading. Please wait.

Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal.

Similar presentations


Presentation on theme: "Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal."— Presentation transcript:

1 Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal density (Section 1.3)  SAS procedures for analyzing univariate data Proc MEANS Proc UNIVARIATE CSC 323 Data analysis and Statistical software I

2 Describing distributions with numbers The distribution of the data is described through its center and its spread. For symmetric distributions use the mean and the standard deviation For skewed distributions, use the five number summary: Min, Q1, Median, Q3, Max The median is the midpoint of a distribution, the number such that half the observations are below it and the other half is above it. Q1 is the first quartile or 25 th percentile, the point such that 25% of the observations are below it. Q3 is the third quartile or 75 th percentile, the point such that 25% of the observations are above it

3 Example Fuel economy (miles per gallon) per model year 2001 cars on highway 13 16 19 21 22 24 24 25 26 28 30 30 68 N=13 Median=? Q1=? Q3=? Possible outliers?

4 Example: SAT Math score of 224 Computer Science Students In a large university, data were collected to study the academic achievements of computer science majors. We’ll consider the SAT math scores of 224 first year CS students. The average SATM score is 595.28 with s.d. s= 86.40 Histogram of the SATM Scores Are the average and s.d. good descriptions of the SATM scores distribution? Roughly 68% of the students have scores between 510 and 680 Roughly 95% of the students have scores between 422 and 768 How did I compute these intervals?

5 Interpreting the s.d. value For many lists of observations – especially if their histogram is bell-shaped 1.Roughly 68% of the observations in the list lie within 1 standard deviation from the average 1.And 95% of the observations lie within 2 standard deviations from the average Average Ave-s.d. Ave+s.d. 68% 95% Ave-2s.d. Ave+2s.d.

6 CS students example: Descriptive statistics Mean = 595.28 Std Deviation = 86.40 Max= 800 Min= 300 Q1 = 540 Median = 600.00 Q3= 650 IQR=110 1.5xIQR=165 5 th percentile = 460 95 th percentile = 750 Histogram of the SATM Scores 422 768 95% of scores

7 Analysis of the scores for male and female students: Box plot : SATM scores for menSATM scores for women

8 Exploratory Data Analysis: 1.Always plot your data 2.Look for overall patterns & striking deviations such as outliers 3.Calculate a numerical summary to describe the center and the spread: Symmetric distributions: Mean and standard deviation Asymmetric distributions: 5 number summary {Min, Q1, Median, Q3, Max} 4.NEXT STEP: sometimes the overall pattern is so regular that we can describe it through a smooth curve, called a density curve

9 Density curves Density curves describe the overall pattern of distributions. A density curve Is always on or above the horizontal axis Has area exactly 100% underneath it. The density curve is a mathematical model that can be used to describe empirical distributions SAT math scores for CS students

10 Normal distribution Normal curves provide a simple compact way to describe symmetric, bell-shaped distributions. SAT math scores for CS students Normal curve

11 The normal curve has the following expression: It is centered on the mean  and has spread equal to the standard deviation 

12 Two normal curves with the same mean but different standard deviation.

13 Money spent in a supermarket Is the normal curve a good approximation?

14 The area under the histogram, i.e. the percentages of the observations, can be approximated by the corresponding area under the normal curve. If the histogram is symmetric, we say that the data are approximately normal (or normally distributed). The approximating normal density curve is uniquely defined by the average and the standard deviation of the observations!! SAT math scores for CS students

15 The variable SAT math scores is normally distributed with mean  = 595.28 (sample average) and std deviation  = 86.40 (sample standard deviation). SAT math scores for CS students

16 The standard normal curve Simple mathematical formula: The curve is perfectly symmetric around 0 The normal approximation is commonly used in statistics. There is a special normal curve that is well known: The standard normal distribution has mean =0 and standard deviation =1

17 Benchmarks under the standard normal curve 50% In the normal distribution N( ,  ): Approximately 68% of the observations are between  -  and  +  (within 1 standard deviation from the mean) Approximately 95% of the observations are between  - 2  and  + 2  (within 2 standard deviations from the mean) Approximately 99.7% of the observations are between  - 3  and  + 3  (within 3 standard deviations from the mean)

18 Normal distribution function F(z) It is defined as the area under the standard normal to the left of z, that is F(z)=P(Z<=z) - The values of F(z) are tabulated, see Table A in your book appendix

19 Standard normal probabilities F(z)=P(Z<=z)

20 Application of the normal distribution to the data The normal distribution can be used to approximate the distribution of the data, when the data have a symmetric histogram! Result: If X is normally distributed N(m,s) with mean m and standard deviation s, then standardized value of X given by Z=(X-m)/s is a standard normal variable N(0,1) with mean 0 and standard deviation equal to 1 Thus, we can compute the relative frequencies for any normal distribution, by standardizing and using the probability Table A.

21 Example Mean = 595.28 Std Dev. s = 86.40 Problem: What is the percentage of CS students that had SAT math scores less than 700? Answer: Use the normal approximation - X is N(595.28, 86.40). The answer is the area under the normal density curve for X< 700 Standardize: subtract the average & divide by the standard deviation X< 700 equivalent to Z=(X-595.28)/86.40<(700-595.28)/86.40=1.212 The distribution of the SATM scores for the CS students is approximately normal with mean 595.28 and s.d. 86.40: N(595.28, 86.40)

22 Answer: The answer is the area under the normal density curve for X< 700 Standardize: subtract the mean, then divide by the standard deviation X< 700 equivalent to Z=(X-595.28)/86.40<(700- 595.28)/86.40=1.212 Look at the Table A We need to find the area to the left of Z=1.212 Results: 88.59% of the CS students has SATM equal to 700 or lower Z=1.212 F(z)=.8859

23 How do we compute it? We use the values of the standard Normal distribution function F(z)=P(Z<=z). Problem: What is the percentage of CS students that had SAT math scores between 600 and 750? Approximate answer: 1) Standardize == __ 595.28 600 750600 750 595.28

24 Summary: Normal distribution calculations Follow the following steps: 1.State the problem. Calculate the sample average and the s.d. and define the interval you are interested in 2.Standardize 3.Compute the area under the standard normal density curve using the Table A.

25 Inverse Problem: What is the lowest SAT math score that a student must have to be in the top 25% of all CS students in the sample? 25% ? Find the value x, such that 25% of observations fall at or above it. Mean = 595.28 Std Dev. s = 86.40 Sample Q 3 =650

26 Example: Example: During a study on machine performance, the time between machine failures was recorded for 39 similar machines. From the data, the average time = 23.35 hours and the sample standard deviation = 1.67h. 1.What is the percentage of machines that failed after 24 hours? 2.What is the percentage of machines with failure time between 20 and 22 hours? 3.How short should the failure time be for a machine to be in the bottom 10% ?

27 Answers The observations are on the variable Time of failure X that is approximately normal N(23.35, 1.67). What is the percentage of machines that failed after 24 hours? Compute the percentage for X>24, that is equal to the area under the normal distribution to the right of 24. Standardize: X>24 as Or equivalently Z> 0.39 Use the standard normal probability tables The area under the standard normal to the right of 0.39 is equal to (Area to the right of 0.39)= 1- (Area to the left of 0.39) So = 1-0.6517=0.3483 The answer is 0.3483. About 35% of the machines failed after 24 hours.

28 2. What is the percentage of machines with failure time between 20 and 22 hours? We need to compute the area under the normal distribution for 20 <X< 22. This is computed subtracting (Area for X<22)-(Area for X<20). Standardize X < 22 is in standard units X<20 is in standard units Use the standard normal probability tables The area under the standard normal distribution for Z<-0.81is 0.2090 The area under the standard normal distribution for Z<-2.00 is 0.0228 The answer is 0.2090-0.228=0.1862 18.62% of the machines have failure time between 20 and 22 hours.

29 3. How short should the failure time be for a machine to be in the bottom 10% ? We need to compute the value x* for X~N(23.35, 1.67), such that the area under the normal distribution on the left of x* is equal to 0.1. X* 23.35 0.1 From the normal probability tables, the standard value z* that corresponds to an area P(Z<z*)=0.1 is z*=-1.28 Thus, transforming the z-value back to the x-units, we have x*=-1.28*st.dev.+mean=-1.28*1.67+23.35=21.21 So the bottom 10% of the cars have failure time equal to 21.21 hours or shorter.

30 Normal approximations Is the normal approximation appropriate for these data? Overestimate this areaUnderestimate this area Use the normal approximation ONLY when the histogram of the observations is bell-shaped!

31 Normal quantile plots A useful tool for assessing if the data come from a normal distribution is a graph called normal quantile plot. If the points on a normal quantile plot lie close to a straight line, the plot indicates that the data are normal. Deviations from a straight line indicates that the data are not normal.

32 SAS for E.D.A. PROC MEANS PROC UNIVARIATE PROC CHART (GCHART) PROC UNIVARIATE To compute descriptive statistics To plot histograms To plot histograms, normal probability plots, boxplots.


Download ppt "Announcements First quiz next Monday (Week 3) at 6:15-6:45 Summary:  Recap first lecture: Descriptive statistics – Measures of center and spread  Normal."

Similar presentations


Ads by Google