Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ch3 Elementary Descriptive Statistics. Section 3.1: Elementary Graphical Treatment of Data Before doing ANYTHING with data: Understand the question. –

Similar presentations


Presentation on theme: "Ch3 Elementary Descriptive Statistics. Section 3.1: Elementary Graphical Treatment of Data Before doing ANYTHING with data: Understand the question. –"— Presentation transcript:

1 Ch3 Elementary Descriptive Statistics

2 Section 3.1: Elementary Graphical Treatment of Data Before doing ANYTHING with data: Understand the question. – An approximate answer to the exact question is always better than an exact answer to an approximate question. John Tukey. Know how the experiment was conducted.

3 The FIRST thing to do with the data is to PLOT THE DATA – Plot all individual points. – If there are connections between points, e.g. points are from same pairs (or sometimes separate blocks), show connections between related points.

4 Plotting data is an extremely important step. More often than not data I get when consulting have problems like incorrect data or attributes they didn’t tell me about. Plotting helps reveal relationships and answers. Plotting is a very effective way to present results. – “A picture is worth a thousand words.”

5 Example: 8 lb. test fishing line question: Which type(s) of line are strongest? Listing numerical data Trilene XL 11.5 11.3 11.7 11.6 11.7 11.4 11.5 11.5 11.6 11.4 Trilene XT 11.6 11.8 11.7 11.7 11.5 11.6 11.6 11.8 11.5 11.7 Stren 11.1 11.1 11.2 11.0 11.1 11.3 11.2 10.9 11.0 11.1 It’s hard to see what’s happening without organizing the data.

6 A “dot” diagram XL XT Stren 11.8 ** 11.7***** 11.6***** 11.5**** 11.4*** 11.3** 11.2** 11.1**** 11.0** 10.9*

7 Stem and leaf plot It shows the distribution shape and at the same time preserves the original values. In the gears’ runouts example, for the gears hung group, we have data points of 7, 8, 8, 10, 10, 10, 10, 11, 11, 11, 12, 13… A stem and leaf plot is 07 8 8 10 0 0 0 1 1 1 2 3

8 Two groups can be compared with back to back stem and leaf diagrams E.g. Stopping distances of bikes Treaded tireSmooth tire 341 8 9 355 536 6 4375 38 1391 2 040 Or dot diagrams | | | * | ** | | * |**Treaded 340350 360 370 380 390400 |*** | * | | * | | * |Smooth

9 When there are associations between sets of data values, plot the data accordingly. E.g., Snowfall for duluth and White Bear Lake 1972-2000 A not very good way to plot the data WB Lake Duluth 130* 120* 110** ** 100*** * 90***** 80****** ****** 70** *** 60** ********** 50**** *** 40*** *** 30* *** 20

10 Duluth White Bear

11 A study of trace metals in South Indian River 1 2 3 4 5 6 T=top water zinc concentration (mg/L) B=bottom water zinc (mg/L) 123456 Top0.4150.2380.3900.4100.6050.609 Bottom0.4300.2660.5670.5310.7070.716

12 One of the first things to do when analyzing data is to PLOT the data This is not a useful way to plot the data. There is not a clear distinction between bottom water and top water zinc—even though Bottom>Top at all 6 locations. TopBottom

13 A better way TopBottom Connect points in the same pair.

14 Another way (scatter plot) Bottom=Top

15 This following plot would imply a natural ordering of sites from 1 to 6. This would not be the best way to plot the data unless the sites 1-6 correspond to a natural ordering such as distance downstream of a factory.

16 Run charts (a version of scatter plot) The variable on the x axis is a time variable. Table: 30 consecutive outer diameters turned on a lathe

17 Moving along time, the outer diameters tend to get smaller until part 16, where there is a large jump, followed by a pattern of diameter generally decreasing in time.

18 Section 3.2: Quantiles and Related Graphical Tools Quantile: Roughly speaking, for a number p between 0 and 1, the p quantile of a distribution is a number such that a fraction p of the distribution lies to the left and a fraction 1-p of the distribution lies to the right.

19 p quantile = 1O0*p th percentile Q(0.10) = 0.10 quantile = 10 th percentile Q(0.50) = 0.50 quantile = 50 th percentile = median Q(0.25) =0.25 quantile = 25 th percentile= first quartile Q(0.75) =0.75 quantile = 75 th percentile= third quartile

20 The p th quantile is ordered point corresponding to the point with index So the comulative probability corresponding to the i th point is

21 Consider the following n=10 points Q(0.25) = 0.25 quantile = 857 Q(0.50) = median =. Q(0.75) = 9614 IQR = Interquartile Range = Q(0.75) - Q(0.25)= 9614- 8572= 1042

22 To find the 93 rd percentile: 0.93 is part way between 0.85 and 0.95. So the Q(0.93) is 0.8 of the way from Q(0.85) to Q(0.95) Q(0.85) + 0.8(Q(0.95)-Q(0.85)) =0.2*Q(0.85) + 0.8*Q(0.95) = 0.2(9614)+ 0.8(10,688) = 10,473.

23 Boxplots are useful summaries, particularly when there are too many points for a dot plot. To make a boxplot, we need essentially 5 numbers.

24

25

26 Section 3.2.3 Q-Q Plots and Comparing Distributional Shapes Most of the statistical tools we will use in this class assume normal distributions (a bell shaped distribution for the population of possible values). In order to know if these are the right tools for a particular job, we need to be able to assess if the data appear to have come from a normal population.

27 With large amounts of data, one can draw a histogram of the measured values and see if it is bell-shaped. A normal plot is a method for assessing normality that works well with big or small data sets. It gives a good visual check for normality.

28 Simulation: 100 observations, normal with mean=5, st dev=1 x<-rnorm(100, mean=5, sd=1) qqnorm(x)

29 A normal plot is a plot of the data in a way such that data from normal populations will come out pretty much in a straight line. We plot the corresponding quantiles of a "standard normal'' distribution versus ordered y values

30 In other words In order to plot the data and check for normality, we compare our observed data to what we would expect from a sample of standard normal data.

31

32 So if we plot ordered values from a normal population against corresponding quantiles of a standard normal population, we expect to get a reasonably straight line, since any normal distribution is linearly related to the standard normal distribution.

33

34 The textbook plots the standard normal quantiles on the vertical axis and the ordered data points on the horizontal axis. Many software packages and other books plot the standard normal quantiles on the horizontal axis and the ordered data points on the vertical axis. Either way, the plot should look ``fairly'' straight if the data are from a normal distribution.

35

36

37 Excel File of Lifetime of Springs Data

38 Section 3.3: Numerical Summaries Measures of Location: The data are found spread around what value ? Median = Q(O.50) = 50 th percentile. Sample mean = arithmetic mean = average The mean is more affected by unusual values than the median.

39 Measures of Spread: R = Range = Biggest – Smallest The size of the range can be affected by how many values we have. Many number will tend to have a larger range than fewer numbers. IQR = lnterquartile Range = Q(0.75) – Q(0.25) Range that include half of the values.

40 Sample variance = Essentially an average squared deviation from the mean. Sample standard deviation =

41 Example: X 1 = 8 X 2 = 9 X 3 = 4

42 Statistics and Parameters A statistic is a numerical summary of the sample data. = sample mean s 2 = sample variance

43 A parameter is a summary of an entire population or a theoretical distribution, for example a normal distribution.  = population mean  2 = population variance Average squared deviation from the mean.  = population standard deviation

44 For a sample of size n, the sample variance is Why divide by n -1? This makes an unbiased estimator of. Unbiased means on the average correct.

45 Suppose we have a large population of ball bearings with diameters  =1cm and Sample 10.980.00032 21.030.00031 31.010.00045 41.020.00052... ∞-------------- Mean 1.000.0004 If we knew  we would find Fact So and would be too small for  . Dividing by n-1 makes s 2 come out right (   )on average.

46 Notice that s 2 is undefined if n=1; we can't divide by zero. This makes sense. If we have only one number, that number tells us nothing about potential spread in the population.

47 Plotting summary statistics over time is useful for issues such as quality control. Read section 3.3.4 for general information.

48


Download ppt "Ch3 Elementary Descriptive Statistics. Section 3.1: Elementary Graphical Treatment of Data Before doing ANYTHING with data: Understand the question. –"

Similar presentations


Ads by Google