Presentation is loading. Please wait.

Presentation is loading. Please wait.

CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251.

Similar presentations


Presentation on theme: "CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251."— Presentation transcript:

1 CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251

2 Outline  Data: definitions and examples  Tabular and Graphical Summaries  Categorical  Quantitative  Data Distributions  Numerical Summaries  Measures of Location  Measures of Spread  Boxplots  Data transformations 2

3 Definitions and Summaries Data 3

4 Data – A Definition 4

5 A Need for Organization 5

6 6

7 Variables 7  Categorical:  Nominal:  Ordinal:  Quantitative:  Discrete:  Continuous:

8 More on Variables and Data 8  Why should we care about the type of variable?  Caution: Categorical variables are often recorded using numbers (e.g. yes=1, no=0). Don’t mistake these for quantitative variables.  Univariate Data: Data on one variable. E.g. weight or age or time or…  Multivariate Data: Data on multiple variables. E.g. weight and age and time and…

9 Raw Data 9 Four algorithms have been developed for cracking coded transmissions. Trials with each of the algorithms were run and the following data were collected In all, 2800 trials were done. Thus the complete table is 3 by 2800. This does not lend itself well to drawing conclusions. TrialAlgorithmTime to Completion (sec) Success 131.34yes 213.45yes 340.99no …………

10 Tables are Still Useful 10  Tables can be used to summarize data.  There are two basic types of summary tables  Frequency Table  Relative Frequency Table  The definitions of each will change according to the type of data (Categorical or Quantitative).

11 For Categorical Data 11  A Frequency Table is a table that displays the total number of cases falling into each category of a single categorical variable.  A Relative Frequency Table displays the percentage/proportion of cases rather than the number of cases. SuccessCount Yes2520 No280 SuccessPercentage Yes90% No10%

12 Steps to making a Frequency Table for Quantitative Data 12 1. Identify the smallest and largest observations to obtain the range of the values for the data. 2. Divide the (adjusted) range into equal sized non- overlapping bins. 3. Count the number of observations (frequency) in each bin. 4. Calculate the Relative Frequency for each bin. (optional)

13 Time to Completion 13 Time (in seconds)Count 0 – 0.99780 1.00 – 1.991276 2.00-2.99614 3.00-3.99130 - The bins should be of equal length - It should be clear where each bin starts and ends. - Use 5-20 bins

14 Displaying Categorical Data 14  Bar Charts  Pie Charts  Side-by-Side Bar Charts  Side-by-Side Pie Charts

15 Pictures Speak Louder than Tables 15

16 Pie Charts 16  Pie Charts present categories as slices of a circle, where the area of each slice is proportional to the total number of case in each category (or proportion)

17 Caution with Fancy 3-D Plots 17

18 Displaying Quantitative Data 18  Histograms: The quantitative equivalent to the bar chart. It is a graphical version of the frequency or relative frequency table. One difference with the bar chart is that there is no space between the bars.  Stem and Leaf plots (not covered in this course)  Box-plots

19 Histogram Example 19 Consider the following ages: 18, 45, 23, 34, 33, 39, 50, 19, 51, 68, 36, 26, 42, 49, 25, 37, 71

20 Distributions and Numerical Summaries 20

21 Data Distributions 21  The definition of data distribution changes for categorical variables and quantitative variables, but both have the same goals: to characterize the behavior of the variable.  Categorical Variable: It’s distribution is the list of categories of the variable, along with the frequency of each.  Quantitative Variable: This can’t be achieved, so we need to describe other features.

22 Describing Distributions for Quantitative Data  Shape  Center  Spread The disadvantage here is that we don’t have as good a grasp of the data as we do with categorical variables, but the advantage is that we can work with these numerically. 22

23 Shape part 1 - Modes  Does the distribution (viewed using a histogram, say) have no humps, one hump or more than one hump?  We call the humps modes (the most popular value(s) the variable can take on).  A distribution with no modes is called uniform.  A distribution with one mode is called unimodal.  A distribution with two modes is called bimodal.  A distribution with many modes is called multimodal. 23

24 Shape part 2 - Symmetry  If we cut the distribution at the center and find an approximately mirror image on both sides, the distribution is sad to be symmetric.  The ends of the distribution are known as its tails. 24

25 Skewed  If the distribution is not symmetric, then we say that it is skewed.  We say it is skewed in the direction of the longer tail. Thus is can be left/negatively skewed or right/positively skewed. 25

26 Shape part 3 - Outliers  An Outlier is an observation that is quite far from the ‘body’ of the distribution.  They can cause problems with just about every method we will discuss in this course, so they must be identified.  In some cases, outliers are removed, but this must be done with great caution.  If an outlier is to be removed, it should be mentioned in any subsequent conclusion/discussion. 26

27 Numerical Summaries 27  The center and spread of the data are described numerically using summary statistics.  These try to communicate as much as possible with regards to the data  The shape of the distribution plays an important role in the choice of summary statistics.

28 Center 1 – Median  The median is the middle observation of the ordered list.  Calculating the median  Order the data (usually from lowest to highest)  If there are an odd number of observations, select the middle observation  If there are an even number of observations, take the average of the two middle observations.  E.g. if there are 7 observations, take the 4 th, if there are 8 observations, take the average of observations 4 and 5. 28

29 Center 2 – Mean  The mean is simply the average of the observations.  We’ll use the mean extensively in this course. Let y i be the i th observation of variable y. Let n be the number of observations Then we denote the using and calculate it using: 29

30 More Mean  The mean summarizes the center well if the distribution is symmetric, unimodal and there are no outliers.  Otherwise the median is a better choice.  At this point, it may appear that the median is the natural choice to calculate the center, but the mean is often favoured. The reason is somewhat beyond our scope for the moment. 30

31 An exercise 31  Consider two small data sets.  1, 3, 7, 2, 12, 4, 8, 5, 8  1, 3, 7, 2, 12, 4, 8, 5, 8, 120  Calculate the mean and median

32 Spread 1 – Interquartile Range  The IQR is the range of the middle 50% of the data.  The first quartile (Q1) is the value with 25% of the ordered data below it.  The third quartile (Q3) is the value with 75% of the ordered values below it.  IQR= Q3- Q1  It is a number, not an interval. 32

33 Spread 2 – Variance and SD  The Variance is the ‘average’ of the squared differences (or deviation) from the mean.  The Standard Deviation is the square root of the variance 33

34 Properties of SD and Variance  They cannot be negative  The SD has the same units as the observations/mean  They are zero iff all observations have the same value. (i.e. there is no spread!)  The larger they are the more spread out the data are.  So why divide by n-1? Why not just take the average? 34

35 What we should retain  The median and IQR are good measures of center and spread even when the distribution is skewed or has outliers.  The mean and variance are good measure of center and spread when the distribution is symmetric without outliers, but not multimodal.  When the data are symmetric without outliers and isn’t multimodal, we typically only report the mean and SD.  If the distribution is multimodal, then summary statistics are not appropriate. 35

36 Boxplots 36

37 Box-plot 37 Q3 – 3 rd Quartile Median Q1 – First Quartile Upper and Lower Whiskers

38 Box-plot  The Box-plot is a visual display of the 5-number summary.  It is useful for comparing two ore more distributions. 38

39 Constructing a Boxplot 39  Step 1 – the Box: Identify the Median, 1 st and 3 rd Quartiles and complete the box.  Step 2 – the Fences: Fences are only used for construction purposes. The fences are 1.5xIQR away from each Quartile.  Step 3 – the Whiskers: Extend a line from each quartile to the most extreme observation within the fences. At these points, extend the whiskers.  Step 4 – Outliers: Any point outside the fences should be drawn in as points. These are potential outliers.

40 Lifetime of Pacemakers 40 Replacing a pacemaker is a big deal. Data were collected on pacemaker lifetimes (in years). Here are the raw data: 12.3, 11.7, 11.5, 9.2, 1.2, 13.4, 12.9, 20.4, 11.1, 15.5, 12.4, 10.4, 10.7, 10.2 Summary Statistics: Median = 11.6 Q1 = 10.4 Q3 = 12.9 Draw an appropriate boxplot.

41 Boxplot Questions 41  When are box-plots inappropriate?  When do we favor box-plots over histograms?  Explain why a point identified as an outlier by a box-plot may not be an outlier.

42 Linear Transformations! 42

43 Purpose 43  There are many situations which lead to the need to linearly transform data.  E.g. Your firm sends some temperature readings to an American firm, so it transforms the readings in degrees Celsius to degrees Fahrenheit.  When we transform the data, what happens to summary statistics?

44 What are you measuring? 44  How the numerical measures are affected is dependent on what it is measuring: location or spread.  We create three classes of numerical summaries which are affected differently by transformations  Measures of location: Mean, Median, Midrange, Quartiles, Percentiles, Min and Max  Measures of Spread: Standard Deviation, IQR, Range  Variance

45 Measures of Location 45  These are affected by both adding/subtracting and multiplying/dividing  Let m be the current measure of location, then  Adding the constant a will result in the new measure: m’ = m + a  Multiplying by b will lead to: m’ = bm  Using the linear function f(x)=a+bx will lead to: m’ = a + bm

46 Measures of Spread 46  These are affected by multiplying/dividing, but not by adding/subtracting  Let m be the current measure of spread, then  Adding the constant a will result in the new measure: m’ = m  Multiplying by b will lead to: m’ = bm  Using the linear function f(x)=a+bx will lead to: m’ = bm

47 Variance 47  Same as measures of spread, but the effect is different.  Let v be the variance, then  Adding the constant a will result in the new measure: v’ = v  Multiplying by b will lead to: v’ = b 2 v  Using the linear function f(x)=a+bx will lead to: v’ = b 2 v

48 Questions 48  The average weight of watermelons from a farm is 4.3 kg with a SD of 1.5 kg. What are the mean and SD weight in lbs?  The median and IQR on a final exam are 50 and 22. The instructor decides to multiply the results1.13 and add 5 to each grade. What are the summary statistics now?


Download ppt "CHAPTERS 1 AND 2: DESCRIPTIVE STATISTICS STAT 241/251."

Similar presentations


Ads by Google