2 Plan of attack Distinguish different types of variables Summarize data numericallySummarize data graphicallyUse theoretical distributions to potentially learn more about a variable.
3 The five steps of statistical analyses Form the questionCollect dataModel the observed dataWe start with exploratory techniques.Check the model for reasonablenessMake and present conclusions
4 Just to make sure we are on the same page More (or repeated) vocabularyIndividuals are the objects described by a set of dataexamples: employees, lab mice, states…A variable is any characteristic of an individual that is of interest to the researcher. Takes on different values for different individualsexamples: age, salary, weight, location…How is this different from a mathematical variable?
5 Just to make sure we are on the same page #2 Measurement The value of a variable obtained and recorded on an individualExample: 145 recorded as a person’s weight, 65 recorded as the height of a tree, etc.Data is a set of measurements made on a group of individualsThe distribution of a variable tells us what values it takes and how often it takes these values
6 Two Types of Variablesa categorical/qualitative variable places an individual into one of several groups or categoriesexamples:Gender, Race, Job Type, Geographic location…JMP calls these variables nominala quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make senseHeight, Age, Salary, Price, Cost…Can be further divided to ordinal and continuousWhy two types?Both require their own summaries (graphically and numerically) and analysis.I can’t emphasis enough the importance of identifying the type of variable being considered before proceeding with any type of statistical analysis
9 Exploratory data analysis Statistical tools that help examine data in order to describe their main featuresBasic strategyExamine variables one by one, then look at the relationships among the different variablesStart with graphs, then add numerical summaries of specific aspects of the data
10 Exploratory data analysis: One variable Graphical displaysQualitative/categorical data: bar chart, pie chart, etc.Quantitative data: histogram, stem-leaf, boxplot, timeplot etc.Summary statisticsQualitative/categorical: contingency tablesQuantitative: mean, median, standard deviation, range etc.Probability modelsQualitative: Binomial distribution(others we won’t cover in this class)Quantitative: Normal curve (others we won’t cover in this class)
12 Summary tablewe summarize categorical data using a table. Note that percentages are often called Relative Frequencies.
13 Bar graphThe bar graph quickly compares the degrees of the four groupsThe heights of the four bars show the counts for the four degree categories
14 Pie chart A pie chart helps us see what part of the whole group forms To make a pie chart, you must include all the categories that make up a whole
15 Summary of categorical variables GraphicallyBar graphs, pie chartsBar graph nearly always preferable to a pie chart. It is easier to compare bar heights compared to slices of a pieNumerically: tables with total counts or percents
20 Histograms Where did the bins come from? They were chosen rather arbitrarilyDoes choosing other bins change the picture?Yes!! And sometimes dramaticallyWhat do we do about this?Some pretty smart people have come up with some “optimal” bin widths and we will rely on there suggestions
21 Histogram The purpose of a graph is to help us understand the data After you make a graph, always ask, “What do I see?”Once you have displayed a distribution you can see the important features
22 HistogramsWe will describe the features of the distribution that the histogram is displaying with three characteristicsShapeSymmetric, skewed right, skewed left, uni-modal, multi-modal, bell shapedCenterMean, medianSpread (outliers or not)Standard deviation, Inter-quartile range
24 Incomes from 500 households in 2000 current population survey
25 Histogram vs. Bar graphSpaces mean something in histograms but not in bar graphsShape means nothing with bar graphsThe biggest difference is that they are displaying fundamentally different types of variables
26 Time Plots Many variables are measured at intervals over time ExamplesClosing stock pricesNumber of hurricanesUnemployment ratesIf interest is a variable is to see change over time use a time plot
27 Time Plots Patterns to look for Patterns that repeat themselves at known regular intervals of time are called seasonal variationA trend is a persistant, long-term rise or fall
29 Numerical summaries of quantitative variables Want a numerical summary for center and spreadCenterMeanMedianModeSpreadRangeInter-quartile rangeStandard deviation5 number summary is a popular collection of the followingmin, 1st quartile, median, 3rd quartile, max
30 MeanTo find the mean of a set of observations, add their values and divide by the number of observationsequation 1:equation 2:
31 Mean exampleThe average age of 20 people in a room is 25. A 28 year old leaves while a 30 year old enters the room.Does the average age change?If so, what is the new average age?
32 Median The median is the midpoint of a distribution The number such that half the observations are smaller and the other half are largerAlso called the 50th percentile or 2nd quartileTo compute a medianOrder observationsIf number of observations is odd the median is the center observationIf number of observations is even the median is the average of the two center observations
33 Median exampleThe median age of 20 people in a room is 25. A 28 year old leaves while a 30 year old enters the room.Does the median age change?If so, what is the new median age?The median age of 21 people in a room is 25. A 28 year old leaves while a 30 year old enters the room.
34 Mean vs Median When histogram is symmetric mean and median are similar Mean and median are different when histogram is skewedSkewed to the right mean is larger than medianSkewed to the left mean is smaller than medianThe business magazine Forbes estimates that the “average” household wealth of its readers is either about $800,000 or about $2.2 million, depending on which “average” it reports. Which of these numbers is the mean wealth and which is the median wealth? Why?
38 Extreme example Income in small town of 6 people $25,000 $27,000 $29,000$35,000 $37,000 $38,000Mean is $31,830 and median is $32,000Bill Gates moves to town$35,000 $37,000 $38,000 $40,000,000Mean is $5,741,571 median is $35,000Mean is pulled by the outlier while the median is not. The median is a better of measure of center for these data
39 Is a central measure enough? A warm, stable climate greatly affects some individual’s health. Atlanta and San Diego have about equal average temperatures (62o vs. 64o). If a person’s health requires a stable climate, in which city would you recommend they live?
40 Measures of spread Range: Inter-quartile range: subtract the largest value form the smallestInter-quartile range:subtract the 3rd quartile from the 1st quartileStandard Deviation (SD):“average” distance from the meanWhich one should we use?
41 Standard DeviationThe standard deviation looks at how far observations are from their meanIt is the square root of the average squared deviations from the meanCompute distance of each value from meanSquare each of these distancesTake the average of these squares and square rootOften we will use SD to denote standard deviation
45 Problem from text (p. 74, #2)Which of the following sets of numbers has the smaller SD’a) 50, 40, 60, 30, 70, 25, 75 b) 50, 40, 60, 30, 70, 25, 75, 50, 50, 50Repeat for these two setsc) 50, 40, 60, 30, 70, 25, 75 d) 50, 40, 60, 30, 70, 25, 75, 99, 1
46 More intuition behind the SD This is a variance contest. You must give a list of six numbers chosen from the whole numbers 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 with repeats allowed.Give a list of six numbers with the largest standard deviation such a list described above can possibly have.Give a list of six numbers with the smallest standard deviation such a list can possibly have.
47 Properties of SD SD ≥ 0. (When is SD = 0)? Has the same unit of measurement as the original observationsInflated by outliers
48 Mean and SDWhat happens to the mean if you add 5 to every number in a list?What happens to the SD?
49 Standard deviation SDs are like measurement units on a ruler Any quantitative variable can be converted into “standardized” unitsThese are often called z-scores and are denoted by the letter zImportant formulaExampleACT versus SAT scoresWhich is more impressiveA 1340 on the SAT, or a 32 on the ACT?
50 The normal curveWhen histogram looks like a bell-shaped curve, z-scores are associated with percentagesThe percentage of the data in between two different z-score values equals the area under the normal curve in between the two z-score valuesA bit of notation here.N(, ) is short hand for writing normal curve with mean and standard deviation (get used to this notation as it will be used fairly regularly through out the course)
53 Properties of normal curve In the Normal distribution with mean and standard deviation :68% of the observations fall within 1 of 95% of the observations fall within 2 s of 99.7% of the observations fall within 3 s of By remembering these numbers, you can think about Normal curves without constantly making detailed calculations
54 Properties of normal curves For a N(0,1) the following holds
55 IQ A person is considered to have mental retardation when IQ is below 70Significant limitations exist in two or more adaptive skill areasCondition is present from childhoodWhat percentage of people have IQ that meet the first criterion of mental retardation
56 IQ A histogram of all people’s IQ scores has a μ=100 and a σ=16 How to get % of people with IQ < 70
57 More IQReggie Jackson, one of the greatest baseball players ever, has an IQ of What percentage of people have bigger IQs than Reggie?Marilyn vos Savant, self-proclaimed smartest person in the world, has a reported IQ of What percentage of people have IQ scores smaller than Marilyn’s score?Mensa is a society for “intelligent people.” To qualify for Mensa, one needs to be in at least the upper 2% of the population in IQ score. What is the score needed to qualify for Mensa?
58 Checking if data follow normal curve Look for symmetric histogramA different method is a normal probability plot. When normal curve is a good fit, points fall on a nearly straight line
59 Measurement error Measurement error model Outliers Measurement = truth + chance errorOutliersBias effects all measurements in the same wayMeasurement = truth + bias + chance errorOften we assume that the chance error follows a normal curve that is centered at 0