Presentation on theme: "Describing Data: One Variable"— Presentation transcript:
1Describing Data: One Variable STAT 101Dr. Kari Lock MorganDescribing Data: One VariableSECTIONS 2.1, 2.2, 2.3, 2.4One categorical variable (2.1)One quantitative variable (2.2, 2.3, 2.4)
2Announcements Homework 1 due now – turn it in according to lab section Clicker grading starts today!
3Why not always randomize? Randomized experiments are ideal, but sometimes not ethical or possibleOften, you have to do the best you can with data from observational studiesExample: research for the Supreme Court case as to whether preferences for minorities in university admissions helps or hurts the minority students
4Randomization in Data Collection Was the explanatory variable randomly assigned?Was the sample randomly selected?YesNoYesNoPossible to generalize to the populationShould not generalize to the populationPossible to make conclusions about causalityCan not make conclusions about causality
5Two Fundamental Questions in Data Collection Random sample???PopulationSampleRandomized experiment???DATA
6RandomizationDoing a randomized experiment on a random sample is ideal, but rarely achievableIf the focus of the study is using a sample to estimate a statistic for the entire population, you need a random sample, but do not need a randomized experiment (example: election polling)If the focus of the study is establishing causality from one variable to another, you need a randomized experiment and can settle for a non- random sample (example: drug testing)
7Review from Last Class Association does not imply causation! In observational studies, confounding variables almost always exist, so causation cannot be establishedRandomized experiments involve randomly determining the level of the explanatory variableRandomized experiments prevent confounding variables, so causality can be inferredA control or comparison group is necessaryThe placebo effect exists, so a placebo and blinding should be used
8Descriptive Statistics The Big PicturePopulationSamplingSampleStatistical InferenceDescriptive Statistics
9Descriptive Statistics In order to make sense of data, we need ways to summarize and visualize itSummarizing and visualizing variables and relationships between two variables is often known as descriptive statistics (also known as exploratory data analysis)Type of summary statistics and visualization methods depend on the type of variable(s) being analyzed (categorical or quantitative)
10One Categorical Variable A random sample of US adults in 2012 were surveyed regarding the type of cell phone ownedAndroid? iPhone? Blackberry? Non- smartphone? No cell phone?
11Cell Phones Which type of cell phone do you own? iPhone Blackberry AndroidiPhoneBlackberryNon-smartphoneNo cell phone
12Frequency TableUS data: A frequency table shows the number of cases that fall in each category:Android458iPhone437Blackberry141Non Smartphone924No cell phone293Total2253R: table(x)
13The proportion in a category is found by Proportion for a sample: 𝑝 (“p-hat”)Proportion for a population: p
14Proportion What proportion of adults sampled do not own a cell phone? 𝑝 = =0.13Android458iPhone437Blackberry141Non Smartphone924No cell phone293Total2253or 13%Proportions and percentages can be used interchangeably
15Relative Frequency Table A relative frequency table shows the proportion of cases that fall in each categoryAll the numbers in a relative frequency table sum to 1Android0.203iPhone0.194Blackberry0.063Non Smartphone0.410No cell phone0.130R: table(x)/length(x)
16Bar Chart/Plot/GraphIn a bar chart, the height of the bar is the number of cases falling in each categoryR: barchart(x)
17Pie ChartIn a pie chart, the relative area of each slice of the pie corresponds to the proportion in each categoryR: pie(table(x))
22DotplotIn a dotplot, each case is represented by a dot and dots are stacked.Easy way to see each caseHighest is Harry Potter and the Deathly Hallows Part 2Second is TransformersThird: Pirates of the Caribbean: On Stranger Tidesunits: Millions of $
23HistogramThe height of the each bar corresponds to the number of cases within that range of the variableR: hist(x)
24Histogram vs Bar Chart This is a Histogram Bar chart Other I have no idea
25Histogram vs Bar Chart This is a Histogram Bar chart Other I have no idea
26Histogram vs Bar ChartA bar chart is for categorical data, and the x-axis has no numeric scaleA histogram is for quantitative data, and the x- axis is numericFor a categorical variable, the number of bars equals the number of categories, and the number in each category is fixedFor a quantitative variable, the number of bars in a histogram is up to you (or your software), and the appearance can differ with different number of bars
27ShapeLong right tailSymmetricRight-SkewedLeft-Skewed
28NotationThe sample size, the number of cases in the sample, is denoted by nWe often let x or y stand for any variable, and x1 , x2 , …, xn represent the n values of the variable xx1 = , x2 = , x3 = , …
29MeanThe mean or average of the data values is 𝑚𝑒𝑎𝑛= 𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠𝑚𝑒𝑎𝑛= 𝑥 1 + 𝑥 2 +…+ 𝑥 𝑛 𝑛 = 𝑥 𝑛Sample mean: 𝑥Population mean: (“mu”)R: mean(x)
30MedianThe median, m, is the middle value when the data are ordered. If there are an even number of values, the median is the average of the two middle values.The median splits the data in half.R: median(x)
31Measures of Centerm = 76.66Mean is “pulled” in the direction of skewness =150.74World Gross (in millions)
32Skewness and CenterA distribution is left-skewed. Which measure of center would you expect to be higher?MeanMedianThe mean will be pulled down towards the skewness (towards the long tail).
33OutlierAn outlier is an observed value that is notably distinct from the other values in a dataset.
34World Gross (in millions) OutliersTransformersHarry PotterPirates of the CaribbeanWorld Gross (in millions)
35ResistanceA statistic is resistant if it is relatively unaffected by extreme values.The median is resistant while the mean is not.MeanMedianWith Harry Potter$150,742,300$76,658,500Without Harry Potter$141,889,900$75,009,000
36OutliersWhen using statistics that are not resistant to outliers, stop and think about whether the outlier is a mistakeIf not, you have to decide whether the outlier is part of your population of interest or notUsually, for outliers that are not a mistake, it’s best to run the analysis twice, once with the outlier(s) and once without, to see how much the outlier(s) are affecting the results
37Standard DeviationThe standard deviation for a quantitative variable measures the spread of the data𝑠= 𝑥− 𝑥 2 𝑛−1Sample standard deviation: sPopulation standard deviation: (“sigma”)R: sd(x)
38Standard DeviationThe standard deviation gives a rough estimate of the typical distance of a data values from the meanThe larger the standard deviation, the more variability there is in the data and the more spread out the data are
39Standard DeviationBoth of these distributions are bell-shaped
4095% RuleIf a distribution of data is approximately symmetric and bell-shaped, about 95% of the data should fall within two standard deviations of the mean.For a population, 95% of the data will be between µ – 2 and µ + 2
42The 95% RuleThe normal distribution app on statkey is a good way to demonstrate the 95% rule – ask students to pick any mean and sd, and then have them guess what the bounds for the middle 95% (can get by clicking on two tail) will beStatKey
43The 95% RuleThe standard deviation for hours of sleep per night is closest to124I have no idea
44The z-score for a data value, x, is 𝑧= 𝑥− 𝑥 𝑠 For a population, 𝑥 is replaced with µ and s is replaced with Values farther from 0 are more extreme
45z-score A z-score puts values on a common scale A z-score is the number of standard deviations a value falls from the mean95% of all z-scores fall between what two values?z-scores beyond -2 or 2 can be considered extreme-2 and 2
46z-scoreWhich is better, an ACT score of 28 or a combined SAT score of 2100?ACT: = 21, = 5SAT: = 1500, = 325Assume ACT and SAT scores have approximately bell-shaped distributionsACT score of 28SAT score of 2100I don’t know
47Other Measures of Location Maximum = largest data valueMinimum = smallest data valueQuartiles:Q1 = median of the values below m.Q3 = median of the values above m.
48Five Number Summary Five Number Summary: Min Max Q1 Q3 m 25% R: summary(x)
49Five Number Summary> summary(study_hours)Min. 1st Qu. Median 3rd Qu. Max.The distribution of number of hours spent studying each week isSymmetricRight-skewedLeft-skewedImpossible to tell
50The Pth percentile is the value which is greater than P% of the data We already used z-scores to determine whether an SAT score of 2100 or an ACT score of 28 is betterWe could also have used percentiles:ACT score of 28: 91st percentileSAT score of 2100: 97th percentile
51Five Number Summary Five Number Summary: Min Max Q1 Q3 m 25% 0th percentile25th percentile50th percentile75th percentile100th percentile
52Measures of Spread Range = Max – Min Interquartile Range (IQR) = Q3 – Q1Is the range resistant to outliers?YesNoIs the IQR resistant to outliers?The range depends entirely on the most extreme values.The IQR is based off the middle 50% of the data, which will not contain outliers.
53Comparing Statistics Measures of Center: Measures of Spread: Mean (not resistant)Median (resistant)Measures of Spread:Standard deviation (not resistant)IQR (resistant)Range (not resistant)Most often, we use the mean and the standard deviation, because they are calculated based on all the data values, so use all the available information
54OutliersOutliers can be informally identified by looking at a plot, but one rule of thumb for identifying outliers is data values more than 1.5 IQRs beyond the quartilesA data value is an outlier if it isSmaller than Q1 – 1.5(IQR)orLarger than Q (IQR)
55BoxplotOutliersLines (“whiskers”) extend from each quartile to the most extreme value that is not an outlierQ3MedianQ1R: boxplot(x)
56BoxplotWhich boxplot goes with the histogram of waiting times for the bus?(a)(b)(c)The data do not show any low outliers.