# Impact of a simulation/ randomization-based curriculum on student understanding of p-values and confidence intervals Beth Chance Karen McGaughey Jimmy.

## Presentation on theme: "Impact of a simulation/ randomization-based curriculum on student understanding of p-values and confidence intervals Beth Chance Karen McGaughey Jimmy."— Presentation transcript:

Impact of a simulation/ randomization-based curriculum on student understanding of p-values and confidence intervals Beth Chance Karen McGaughey Jimmy Wong Cal Poly – San Luis Obispo ICOTS9

Outline About the curriculum (Karen) Evaluating the curriculum (Beth) Benefits/Cautions/Suggestions (Karen) Next Steps (Beth)

Background Randomization-based introductory statistics courses (Saturday workshop) Introducing all inferential techniques through simulation and randomization-based methods e.g., permutation tests, bootstrapping Tintle et al. (2015) text (Roy, Session 4A) Focus on overall statistical process via genuine research studies Normal-based methods presented as alternative approximation to simulation results

Background Spiraled just-in-time curriculum: Brief introduction to probability through simulation e.g., Monty Hall problem, coin tossing Develop understanding of probability as a long-run proportion Statistical Inference (Ch. 1) Process probability/one proportion One mean, two proportions, two means, matched pairs, multiple proportions, multiple means, regression Deeper dive in each iteration Interspersed as needed: discussions of random sampling, random assignment, graphical displays, scope of conclusions, etc.

Background Ch 1: Test of significance One proportion Facial Prototyping – “Bob & Tim” (Lea, Thomas, Lamkin, & Bell, 2007) Binary response Overwhelmingly name left picture “Tim” (e.g. ~ 80%) Two competing explanations for the study outcome: “Random chance alone” Research conjecture Could the observed statistic plausibly have happened by random chance alone? Design the simulation: What does “by random chance alone” look like? Coin tossing model Tactile & via computer

Background Test Two-sided p-value Decision at 0.05 significance level Plausible? H o : π = 0.26 0.0430Reject H o No H o : π = 0.27 0.0800Fail to reject H o Yes : :Fail to reject H o Yes ::Fail to reject H o Yes H o : π = 0.55 0.0770Fail to reject H o Yes H o : π = 0.56 0.0450Reject H o No

Background Ch 5: Two proportions Dolphin Therapy (Antonioli & Reveley, 2005) Binary response Designed experiment Two competing explanations: H o : “random chance alone” H a : research conjecture Could the observed statistic plausibly have happened by random chance alone? Design the simulation: Card shuffling Tactile & via computer Therapy group DolphinControl Improved103 Did not improve512

2013-2014 Evaluation New and experienced teachers 15 institutions (HS, community college, university) 15 instructors (fall) and 23 instructors (spring, 12 new) Over 1500 students Assessment (Modified) CAOS pre and post tests (Tintle, Session 8A) SATS attitudes pre and post tests (Swanson, Session 1F) Set of common multiple choice exam questions 25 instructors, 774-826 students Final exam transfer question

One Proportion (Exam 1) Research question: Are city residents more likely to watch a movie at home rather than in the theater? Q1: Picking the correct null hypothesis (overall percentages) Adult residents of the city are equally likely to choose to watch the movie at home as to watch the movie at the theater. 92.9% Adult residents of the city are more likely to choose to watch the movie at home than to watch the movie at the theater. 5.8% Adult residents of the city are less likely to choose to watch the movie at home than to watch at the theater..6% Other.6%

One Proportion (Exam 1) Research question: Are city residents more likely to watch a movie at home rather than in the theater? Q2: Picking the correct alternative hypothesis Adult residents of the city are equally likely to choose to watch the movie at home as to watch the movie at the theater. 1.7% Adult residents of the city are more likely to choose to watch the movie at home than to watch the movie at the theater. 90.1% Adult residents of the city are less likely to choose to watch the movie at home than to watch at the theater. 5.6% Other2.7%

One Proportion (Exam 1) Research question: Are city residents more likely to watch a movie at home rather than in the theater? Q3: Result is statistically significant (p = 0.012), which explanation is more plausible? More than half of the adult residents in her city prefer to watch the movie at home. 65.6% There is no overall preference for movie-watching-at-home in her city, but by pure chance her sample just happened to have an unusually high number of people choose to watch the movie at home. 6.0% (a) and (b) are equally plausible explanations. 29.7% Substantial section-to-section variability!

One Proportion (Exam 1) Research question: Are city residents more likely to watch a movie at home rather than in the theater? Q4: Most valid interpretation of p-value? A sample proportion as large as or larger than hers would rarely occur. 14.0% A sample proportion as large as or larger than hers would rarely occur if the study had been conducted properly. 6.9% A sample proportion as large as or larger than hers would rarely occur if 50% of adults in the population prefer to watch the movie at home. 59.9% A sample proportion as large as or larger than hers would rarely occur if more than 50% of adults in the population prefer to watch the movie at home 20.3% Higher for experienced instructors

One Proportion (Exam 1) Research question: Are city residents more likely to watch a movie at home rather than in the theater? Q5: Would 95% confidence interval contain 0.5? Yes 25.3% No 43.8% Not enough information 31.0%

Two Proportions (Exam 2) Research question: Are women more likely to dream in color than men? Q1: Best conclusion from not significant (not small p-value) result ? You have found strong evidence that there is no difference between the proportions of men and women in your community that dream in color. 14.5% You have not found enough evidence to conclude that there is a difference between the proportions of men and women in your community that dream in color. 72.8% You have found strong evidence against the claim that there is a difference between the proportions of men and women that dream in color. 10.7% Because the result is not significant, we can’t conclude anything from this study. 4.1% Higher for new instructors

Two Proportions (Exam 2) Research question: Are women more likely to dream in color than men? Q2: Best interpretation from small p-value? It would not be very surprising to obtain the observed sample results if there is really no difference between the proportions of men and women in your community that dream in color. 5.0% It would be very surprising to obtain the observed sample results if there is really no difference between the proportions of men and women in your community that dream in color. 56.5% It would be very surprising to obtain the observed sample results if there is really a difference between the proportion of men and women in your community that dream in color. 7.9% The probability is very small that there is no difference between the proportions of men and women in your community that dream in color. 22.6% The probability is very small that there is a difference between the proportions of men and women in your community that dream in color. 8.4%

Two Proportions (Exam 2) Research question: Are women more likely to dream in color than men? Q3: If really is a difference, why might get large p-value? Something went wrong with the analysis, and the results of this study cannot be trusted. 6.1% There must not be a difference after all and the other research studies were flawed. 3.8% The sample size might have been too small to detect a difference even if there is one. 90.1%

Two Proportions (Exam 2) Research question: Are women more likely to dream in color than men? Q4: Which has stronger evidence of a difference: Study A vs. Study B? Study A: 40/100 vs. 20/100 80.3% Study B: 35/100 vs. 25/100 4.4% The strength of evidence would be similar for these two studies 15.3%

Two Proportions (Exam 2) Research question: Are women more likely to dream in color than men? Q5: Which has stronger evidence of a difference: Study C vs. Study D (30% vs. 20%)? Study C: sample sizes of 100 and 100 83.0% Study D: sample sizes of 40 and 40 6.0% The strength of evidence would be similar for these two studies 10.8%

Two Proportions (Exam 2) Research question: Are women more likely to dream in color than men? Q6: Small p-value, which explanation is more plausible? Men and women in your community do not differ on this issue but by chance alone the random sampling led to the difference we observed between the two groups. 13.6% Men and women in your community differ on this issue. 58.1% (a) and (b) are equally plausible explanations. 28.2% 36% correct with draft curriculum four years ago

Two Proportions (Exam 2) n = 404 students (8 instructors) Q7: Main purpose of the randomness in the simulation? To allow me to draw a cause-and-effect conclusion from the study. 19.1% To allow me to generalize my results to a larger population. 11.4% To simulate values of the statistic under the null hypothesis. 58.8% To replicate the study and increase the accuracy of the results 8.2

Two Means (Exam 2/Final) 717 students, 14 instructors Want to compare mean score on video game with and without monetary incentive Simulation process is described and given null distribution

Two Means (Exam 2/Final) Q1: Main motivation for this process? This process allows her to compare her actual result to what could have happened by chance if gamers’ performances were not affected by whether they were asked to do their best or offered an incentive. 83.0% This process allows her to determine the percentage of time the \$5 incentive strategy would outperform the “do your best" strategy for all possible scenarios. 12.0% This process allows her to determine how many times she needs to replicate the experiment for valid results. 2.2% This process allows her to determine whether the normal distribution fits the data. 2.8%

Two Means (Exam 2/Final) Q2: What’s assumed in carrying out the simulation? The \$5 incentive is more effective than the “do your best” incentive for improving performance. 25.8% The \$5 incentive and the “do your best” incentive are equally effective at improving performance. 60.9% The “do your best” incentive is more effective than a \$5 incentive for improving performance. 6.0% Both (a) and (b) but not (c). 7.3%

Two Means (Exam 2/Final) Q3: Approximate p-value from graph 0.501 (using null value) 14.0% 0.047 (two-sided) 16.9% 0.022 52.5%.001 (small) 16.2%

Two Means (Exam 2/Final) Q4: What does histogram tell us about research question? The \$5 incentive is not effective because the distribution of differences generated is centered at zero. 16.3% The \$5 incentive is effective because distribution of differences generated is centered at zero. 14.8% The \$5 incentive is not effective because the p-value is greater than 0.05. 5.1% The \$5 incentive is effective because the p-value is less than 0.05. 63.4%

Two Means (Exam 2/Final) Q5: Appropriate interpretation of p-value? The p-value is the probability that the \$5 incentive is not really helpful. 3.7% The p-value is the probability that the \$5 incentive is really helpful. 12.9% The p-value is the probability that she would get a result as least as extreme as the one she actually found, if the \$5 incentive is really not helpful. 82.3% The p-value is the probability that a student wins on the video game. 0.9%

CAOS Significance questions (n  2,000 pre, 1,500 post) Valid/invalid interpretations Pre Post CAOS ExpNewNon Large or small p-value, no impact 50%89%85%62% en 68% Probability of results at least as extreme under null: valid 50%65%66%52%57% Probability of alternative: invalid 40%53%58%48%54% Probability of null: invalid 53%72%67%58%60%

CAOS Conf interval questions Valid/invalid interpretations Pre Post CAOS ExpNewNon 95% of all observations in population in interval: invalid 57%63%64%56%65% 95% confident an observational unit is in interval: invalid 27%41%37%21% en 49% 95% of sample means from population are in interval: invalid 51%60% 64%48% 95% confident population mean is in interval: valid 71%80% 82%76%

CAOS Sampling variability questions Pre Post CAOS ExpNewNon Small sample (n = 60) may fail to detect difference 71%58%57%49%67% Necessary sample size for all 310 million U.S. residents 10%19%22%11% “Hospital problem” 33%39%38%34%33% Values of 10 sample proportions 42%44%52% e 43%52% Simulation design 24%40%35%24% e 22%

Topic areas – Summary Auth= author team member Mid = non-author but have used materials more than once Pre Post AuthMidNewNonAuthMidNewNon Significance 52%43%47%46%72%67%69%55% * Confidence 55%51% 49%63%60% 56% Sampling variability 35%36% 35%41%40%41%32%

Transfer Question (Final exam) A constant theme of course: Could the statistic have happened by chance alone? Applicable in any situation vs. statistical test applicable in only one specific situation Can students apply the same logic to a novel problem? Spring 2014: Two Cal Poly instructors (169 students) Final exam: mean/median as a measure of skewness to make inference about population shape (adapted from 2009 AP Statistics exam) Earlier midterm: Ratio of standard deviations or relative risk

Transfer Question (Final exam) Do the sample data provide convincing evidence the population is right skewed? Calculate statistic: mean/median = 1.05 What values would you expect for the statistic with a normally distributed population? With a skewed right population? 39% answered both questions correctly Common errors: Mean/median > 1.05 if right skewed Wrong direction: mean/median < 1 if right skewed

Transfer Question (Final exam) Do the sample data provide evidence the population is right skewed? Calculate statistic: mean/median = 1.05 Given a simulated null distribution from a symmetric population (centered at 1) Evidence against the null hypothesis?

Transfer Question (Final exam) Multiple choice version based on common responses from open-ended version: Answer choices focus on 3 characteristics of the null distribution: There is strong evidence (or not) to suggest the actual population distribution is right skewed……. Due to symmetric shape Because the center is at 1 Because most values vary between 0.96 to 1.04

Transfer Question (Final exam) Two instructors (5 sections/169 students) from Cal Poly does not provide strong evidence … because this null distribution is symmetric. 11% provides strong evidence … because this null distribution is symmetric.12% does not provide strong evidence … because this null distribution is centered around one. 20% provides strong evidence … because this null distribution is centered around one. 26% does not provide strong evidence … because most of the values in this null distribution vary between 0.96 to 1.04. 10% provides strong evidence … because most of the values in this null distribution vary between 0.96 to 1.04. 18% Other: provided correct reasoning7% * 25% answered correctly and an additional 8% showed work indicating correct reasoning

Benefits Little to no confusion that small p-values  statistical significance Students very comfortable (even initially) with idea of “could this have happened by chance alone” Idea of large z-score or t-score (beyond 2SE) also clicks Address difficult inferential reasoning earlier in course Repeated exposures allow a synthesis of the ideas Understanding “Inference process” as statistical method, rather than stand-alone methods for testing means, proportions, etc. Efficiency gains: Still possible to do both simulation and normal-based methods Exploration of other statistics (e.g. MAD for multiple means) Instructors enjoy approach, research study focus, richer student questions

Cautions Inferential reasoning is difficult and initially, little carry-over of learning: Non 50/50 cases Comparing groups Need several repeated exposures May introduce a misconception of “repeating the study” Possible increase in misconception that we are “providing evidence for the null hypothesis” Continue to struggle with identifying & defining parameters Balance inferential with descriptive statistics (less as Common Core comes on line?)

Main Suggestions Emphasize the ideas of model and simulation Repeatedly test their ability to design a simulation Ask students to predict simulation results (where will it be centered, why) Focus on variability in null distribution as the key Clearly delineate observed data from simulation Explicitly discuss roles of randomness in the study design vs. randomness in simulation Use early experiential examples that give students ownership of the data (“observed” statistic)

Future Steps Three year NSF grant (DUE/TUES – 1323210) to continue data collection across institutions More “non-users” and other randomization- based curriculums (e.g., Lock5, Catalst) More studies of student retention of concepts Next theme of common exam questions: Confidence intervals Email Nathan Tintle (nathan.tintle@dordt.edu) or Beth Chance (bchance@calpoly.edu) if you would like to participatenathan.tintle@dordt.edubchance@calpoly.edu

Questions?

Download ppt "Impact of a simulation/ randomization-based curriculum on student understanding of p-values and confidence intervals Beth Chance Karen McGaughey Jimmy."

Similar presentations