# Statistics That Deceive.  It is well accepted knowledge that the larger the data set, the better the results  Simpson’s Paradox demonstrates that a.

## Presentation on theme: "Statistics That Deceive.  It is well accepted knowledge that the larger the data set, the better the results  Simpson’s Paradox demonstrates that a."— Presentation transcript:

Statistics That Deceive

 It is well accepted knowledge that the larger the data set, the better the results  Simpson’s Paradox demonstrates that a great deal of care has to be taken when combining smaller data sets into a larger one  Sometimes the conclusions from the larger data set are opposite the conclusion from the smaller data sets!

First HalfSecond HalfTotal Season Player A.400.250.264 Player B.350.200.336 Baseball batting statistics for two players: How could Player A beat Player B for both halves individually, but then have a lower total season batting average?

First HalfSecond HalfTotal Season Player A4/10 (.400)25/100 (.250)29/110 (.264) Player B35/100 (.350)2/10 (.200)37/110 (.336) We weren’t told how many at bats each player had: Player A’s dismal second half and Player B’s great first half had higher weights than the other two values.

Average college physics grades for students in an engineering program: taken HS physicsno HS physics Number of Students505 Average Grade8070 Average college physics grades for students in a liberal arts program: taken HS physicsno HS physics Number of Students550 Average Grade9585 It appears that in both majors (Liberal Arts and Engineering), taking high school physics improves your college physics grade by 10.

In order to get better results, let’s combine our datasets. In particular, let’s combine all the students that took high school physics. More precisely, let’s combine the Engineering majors that took high school physics with the LA majors that took high school physics. Likewise, combine the Engineers that did not take high school physics with LAs that did not take high school physics. But be careful! You can’t just take the average of the two averages, because each dataset has a different number of values!!

Average college physics grades for students who took high school physics: # StudentsAvgGradesWeighted Grade Engineering508050/55*80=72.7 Lib Arts5955/55*95=8.6 Total55 Average (72.7 + 8.6) = 81.3 Average college physics grades for students who did not take high school physics: # StudentsAvgGradesWeighted Grade Engineering5705/55*70=6.4 Lib Arts508550/55*85=77.3 Total55 Average (6.4 + 77.3) = 83.7 Did the students that did not have high school physics actually do better?

Average college physics grades for students who took high school physics: # StudentsGradesGrade Pts Engineering50804000 Lib Arts595475 Total554475 Average (4000/4475*80 + 475/4475*95) 81.3 Average college physics grades for students who did not take high school physics: # StudentsGradesGrade Pts Engineering570350 Lib Arts50854250 Total554600 Average (350/4600*70 + 4250/4600*85) 83.7 Did the students that did not have high school physics actually do better?

 Two problems with combining the data ◦ There was a larger percentage of one type of student in each table ◦ The engineering students had a more rigorous physics class (e.g. “Physics for Enginners”) than the liberal arts students, thus there is a hidden variable  In fact, this ‘lurking variable’ that makes the subcategories different from one another is the most common cause of Simpson’s Paradox  Key Point: Be very careful when you combine data into a larger set

Download ppt "Statistics That Deceive.  It is well accepted knowledge that the larger the data set, the better the results  Simpson’s Paradox demonstrates that a."

Similar presentations