Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unit 1: Getting Data. Science? (I) The word “science” may just be Latin for “knowing”, but in its modern form... Science can only answer questions from.

Similar presentations


Presentation on theme: "Unit 1: Getting Data. Science? (I) The word “science” may just be Latin for “knowing”, but in its modern form... Science can only answer questions from."— Presentation transcript:

1 Unit 1: Getting Data

2 Science? (I) The word “science” may just be Latin for “knowing”, but in its modern form... Science can only answer questions from data (“empirical evidence”) –so some things that claim scientific backing have none (Scientology) –Beware “anecdotal evidence” (Aunt Tilly, Marxism) There are many questions science can’t touch without a definition (usually arbitrary) based on data –(e.g., fairest social system, most beautiful art)

3 Science? (II) In your final project, you may find that something you thought was true and obvious isn’t supported by your data –Don’t torture your data to make it come out “right” –Report your results honestly. (You’re allowed to guess what might have gone wrong in the Discussion and Self-Critique.) No study is ever final! –Ex: Does a nightlight in a child’s room make him/her nearsighted? (This ex will reappear.)

4 Experiments the only way to verify causality best (but not always possible): –“a controlled experiment... –with random assignment to groups... –conducted double-blind”

5 O(ne) F(actor) A(t) A T(ime)???

6 Observational studies census(es): poll entire population –usually too expensive sampling: study a few, and infer sample’s results are close to population’s –this is where “significance tests” [discussed later] are used – how far away might they be? proves only association, not causation,... but may be only possible or ethical method –That’s why scientific articles are so carefully hedged (even in Scientific American) –and why even legitimate scientific studies can disagree

7 Statistical cliché “Correlation does not imply causation!” –Post hoc, ergo propter hoc fallacy (“After this, therefore because of this”) –correlated variables may have a common cause (Do larger feet improve reading skill?) –Do nightlights cause nearsightedness in children? Later researchers found parents’ nearsightedness was a common cause –confounding: “explanatory” variables are associated (Does TV make you dumber or just keep you from studying, which other activities might do?)

8 “Denmark’s Social Research Inst says single fathers are calmer and less likely to punish their children than lone mothers...” [1200 kids, 3-5, half with only mom, half with dad] Moms more stressed & depressed, less self- confident, more nightmares & insomnia, more conflict with kids, quicker to hit or punish... not because of genetics, but less money (jobless or underpaid) What caused what: Being single mom caused poverty & stress? Or poverty & stress caused single parent to be Mom?

9 Data from the internet? “Doing research on the Web is like using a library assembled piecemeal by packrats and vandalized nightly.” –Roger Ebert, on writing a review of the movie Wilde, from Yahoo Internet Life, Sept. ‘98

10 And now for something completely different: Simpson’s paradox(es?)

11 Simpson’s paradox I Averages don’t “average” correctly. –Use “weighted” averages [One text says: Averages of subgroups are “more accurate” than the overall average.]

12

13 Simpson’s paradox (toy #1) In A-ville, there are 1000 Whigs of whom 30% can juggle, and 100 Tories, of whom 20% can juggle. In B-ville, there are 100 Whigs, of whom 60% can juggle, and 1000 Tories, of whom 50% can juggle. So a given Whig is more likely to be able to juggle than a given Tory?

14 Solution to the juggling pols Fraction of Whigs who can juggle: [.3(1000)+.6(100)]/[1000+100] = 360/1100 Fraction of Tories who can juggle: [.2(100)+.5(1000)]/[100+1000] = 520/1100 Latter is clearly larger First is “weighted average” of fractions.3 and.6, weighted by relative populations:.3(1000/1100)+.6(100/1100) –“Weights” 1000/1100, 100/1000 add up to 1

15 In the first half of the season, Ruth has a batting average of.280 on 100 at-bats, while Gehrig has a batting average of.270 on 200 at-bats. In the second half of the season, Ruth bats.190 on 200 at-bats, while Gehrig bats.180 on 100 at-bats. So Ruth’s average for the season is higher than Gehrig’s? –(Find weighted batting averages) Simpson’s paradox (toy #2)

16 At least we can say that a weighted average of several numbers (like a list of averages) is somewhere between the smallest and the largest numbers in the list. Ex: Men’s averages in the Berkeley case:.62(825/2891) +.63(560/2891) +.37(325/2891) +.33(417/2891) +.28(191/2891) +.06(373/2891) ≈.44 is between.06 and.63 Ex: Juggling Whigs: 380/1100 ≈.33 is between.3 and.6

17 Simpson’s paradox II For two interrelated variables, say x and y (more about this later in the course): When x goes up, does y go up or down? –If they both go up, the “correlation” is + –If x goes up and y goes down, it’s - Parts of the data may each go up, but overall it goes down, or v.v. [One text: Look for a “lurking variable” that is separating your data into pieces that behave differently.]

18

19


Download ppt "Unit 1: Getting Data. Science? (I) The word “science” may just be Latin for “knowing”, but in its modern form... Science can only answer questions from."

Similar presentations


Ads by Google