Presentation is loading. Please wait.

Presentation is loading. Please wait.

Week 6. Experiments, methodology, the responsible use of numbers, and paper structure. GRS LX 700 Language Acquisition & Linguistic Theory.

Similar presentations


Presentation on theme: "Week 6. Experiments, methodology, the responsible use of numbers, and paper structure. GRS LX 700 Language Acquisition & Linguistic Theory."— Presentation transcript:

1 Week 6. Experiments, methodology, the responsible use of numbers, and paper structure. GRS LX 700 Language Acquisition & Linguistic Theory

2 Why we do experiments In this context, we’re generally interested in the state and developmental course of children’s linguistic knowledge. In this context, we’re generally interested in the state and developmental course of children’s linguistic knowledge. What does the child know? What does the child know? To what extent does it differ from what an adult knows? To what extent does it differ from what an adult knows? We have logical, abstract reasons to believe that a lot of what kids ultimately know about language is not deduced from their language input— but what evidence is there? We have logical, abstract reasons to believe that a lot of what kids ultimately know about language is not deduced from their language input— but what evidence is there?

3 Universal Grammar The poverty of the stimulus The poverty of the stimulus 1 2 3 — what’s next? 1 2 3 — what’s next? 1 2 3 5 — what’s next? 1 2 3 5 — what’s next? Properties of “innateness”? Properties of “innateness”? Independence from external evidence. Independence from external evidence. Universality? Universality? Early emergence? Early emergence? Constraints and the absence of negative evidence. Constraints and the absence of negative evidence. Candidate A: Who does Arnold wanna make breakfast for? Who does Arnold wanna make breakfast? Candidate A: Who does Arnold wanna make breakfast for? Who does Arnold wanna make breakfast? Candidate B: Who does Arnold wanna make breakfast for? *Who does Arnold wanna make breakfast? Candidate B: Who does Arnold wanna make breakfast for? *Who does Arnold wanna make breakfast?

4 Hypotheses Where you have two hypotheses that make different predictions, you use an experiment to determine which predictions are actually borne out. Where you have two hypotheses that make different predictions, you use an experiment to determine which predictions are actually borne out. A standard setup in child language studies is pitting an experimental hypothesis (H 1 ) such as Children (Like Those We Are Testing) Know Grammatical Constraint X against the null hypothesis (H 0 ) They Don’t. A standard setup in child language studies is pitting an experimental hypothesis (H 1 ) such as Children (Like Those We Are Testing) Know Grammatical Constraint X against the null hypothesis (H 0 ) They Don’t.

5 This should be difficult Experiments are naturally subject to error. We’re measuring things in the real world. Experiments are naturally subject to error. We’re measuring things in the real world. We only want to reject the null hypothesis if we’re sure. We only want to reject the null hypothesis if we’re sure. So, we want to “stack the deck against” H 1. Exclude the possibility that kids give the correct answers for the wrong reasons. So, we want to “stack the deck against” H 1. Exclude the possibility that kids give the correct answers for the wrong reasons. H 0 actually true H 0 actually false Reject H 0 Type I error Correct Do not reject H 0 Correct Type II error

6 Production and comprehension Two things to look at to assess kids’ grammatical knowledge. Two things to look at to assess kids’ grammatical knowledge. Naturalistic production (e.g., CHILDES transcripts) is good for some things, but not so good for others. Naturalistic production (e.g., CHILDES transcripts) is good for some things, but not so good for others. If we’re interested in a particular construction, often this needs to be elicited in an experimental setting in order to get enough examples. If we’re interested in a particular construction, often this needs to be elicited in an experimental setting in order to get enough examples. Non-production: Constraint? Or preference? Test comprehension. Act-out Grammaticality judgment Truth value judgment

7 Elicited production Wanna contraction Wanna contraction Who do you want to kiss? Who do you want to kiss? Who do you wanna kiss? Who do you wanna kiss? I want to kiss Bill I want to kiss Bill I wanna kiss Bill I wanna kiss Bill Who do you want to kiss Bill? Who do you want to kiss Bill? *Who do you wanna kiss Bill? *Who do you wanna kiss Bill? All signs point to the last one being good. But it isn’t. Do kids know this? All signs point to the last one being good. But it isn’t. Do kids know this? So: try to get kids to say Who do you wanna kiss Bill? Great. Suppose they don’t. Have we shown that they know the constraint? Hint: No.

8 But I don’t wanna contract. The fact that a kid never says Who do you wanna kiss Bill? doesn’t tell us anything unless the kid would have otherwise contracted. The fact that a kid never says Who do you wanna kiss Bill? doesn’t tell us anything unless the kid would have otherwise contracted. So we need controls in order to determine whether the kid knows/uses wanna contraction to begin with. So we need controls in order to determine whether the kid knows/uses wanna contraction to begin with. So: Try to get the kid to say Who do you wanna kiss? as well as trying to get them to say Who do you wanna kiss Bill? If the kid never says wanna, we have no evidence of anything. Also: Maybe the kid knows all there is to know about wanna contraction, but prefers not to contract.

9 Here’s how it might be done. Design the experiment to be some kind of game, to keep the kid interested and willing to continue. Design the experiment to be some kind of game, to keep the kid interested and willing to continue. Use a puppet. Kids are more willing to interact with the puppet, more willing to disagree with a puppet. Use a puppet. Kids are more willing to interact with the puppet, more willing to disagree with a puppet. Though: Gordon (1996) relates a tale of testing Kadiweu kids, who “had never encountered puppets before and reacted with a mixture of curiosity and fear that often led to tears” Though: Gordon (1996) relates a tale of testing Kadiweu kids, who “had never encountered puppets before and reacted with a mixture of curiosity and fear that often led to tears” Ratty missed snacktime and is probably hungry. Various toy food items are arrayed in the playspace. E: The rat looks kind of hungry. I bet he wants to eat something. Ask him what. C: What do you want? R: Huh? C: What do you wanna eat? R: Is that pepperoni pizza over there? I’ll have some of that. E: I bet the rat wants someone to brush his teeth for him. Ask him who. C: Who do you want to brush your teeth?

10 Analyzing the results Find the kids who Find the kids who Produced both subject and object questions Produced both subject and object questions Produced wanna sometimes. Produced wanna sometimes. H 1 : Kids know that want to cannot contract to wanna over a (subject) wh- trace. H 1 : Kids know that want to cannot contract to wanna over a (subject) wh- trace. H 0 : They don’t. H 0 : They don’t. Expectation given H 0 would be that kids would not distinguish subjects and objects; they would be as likely to contract in one case as in the other. If kids contract often with objects and never with subjects, that points to H 1. We could reject H 0.

11 Testing interpretation Do kids assign the same meanings to a sentence as adults? (More meanings? Fewer meanings?) Do kids assign the same meanings to a sentence as adults? (More meanings? Fewer meanings?) Constraints on meaning (e.g., Binding Theory) Constraints on meaning (e.g., Binding Theory) The Truth Value Judgment task is a popular way to approach this. The Truth Value Judgment task is a popular way to approach this. Pros: Fun for the kids. Minimal extra cognitive demands Gets at alternative meanings an act-out task can’t reliably exclude Con: A trial takes a long time, not many data points collected.

12 TVJ The idea: Set up a context by telling a story. Provide a test sentence which is either true or false of the situation (and have the puppet say it). Kid then either agrees with the puppet and rewards it, or disagrees and punishes it. If the puppet is wrong, the kid is asked “What really happened?” The idea: Set up a context by telling a story. Provide a test sentence which is either true or false of the situation (and have the puppet say it). Kid then either agrees with the puppet and rewards it, or disagrees and punishes it. If the puppet is wrong, the kid is asked “What really happened?” Also: Kids often like the puppet to be right, and will more readily agree with the puppet. So: stack the deck against H 1, and have adult-impossible readings correspond to “yes” responses.

13 Principle C, for example. Jumping competition (Crain & Thornton 1998) Jumping competition (Crain & Thornton 1998) This is a story about a jumping competition. The judge is Robocop. Last year he won the jumping competition, so this year he gets to be judge. This year, these guys, Cookie Monster, the Troll, and Grover are in the jumping competition. They have to try and jump over this log, the barrels, and the benches over here. This is a story about a jumping competition. The judge is Robocop. Last year he won the jumping competition, so this year he gets to be judge. This year, these guys, Cookie Monster, the Troll, and Grover are in the jumping competition. They have to try and jump over this log, the barrels, and the benches over here. R: The winner of the competition gets a great prize: colored pasta! See, it’s in this barrel right here. R: The winner of the competition gets a great prize: colored pasta! See, it’s in this barrel right here.

14 R: Troll, you jumped very well. You didn’t crash into anything at all. You could be the winner. But let me judge Grover before I decide. R: Troll, you jumped very well. You didn’t crash into anything at all. You could be the winner. But let me judge Grover before I decide. Now Troll winning is a possibility. Now Troll winning is a possibility. R: Grover, your jumps were very good, too. You didn’t knock anything down, and you were also very fast. So, I think you were the best jumper. You win the prize, this colored pasta. Well done, Grover. Great job! Robocop will remain by Grover as a reminder.

15 Against all odds T: No, Robocop, you’re wrong! I am the best jumper. I think I should get the prize. I’m going to take some colored pasta for myself. T: No, Robocop, you’re wrong! I am the best jumper. I think I should get the prize. I’m going to take some colored pasta for myself. K: Let me try to say what happened. That was a story about Robocop, who was the judge, and Cookie Monster, and Grover, and there was the Troll. I know one thing that happened. He said that the Troll is the best jumper. K: Let me try to say what happened. That was a story about Robocop, who was the judge, and Cookie Monster, and Grover, and there was the Troll. I know one thing that happened. He said that the Troll is the best jumper. C: No!! Bad Kermit. Eat this rag. C: No!! Bad Kermit. Eat this rag. Yet he i said that Troll i is the best jumper is true. Yet he i said that Troll i is the best jumper is true.

16 TVJ Distinguishing meaning 1 (disallowed by adult constraint) and meaning 2 (allowed by adult constraint). Distinguishing meaning 1 (disallowed by adult constraint) and meaning 2 (allowed by adult constraint). Test sentence should be true on meaning 1, false on meaning 2. Test sentence should be true on meaning 1, false on meaning 2. Child judges puppet’s report to be true (reward) or false (punishment). Child judges puppet’s report to be true (reward) or false (punishment). Evidence for meaning 1 should be acted out last. Linguistic antecedent for meaning 1 should be mentioned (by puppet) last. What really happened? Ensure that the test sentence is relevant; it must be clear why it is true or false (condition of plausible dissent).

17 Experimental design What we’re trying to determine is the degree to which variables in the situation affect one another. What we’re trying to determine is the degree to which variables in the situation affect one another. Does a Principle C configuration preclude a certain interpretation? Does a Principle C configuration preclude a certain interpretation? Does a subject wh- extraction preclude wanna contraction? Does a subject wh- extraction preclude wanna contraction? Independent variables are the presumed causal variables. Dependent variables are the presumed caused variables. Nuisance or confounding variables are other factors that may introduce systematic “noise”

18 Between- and within-subjects Between-subjects designs vary independent variables with the subjects, so each subject represents one of the values (levels) of the independent variable. Between-subjects designs vary independent variables with the subjects, so each subject represents one of the values (levels) of the independent variable. Age, for example (“Cross-sectional”). Age, for example (“Cross-sectional”). Within-subjects designs vary independent variables for each subject, so each subject sees all of the levels of the independent variable. Subject extraction and object extraction, for example. Age, for another (“Longitudinal”)

19 Task considerations Ideally, we want test items to be distinguished by just the factor we’re looking at. Ideally, we want test items to be distinguished by just the factor we’re looking at. This is important because other things may play a role and may confound the result. This is important because other things may play a role and may confound the result. If we find that kids are slower on: Who did Pat say met Chris? Than on: Who met Chris? Can we conclude that it takes more time to process a longer- distance extraction? Well, it could just take longer because there are more words— we need to rule that out if we want to conclude that it has to do with long-distance extraction.

20 Things that matter Performing the task on the test item changes the subject. Performing the task on the test item changes the subject. Present the same item to them, they’ll remember, it’ll affect how they act. Present the same item to them, they’ll remember, it’ll affect how they act. Seeing a pattern in all of the items they get may lead to an irrelevant strategy. Seeing a pattern in all of the items they get may lead to an irrelevant strategy. Items should include controls. Ensure that the subjects are performing the task. Rule out confounding variables. Fillers should (often) be included. Irrelevant items to mask the actual goal of the experiment Items should be presented in different orders ruling out another confounding variable

21 Things that matter The instructions given matter a lot. Is the task clear? The instructions given matter a lot. Is the task clear? Circle 1 for grammatical, 5 for ungrammatical. Circle 1 for grammatical, 5 for ungrammatical. Circle 1 if the sentence sounds ok, 5 if you would never use it. Circle 1 if the sentence sounds ok, 5 if you would never use it. Give the puppet a cookie if his sentence makes sense, and give him a rag if his sentence is silly. Give the puppet a cookie if his sentence makes sense, and give him a rag if his sentence is silly. Practice with feedback To confirm that the subjects understand the task (and are comfortable that they do), run a couple of practice trials. Practice items should not be test items. Should be relatively easy.

22 Things that matter Balance the responses Balance the responses If a “no box” will be considered to have scored perfectly, there is a huge uncontrolled confound. If a “no box” will be considered to have scored perfectly, there is a huge uncontrolled confound. If you are testing for obliviousness to a constraint, but obliviousness would yield all “yes” responses (or a big preponderance), subjects may start to “second guess” themselves. If you are testing for obliviousness to a constraint, but obliviousness would yield all “yes” responses (or a big preponderance), subjects may start to “second guess” themselves. Fillers/controls are for this. Fillers/controls are for this. Balance the items There should be the same number of items at each level of your independent variable(s). This maximizes the power of statistical analysis later. If a subject misses one, not a huge problem, but design it as a nice “square” if you can. Balance the conditions Eliminate confounds. Lexical items (Known? Frequent? Ambiguous? Long?)

23 Conditions At the outset, we need to define what we’re going to test for. At the outset, we need to define what we’re going to test for. Suppose we’re going to do a simple test of the that-trace effect, with some kind of acceptability judgment task (for adults, say). Suppose we’re going to do a simple test of the that-trace effect, with some kind of acceptability judgment task (for adults, say). The question is: are sentences that violate the that-trace filter worse than those that don’t? The question is: are sentences that violate the that-trace filter worse than those that don’t? Who did John say that left? Who did John say that left? Which capybara did Madonna meet on Mars? Which capybara did Madonna meet on Mars?

24 Confounds Controlling for confounds is one of the most important things you have to do. Controlling for confounds is one of the most important things you have to do. That-trace filter violations are not the only things that differentiate these sentences. That-trace filter violations are not the only things that differentiate these sentences. Who did John say that left? Who did John say that left? Which capybara did Madonna meet on Mars? Which capybara did Madonna meet on Mars?

25 Confounds Who did John say that left? Who did John say that left? Which capybara did Madonna meet on Mars? Which capybara did Madonna meet on Mars? Differences in lexical frequency can have a big effect on processing difficulty/time. Differences in lexical frequency can have a big effect on processing difficulty/time. Differences in plausibility can have a big effect on ratings from subjects. Differences in plausibility can have a big effect on ratings from subjects. Differences in length can conceivably play a role. Differences in length can conceivably play a role. Differences in structure can have an effect. Differences in structure can have an effect.

26 Confounds Who did John say that left? Who did John say that left? Which capybara did Madonna meet on Mars? Which capybara did Madonna meet on Mars? The point is: If you find that one sentence is judged worse than the other, we’ve learned nothing. We have no idea to what extent the that-trace violation played a role in the difference. The point is: If you find that one sentence is judged worse than the other, we’ve learned nothing. We have no idea to what extent the that-trace violation played a role in the difference.

27 Confounds You want to do everything you can to be testing exactly what you mean to be testing for. You want to do everything you can to be testing exactly what you mean to be testing for. We can’t control frequency, familiarity, plausibility very reliably—but we can control for them to some extent. We can’t control frequency, familiarity, plausibility very reliably—but we can control for them to some extent. Who did John say that left? Who did John say that left? Who did John say left? Who did John say left? Keep everything the same and at least they don’t differ in structure, frequency, plausibility—only in that-trace. (Well, and here, length). Keep everything the same and at least they don’t differ in structure, frequency, plausibility—only in that-trace. (Well, and here, length). However—note that length now works against that-trace, unless shorter sentences are harder. However—note that length now works against that-trace, unless shorter sentences are harder.

28 Conditions To start, we might say we want to test two conditions: To start, we might say we want to test two conditions: Sentences with a that-trace violation Sentences with a that-trace violation Sentences with no that-trace violation Sentences with no that-trace violation But we can’t build these without a length confound— holding everything else constant, we still have one fewer words in the that-trace case. How do we solve this? But we can’t build these without a length confound— holding everything else constant, we still have one fewer words in the that-trace case. How do we solve this? How can we show that the effect of the extra word that isn’t responsible for the overall effect? How can we show that the effect of the extra word that isn’t responsible for the overall effect?

29 Conditions The trick we’ll use is to have a second set of conditions, testing only the exact length issue. There’s no that-trace problem in object questions, so we can compare: The trick we’ll use is to have a second set of conditions, testing only the exact length issue. There’s no that-trace problem in object questions, so we can compare: Who did John say Mary met? Who did John say Mary met? Who did John say that Mary met? Who did John say that Mary met? to see how the difference compares to: to see how the difference compares to: Who did John say met Mary? Who did John say met Mary? Who did John say that met Mary? Who did John say that met Mary?

30 Factors We now have two “factors”—our sentences differ in terms of: We now have two “factors”—our sentences differ in terms of: subject vs. object question subject vs. object question presence vs. absence of that presence vs. absence of that When we analyze the result, we can determine the extent of the influence of the second factor by looking at the object condition and comparing it to the (disproportionately larger) effect of the presence of that in the subject condition. When we analyze the result, we can determine the extent of the influence of the second factor by looking at the object condition and comparing it to the (disproportionately larger) effect of the presence of that in the subject condition.

31 2x2 factorial design Often this is drawn in a table, with each factor on a different dimension. Often this is drawn in a table, with each factor on a different dimension. This is known as a 2x2 factorial design. This is known as a 2x2 factorial design. without that with that Subject extraction Who do you think likes John? Who do you think that likes John? Object extraction Who do you think John likes? Who do you think that John likes?

32 Context It turns out that the context also seems to have an effect on people’s ratings of sentences. It turns out that the context also seems to have an effect on people’s ratings of sentences. What comes before can color your subjects’ opinions. This too needs to be controlled for. What comes before can color your subjects’ opinions. This too needs to be controlled for. One aspect of this is that we generally avoid showing a single subject two versions of the same sentence (more relevant when they’re more unique than the John and Mary sentences)—the reaction to the second viewing may be based a lot on the first one. One aspect of this is that we generally avoid showing a single subject two versions of the same sentence (more relevant when they’re more unique than the John and Mary sentences)—the reaction to the second viewing may be based a lot on the first one. Another is that you want to give the sentences in a different order to different subjects. Another is that you want to give the sentences in a different order to different subjects.

33 Strategy You also don’t want your subjects to “catch on” to what you’re testing for—they will often see that they’re getting a lot of sentences with a particular structure and start responding to them based on their own theory of whether the sentence should be good or not, no longer performing the task. You also don’t want your subjects to “catch on” to what you’re testing for—they will often see that they’re getting a lot of sentences with a particular structure and start responding to them based on their own theory of whether the sentence should be good or not, no longer performing the task. Nor do you want to include people who seem to simply have a crazy grammar (or more likely just aren’t understanding or doing the task). Nor do you want to include people who seem to simply have a crazy grammar (or more likely just aren’t understanding or doing the task).

34 Fillers The solution to both problems is traditionally to use “fillers”, sentences which are not really part of the experiment. The solution to both problems is traditionally to use “fillers”, sentences which are not really part of the experiment. These can provide a baseline to show that a given subject is behaving “normally” and can serve to obscure the real “test items.” These can provide a baseline to show that a given subject is behaving “normally” and can serve to obscure the real “test items.” There’s no answer to “how many fillers should there be?” but it shouldn’t be fewer than the test items, and probably a 2:1 (filler:test item) ration is a good idea. There’s no answer to “how many fillers should there be?” but it shouldn’t be fewer than the test items, and probably a 2:1 (filler:test item) ration is a good idea. Fillers can’t be all good! About half should be bad. Fillers can’t be all good! About half should be bad.

35 Instructions and practice Another vital aspect of this procedure is to be sure that the subjects understand the task that they are supposed to be performing (and all in the same way). Another vital aspect of this procedure is to be sure that the subjects understand the task that they are supposed to be performing (and all in the same way). The wordings of the instructions and the rating scales are very important, and it’s a good idea to give subjects a few “practice” items before the test begins (clear cases for which the answers are provided). The wordings of the instructions and the rating scales are very important, and it’s a good idea to give subjects a few “practice” items before the test begins (clear cases for which the answers are provided).

36 Instructions “Is the sentence grammatical?” is not a good instruction. “Is the sentence grammatical?” is not a good instruction. The closest the naïve subject can come to “grammatical” will probably be to evaluate based on prescriptive rules learned in grammar classes—the term does not have the same meaning in common usage. The closest the naïve subject can come to “grammatical” will probably be to evaluate based on prescriptive rules learned in grammar classes—the term does not have the same meaning in common usage. “Is this a good sentence?” also has problems. “Is this a good sentence?” also has problems. I’d never say that, I’d say it another way. I’d never say that, I’d say it another way. That could never happen. That could never happen.

37 Numerical / category ratings How do you ask people to judge? How do you ask people to judge? Good/bad Good/bad Forces a choice, for anything other than “certainly good” and “certainly bad” there’s a chance that it doesn’t reflect the subject’s actual opinion—no differentiation between “great!” and “well, kind of ok” Forces a choice, for anything other than “certainly good” and “certainly bad” there’s a chance that it doesn’t reflect the subject’s actual opinion—no differentiation between “great!” and “well, kind of ok” Good/neutral/bad Good/neutral/bad Neutral also tends to get used for “I can’t decide” which is different from “I’m confident it has an in-between status” (doesn’t change much if you call it “in-between”) Neutral also tends to get used for “I can’t decide” which is different from “I’m confident it has an in-between status” (doesn’t change much if you call it “in-between”)

38 Numerical / category ratings Rate the sentence: (good) 1 2 3 4 5 (bad) Rate the sentence: (good) 1 2 3 4 5 (bad) Some people will never use the ends of the scale, likely to confound certainty with acceptability. Also, for certain applications, “3” is unusable. Some people will never use the ends of the scale, likely to confound certainty with acceptability. Also, for certain applications, “3” is unusable. Rate the sentence: (good) 1 2 3 4 (bad) Rate the sentence: (good) 1 2 3 4 (bad) Can be treated as a categorial judgment, may be able to factor out some personality aspects. This is the one I tend to like best. Can be treated as a categorial judgment, may be able to factor out some personality aspects. This is the one I tend to like best.

39 Online tasks The nice thing about an online experiment is it to some extent takes it “out of their hands.” The subject simply reacts, and we time it. The nice thing about an online experiment is it to some extent takes it “out of their hands.” The subject simply reacts, and we time it. Nevertheless, it is still important to ensure that the subject is performing the task, paying attention. Nevertheless, it is still important to ensure that the subject is performing the task, paying attention. Often can be addressed by questions about the sentence afterwards they must answer. Often can be addressed by questions about the sentence afterwards they must answer. Feedback can strengthen the motivation. Feedback can strengthen the motivation.

40                       

41 Descriptive, inferential Any discussion of statistics anywhere (± a couple) seems to begin with the following distinction: Any discussion of statistics anywhere (± a couple) seems to begin with the following distinction: Descriptive statistics Descriptive statistics Various measures used to describe/summarize an existing set of data. Average, spread, … Various measures used to describe/summarize an existing set of data. Average, spread, … Inferential statistics Inferential statistics Similar-looking measures, but aiming at drawing conclusions about a population by examining a sample. Similar-looking measures, but aiming at drawing conclusions about a population by examining a sample.

42 Central tendency and dispersion A good way to summarize a set of numbers (e.g., reaction times, test scores, heights) is to ascertain a “usual value” given the set, as well as some idea of how far values tend to vary from the usual. A good way to summarize a set of numbers (e.g., reaction times, test scores, heights) is to ascertain a “usual value” given the set, as well as some idea of how far values tend to vary from the usual. Central tendency: Central tendency: mean (average), median, mode mean (average), median, mode Dispersion: Dispersion: Range, variance (S 2 ), standard deviation (S) Range, variance (S 2 ), standard deviation (S)

43 Data points relative to the distribution: z-scores Once we have the summary characteristics of a data set (mean, standard deviation), we can describe any given data point in terms of its position relative to the mean and the distribution using a standardized score (the z- score). Once we have the summary characteristics of a data set (mean, standard deviation), we can describe any given data point in terms of its position relative to the mean and the distribution using a standardized score (the z- score). The z-score is defined so that 0 is at the mean, -1 is one standard deviation below, and 1 is one standard deviation above: The z-score is defined so that 0 is at the mean, -1 is one standard deviation below, and 1 is one standard deviation above:

44 Type I and Type II errors As a reminder, as we evaluate data sampled from the world to draw conclusions, there are four possibilities for any given hypothesis: As a reminder, as we evaluate data sampled from the world to draw conclusions, there are four possibilities for any given hypothesis: The hypothesis is (in reality) either true or false The hypothesis is (in reality) either true or false We conclude that the hypothesis is true or false. We conclude that the hypothesis is true or false. Inno- cent Guilty Convict Type I error Correct AcquitCorrect Type II error This leaves two outcomes that are correct, and two that are errors.

45 Type I and Type II errors The risk of making a Type I error is counterbalanced by the risk of making Type II errors; being safer with respect to one means being riskier with respect to the other. The risk of making a Type I error is counterbalanced by the risk of making Type II errors; being safer with respect to one means being riskier with respect to the other. One needs to decide which is worse, what the acceptable level of risk is for a Type I error, and establish a criterion— a threshold of evidence that is needed in order to decide to convict. One needs to decide which is worse, what the acceptable level of risk is for a Type I error, and establish a criterion— a threshold of evidence that is needed in order to decide to convict. Inno- cent Guilty Convict Type I error Correct AcquitCorrect Type II error You may sometimes encounter Type I errors referred to as  errors, and Type II errors as  errors.

46 Binomial / sign tests If you have an experiment in which each trial has two possible outcomes (coin flip, rolling a 3 on a die, kid picking the right animal out of 6), you can do a binomial test. If you have an experiment in which each trial has two possible outcomes (coin flip, rolling a 3 on a die, kid picking the right animal out of 6), you can do a binomial test. Called a sign test if success and failure have equal probabilities (e.g. coin toss) Called a sign test if success and failure have equal probabilities (e.g. coin toss) Hsu & Hsu’s (1996) example: Kid asked to pick an animal in response to stimulus sentence. Picking the right animal (of 6) serves as evidence of knowing the linguistic phenomenon under investigation. Random choice would yield 1 out of 6 chance (probability.17) of getting it right. Success. Failure: probability 1-.17=.83 Chances of getting it right 4 times out of 5 by guessing =.0035. Chances of getting it right all 5 times is.0001.

47 Hypothesis testing Independent variable is one which we control. Independent variable is one which we control. Dependent variable is the one which we measure, and which we hypothesize may be affected by the choice of independent variable. Dependent variable is the one which we measure, and which we hypothesize may be affected by the choice of independent variable. Summary score: What we’re measuring about the dependent variable. Perhaps number of times a kid picks the right animal. Summary score: What we’re measuring about the dependent variable. Perhaps number of times a kid picks the right animal. H 0 : The independent variable has no effect on the dependent variable. A grammatically indicated animal is not more likely to be picked. H 1 : The independent variable does have an effect on the dependent variable. A grammatically indicated animal is more likely to be picked.

48 Hypothesis testing H 0 : The independent variable has no effect on the dependent variable. H 0 : The independent variable has no effect on the dependent variable. A grammatically indicated animal is not more likely to be picked. A grammatically indicated animal is not more likely to be picked. H 1 : The independent variable does have an effect on the dependent variable. H 1 : The independent variable does have an effect on the dependent variable. A grammatically indicated animal is more likely to be picked. A grammatically indicated animal is more likely to be picked. If H 0 is true, the kid has a 1/6th chance (0.17) of getting one right in each trial. So, given 5 tries, that’s a 40% chance (.40) of getting one. But odds of getting 3 are about 3% (0.03), and odds of getting 4 are about.4% (0.0035). So, if the kid gets 3 of 5 right, the likelihood that this came about by chance (H 0 ) are slim. =BINOMDIST(3, 5, 0.17, false) Yields 0.03. 3 is number of successes, 5 is number of tries, 0.17 is the probability of success per try. True instead of false would be probability that at most 3 were successes.

49 Criteria In hypothesis testing, a criterion is set for rejecting the null hypothesis. In hypothesis testing, a criterion is set for rejecting the null hypothesis. This is a maximum probability that, if the null hypothesis were true, we would have gotten the observed result. This is a maximum probability that, if the null hypothesis were true, we would have gotten the observed result. This has arbitrarily been (conventionally) set to 0.05. This has arbitrarily been (conventionally) set to 0.05. So, if the probability p of seeing what we see if H 0 were true is less than 0.05, we reject the null hypothesis. If the kid gets 3 animals right in 5 trials, p=0.03 — that is, p<0.05 so we reject the null hypothesis.

50 Measuring things When we go out into the world and measure something like reaction time for reading a word, we’re trying to investigate the underlying phenomenon that gives rise to the reaction time. When we go out into the world and measure something like reaction time for reading a word, we’re trying to investigate the underlying phenomenon that gives rise to the reaction time. When we measure reaction time of reading I vs. they, we are trying to find out of there is a real, systematic difference between them (such that I is generally faster). When we measure reaction time of reading I vs. they, we are trying to find out of there is a real, systematic difference between them (such that I is generally faster).

51 Measuring things Does it take longer to read I than they? Does it take longer to read I than they? Suppose that in principle it takes Pat A ms to read I and B ms to read they. Suppose that in principle it takes Pat A ms to read I and B ms to read they. Except sometimes his mind wanders, sometimes he’s sleepy, sometimes he’s hyper-caffeinated. Except sometimes his mind wanders, sometimes he’s sleepy, sometimes he’s hyper-caffeinated. Does it take longer for people to read I than they? Does it take longer for people to read I than they? Some people read/react slower than Pat. Some people read/react faster than Pat. Some people read/react slower than Pat. Some people read/react faster than Pat.

52 Normally… Many things we measure, with their noise taken into account, can be described (at least to a good approximation) by this “bell-shaped” normal distribution. Many things we measure, with their noise taken into account, can be described (at least to a good approximation) by this “bell-shaped” normal distribution. Often as we do statistics, we implicitly assume that this is the case… Often as we do statistics, we implicitly assume that this is the case…

53 Properties of the normal distribution A normal distribution can be described in terms of two parameters. A normal distribution can be described in terms of two parameters.  = mean  = mean  = standard deviation (spread)  = standard deviation (spread)

54 Interesting facts about the standard deviation About 68% of the observations will be within one standard deviation of the population mean. About 68% of the observations will be within one standard deviation of the population mean. About 95% of the observations will be within two standard deviations of the population mean. About 95% of the observations will be within two standard deviations of the population mean. Percentile (mean 80, score 75, stdev 5): 15.9 Percentile (mean 80, score 75, stdev 5): 15.9

55 Inferential statistics For much of what you’ll use statistics for, the presumption is that there is a distribution out in the world, a truth of the matter. For much of what you’ll use statistics for, the presumption is that there is a distribution out in the world, a truth of the matter. If that distribution is a normal distribution, there will be a population mean (  ) and standard deviation (  ). If that distribution is a normal distribution, there will be a population mean (  ) and standard deviation (  ). By measuring a sample of the population, we can try to guess  and  from the properties of our sample. By measuring a sample of the population, we can try to guess  and  from the properties of our sample.

56 A common goal Commonly what we’re after is an answer to the question: are these two things that we’re measuring actually different? Commonly what we’re after is an answer to the question: are these two things that we’re measuring actually different? So, we measure for I and for they. Of the measurements we’ve gotten, I seems to be around A, they seems to be around B, and B is a bit longer than A. The question is: given the inherent noise of measurement, how likely is it that we got that difference just by chance? So, we measure for I and for they. Of the measurements we’ve gotten, I seems to be around A, they seems to be around B, and B is a bit longer than A. The question is: given the inherent noise of measurement, how likely is it that we got that difference just by chance?

57 So, more or less, … If we knew the actual mean of the variable we’re measuring and the standard deviation, we can be 95% sure that any given measurement we do will land within two standard deviations of that mean— and 68% sure that it will be within one. If we knew the actual mean of the variable we’re measuring and the standard deviation, we can be 95% sure that any given measurement we do will land within two standard deviations of that mean— and 68% sure that it will be within one. Of course, we can’t know the actual mean. But we’d like to. Of course, we can’t know the actual mean. But we’d like to.

58 Estimating If we take a sample of the population and compute the sample mean of the measures we get, that’s the best estimate we’ve got of the population mean. If we take a sample of the population and compute the sample mean of the measures we get, that’s the best estimate we’ve got of the population mean. =AVERAGE(A2:A10) =AVERAGE(A2:A10) To estimate the spread of the population, we use a number related to the number of samples we took and the variance of our sample. To estimate the spread of the population, we use a number related to the number of samples we took and the variance of our sample. =STDEV(A2:A10) =STDEV(A2:A10) If you want to describe your sample (that is if you have the entire population sampled), use STDEVP instead. If you want to describe your sample (that is if you have the entire population sampled), use STDEVP instead.

59 t-tests Take a sample from the population and measure it. Say you took n measurements. Take a sample from the population and measure it. Say you took n measurements. Population estimates:   = AVERAGE(sample),   = SQRT(VAR(sample)/n) Population estimates:   = AVERAGE(sample),   = SQRT(VAR(sample)/n) Your hypotheses determine what you expect your population mean to be if the null hypothesis is true. Your hypotheses determine what you expect your population mean to be if the null hypothesis is true. We’re actually considering variability in the sample means here—what is the mean mean you expect to get, and what is the variance in those means? We’re actually considering variability in the sample means here—what is the mean mean you expect to get, and what is the variance in those means? You look at the distance of the sample mean from the estimated population mean (of sample means) and see if it’s far enough away to be very unlikely (e.g., p<0.05) to have arisen by chance. You look at the distance of the sample mean from the estimated population mean (of sample means) and see if it’s far enough away to be very unlikely (e.g., p<0.05) to have arisen by chance.

60 t-tests Does caffeine affect heart rate (example from Loftus & Loftus 1988)? Does caffeine affect heart rate (example from Loftus & Loftus 1988)? Sample 9 people, measure their heart rate pre- and post- caffeination. The measure for each subject will be the difference score (post-pre). This is a within-subjects design. Sample 9 people, measure their heart rate pre- and post- caffeination. The measure for each subject will be the difference score (post-pre). This is a within-subjects design. Estimate the sample mean population:  M =AVERAGE(B1:B10)=4.44  M =SQRT(VAR(B1:B10)/COUNT(B1:B10))=1.37 Estimate the sample mean population:  M =AVERAGE(B1:B10)=4.44  M =SQRT(VAR(B1:B10)/COUNT(B1:B10))=1.37 t-score (like z-score) is scaled (here, against estimated standard deviation), giving a measure of how “extreme” the sample mean was that we found. t-score (like z-score) is scaled (here, against estimated standard deviation), giving a measure of how “extreme” the sample mean was that we found. If the t-score (here 3.24) is higher than the criterion t (2.31, based on “degrees of freedom” = n-1 = 8) and desired  -level (0.05), we can reject the null hypothesis: caffeine affects heart rate. If the t-score (here 3.24) is higher than the criterion t (2.31, based on “degrees of freedom” = n-1 = 8) and desired  -level (0.05), we can reject the null hypothesis: caffeine affects heart rate.

61 t-tests: 2 sample means The more normal use of a t-test is to see if two sample means are different from one another. The more normal use of a t-test is to see if two sample means are different from one another. H 0 :  1 =  2 H 0 :  1 =  2 H 1 :  1 >  2 H 1 :  1 >  2 This is a directional hypothesis—we are investigating not just that they are different, but that  1 is more than  2. This is a directional hypothesis—we are investigating not just that they are different, but that  1 is more than  2. For such situations, our criterion t score should be one- tailed. We’re only looking in one direction, and  1 has to be sufficiently bigger than  2 to conclude that H 0 is wrong. For such situations, our criterion t score should be one- tailed. We’re only looking in one direction, and  1 has to be sufficiently bigger than  2 to conclude that H 0 is wrong.

62 Tails If we are taking as our alternative hypothesis (H 1 ) that two means simply differ, then they could differ in either direction, and so we’d conclude that they differ if the one were far out from the the other in either direction. If H 1 is that the mean will increase, then it is a directional hypothesis, and then a one-tailed criterion is called for. If we are taking as our alternative hypothesis (H 1 ) that two means simply differ, then they could differ in either direction, and so we’d conclude that they differ if the one were far out from the the other in either direction. If H 1 is that the mean will increase, then it is a directional hypothesis, and then a one-tailed criterion is called for.

63 t-tests in Excel If you have one set of data in column A, and another in column B, If you have one set of data in column A, and another in column B, =TTEST(A1:A10, B1:B10, 1, type) =TTEST(A1:A10, B1:B10, 1, type) Type is 1 if paired (each row in column A corresponds to a row in column B), 2 if independently sampled but with equal variance, 3 if independently sampled but with unequal variance. Type is 1 if paired (each row in column A corresponds to a row in column B), 2 if independently sampled but with equal variance, 3 if independently sampled but with unequal variance. Paired is generally better at keeping variance under control. Paired is generally better at keeping variance under control.

64 ANOVA Analysis of Variance (ANOVA), finding where the variance comes from. Analysis of Variance (ANOVA), finding where the variance comes from. Suppose we have three conditions and we want to see if the means differ. Suppose we have three conditions and we want to see if the means differ. We could do t-tests, condition 1 against condition 2, condition 1 against condition 3, condition 2 against condition 3, but this turns out to be not as good. We could do t-tests, condition 1 against condition 2, condition 1 against condition 3, condition 2 against condition 3, but this turns out to be not as good.

65 Finding the variance The idea of the ANOVA is to divide up the total variance in the data into parts (to “account for the variance”): The idea of the ANOVA is to divide up the total variance in the data into parts (to “account for the variance”): Within group variance (variance that arises within a single condition) Within group variance (variance that arises within a single condition) Between group variance (variance that arises between different conditions) Between group variance (variance that arises between different conditions) ANOVA:SSdfMSFpFc between groups…5..2.450.0452.39 within groups…54.. total ANOVA:SSdfMSFpFc between groups…5..2.450.0452.39 within groups…54.. total

66 Confidence intervals As well as trying to decide if your observed sample is within what you’d expect your estimated distribution to provide, you can kind of run this logic in reverse as well, and come up with a confidence interval: As well as trying to decide if your observed sample is within what you’d expect your estimated distribution to provide, you can kind of run this logic in reverse as well, and come up with a confidence interval: Given where you see the measurements coming up, they must be 68% likely to be within 1 CI of the mean, and 95% likely to be within 2 CI of the mean, so the more measurements you have the better guess you can make. Given where you see the measurements coming up, they must be 68% likely to be within 1 CI of the mean, and 95% likely to be within 2 CI of the mean, so the more measurements you have the better guess you can make. A 95% CI like 209.9 < µ < 523.4 means “we’re 95% confident that the real population mean is in there”. A 95% CI like 209.9 < µ < 523.4 means “we’re 95% confident that the real population mean is in there”. =CONFIDENCE(0.05, STDEV(sample), COUNT(sample)) =CONFIDENCE(0.05, STDEV(sample), COUNT(sample))

67 Correlation and Chi square Correlation between two two measured variables is often measured in terms of (Pearson’s) r. Correlation between two two measured variables is often measured in terms of (Pearson’s) r. If r is close to 1 or -1, the value of one variable can predict quite accurate the value of the other. If r is close to 1 or -1, the value of one variable can predict quite accurate the value of the other. If r is close to 0, predictive power is low. If r is close to 0, predictive power is low. Chi-square test is supposed to help us decide if two conditions/factors are independent of one another or not. (Does knowing one help predict the effect of the other?)

68                       

69                       


Download ppt "Week 6. Experiments, methodology, the responsible use of numbers, and paper structure. GRS LX 700 Language Acquisition & Linguistic Theory."

Similar presentations


Ads by Google