The Data Collection and Statistical Analysis in IB Biology John Gasparini The Munich International School Part IV – Hypothesis testing with the Chi-Squared.

The Data Collection and Statistical Analysis in IB Biology John Gasparini The Munich International School Part IV – Hypothesis testing with the Chi-Squared Test

T-tests and Standard Deviation are used to analyze measurement data which, in theory, are continuously variable. Between a measurement of, say, 2 grams and 3 grams there is a continuous range from 2.0001 to 2.9999 grams (using the balance on pictured on the right.) The Chi-Squared Test: So far we have talked about Standard Deviation as a way to measure variability in a dataset, and hypothesis testing using t-tests… Data consisting of measurements like the mass of a butterfly, or length of a proboscis should be analyzed with S.D. values and t-tests.

These counts, or enumeration data, are discontinuous (1 brown butterfly, 2 brown butterflies, 3, 4 etc.) and must be treated differently from continuous data. Often, the appropriate analysis is the Chi-Squared (Χ 2 ) Test, which we use to test whether the number of individuals in different categories fit a null hypothesis (an expected count or ratio of some sort). The Chi-Squared Test works with… ENUMERATION DATA

A Simple Example of the Chi-Squared Test Applied: Suppose that the ratio of male to female monarch butterflies is exactly 1:1 at birth. In theory, for every 1000 Monarchs Butterflies born in the US and Canada, 500 are male and 500 are female. 500 males to 500 females is the EXPECTED number of butterflies for each gender. 1 to 1 is the expected sex ratio for populations of this species.

Monarchs are especially noted for their yearly migration over long distances. In North America they make massive southward migrations starting in August until the first frost. There is a northward migration in the spring. Female monarchs lay eggs for the next generation during these migrations.

The length of these journeys exceeds the normal lifespan of most monarchs, which is less than two months for butterflies born in early summer. The last generation of the summer enters into a nonreproductive phase known as diapause, which may last seven months or more. The overwintering generation generally does not reproduce until it has left the overwintering site sometime in February and March and is on its way North. During diapause, butterflies fly to one of many overwintering sites in Mexico or Southern California.

In one region of Northern Texas, after a long spring migration north from Mexico into the US, a sex ratio of 443 males to 557 females are found when the returning migrant butterflies are captured. Is this a significant departure from expectation? Do these numbers suggest that overwintering females are surviving at greater rates as they migrate and overwinter? The Chi Squared Test Can help us analyze this data and determine if the difference in this enumeration data is significant.

Is this a significant departure from expectation? We proceed with a Chi Squared Test as follows (but note that we are going to overlook a very important point that we shall deal with later). We expect that the sex ratio will remain during overwintering and migration, so for every 1000 Monarchs that return to North Texas, 500 should be male and 500 should be female. These are our EXPECTED VALUES (E) The observed ratio in the returning population is 443 males to 557 females. These are our OBSERVED VALUES (O)

FemaleMaleTotal Observed Numbers (O)5574431000 Expected Numbers (E)500 1000 O – E57-570 (O – E) 2 3249 (O – E) 2 / E6.498 Total (χ2) = 12.99 To conduct a Chi Squared Test by hand, set out a table as shown below, with the “Observed" numbers and the “Expected" numbers (i.e. our null hypothesis). Step 1: Calculate (Observed – Expected), Step 2: Calculate (Observed – Expected) 2 Step 3: Calculate (Observed – Expected) 2 /Expected Step 4: Add up your (O – E) 2 /E values This total of (O – E) 2 /E values is the χ2 value.

Now, like was the case with the T-test, we need to compare our χ2 value to a table of “critical values” to allow us to see if there is a significant difference between the Observed and Expected counts.

The number of Degrees of Freedom in conducting a χ 2 test is simply the number of count categories in our Observed and Expected data minus one. We have only two categories in this case: Male and Female Df = Count Categories – 1 Df = 2 – 1 = 1

In Biology we want to be conservative in drawing our conclusions. Therefore it is convention to run hypothesis testing at a 95% confidence level or a probability value (P) of 0.05 P -value 3.841 is our Critical Value!

If our calculated value of X 2 is less than the critical value (c 2 ), then we have no significant difference from the expectation. 3.841 is our Critical Value12.99 is our χ2 value In fact, our calculated X 2 (12.99) exceeds even the tabulated c 2 value (3.841) for p = 0.05. If our calculated value of X 2 exceeds the critical value (c 2 ), then we have a significant difference from the expectation. ∴ Accept Null Hypothesis (H o ). There is no difference between the number of males versus females returning from overwintering and migration from the expected 1:1 sex ratio ∴ Reject Null Hypothesis (H o ).

Our calculated X 2 (12.99) exceeds even the tabulated c 2 value (3.841) for p = 0.05. Our X 2 value (12.99) even exceeds the critical value for p = 0.001! This shows an EXTREME DEPARTURE FROM EXPECTATION! It is still possible that we could have got this result by chance - a probability of less than 1 in 1000. But we could be 99.9% confident that some factor leads to a "bias" towards females returning from overwintering and migration. (What is causing this bias towards females, or increased female survival? Greater energy reserves? Less predation pressure? Greater concentration of toxic alkaloids in female body tissue? That’s for another experiment to determine…)

When there are only two categories (e.g. male/female) or, more correctly, when there is only one degree of freedom, the c 2 test should not, strictly, be used. There have been various attempts to correct this deficiency, but the simplest is to apply the "Yates Correction" to our data. To do this, we simply subtract 0.5 from each calculated value of "O-E", ignoring the sign (plus or minus). In other words, an "O-E" value of +5 becomes +4.5, and an "O-E" value of -5 becomes -4.5. To signify that we are reducing the absolute value, ignoring the sign, we use vertical lines: |O-E|-0.5. Then we continue as usual but with these new (corrected) O-E values: we calculate (with the corrected values) (O-E) 2, (O-E) 2 /E and then sum the (O-E) 2 /E values to get X 2. HA! The Yates Correction only applies when we have two categories (one degree of freedom). WARNING: Please ignore this slide if you are struggling with how to perform a X 2 test! But for those of you that are following this well, there is a problem with what we just did!

FemaleMaleTotal Observed Numbers (O)5574431000 Expected Numbers (E)500 1000 O – E57-570 |O-E|-0.556.5 (|O-E|-0.5) 2 3,192 (|O-E|-0.5) 2 / E6.38 Total (χ2) = 12.76 WARNING: Please ignore this slide if you are struggling with how to perform a X 2 test! Here is how we would complete a two category X 2 table with the Yates Correction Not too big of a difference in this instance 12.99 vs. 12.76, but this correction could be important to consider if your data is on the borderline near the c 2 critical value!

Now view this exciting podcast on YouTube on how to conduct a “Chi Squared Test” using Excel. It’s pretty easy to do and will allow you to conduct this hypothesis test very quickly! Remember: In Excel the X 2 -values that are calculated are really Probability values, P- values P-values > 0.05 = No Sig. Dif! P-values < 0.05 = Sig. Dif! http://youtu.be/EOZu6O6i-Zk

Suppose that an ecologist studying Monarch populations theorizes that the choice of overwintering sites made by Monarchs might in some way be causing the difference in the sex ratio of returning butterflies. Almost all Monarchs from Eastern Canada and the US overwinter in the Oyamel Trees found in a preserve outside of Angangueo in the state of Michoacan, Mexico. The Chi-Squared Test: Multiple Classifications Oyamel Trees - a type of fir native to the mountains of central and southern Mexico This is the tree for me! !

She observes that Female Monarchs tend to be found lower on the branches of the Oyamel Trees. She wants to determine if her observation is statistically significant. So she collects enumeration data. More males roost higher on the trees Females roost more frequently on the lower, more protected branches.

Monarchs Counted: 623 Females and 370 on Branches below 5 meters. Is there a statistically significant difference in the choice of the height of overwintering roosts of Male vs. Female Monarchs? Monarchs Counted: 380 Females and 480 Males on Branches above 5 meters.

1. Set out a table as follows with the classifications on the y-axis of the table: FemaleMaleTotal Roost sites > 5 meters380480860 Roost sites < 5 meters623370993 Totals10038501853 Steps in carrying out a Chi-Squared Test with Multiple Classifications: 2. Decide on the null hypothesis. In this case there is no "theory" that gives us an obvious null hypothesis. For example, we have no reason to suppose that 55% or 75% or any other percentage of Female Monarchs will roost in low tree branches. So the most sensible null hypothesis is that both the male and the female Monarchs will behave similarly and that both types of butterfly will roost equally along the height of the tree. In other words, we will test against a 1:1:1:1 ratio. Then, if our data do not agree with this expectation we will have evidence that females roost in greater frequency on lower branches.

FemaleMaleTotal Roost sites > 5 mObserved Numbers (O)380480860 Expected Numbers (E)ab Roost sites < 5 mObserved Numbers (O)623370993 Expected Numbers (E)cd Column Totals10038501853 3. Calculate the Expected frequencies, based on the null hypothesis: This step is complicated by the fact that we have different numbers of male and female Monarchs, and different numbers of these sexes above and below 5 meters on the trees. But we can find the expected frequencies (a, b, c and d) by using the grand total (1853) and the column and row totals (see table below).

3. Calculate the Expected frequencies, based on the null hypothesis continued… To find the expected value "a" we know that a total 860 butterflies roosted > 5m and that 993 of the total 1853 butterflies were female. So a = 860 (1003/1853) = 465.5 Similarly, to find b = 860 (850/1853) = 394.5 [Actually, we could have done this simply by subtracting a from the expected 860 row total - the expected total must always be the same as the observed total] FemaleMaleTotal Roost sites > 5 mObserved Numbers (O)380480860 Expected Numbers (E)ab Roost sites < 5 mObserved Numbers (O)623370993 Expected Numbers (E)cd Column Totals10038501853

3. Calculate the Expected frequencies, based on the null hypothesis continued… To find the expected value “c”, we know that a total 993 butterflies roosted < 5m and that 1003 of the total 1853 butterflies were female. So c is 993(1003/1853) = 537.5 To find the expected value “d”, we know that a total 993 butterflies roosted < 5m and that 8503 of the total 1853 butterflies were male. So d is 993(850/1853) = 455.5 [Again, this value also could have been obtained by subtraction] FemaleMaleTotal Roost sites > 5 mObserved Numbers (O)380480860 Expected Numbers (E)a = 465.5b = 394.5860 Roost sites < 5 mObserved Numbers (O)623370993 Expected Numbers (E)c = 537.5d = 455.5993 Column Totals10038501853

4. Decide the number of degrees of freedom You might think that there are 3 degrees of freedom (because there are 4 categories). But there is actually one degree of freedom! The reason is that we lose one degree of freedom because we have 4 categories, and we lose a further 2 degrees of freedom because we used two pieces of information to construct our null hypothesis - we used a column total and a row total. Once we had used these we would have needed only one data entry in order to fill in the rest of the values (therefore we have one degree of freedom). IF you are confused by this and how you find degrees of freedom, see me! Of course, with one degree of freedom we must use Yates correction (subtract 0.5 from each O-E value).

FemaleMaleTotal Roost sites > 5 mObserved Numbers (O)380480860 Expected Numbers (E)465.5394.5860 O – E-85.585.5 Yates Correction(|O-E|)-0.585.0 (O – E) 2 / E15.518.3 Roost sites > 5 mObserved Numbers (O)623370993 Expected Numbers (E)537.5455.5993 O – E85.5-85.5 Yates Correction(|O-E|)-0.585.0 (O – E) 2 / E13.415.9Total (χ2) = 63.1 Totals10038501853 5. Run the analysis as usual. Calculating O-E, (O-E) 2 and (O-E) 2 /E for each category, then sum the (O-E) 2 /E. values to obtain X 2 We have calculated with a χ2 value of 63.1

6) Test the X 2 value against the table of critical c 2 values: We have calculated with a χ2 value of 63.1 3.841 is our Critical Value!

If our calculated value of X 2 is less than the critical value (c 2 ), then we have no significant difference from the expectation. 3.841 is our Critical Value63.1 is our χ2 value In fact, our calculated X 2 (63.1) well exceeds the tabulated c 2 value (3.841) for p = 0.05. If our calculated value of X 2 exceeds the critical value (c 2 ), then we have a significant difference from the expectation. ∴ Accept Null Hypothesis (H o ). There is no difference between the number of males versus females returning from overwintering and migration from the expected 1:1 sex ratio ∴ Reject Null Hypothesis (H o ).

Our calculated X 2 (63.1) exceeds even the tabulated c 2 value (3.841) for p = 0.05. Our X 2 value (12.99) even exceeds the critical value for p = 0.001! Again, this shows an EXTREME DEPARTURE FROM EXPECTATION! It is still possible that we could have got this result by chance - a probability of less than 1 in 1000. But we could be 99.9% confident that some factor leads to a "bias" towards females overwintering in branches lower than 5m while males overwinter in braches above 5m.

Now view this even more exciting podcast on YouTube on how to conduct a “Chi Squared Test with Multiple Categories” using Excel. It’s pretty easy to do and will allow you to conduct this hypothesis test very quickly! Remember: In Excel the X 2 -values that are calculated are really Probability values, P- values P-values > 0.05 = No Sig. Dif! P-values < 0.05 = Sig. Dif! http://youtu.be/I2Gi2DN_y4A

The Data Collection and Statistical Analysis in IB Biology John Gasparini The Munich International School Part IV – Hypothesis testing with the Chi-Squared.

Similar presentations

Presentation on theme: "The Data Collection and Statistical Analysis in IB Biology John Gasparini The Munich International School Part IV – Hypothesis testing with the Chi-Squared."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Data Collection and Statistical Analysis in IB Biology John Gasparini The Munich International School Part IV – Hypothesis testing with the Chi-Squared.

Similar presentations

Presentation on theme: "The Data Collection and Statistical Analysis in IB Biology John Gasparini The Munich International School Part IV – Hypothesis testing with the Chi-Squared."— Presentation transcript:

Similar presentations

About project

Feedback