# Z-squared: the origin and use of χ² - or - what I wish I had been told about statistics (but had to work out for myself) Sean Wallis Survey of English.

## Presentation on theme: "Z-squared: the origin and use of χ² - or - what I wish I had been told about statistics (but had to work out for myself) Sean Wallis Survey of English."— Presentation transcript:

z-squared: the origin and use of χ² - or - what I wish I had been told about statistics (but had to work out for myself) Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Outline What is the point of statistics? –Linguistic alternation experiments –How inferential statistics works Introducing z tests –Two types (single-sample and two-sample) –How these tests are related to χ² Comparing experiments and effect size –Swing and skew Low frequency events and small samples

What is the point of statistics? Analyse data you already have –corpus linguistics Design new experiments –collect new data, add annotation –experimental linguistics in the lab Try new methods –pose the right question We are going to focus on z and χ² tests

What is the point of statistics? Analyse data you already have –corpus linguistics Design new experiments –collect new data, add annotation –experimental linguistics in the lab Try new methods –pose the right question We are going to focus on z and χ² tests experimental science } observational science } philosophy of science } a little maths }

What is inferential statistics? Suppose we carry out an experiment –We toss a coin 10 times and get 5 heads –How confident are we in the results? Suppose we repeat the experiment Will we get the same result again? Inferential statistics is a method of inferring the behaviour of future ghost experiments from one experiment –Infer from the sample to the population Let us consider one type of experiment –Linguistic alternation experiments

Alternation experiments Imagine a speaker forming a sentence as a series of decisions/choices. They can –add: choose to extend a phrase or clause, or stop –select: choose between constructions Choices will be constrained –grammatically –semantically

Alternation experiments Imagine a speaker forming a sentence as a series of decisions/choices. They can –add: choose to extend a phrase or clause, or stop –select: choose between constructions Choices will be constrained –grammatically –semantically Research question: –within these constraints, what factors influence the particular choice?

Alternation experiments Laboratory experiment (cued) –pose the choice to subjects –observe the one they make –manipulate different potential influences Observational experiment (uncued) –observe the choices speakers make when they make them (e.g. in a corpus) –extract data for different potential influences sociolinguistic: subdivide data by genre, etc lexical/grammatical: subdivide data by elements in surrounding context

Statistical assumptions A random sample taken from the population –Not always easy to achieve multiple cases from the same text and speakers, etc may be limited historical data available –Be careful with data concentrated in a few texts The sample is tiny compared to the population –This is easy to satisfy in linguistics! Repeated sampling tends to form a Binomial distribution –This requires slightly more explanation...

The Binomial distribution Repeated sampling tends to form a Binomial distribution –We toss a coin 10 times, and get 5 heads: F N = 1 x 53179

The Binomial distribution Repeated sampling tends to form a Binomial distribution F N = 4 x 53179

The Binomial distribution Repeated sampling tends to form a Binomial distribution F N = 8 x 53179

The Binomial distribution Repeated sampling tends to form a Binomial distribution F N = 12 x 53179

The Binomial distribution Repeated sampling tends to form a Binomial distribution F N = 16 x 53179

The Binomial distribution Repeated sampling tends to form a Binomial distribution F N = 20 x 53179

The Binomial distribution Repeated sampling tends to form a Binomial distribution F N = 24 x 53179

Binomial Normal The Binomial (discrete) distribution tends to match the Normal (continuous) distribution x F 53179

The central limit theorem Any Normal distribution can be defined by only two variables and the Normal function z z. s F –With more data in the experiment, s will be smaller p 0.50.30.10.7 –Divide by 10 for probability scale population mean x = P standard deviation s = P(1 – P) / n

The central limit theorem Any Normal distribution can be defined by only two variables and the Normal function z z. s F 2.5% population mean x = P –95% of the curve is within ~2 standard deviations of the mean (the correct figure is 1.95996!) standard deviation s = P(1 – P) / n p 0.50.30.10.7 95%

The single-sample z test... Is an observation > z standard deviations from the expected population mean? –If yes, the result is significant z. s F P 0.25% p 0.50.30.10.7 observation p

...gives us a confidence interval P ± z. s is the confidence interval for P –Enough for a test z. s F P 0.25% p 0.50.30.10.7

...gives us a confidence interval P ± z. s is the confidence interval for P –But we need the interval about p w+w+ F P 0.25% p 0.50.30.10.7 observation p w–w–

...gives us a confidence interval The interval about p is called the Wilson score interval w+w+ F P 0.25% p 0.50.30.10.7 observation p w–w– This interval is asymmetric It reflects the Normal interval about P : If P is at the upper limit of p, p is at the lower limit of P (Wilson, 1927)

...gives us a confidence interval The interval about p is called the Wilson score interval w+w+ F P 0.25% p 0.50.30.10.7 observation p w–w– To calculate w – and w + we use this formula: (Wilson, 1927)

Plotting confidence intervals E.g. Plot the probability of adding successive attributive adjectives to a NP in ICE-GB –You can easily see that the first two falls are significant, but the last is not 0.00 0.05 0.10 0.15 0.20 0.25 01234 p

A simple experiment Consider two binary variables, A and B –Each one is subdivided: A = {a, ¬a} e.g. NP has AJP? {yes, no} B = {b, ¬b} e.g. Speaker gender {male, female} –Does B affect A ? We perform an experiment (or sample a corpus) –We find 45 cases (NPs) classified by A and B (left) –This is a contingency table

A simple experiment Consider two binary variables, A and B –Each one is subdivided: A = {a, ¬a} e.g. NP has AJP? {yes, no} B = {b, ¬b} e.g. Speaker gender {male, female} –Does B affect A ? We perform an experiment (or sample a corpus) –We find 45 cases (NPs) classified by A and B (left) –This is a contingency table Q1. Does B cause a to differ from A ? –Does speaker gender affect decision to include an AJP? a ¬a b20525 ¬b101020 301545 A = dependent variable B = independent variable

Does B cause a to differ from A ? Compare column 1 ( a ) and column 3 ( A ) –Probability of picking b at random (gender = male) p(b) = 25/45 = 5/9 = 0.556 a ¬a b20525 ¬b101020 301545

Does B cause a to differ from A ? Compare column 1 ( a ) and column 3 ( A ) –Probability of picking b at random (gender = male) p(b) = 25/45 = 5/9 = 0.556 Next, examine a (has AJP) –New probability of picking b p(b | a) = 20/30 = 2/3 = 0.667 –Confidence interval for p(b | a) population standard deviation s = p(b)(1–p(b))/n = ( 5 / 9 4 / 9 ) / 30 p z.s = (0.489, 0.845) a ¬a b20525 ¬b101020 301545

Does B cause a to differ from A ? Compare column 1 ( a ) and column 3 ( A ) –Probability of picking b at random (gender = male) p(b) = 25/45 = 5/9 = 0.556 Next, examine a (has AJP) –New probability of picking b p(b | a) = 20/30 = 2/3 = 0.667 –Confidence interval for p(b) population standard deviation s = p(b)(1–p(b))/n = ( 5 / 9 4 / 9 ) / 30 p z.s = (0.378, 0.733) Not significant: p(b | a) is inside c.i. for p(b) a ¬a b20525 ¬b101020 301545

Visualising this test Confidence interval for p(b) –P = expected value E = expected distribution z. s F P p 0.556 0.3780.733 p(b) 0.667 p E p(b | a) Aa p(b)p(b)

The single-sample z test Compares an observation with a given value –We used it to compare p(b | a) with p(b) –This is a goodness of fit test –Identical to a standard 2 1 χ² test –No need to test p(¬b | a) with p(¬b) Note that p(b) is given –All of the variation is assumed to be in the estimation of p(b | a) –Could also compare p(b | ¬a) (no AJP) with p(b) Q2. Does B cause a to differ from ¬ a ? –Does speaker gender affect presence / absence of AJP? p E Aa

z test for 2 independent proportions Method: combine observed values –take the difference (subtract) |p 1 – p 2 | –calculate an averaged confidence interval p O1O1 O2O2 p O1O1 O2O2 ¬a¬a a F p 2 = p(b | ¬a) p 1 = p(b | a) p2p2 p1p1

z test for 2 independent proportions New confidence interval D = |O 1 – O 2 | –standard deviation s' = p(1 – p) (1/n 1 +1/n 2 ) – p = p(b) = 25/45 = 5/9 –compare z.s' with x = |p 1 – p 2 | D p D x difference in p x = |p 1 – p 2 | ^ ^ ^ z.s'z.s' a ¬a b20525 ¬b101020 301545 n1n1 n2n2 mean x = 0 0

Does B cause a to differ from ¬a ? Compare column 1 ( a ) and column 2 ( ¬a ) –Probabilities (speaker gender = male) p(b | a) = 20/30 = 2/3 = 0.667 p(b | ¬a) = 5/15 = 1/3 = 0.333 –Confidence interval pooled probability estimate p = p(b) = 5/9 = 0.556 standard deviation s' = p(1 – p) (1/n 1 + 1/n 2 ) = ( 5 / 9 4 / 9 ) ( 1 / 30 + 1 / 15 ) z.s' = 0.308 a ¬a b20525 ¬b101020 301545 ^ ^ ^

Does B cause a to differ from ¬a ? Compare column 1 ( a ) and column 2 ( ¬a ) –Probabilities (speaker gender = male) p(b | a) = 20/30 = 2/3 = 0.667 p(b | ¬a) = 5/15 = 1/3 = 0.333 –Confidence interval pooled probability estimate p = p(b) = 5/9 = 0.556 standard deviation s' = p(1 – p) (1/n 1 + 1/n 2 ) = ( 5 / 9 4 / 9 ) ( 1 / 30 + 1 / 15 ) z.s' = 0.308 Significant: |p(b | a) – p(b | ¬a)| > z.s' a ¬a b20525 ¬b101020 301545 ^ ^ ^

z test for 2 independent proportions Identical to a standard 2 2 χ² test –So you can use the usual method!

z test for 2 independent proportions Identical to a standard 2 2 χ² test –So you can use the usual method! BUT: these tests have different purposes –2 1 goodness of fit compares single value a with superset A assumes only a varies –2 2 test compares two values a, ¬a within a set A both values may vary A a g.o.f. 2 2 2 2 ¬a

z test for 2 independent proportions Identical to a standard 2 2 χ² test –So you can use the usual method! BUT: these tests have different purposes –2 1 goodness of fit compares single value a with superset A assumes only a varies –2 2 test compares two values a, ¬a within a set A both values may vary Q: Do we need χ²? A a g.o.f. 2 2 2 2 ¬a

Larger χ² tests χ² is popular because it can be applied to contingency tables with many values r 1 goodness of fit χ² tests (r 2) r c χ² tests for homogeneity (r,c 2) z tests have 1 degree of freedom strength: significance is due to only one source strength: easy to plot values and confidence intervals weakness: multiple values may be unavoidable With larger χ² tests, evaluate and simplify: Examine χ² contributions for each row or column Focus on alternation - try to test for a speaker choice

How big is the effect? These tests do not measure the strength of the interaction between two variables –They test whether the strength of an interaction is greater than would be expected by chance With lots of data, a tiny change would be significant

How big is the effect? These tests do not measure the strength of the interaction between two variables –They test whether the strength of an interaction is greater than would be expected by chance With lots of data, a tiny change would be significant –Dont use χ², p or z values to compare two different experiments A result significant at p<0.01 is not better than one significant at p<0.05

How big is the effect? These tests do not measure the strength of the interaction between two variables –They test whether the strength of an interaction is greater than would be expected by chance With lots of data, a tiny change would be significant –Dont use χ², p or z values to compare two different experiments A result significant at p<0.01 is not better than one significant at p<0.05 There are a number of ways of measuring association strength or effect size

Percentage swing Compare probabilities of a DV value (a, AJP) across a change in the IV (gender) : –swing d = p(a | ¬b) – p(a | b) = 10/20 – 20/25 = -0.3 a ¬a b20525 ¬b101020 301545

Percentage swing Compare probabilities of a DV value (a, AJP) across a change in the IV (gender) : –swing d = p(a | ¬b) – p(a | b) = 10/20 – 20/25 = -0.3 As a proportion of the initial value –% swing d % = d/p(a | b) = -0.3/0.8 a ¬a b20525 ¬b101020 301545

Percentage swing Compare probabilities of a DV value (a, AJP) across a change in the IV (gender) : –swing d = p(a | ¬b) – p(a | b) = 10/20 – 20/25 = -0.3 As a proportion of the initial value –% swing d % = d/p(a | b) = -37.5% We can even calculate confidence intervals on d or d % –Use z test for two independent proportions (we are comparing differences in p values) a ¬a b20525 ¬b101020 301545

Cramérs φ Can be used on any χ² table –Mathematically well defined –Probabilistic (c.f. swing d [-1, +1], d % = ?) = 0 no relationship between A and B = 1 B strictly determines A straight line between these two extremes a ¬a b0.50.51 ¬b0.50.51 112 a ¬a b101 ¬b011 112 = 0 = 1

Cramérs φ Can be used on any χ² table –Mathematically well defined –Probabilistic (c.f. swing d [-1, +1], d % = ?) = 0 no relationship between A and B = 1 B strictly determines A straight line between these two extremes a ¬a b0.50.51 ¬b0.50.51 112 a ¬a b101 ¬b011 112 = 0 = 1 averaged swing }

Cramérs φ Can be used on any χ² table –Mathematically well defined –Probabilistic (c.f. swing d [-1, +1], d % = ?) = 0 no relationship between A and B = 1 B strictly determines A straight line between these two extremes –Based on χ² = χ²/N (2 2) N = grand total c = χ²/(k – 1)N (r c ) k = min(r, c)

Cramérs φ Can be used on any χ² table –Mathematically well defined –Probabilistic (c.f. swing d [-1, +1], d % = ?) = 0 no relationship between A and B = 1 B strictly determines A straight line between these two extremes –Based on χ² = χ²/N (2 2) N = grand total c = χ²/(k – 1)N (r c ) k = min(r, c) Can be used for r 1 goodness of fit tests –Recalibrate using methods in Wallis (2012) –Better indicator than percentage swing

Significantly better? Suppose we have two similar experiments –How do we test if one result is significantly stronger than another?

Significantly better? Suppose we have two similar experiments –How do we test if one result is significantly stronger than another? Test swings Use z test for two samples from different populations Use s' = s 1 2 + s 2 2 Test |d 1 (a) – d 2 (a)| > z.s' a ¬a b20525 ¬b101020 301545 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 d1(a)d1(a)d2(a)d2(a) a ¬a b50555 ¬b101020 301575

Modern improvements on z and χ² Continuity correction for small n –Yates χ 2 test

Modern improvements on z and χ² Continuity correction for small n –Yates χ 2 test Wilsons score interval –The correct formula for intervals on p p p w–w– w+w+ 0

Modern improvements on z and χ² Continuity correction for small n –Yates χ 2 test – can be used elsewhere Wilsons score interval –The correct formula for intervals on p Newcombe (1998) improves on 2 2 χ² test –Uses the Wilson interval –Better than χ² and log-likelihood (etc.) for low-frequency events p p w–w– w+w+ 0

Conclusions The basic idea of all of these tests is –Predict future results if experiment were repeated Significant = effect > 0 (e.g. 19 times out of 20) Based on the Binomial distribution –Approximated by Normal distribution – many uses Plotting confidence intervals Use goodness of fit or single-sample z tests to compare a sample, a, with a point it is dependent on, A Use 2 2 tests or two independent sample z tests to compare two observed samples ( a, ¬a ) When using larger r c tests, simplify as far as possible to identify the source of variation!

Conclusions Two methods for measuring the size of an experimental effect –Simple idea, easy to report absolute or percentage swing –More reliable, but possibly less intuitive Cramérs φ –You can compare two experiments Is absolute swing significantly greater? Use a type of z test! A similar approach is possible with φ Take care with small samples / low frequencies –Use Wilson and Newcombes methods instead!

References Newcombe, R.G. 1998. Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine 17: 873-890 Wallis, S.A. 2009. Binomial distributions, probability and Wilsons confidence interval. London: Survey of English Usage Wallis, S.A. 2010. z-squared: The origin and use of χ². London: Survey of English Usage Wallis, S.A. 2012. Goodness of fit measures for discrete categorical data. London: Survey of English Usage Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209-212 Assorted statistical tests: –www.ucl.ac.uk/english-usage/staff/sean/resources/2x2chisq.xls

Download ppt "Z-squared: the origin and use of χ² - or - what I wish I had been told about statistics (but had to work out for myself) Sean Wallis Survey of English."

Similar presentations