Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures.

Similar presentations


Presentation on theme: "Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures."— Presentation transcript:

1 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures

2 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Reminder: Contingency Tables

3 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS General Remarks we will only use data from contingency tables we will consider each pair type on its own, independently from all other pair types (  no distributional information) we won't distinguish between relational and positional cooccurrences

4 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures (AMs) goal: assign association score to each pair type = strength of association between components high score = strong association association in a statistical sense, but there is no precise definition positive vs. negative association ("colourless green ideas")

5 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Using Association Scores absolute values (cut-off threshold) input for higher-order statistics (AMs are first-order statistics)  scores should be meaningful ranking of collocation candidates  only relative scores matter rank collocates of given base  one marginal frequency fixed  only two free parameters

6 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS First Steps: Proportions Workshop on Mechanized Documentation (Washington, 1964)

7 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS First Steps: Proportions proportions between 0 and 1 high proportion = strong (directional) association need to combine two proportions into a single association score average (P 1 + P 2 ) / 2 is not useful f=1, f 1 =1, f 2 =1000  avg.=0.5005 f=50, f 1 =100, f 2 =100  avg.=0.5  more "conservative" weighting

8 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS First Steps: Proportions harmonic mean geometric mean minimum Jaccard

9 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS First Steps: Proportions coefficients range from 0 to 1 1 = total (positive) association interpretation of lower scores is less clear positive vs. negative association? which score for no association? what is "no association"??  random combinations

10 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Expected Frequencies assume that types u and v cooccur only by chance f 1 (u) occs. of u and f 2 (v) occs. of v spread randomly over N tokens each instance of u has a chance of f 2 (v)/N to cooccur with a v  expected # of cooccurrences:

11 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Expected Frequencies expected frequencies for all cells of the contingency table assuming random combinations (  statistical independence)

12 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Expected Frequencies comparison of expected against observed frequencies note that row and column sums are the same for both tables

13 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Mutual Information compares O 11 with E 11 ratio O 11 /E 11 ranges from 0 to  1 = no association (O 11 =E 11 ) usually logarithmic values range: -  to +  0 = no assoc., 0 pos. used in English lexicography

14 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Low-Frequency Pairs & Random Variation large amount of low-frequency data (consequence of Zipf's law) a simple (invented) example A: f=50, f 1 =100, f 2 =100, N=1000  O 11 =50, E 11 =10, MI = log 5 B: f=1, f 1 =1, f 2 =1, N=1000  O 11 =1, E 11 =.001, MI = log 1000

15 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Low-Frequency Pairs & Random Variation three problems with case B how meaningful is a single example? (not very much, actually) could well be a spelling mistake or noise from automatic processing we want to make generalisations (from particular corpus to "language")  this is the domain of statistics: draw inferences about population (=language) from a sample (=corpus)

16 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Statistical Model: Random Sample assumption: corpus data is a random sample from the language  base data is a random sample from all coocs. in the language

17 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Statistical Model: Random Sample random sample of size N is described by random variables U i and V i (i = 1..N), representing the labels of the i-th bigram token notation: U and V as "prototypes" for a given pair type (u,v), contingency table can be computed from U i and V i  random variables X 11, X 12, X 21, X 22

18 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Statistical Model: Random Sample population parameters  11,  12,  21,  22 for pair type (u,v) observed frequencies O 11, O 12, O 21, O 22 represent one particular realisation of the sample theory of random samples predicts distribution of X 11, X 12, X 21, X 22 from assumptions about the population parameters  11,  12,  21,  22

19 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Statistical Model: Random Sample

20 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Two Footnotes vector notation for cont. tables population  general language restricted to domain(s), genre(s),... covered by source corpus e.g. black box in computer science vs. newspapers vs. cooking

21 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Sampling Distribution multinomial sampling distribution each individual cell count X ij has a binomial distribution (but these are not independent)

22 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Sampling Distribution given assumptions about the population parameters, we can compute the likelihood of the observed contingency table relatively high likelihood = consistent with assumptions relatively low likelihood = evidence against assumptions (inversely proportional to likelihood)

23 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Adequacy of the Statistical Model particular sequence of pair tokens is irrelevant, only the overall frequencies matter (  sufficiency) randomness assumption (random sample from fixed population) independence of pair tokens constancy of population parameters violations problematic only when they affect sampling distribution

24 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Adequacy of the Statistical Model three causes of non-randomness local dependencies (e.g. syntax)  usually not problematic inhomogeneity of source corpus (speakers, domains, topics,...)  mixture population repetition / clustering of bigrams  can be a serious problem (does not affect segment-based data if clustered within segments)

25 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Making Assumptions about the Population Parameters population parameters ( ,  1,  2 ) are unknown best guess from observation: MLE = maximum-likelihood estimate

26 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Making Assumptions about the Population Parameters conditional probabilities with MLE Dice coefficient etc. are MLE for population characteristics MI is MLE for log(  /(  1   2 ))  unreliable for small frequencies

27 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Null Hypothesis null hypothesis H 0 : no association = independence of instances, i.e. P(U=u  V=v) = P(U=u)  P(V=v) not all parameters determined MLE maximise probability of observed data under H 0

28 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Likelihood Measures probability of observed data under H 0 (with MLE) probability of single cell: X 11 should be most "informative"

29 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Likelihood Measures small likelihood values = strong association computed probabilities are often extremely small use negative base-10 logarithm  more convenient scale  high scores indicate strong association

30 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Problems of Likelihood Measures three reasons for low likelihood observed data is inconsistent with the null hypothesis because of strong association association may also be negative (fewer coocs. than expected) observed data is consistent, but probability mass is spread across many similar contingency tables

31 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Problems of Likelihood Measures high frequency = low likelihood e.g. binomial likelihood O 11 =1, E 11 =1  L = 0.3679 O 11 =1000, E 11 =1000  L = 0.0126 O 11 =4, E 11 =1  L  0.0126 need to "normalise" likelihood NB: likelihood association measures often have good empirical results nonetheless

32 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Likelihood Ratios simplest normalisation technique divide maximum probability of data under H 0 by unconstrained maximum probability suggested by Dunning (1993)

33 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Statistical Hypothesis Tests compute probability of group of outcomes instead of single one observed contingency table is grouped with all tables that provide at least the same amount of evidence against H 0 total probability is known as the p-value or significance problem: ranking of cont. tables

34 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Asymptotic Tests asymptotic tests defined ranking of contingency tables explicitly compute test statistic from data higher values = more evidence against H 0 can use test statistic as an AM theory: approximation of p-value associated with test statistic (accurate in the limit N   )

35 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Asymptotic Tests standard test for independence is Pearson's chi-squared test limiting distribution =  2 distribution with df=1 number of degrees of freedom was subject of a long debate

36 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Two-Sided Tests chi-squared test is two-sided, i.e. no difference between positive and negative association ignore small number of pairs with (non-total) negative association or convert to one-sided test: reject H 0 only when O 11 > E 11 p-value is usually divided by 2

37 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Yates Continuity Correction Pearson's chi-squared test approximates discrete binomial distributions of each cell by continuous normal distribution (  "normal theory") estimating probabilities P(X ij  k) from normal distribution introduces systematic errors

38 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Yates' Continuity Correction

39 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Yates' Continuity Correction

40 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Yates' Continuity Correction generic form of Yates' continuity correction for contingency tables usefulness is still controversial (criticised as too conservative) applicability for chi-squared test is generally accepted

41 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Asymptotic Tests different form of chi-squared test (comparison of two binomials) is equivalent to independence test special eq. with Yates' correction

42 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Asymptotic Tests can also use log-likelihood ratio as a test statistic (two-sided) limiting distribution is found to be  2 distribution with df=1 more conservative than Pearson's chi-squared test Dunning (1993) showed that Pearson's test over-estimates evidence against H 0 (simulation)

43 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Something I'd Rather Not Mention Church & Hanks: O 11 and E 11 are both random variables H 0 : expected values are equal assume normal distribution with unknown variance compare O 11 and E 11 with Student's t-test, estimating unknown variance from the observed data

44 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Something I'd Rather Not Mention one-sided test statistical model is questionable limiting distribution: t-distribution with df  N even more conservative than log-likelihood (low-frequency data)

45 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Exact Tests problem: how to establish ranking of contingency tables solution: reduce set of alternatives if we consider only the cell X 11, the difference X 11 – E 11 gives a sensible ranking: binomial test

46 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Exact Tests another solution: marginal frequencies do not provide evidence for or against H 0 (  "ancillary" statistics) condition on fixed row and column sums R 1, R 2, C 1, C 2 conditional hypergeometric distribution does not depend on parameters  1 and  2

47 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Exact Tests X 11 is the only free parameter we can use X 11 – E 11 for ranking Fisher's exact test (Pedersen 1996) computationally expensive numerical difficulties

48 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Comparing Hypothesis Tests Fisher's test is now widely accepted as most appropriate tends to be conservative log-likelihood gives good approximation of "correct" p-values (slightly less conservative) chi-squared over-estimates t-score far too conservative

49 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Other Approaches to Measuring Association information-theoretic (MI, entropy)  equivalent to log-likelihood combined measures ("boosting") conservative estimates instead of MLE (confidence intervals) hypothesis tests with different null hypothesis:  = C   1   2 mixture of conservative estimates and hypothesis tests?

50 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Implementation one-sided vs. two-sided tests need special software to obtain p-values for asymptotic tests numerical accuracy beware of zero frequencies!

51 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Errr.... Help!? Software? Ted Pedersen's N-gram Statistics Package (NSP) [Perl, portable, easy to use] UCS Toolkit will be available soon from www.collocations.de [Perl/Linux, some prerequisites, for the more ambitious :o) ]

52 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS More Association Measures lots of association measures will be updated references slides from this course under construction

53 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Comparing Association Measures mathematical discussion very complex results only for special cases numerical simulation computationally expensive Dunning (1993, 1998) lazy man's approach construct mock data set where frequencies vary systematically

54 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Comparing Association Measures N = 100,000

55 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Comparing Association Measures N = 100,000

56 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Comparing Association Measures N = 100,000

57 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Comparing Association Measures N = 100,000

58 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Comparing Association Measures N = 100,000

59 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Comparing Association Measures N = 10,000,000

60 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Comparing Association Measures N = 10,000,000

61 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Comparing Association Measures N = 100,000

62 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Comparing Association Measures N = 100,000

63 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Comparing Association Measures N = 100,000

64 Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Comparing Association Measures N = 100,000


Download ppt "Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Association Measures."

Similar presentations


Ads by Google