Download presentation

Presentation is loading. Please wait.

Published byDiana Cross Modified over 3 years ago

1
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Significance of Result Differences

2
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Why Significance Tests? everybody knows we have to test the significance of our results but do we really? evaluation results are valid for data from specific corpus extracted with specific methods for a particular type of collocations according to the intuitions of one particular annotator (or two)

3
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Why Significance Tests? significance tests are about generalisations basic question: "If we repeated the evaluation experiment (on similar data), would we get the same results?" influence of source corpus, domain, collocation type and definition, annotation guidelines,...

4
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Evaluation of Association Measures

5
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Evaluation of Association Measures

6
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS A Different Perspective pair types are described by tables (O 11, O 12, O 21, O 22 ) coordinates in 4-D space O 22 is redundant because O 11 + O 12 + O 21 + O 22 = N can also describe pair type by joint and marginal frequencies (f, f 1, f 2 ) = "coordinates" coordinates in 3-D space

7
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS A Different Perspective data set = cloud of points in three-dimensional space visualisation is "challenging" many association measures depend on O 11 and E 11 only (MI, gmean, t-score, binomial) projection to (O 11, E 11 ) coordinates in 2-D space (ignoring the ratio f 1 / f 2 )

8
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Parameter Space of Collocation Candidates

9
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Parameter Space of Collocation Candidates

10
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Parameter Space of Collocation Candidates

11
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Parameter Space of Collocation Candidates

12
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS The Parameter Space of Collocation Candidates

13
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS N-best Lists in Parameter Space N-best List for AM includes all pair types where score c (threshold c obtained from data) { c} describes a subset of the parameter space for a sound association measure isoline { = c} is lower boundary (because scores should increase with O 11 for fixed value of E 11 )

14
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS N-Best Isolines in the Parameter Space MI

15
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS N-Best Isolines in the Parameter Space MI

16
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS N-Best Isolines in the Parameter Space t-score

17
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS N-Best Isolines in the Parameter Space t-score

18
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS 95% Confidence Interval

19
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS 99% Confidence Interval

20
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS 95% Confidence Interval

21
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Comparing Precision Values number of TPs and FPs for 1000-best lists tbl t-scorefrequency TPs FPs678717

22
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS McNemar's Test + = in 1000-best list – = not in 1000-best list ideally: all TPs in 1000-best list (possible!) H 0 : differences between AMs are random tbl – t-score+ t-score – freq freq7276

23
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS McNemar's Test + = in 1000-best list – = not in 1000-best list > mcnemar.test(tbl) p-value < highly significant tbl – t-score+ t-score – freq freq7276

24
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Significant Differences

25
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Significant Differences

26
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Significant Differences = significant= relevant (2%)

27
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Lowest-Frequency Data: Samples Too much data for full manual evaluation random samples AdjN data 965 pairs with f = 1 (15% sample) manually identified 31 TPs (3.2%) PNV data 983 pairs with f < 3 (0.35% sample) manually identified 6 TPs (0.6%)

28
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Lowest-Frequency Data: Samples Estimate proportion p of TPs among all lowest-frequency data Confidence set from binomial test AdjN: 31 TPs among 965 items p 5% with 99% confidence at most 320 TPs PNV: 6 TPs among 983-items p 1.5% with 99% confidence there might still be 4200 TPs !!

29
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS N-best Lists for Lowest-Frequency Data evaluate 10,000-best lists to reduce manual annotation work, take 10% sample from each list (i.e. 1,000 candidates for each AM) precision graphs for N-best lists up to N = 10,000 for the PNV data 95% confidence estimates for precision of best-performing AM (from binomial test)

30
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Random Sample Evaluation

31
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Random Sample Evaluation

32
Stefan Evert, IMS - Uni Stuttgart Brigitte Krenn, ÖFAI Wien IMS Random Sample Evaluation

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google