Presentation is loading. Please wait.

Presentation is loading. Please wait.

Simple Maths for Keywords Adam Kilgarriff Lexical Computing Ltd.

Similar presentations


Presentation on theme: "Simple Maths for Keywords Adam Kilgarriff Lexical Computing Ltd."— Presentation transcript:

1 Simple Maths for Keywords Adam Kilgarriff Lexical Computing Ltd

2 Liverpool, July 2009Kilgarriff: Simple Maths2 “This word is twice as common here as there”

3 Liverpool, July 2009Kilgarriff: Simple Maths3 “This word is twice as common here as there”  What does it mean? For word wubble Ratio=2: wubble is twice as common in fc as rc Freq (f)Corp SizePer million Focus corp (fc) 4010m4 Reference corp (rc) 5025m2

4 Liverpool, July 2009Kilgarriff: Simple Maths4 “This word is twice as common here as there”  Not just words Grammatical constructions Suffixes …  Keyword list Calculate ratio for all words Sort Keywords: at top of list

5 Liverpool, July 2009Kilgarriff: Simple Maths5 Good enough for keywords?  Almost, but 1.Are corpora well matched? 2.Burstiness 3.You can’t divide by zero 4.High ratios more common for rare words

6 Liverpool, July 2009Kilgarriff: Simple Maths6 1Are corpora well matched?  Proportionality If fiction contains more American, newspaper more British… genre compromised by region  Usual problem  Issue in corpus design  Not here

7 Liverpool, July 2009Kilgarriff: Simple Maths7 2Burstiness WordBNC freqBNC files mucosa10319 theology1032230 unfortunate1031648 Discount frequency for bursty words Gries, CL 2007, also CL journal We use ARF (average reduced frequency) Not here

8 Liverpool, July 2009Kilgarriff: Simple Maths8 3You can’t divide by zero  Standard solution: add one  Problem solved fc rcratio buggle100? stort1000? nammikin10000? fc rcratio buggle111 stort1011 nammikin10011

9 Liverpool, July 2009Kilgarriff: Simple Maths9 4High ratios more common for rarer words fc rc ratiointeresting? spug101 no grod100010010yes some researchers: grammar, grammar words some researchers: lexis content words No right answer Slider?

10 Liverpool, July 2009Kilgarriff: Simple Maths10 Solution  Don’t just add 1, add n: n=1  n=100 word fc rc fc+n rc+nRatioRank obscurish10011111.001 middling2001002011011.992 common120001000012001100011.203 word fc rc fc+n rc+nRatioRank obscurish1001101001.103 middling2001003002001.501 common120001000012100101001.202

11 Liverpool, July 2009Kilgarriff: Simple Maths11 Solution  n=1000 Summary word fc rc fc+n rc+nRatioRank obscurish100101010001.013 middling200100120011001.092 common120001000013000110001.181 word fc rc n=1 n=100n=1000 obscurish1001st2nd3rd middling2001002nd1st2nd common12000100003rd 1st

12 Liverpool, July 2009Kilgarriff: Simple Maths12 But what about  Mutual information  Log-likelihood  Chi-square  Fisher’s test  …  Don’t they use cleverer maths?

13 Liverpool, July 2009Kilgarriff: Simple Maths13 Yes but  Clever maths is for hypothesis testing Can you defeat null hypothesis?  Language is not random, so  … you always can  Null hypothesis never true  Hypothesis-testing not informative  Clever maths irrelevant Kilgarriff 2006, CLLT

14 Liverpool, July 2009Kilgarriff: Simple Maths14 Moreover…  just one answer grammar words vs content words? does not help  confuses and obscures

15 Liverpool, July 2009Kilgarriff: Simple Maths15 you should understand the maths you use

16 Liverpool, July 2009Kilgarriff: Simple Maths16 The Sketch Engine  Leading corpus query tool  Widely used by dictionary publishers, at universities  Large corpora for many lgs available  Word sketches  Web service  Since last week: Implements SimpleMaths

17 Liverpool, July 2009Kilgarriff: Simple Maths17 Example  BAWE British Academic Written English  Nesi and Thompson, completed last year Student essays  Arts/Humanities, Social Sciences, Life Sciences, Physical Sciences fc: ArtsHum, rc: SocSci With n=10 and n=1000

18 Liverpool, July 2009Kilgarriff: Simple Maths18

19 Liverpool, July 2009Kilgarriff: Simple Maths19

20 Liverpool, July 2009Kilgarriff: Simple Maths20 Thank you http://www.sketchengine.co.uk

21 Liverpool, July 2009Kilgarriff: Simple Maths21 Language is never ever ever random

22 Liverpool, July 2009Kilgarriff: Simple Maths22 Language

23 Liverpool, July 2009Kilgarriff: Simple Maths23 is

24 Liverpool, July 2009Kilgarriff: Simple Maths24 never

25 Liverpool, July 2009Kilgarriff: Simple Maths25 ever

26 Liverpool, July 2009Kilgarriff: Simple Maths26 ever

27 Liverpool, July 2009Kilgarriff: Simple Maths27 random


Download ppt "Simple Maths for Keywords Adam Kilgarriff Lexical Computing Ltd."

Similar presentations


Ads by Google