Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK.

Similar presentations


Presentation on theme: "Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK."— Presentation transcript:

1 Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

2 IDS Mannheim 2010Kilgarriff2 Outline  Precision and recall  History of corpus lexicography  The Sketch Engine –Demo  Web corpora  Corpus and dictionary

3 IDS Mannheim 2010Kilgarriff3 Find me all the fat cats  a request for information

4 IDS Mannheim 2010Kilgarriff4 High recall  Lots of responses  Maybe not all good

5 IDS Mannheim 2010Kilgarriff5 High precision  Fewer hits  Higher confidence

6 IDS Mannheim 2010Kilgarriff6 Information-seeking RecallPrecision Computers good bad People bad good

7 IDS Mannheim 2010Kilgarriff7 Lexicography: finding facts about words  collocations  grammatical patterns  idioms  synonyms  meanings  translations

8 IDS Mannheim 2010Kilgarriff8 Four ages of corpus lexicography

9 IDS Mannheim 2010Kilgarriff9 Age 1: Pre computer Oxford English Dictionary: 5 million index cards

10 IDS Mannheim 2010Kilgarriff10 Age 2: KWIC Concordances  From 1980  Computerised  Overhauled lexicography

11 IDS Mannheim 2010Kilgarriff11 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: :read all 500 lines: could read all, takes a long time, slow 5000 lines: no

12 IDS Mannheim 2010Kilgarriff12 Age 3: Collocation statistics  Problem: too much data - how to summarise?  Solution: list of words occurring in neighbourhood of headword, with frequencies  Sorted by salience

13 IDS Mannheim 2010Kilgarriff13 Age-3 collocation statistics: limitations Lists contain  junk  unsorted for type – mixes together adverbs, subjects, objects, prepositions What we really want:  noise-free lists  one list for each grammatical relation

14 IDS Mannheim 2010Kilgarriff14 Collocation listing For collocates of save (>5 hits), window 1-5 words to right of nodeword word forestslife $1.2dollars livescosts enormousthousands annuallyface jobsestimated moneyyour

15 IDS Mannheim 2010Kilgarriff15 Age 4: The word sketch  Large well-balanced corpus  Parse to find – subjects, objects, heads, modifiers etc  One list for each grammatical relation  Statistics to sort each list, as before

16 IDS Mannheim 2010Kilgarriff16 Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002

17 IDS Mannheim 2010Kilgarriff17 Working practice  Lexicographers mainly used sketches not concordances –missed less, more consistent –Faster

18 IDS Mannheim 2010Kilgarriff18 Euralex 2002

19 IDS Mannheim 2010Kilgarriff19 Euralex 2002  Can I have them for my language please

20 IDS Mannheim 2010Kilgarriff20 The Sketch Engine  Input: –any corpus, any language  Lemmatised, part-of-speech tagged –specification of grammatical relations  Word sketches integrated with  Corpus query system –Supports complex searching, sorting etc  Credit: Pavel Rychly, Masaryk Univ

21 IDS Mannheim 2010Kilgarriff21 Customers  Dictionary publishers –Oxford University Press –Cambridge University Press –Collins –Macmillan –FrameNet Project (Berkeley, US) –National dictionary projects in  Czech Republic, Estonia, Ireland, Netherlands, Slovakia, Slovenia  Universities –Teaching and research –Languages, linguistics, language technology –UK, Germany, US, Greece, Taiwan, Japan, China, Slovenia,…  Other –Language teaching, textbook writing –Information management, web search companies –Automatic translation

22 IDS Mannheim 2010Kilgarriff22 Web corpora  Replaceable or replacable? –http://googlefight.comhttp://googlefight.com –http://looglefight.comhttp://looglefight.com

23 IDS Mannheim 2010Kilgarriff23  The web is –Very very large –Most languages –Most language types –Up-to-date –Free –Instant access

24 IDS Mannheim 2010Kilgarriff24 Web corpus types  Large, general corpora  Small, specialised corpora –Specially for translators

25 IDS Mannheim 2010Kilgarriff25 Basic steps  Gather pages –CSE hits –Select and gather whole sites –General crawl  Filter  De-duplicate  Linguistic processing  Load into corpus tool

26 IDS Mannheim 2010Kilgarriff26 WaC family corpora  100m – 2b word corpora  2-month project each  All major world languages available in Sketch Engine –Currently 30 languages –Growing monthly  Pioneers: Marco Baroni, Serge Sharoff  Corpus Factory  Seeds: –mid-frequency words from ‘core vocab’ lists and corpora  Google on seed words, then crawl

27 IDS Mannheim 2010Kilgarriff27 Corpora Arabic174Hindi31Russian188 Chinese456Indonesian102Slovak536 Czech800Irish34Slovene738 Dutch128Italian1910Spanish117 English5508Japanese409Swedish114 French126Norwegian95Telugu5 German1627Persian6Thai108 Greek149Portuguese66Vietnamese174 Estonian11Romanian53Welsh63 Korean77Polish156Malay230

28 IDS Mannheim 2010Kilgarriff28 How good are they?  How to assess? –Hard question, open research topic  Good coverage –Newspapers: news, politics bias –Web corpora: also cover personal, kitchen vocab  Web corpus / BNC / journalism corpus –First two are close

29 IDS Mannheim 2010Kilgarriff29 Evaluating word sketches  11 years –1999-2010  Feedback –Good but anecdotal  Formal evaluation  Method also lets us evaluate corpora

30 Kilgarriff30 Goal  Collocations dictionary –Model: Oxford Collocations Dictionary –Publication-quality  Ask a lexicographer –For 42 headwords  For 20 best collocates per headwords –“should we include this collocation in a published dictionary?”

31 Kilgarriff31 Sample of headwords  Nouns verbs adjectives, random  High (Top 3000)‏  N space solution opinion mass corporation leader  V serve incorporate mix desire  Adj high detailed open academic  Mid (3000- 9999)‏  N cattle repayment fundraising elder biologist sanitation  V grieve classify ascertain implant  Adj adjacent eldest prolific ill  Low (10,000- 30,000)‏  N predicament adulterer bake bombshell candy shellfish  V slap outgrow plow traipse  Adj neoclassical votive adulterous expandable

32 Kilgarriff32 Precision and recall We test precision Recall is harder How do we find all the collocations that the system should have found? Current work 200 collocates per headword Selected from All the corpora we have Various parameter settings Plus just-in-time evaluation for 'new' collocates

33 Kilgarriff33 Four languages, three families  Dutch –ANW, 102m-word lexicographic corpus  English –UKWaC, 1.5b web corpus  Japanese –JpWaC, 400m web corpus  Slovene –FidaPlus, 620m lexicographic corpus

34 Kilgarriff34 User evaluation  Evaluate whole system –Will it help with my task  Eg preparing a collocations dictionary  Contrast: developer evaluation –Can I make the system better?  Evaluate each module separately  Current work

35 Kilgarriff35 Components  Corpus  NLP tools –Segmenter, lemmatiser, POS-tagger  Sketch grammar  Statistics

36 Kilgarriff36 Practicalities  Interface –Good, Good-but  Merge to good –Maybe, Maybe-specialised, Bad  Merge to bad  For each language –Two/three linguists/lexicographers –If they disagree  Don't use for computing performance

37 Kilgarriff37 Results  Dutch 66%  English71%  Japanese87%  Slovene71%

38 IDS Mannheim 2010Kilgarriff38 Two thirds of a collocations dictionary can be gathered automatically

39 IDS Mannheim 2010Kilgarriff39 Small specialised corpora  Terminologists  Translators needing target-language domain-specific vocab  Specialist dictionaries –Don’t exist –Expensive/inaccessible –Out of date  Instant small web corpora –BootCaT: Baroni and Bernardini 2004 –WebBootCaT demo

40 IDS Mannheim 2010Kilgarriff40 Cyborgs  A creature that is partly human and partly machine –Macmillan English Dictionary

41 IDS Mannheim 2010Kilgarriff41

42 IDS Mannheim 2010Kilgarriff42

43 IDS Mannheim 2010Kilgarriff43

44 IDS Mannheim 2010Kilgarriff44

45 IDS Mannheim 2010Kilgarriff45 Cyborgs and the Information Society The dictionary-making agent is part human (for precision), part computer (for recall).

46 IDS Mannheim 2010Kilgarriff46 Treat your computer with respect. You and it can do great things together.

47 IDS Mannheim 2010Kilgarriff47 Thank you http://www.sketchengine.co.uk

48 IDS Mannheim 2010Kilgarriff48 Corpus and dictionary  Established model: –Lexicographers use corpora, users use dictionaries  But –Users like collocations, examples –But are not corpus linguists  Explore the space between corpus and dictionary

49 IDS Mannheim 2010Kilgarriff49 Collocationality  Which words are most ‘collocational’  Dictionary publishers –Where to put ‘collocation boxes’  Language learners

50 IDS Mannheim 2010Kilgarriff50 VerbFreqMLE Prob x log = entropy Take2084-.469 Gain131-.169 Offer117-.157 See110-.150 Enjoy67-.104 ……… Clarify1-0.0031 ……… Total3730-3.909 Calculation of entropy for advantage (object relation)

51 IDS Mannheim 2010Kilgarriff51

52 IDS Mannheim 2010Kilgarriff52 place (17881), attention (8476), door (8426), care (4884), step (4277), advantage (3730), rise (3334), attempt (2825), impression (2596), notice (2462), chapter (2318), mistake (2205), breath (2140), hold (1949), birth (1016), living (953), indication (812), tribute (720), debut (714), button (661), eyebrow (649), anniversary (637), mention (615), glimpse (531), suicide (486), toll (472), refuge (470), spokesman (453), sigh (436), birthday (429), wicket (412), appendix (410), pardon (399), precaution (396), temptation (374), goodbye (372), fuss (366), resemblance (350), goodness (288), precedence (285), havoc (270), tennis (266), comeback (260), farewell (228), prominence (228), go-ahead (202), sip (198),

53 IDS Mannheim 2010Kilgarriff53 What is there on the web?  Web1T –Present from google –All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (10 12) words of English  1,000,000,000,000  Compare with BNC –Take top 50,000 items of each –105 Web1T words not in BNC top50k –50 words with highest Web1T:BNC ratio –50 words with lowest ratio

54 IDS Mannheim 2010Kilgarriff54 Web-high (155 terms)  61 web and computing –config browser spyware url www forum  38 porn  22 US English  18 business/products common on web –poker viagra lingerie ringtone dvd casino rental collectible tiffany –NB: BNC is old  4 legal –trademarks pursuant accordance herein

55 IDS Mannheim 2010Kilgarriff55 Web-low  Exclude British English, transcription/tokenisation anomalies –herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

56 IDS Mannheim 2010Kilgarriff56 Observations  Pronouns and past tense verbs –Fiction  Masc vs fem  Yesterday –Probably daily newspapers  Constancy of ratios: –He/him/himself –She/her/herself


Download ppt "Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK."

Similar presentations


Ads by Google