Presentation is loading. Please wait.

Presentation is loading. Please wait.

A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing.

Similar presentations


Presentation on theme: "A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing."— Presentation transcript:

1 A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing Ltd http://www.sketchengine.co.uk

2 English Profile From 2006 Cambridge Univ, Univ Press, ESOL (+ others) Goal – for each CEFR level, find characteristic lexis and grammar CEFR: Common European Framework of Reference – A1, A2: Beginner – B1, B2: Intermediate – C1, C2: Advanced – Main resource: CLC NTNU Nov 2011KIlgarriff2

3 Cambridge Learner Corpus (CLC) Since 1993 Leading resource CUP and Cambridge Assessment – For better dictionaries, ELT courses, tests – Material: all from exams (levels A1-C2) 45m words; 22m error-tagged 200,000 scripts, 138 L1s, 203 nationalities NTNU Nov 2011KIlgarriff3

4 Sketch Engine Leading corpus tool Word sketches – One-page summaries of a word’s grammatical and collocational behaviour In use at OUP, CUP, Collins, Macmillan, INL … 55 languages – 175 corpora – Since May including CHILDES: demodemo – Since last year including CLC NTNU Nov 2011KIlgarriff4

5 NTNU Nov 2011KIlgarriff5 Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002

6 Error-coded corpus Challenge – Intuitive to search for x anywhere only where it is part of an error only where it is part of a correction where x can be a word, phrase, grammar pattern … Requirement for CLC in Sketch Engine NTNU Nov 2011KIlgarriff6

7 Error-coded corpora in SkE demo NTNU Nov 2011KIlgarriff7

8 HOO / HOO+ Helping Our Own HOO: English-NNS NLP researchers – Developer = user: motivation – Shared task/competitive evaluation Organisers define task and prepare ‘gold standard’ Teams participate by running their software over test data Six teams (incl Tübingen), workshop end Sept NTNU Nov 2011KIlgarriff8

9 HOO+ (2012) Probably – English: learner data from CLC – Other languages? – Tasks Essay scoring Determiner, preposition errors ? http://www.clt.mq.edu.au/research/projects/hoo/ NTNU Nov 2011KIlgarriff9

10 DANTE Highlights of English lexicography NTNU Nov 2011KIlgarriff10

11 DANTE NTNU Nov 2011KIlgarriff11

12 DANTE NTNU Nov 2011KIlgarriff12

13 DANTE NTNU Nov 2011KIlgarriff13

14 DANTE http://webdante.com NTNU Nov 2011KIlgarriff14

15 The KELLY Project EU Lifelong Learning Project Word cards – 9 languages Arabic Chinese English Greek Italian Norwegian Polish Russian Swedish – All 36 pairs – Words the learner should know (at A1 … C2) Partners Stockholm Univ, Gotheburg Univ, Adam Mickiewicz Univ, ILSP Athens, CNR Pisa, Oslo Univ, Leeds Univ, Keewords A/S, Lexical Computing Ltd NTNU Nov 2011KIlgarriff15

16 Interesting question How close to purely corpus-based can a pedagogic list be? NTNU Nov 2011KIlgarriff16

17 Method Take a general corpus Count Review, add, delete using other lists and corpora Translate (72 directed-lg-pairs) Words not in source list which occur in translations: – Review source list http://kelly.sketchengine.co.uk NTNU Nov 2011KIlgarriff17

18 Symmatrical pairs: and Cliques: – For x, y, z, … all pairs are symmetrical – 9-language cliques (English members) hospital library music sun theory NTNU Nov 2011KIlgarriff18

19 NTNU Nov 2011KIlgarriff19 Web corpora Replaceable or replacable? – http://googlefight.com http://googlefight.com – http://looglefight.com http://looglefight.com

20 NTNU Nov 2011KIlgarriff20 The web is – Very very large – Most languages – Most language types – Up-to-date – Free – Instant access

21 NTNU Nov 2011KIlgarriff21 Web corpus types Large, general corpora Small, specialised corpora – Specially for translators

22 NTNU Nov 2011KIlgarriff22 Basic steps Gather pages – CSE hits – Select and gather whole sites – General crawl Filter De-duplicate Linguistic processing Load into corpus tool

23 NTNU Nov 2011KIlgarriff23 WaC family corpora 100m – 2b word corpora 2-month project each All major world languages available in Sketch Engine – Currently 42 languages – Growing monthly Pioneers: Marco Baroni, Serge Sharoff Corpus Factory Seeds: – mid-frequency words from ‘core vocab’ lists and corpora Google on seed words, then crawl

24 NTNU Nov 2011KIlgarriff24 How good are they? How to assess? – Hard question, open research topic Good coverage – Newspapers: news, politics bias – Web corpora: also cover personal, kitchen vocab Web corpus / BNC / journalism corpus – First two are close

25 NTNU Nov 2011KIlgarriff25 Evaluating word sketches 11 years – 1999-2011 Feedback – Good but anecdotal Formal evaluation Method also lets us evaluate corpora

26 KIlgarriff26 Goal Collocations dictionary – Model: Oxford Collocations Dictionary – Publication-quality Ask a lexicographer – For 42 headwords For 20 best collocates per headwords – “should we include this collocation in a published dictionary?” NTNU Nov 2011

27 KIlgarriff27 Sample of headwords Nouns verbs adjectives, random High (Top 3000)‏ N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999)‏ N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000)‏ N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable NTNU Nov 2011

28 KIlgarriff28 Precision and recall a request for information – Find me all the fat cats

29 NTNU Nov 2011 KIlgarriff29 High recall Lots of responses Maybe not all good

30 NTNU Nov 2011KIlgarriff30 High precision Fewer hits Higher confidence

31 KIlgarriff31 Precision and recall We test precision Recall is harder How do we find all the collocations that the system should have found? Current work 200 collocates per headword Selected from All the corpora we have Various parameter settings Plus just-in-time evaluation for 'new' collocates NTNU Nov 2011

32 KIlgarriff32 Four languages, three families Dutch – ANW, 102m-word lexicographic corpus English – UKWaC, 1.5b web corpus Japanese – JpWaC, 400m web corpus Slovene – FidaPlus, 620m lexicographic corpus NTNU Nov 2011

33 KIlgarriff33 User evaluation Evaluate whole system – Will it help with my task Eg preparing a collocations dictionary Contrast: developer evaluation – Can I make the system better? Evaluate each module separately Current work NTNU Nov 2011

34 KIlgarriff34 Components Corpus NLP tools – Segmenter, lemmatiser, POS-tagger Sketch grammar Statistics NTNU Nov 2011

35 KIlgarriff35 Practicalities Interface – Good, Good-but Merge to good – Maybe, Maybe-specialised, Bad Merge to bad For each language – Two/three linguists/lexicographers – If they disagree Don't use for computing performance NTNU Nov 2011

36 KIlgarriff36 Results Dutch 66% English71% Japanese 87% Slovene71% NTNU Nov 2011

37 KIlgarriff37 Two thirds of a collocations dictionary can be gathered automatically

38 Thank you http://www.sketchengine.co.uk http://www.sketchengine.co.uk NTNU Nov 2011KIlgarriff38

39 NTNU Nov 2011KIlgarriff39

40 NTNU Nov 2011KIlgarriff40 Lexicography: finding facts about words collocations grammatical patterns idioms synonyms meanings translations

41 NTNU Nov 2011KIlgarriff41 Four ages of corpus lexicography

42 NTNU Nov 2011KIlgarriff42 Age 1: Pre computer Oxford English Dictionary: 5 million index cards

43 NTNU Nov 2011KIlgarriff43 Age 2: KWIC Concordances From 1980 Computerised Overhauled lexicography

44 NTNU Nov 2011KIlgarriff44 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: :read all 500 lines: could read all, takes a long time, slow 5000 lines: no

45 NTNU Nov 2011KIlgarriff45 Age 3: Collocation statistics Problem: too much data - how to summarise? Solution: list of words occurring in neighbourhood of headword, with frequencies Sorted by salience

46 NTNU Nov 2011KIlgarriff46 Age-3 collocation statistics: limitations Lists contain junk unsorted for type – mixes together adverbs, subjects, objects, prepositions What we really want: noise-free lists one list for each grammatical relation

47 NTNU Nov 2011KIlgarriff47 Age 4: The word sketch Large well-balanced corpus Parse to find – subjects, objects, heads, modifiers etc One list for each grammatical relation Statistics to sort each list, as before

48 NTNU Nov 2011KIlgarriff48 Working practice Lexicographers mainly used sketches not concordances – missed less, more consistent – Faster

49 NTNU Nov 2011KIlgarriff49 Euralex 2002

50 NTNU Nov 2011KIlgarriff50 Euralex 2002 Can I have them for my language please

51 NTNU Nov 2011KIlgarriff51 The Sketch Engine Input: – any corpus, any language Lemmatised, part-of-speech tagged – specification of grammatical relations Word sketches integrated with Corpus query system – Supports complex searching, sorting etc Credit: Pavel Rychly, Masaryk Univ

52 NTNU Nov 2011KIlgarriff52 Customers Dictionary publishers – Oxford University Press – Cambridge University Press – Collins – National dictionary projects in Czech Republic, Estonia, Ireland, Netherlands, Slovakia, Slovenia Universities – Teaching and research – Languages, linguistics, language technology – UK, Germany, US, Greece, Taiwan, Japan, China, … Other – Language teaching, textbook writing – Information management, web search

53 NTNU Nov 2011KIlgarriff53 Demo – http://sketchengine.co.uk http://sketchengine.co.uk – Free trial

54 NTNU Nov 2011KIlgarriff54 What is there on the web? Web1T – Present from google – All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (10 12) words of English 1,000,000,000,000 Compare with BNC – Take top 50,000 items of each – 105 Web1T words not in BNC top50k – 50 words with highest Web1T:BNC ratio – 50 words with lowest ratio

55 NTNU Nov 2011KIlgarriff55 Web-high (155 terms)‏ 61 web and computing – config browser spyware url www forum 38 porn 22 US English 18 business/products common on web – poker viagra lingerie ringtone dvd casino rental collectible tiffany – NB: BNC is old 4 legal – trademarks pursuant accordance herein

56 NTNU Nov 2011KIlgarriff56 Web-low Exclude British English, transcription/tokenisation anomalies – herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

57 NTNU Nov 2011KIlgarriff57 Observations Pronouns and past tense verbs – Fiction Masc vs fem Yesterday – Probably daily newspapers Constancy of ratios: – He/him/himself – She/her/herself


Download ppt "A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing."

Similar presentations


Ads by Google