A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing.

A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing Ltd http://www.sketchengine.co.uk

English Profile From 2006 Cambridge Univ, Univ Press, ESOL (+ others) Goal – for each CEFR level, find characteristic lexis and grammar CEFR: Common European Framework of Reference – A1, A2: Beginner – B1, B2: Intermediate – C1, C2: Advanced – Main resource: CLC NTNU Nov 2011KIlgarriff2

Cambridge Learner Corpus (CLC) Since 1993 Leading resource CUP and Cambridge Assessment – For better dictionaries, ELT courses, tests – Material: all from exams (levels A1-C2) 45m words; 22m error-tagged 200,000 scripts, 138 L1s, 203 nationalities NTNU Nov 2011KIlgarriff3

Sketch Engine Leading corpus tool Word sketches – One-page summaries of a word’s grammatical and collocational behaviour In use at OUP, CUP, Collins, Macmillan, INL … 55 languages – 175 corpora – Since May including CHILDES: demodemo – Since last year including CLC NTNU Nov 2011KIlgarriff4

NTNU Nov 2011KIlgarriff5 Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002

Error-coded corpus Challenge – Intuitive to search for x anywhere only where it is part of an error only where it is part of a correction where x can be a word, phrase, grammar pattern … Requirement for CLC in Sketch Engine NTNU Nov 2011KIlgarriff6

Error-coded corpora in SkE demo NTNU Nov 2011KIlgarriff7

HOO / HOO+ Helping Our Own HOO: English-NNS NLP researchers – Developer = user: motivation – Shared task/competitive evaluation Organisers define task and prepare ‘gold standard’ Teams participate by running their software over test data Six teams (incl Tübingen), workshop end Sept NTNU Nov 2011KIlgarriff8

HOO+ (2012) Probably – English: learner data from CLC – Other languages? – Tasks Essay scoring Determiner, preposition errors ? http://www.clt.mq.edu.au/research/projects/hoo/ NTNU Nov 2011KIlgarriff9

DANTE Highlights of English lexicography NTNU Nov 2011KIlgarriff10

DANTE NTNU Nov 2011KIlgarriff11

DANTE http://webdante.com NTNU Nov 2011KIlgarriff14

The KELLY Project EU Lifelong Learning Project Word cards – 9 languages Arabic Chinese English Greek Italian Norwegian Polish Russian Swedish – All 36 pairs – Words the learner should know (at A1 … C2) Partners Stockholm Univ, Gotheburg Univ, Adam Mickiewicz Univ, ILSP Athens, CNR Pisa, Oslo Univ, Leeds Univ, Keewords A/S, Lexical Computing Ltd NTNU Nov 2011KIlgarriff15

Interesting question How close to purely corpus-based can a pedagogic list be? NTNU Nov 2011KIlgarriff16

Method Take a general corpus Count Review, add, delete using other lists and corpora Translate (72 directed-lg-pairs) Words not in source list which occur in translations: – Review source list http://kelly.sketchengine.co.uk NTNU Nov 2011KIlgarriff17

Symmatrical pairs: and Cliques: – For x, y, z, … all pairs are symmetrical – 9-language cliques (English members) hospital library music sun theory NTNU Nov 2011KIlgarriff18

NTNU Nov 2011KIlgarriff19 Web corpora Replaceable or replacable? – http://googlefight.com http://googlefight.com – http://looglefight.com http://looglefight.com

NTNU Nov 2011KIlgarriff20 The web is – Very very large – Most languages – Most language types – Up-to-date – Free – Instant access

NTNU Nov 2011KIlgarriff21 Web corpus types Large, general corpora Small, specialised corpora – Specially for translators

NTNU Nov 2011KIlgarriff22 Basic steps Gather pages – CSE hits – Select and gather whole sites – General crawl Filter De-duplicate Linguistic processing Load into corpus tool

NTNU Nov 2011KIlgarriff23 WaC family corpora 100m – 2b word corpora 2-month project each All major world languages available in Sketch Engine – Currently 42 languages – Growing monthly Pioneers: Marco Baroni, Serge Sharoff Corpus Factory Seeds: – mid-frequency words from ‘core vocab’ lists and corpora Google on seed words, then crawl

NTNU Nov 2011KIlgarriff24 How good are they? How to assess? – Hard question, open research topic Good coverage – Newspapers: news, politics bias – Web corpora: also cover personal, kitchen vocab Web corpus / BNC / journalism corpus – First two are close

NTNU Nov 2011KIlgarriff25 Evaluating word sketches 11 years – 1999-2011 Feedback – Good but anecdotal Formal evaluation Method also lets us evaluate corpora

KIlgarriff26 Goal Collocations dictionary – Model: Oxford Collocations Dictionary – Publication-quality Ask a lexicographer – For 42 headwords For 20 best collocates per headwords – “should we include this collocation in a published dictionary?” NTNU Nov 2011

KIlgarriff27 Sample of headwords Nouns verbs adjectives, random High (Top 3000)‏ N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999)‏ N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000)‏ N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable NTNU Nov 2011

KIlgarriff28 Precision and recall a request for information – Find me all the fat cats

NTNU Nov 2011 KIlgarriff29 High recall Lots of responses Maybe not all good

NTNU Nov 2011KIlgarriff30 High precision Fewer hits Higher confidence

KIlgarriff31 Precision and recall We test precision Recall is harder How do we find all the collocations that the system should have found? Current work 200 collocates per headword Selected from All the corpora we have Various parameter settings Plus just-in-time evaluation for 'new' collocates NTNU Nov 2011

KIlgarriff32 Four languages, three families Dutch – ANW, 102m-word lexicographic corpus English – UKWaC, 1.5b web corpus Japanese – JpWaC, 400m web corpus Slovene – FidaPlus, 620m lexicographic corpus NTNU Nov 2011

KIlgarriff33 User evaluation Evaluate whole system – Will it help with my task Eg preparing a collocations dictionary Contrast: developer evaluation – Can I make the system better? Evaluate each module separately Current work NTNU Nov 2011

KIlgarriff34 Components Corpus NLP tools – Segmenter, lemmatiser, POS-tagger Sketch grammar Statistics NTNU Nov 2011

KIlgarriff35 Practicalities Interface – Good, Good-but Merge to good – Maybe, Maybe-specialised, Bad Merge to bad For each language – Two/three linguists/lexicographers – If they disagree Don't use for computing performance NTNU Nov 2011

KIlgarriff36 Results Dutch 66% English71% Japanese 87% Slovene71% NTNU Nov 2011

KIlgarriff37 Two thirds of a collocations dictionary can be gathered automatically

Thank you http://www.sketchengine.co.uk http://www.sketchengine.co.uk NTNU Nov 2011KIlgarriff38

NTNU Nov 2011KIlgarriff39

NTNU Nov 2011KIlgarriff40 Lexicography: finding facts about words collocations grammatical patterns idioms synonyms meanings translations

NTNU Nov 2011KIlgarriff41 Four ages of corpus lexicography

NTNU Nov 2011KIlgarriff42 Age 1: Pre computer Oxford English Dictionary: 5 million index cards

NTNU Nov 2011KIlgarriff43 Age 2: KWIC Concordances From 1980 Computerised Overhauled lexicography

NTNU Nov 2011KIlgarriff44 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: :read all 500 lines: could read all, takes a long time, slow 5000 lines: no

NTNU Nov 2011KIlgarriff45 Age 3: Collocation statistics Problem: too much data - how to summarise? Solution: list of words occurring in neighbourhood of headword, with frequencies Sorted by salience

NTNU Nov 2011KIlgarriff46 Age-3 collocation statistics: limitations Lists contain junk unsorted for type – mixes together adverbs, subjects, objects, prepositions What we really want: noise-free lists one list for each grammatical relation

NTNU Nov 2011KIlgarriff47 Age 4: The word sketch Large well-balanced corpus Parse to find – subjects, objects, heads, modifiers etc One list for each grammatical relation Statistics to sort each list, as before

NTNU Nov 2011KIlgarriff48 Working practice Lexicographers mainly used sketches not concordances – missed less, more consistent – Faster

NTNU Nov 2011KIlgarriff49 Euralex 2002

NTNU Nov 2011KIlgarriff50 Euralex 2002 Can I have them for my language please

NTNU Nov 2011KIlgarriff51 The Sketch Engine Input: – any corpus, any language Lemmatised, part-of-speech tagged – specification of grammatical relations Word sketches integrated with Corpus query system – Supports complex searching, sorting etc Credit: Pavel Rychly, Masaryk Univ

NTNU Nov 2011KIlgarriff52 Customers Dictionary publishers – Oxford University Press – Cambridge University Press – Collins – National dictionary projects in Czech Republic, Estonia, Ireland, Netherlands, Slovakia, Slovenia Universities – Teaching and research – Languages, linguistics, language technology – UK, Germany, US, Greece, Taiwan, Japan, China, … Other – Language teaching, textbook writing – Information management, web search

NTNU Nov 2011KIlgarriff53 Demo – http://sketchengine.co.uk http://sketchengine.co.uk – Free trial

NTNU Nov 2011KIlgarriff54 What is there on the web? Web1T – Present from google – All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (10 12) words of English 1,000,000,000,000 Compare with BNC – Take top 50,000 items of each – 105 Web1T words not in BNC top50k – 50 words with highest Web1T:BNC ratio – 50 words with lowest ratio

NTNU Nov 2011KIlgarriff55 Web-high (155 terms)‏ 61 web and computing – config browser spyware url www forum 38 porn 22 US English 18 business/products common on web – poker viagra lingerie ringtone dvd casino rental collectible tiffany – NB: BNC is old 4 legal – trademarks pursuant accordance herein

NTNU Nov 2011KIlgarriff56 Web-low Exclude British English, transcription/tokenisation anomalies – herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

NTNU Nov 2011KIlgarriff57 Observations Pronouns and past tense verbs – Fiction Masc vs fem Yesterday – Probably daily newspapers Constancy of ratios: – He/him/himself – She/her/herself

A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing.

Similar presentations

Presentation on theme: "A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing.

Similar presentations

Presentation on theme: "A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing."— Presentation transcript:

Similar presentations

About project

Feedback