Presentation on theme: "The Cambridge Learner Corpus, English Profile, the Sketch Engine and the Kelly Project Adam Kilgarriff Lexical Computing Ltd"— Presentation transcript:
The Cambridge Learner Corpus, English Profile, the Sketch Engine and the Kelly Project Adam Kilgarriff Lexical Computing Ltd
The Cambridge Learner Corpus, English Profile, the Sketch Engine, “freely available”, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing Ltd
Cambridge Learner Corpus (CLC) Since 1993 – Nearly as old as CECL Leading resource (like ICLE) CUP and Cambridge ESOL – For better dictionaries, ELT courses, tests – Material: all from exams (levels A1-C2) 45m words; 22m error-tagged 200,000 scripts, 138 L1s, 203 nationalities
English Profile From 2006 Cambridge Univ, Univ Press, ESOL (+ others) Goal – for each CEFR level, find characteristic lexis and grammar – Main resource: CLC – Talk on Thursday Theodora Alexopolou, Helen Yannakoudakis
Sketch Engine Leading corpus tool Word sketches – One-page summaries of a word’s grammatical and collocational behaviour In use at OUP, CUP, Collins, Macmillan, INL … 42 languages – Over 150 corpora – Since May including CHILDES: demodemo – Since last year including CLC
Error-coded corpus Challenge – Intuitive to search for x anywhere only where it is part of an error only where it is part of a correction where x can be a word, phrase, grammar pattern … Requirement for CLC in Sketch Engine
Sample text We will only use those informations to take part of our guest survey
Error-coded corpora in SkE demo
Free (MED online) Sense 1: not costing anything Sense 4: not limited by rules … anyone can get hold of it??
freely available Free (MED online) Sense 1: not costing anything Sense 4: not limited by rules … anyone can get hold of it?? Available To download onto your com To use
Case studies ICLECLC Money225 EURNo To everyoneYesCambridge author/collab To download?No To useYes
Non-geeks Access is important, not download Web is beautiful
HOO / HOO+ Helping Our Own HOO: English-NNS NLP researchers – Developer = user: motivation – Shared task/competitive evaluation Organisers define task and prepare ‘gold standard’ Teams participate by running their software over test data Six teams (incl Tübingen), workshop end Sept
HOO+ (2012) Probably – English: learner data from CLC – Other languages? – Tasks Essay scoring Determiner, preposition errors ?
DANTE Highlights of English lexicography
The KELLY Project EU Lifelong Learning Project Word cards – 9 languages Arabic Chinese English Greek Italian Norwegian Polish Russian Swedish – All 36 pairs – Words the learner should know (at A1 … C2) Partners Stockholm Univ, Gotheburg Univ, Adam Mickiewicz Univ, ILSP Athens, CNR Pisa, Oslo Univ, Leeds Univ, Keewords A/S, Lexical Computing Ltd
Interesting question How close to purely corpus-based can a pedagogic list be?
Method Take a general corpus Count Review, add, delete using other lists and corpora Translate (72 directed-lg-pairs) Words not in source list which occur in translations: – Review source list
Symmatrical pairs: and Cliques: – For x, y, z, … all pairs are symmetrical – 9-language cliques (English members) hospital library music sun theory