1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass.

Slides:



Advertisements
Similar presentations
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
Advertisements

The Cambridge Learner Corpus, English Profile, the Sketch Engine and the Kelly Project Adam Kilgarriff Lexical Computing Ltd
WebBootCaT usage Adam Kilgarriff Lexical Computing Ltd.
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Measuring Distance between Language Varieties Adam Kilgarriff, Jan Pomikalek, Pavel Rychly, Vit Suchomel Supported by EU Project PRESEMT.
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger 
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Making useful wordlists for ELT Topical vocabulary from the WWW Simon Smith & Scott Sommers Ming Chuan University, Taipei Adam Kilgarriff, Lexical Computing.
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
Today Writing: using the comma –Writing task Corpus linguistics talk, Part 2 Re-organize groups –Group news discussion.
Corpus Linguistics Lexicography. Questions for lexicography in corpus linguistics How common are different words? How common are the different senese.
Research methods in corpus linguistics Xiaofei Lu.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Memory Strategy – Using Mental Images
Simple Maths for Keywords Adam Kilgarriff Lexical Computing Ltd.
Labels: automation Adam Kilgarriff. Auckland 2012Kilgarriff / Labels: automation2 Which words are:  Most distinctive of business English?  Most often.
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.
1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass.
First International Sketch Grammar Workshop Ljubljana 3-4 February 2010.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Terminology, translation, and PRESEMT; word frequency lists and KELLY 1 Adam Kilgarriff Lexical Computing Ltd SKEW-2, March 2011Kilgarriff: PRESEMT and.
Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.
Researching language with computers Paul Thompson.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7.
1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds.
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
How Can Corpora Help Me To Be Successful in CO150?
Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK.
1 Word senses: a computational response Adam Kilgarriff.
Subcorpus configuration Adam Kilgarriff. Feb 2010Kilgarriff: IWSG: Subcorpora2 “you can’t get away from genre” Bonnie Weber, Keynote Lecture ICON (Indian.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus search What are the most common words in English
1 Ch 1. VOCABULARY SIZE, TEXT COVERAGE & WORD LISTS Nation& Waring.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
GDEX: Automatically finding good dictionary examples in a corpus.
Changes in English 1 In this presentation we are going to look at the way other languages have influenced English and at the similarities and differences.
Measuring Monolinguality
Making useful wordlists for ELT

Evaluating word sketches and corpora
Statistical n-gram David ling.
Corpora, Language Technology and Maltese
Presentation transcript:

1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Universities of Leeds and Sussex

Malta, May 2010Kilgarriff: BUCC 2 Two corpora are comparable iff roughly the same text types, subject matter, proportions

Malta, May 2010Kilgarriff: BUCC 3 Two corpora are comparable iff roughly the same text types, subject matter, proportions  Applicable where Different languages Same language  comparable=similar  Any corpus is entirely similar to itself

Malta, May 2010Kilgarriff: BUCC 4 Comparing Corpora  Input Word freq list for c1 Word freq list for c2  For top 500 words compute sum of (observed-expected) 2 /expected  Chi-square-based Discriminates well  Better than spearman rank, cross-entropy

Malta, May 2010Kilgarriff: BUCC s work  Then Very few corpora Purely theoretical interest  Now Web lots of corpora, created to spec Compare…  first question to ask about a new corpus

Malta, May 2010Kilgarriff: BUCC 6 (Monolingual) Word Lists  Define a syllabus  Which words get used in Learning-to-read books (NS children) ‏ NNS language learner textbooks Dictionaries Language testing  NS: educational psychologists  NNS: proficiency levels

Malta, May 2010Kilgarriff: BUCC 7 Should be corpus-based  Most aren't Corpora are quite new  Easy to do better  People will use them Maybe also Governments

Malta, May 2010Kilgarriff: BUCC 8 How  Take your corpus  Count  Voila

Malta, May 2010Kilgarriff: BUCC 9 Complications  What is a word  Words and lemmas  Grammatical classes  Numbers, names...  Multiwords  Homonymy All are slightly different issues for each lg

Malta, May 2010Kilgarriff: BUCC 10 What is a word; delimiters  Found between spaces Not for Chinese: segmentation  English co-operate, widely-held, farmer's, can't  Norwegian, Swedish Compounding, separable verbs  Arabic, Italian Clitics, al,... ...

Malta, May 2010Kilgarriff: BUCC 11 Words and lemmas  Word form (in text) ‏ invading  Lemma (dictionary headword) ‏  Invade for forms invade invades invaded invading  Lemmatisation Chinese, none; English, simple Middling: Swe Nor It Gr Tough: Rus, Pol, Ara

Malta, May 2010Kilgarriff: BUCC 12 Word Families  Derivational morphology efficient/efficiently access/accessible/accessibility available/availability/unavailable  ‘Word families’ tradition  eg: Coxhead, Academic word list Pedagogy: one item to learn But  Where do families end? Different meanings

Malta, May 2010Kilgarriff: BUCC 13 Grammatical classes  brush (verb) and brush (noun) ‏ Same item or different? (both in same word family)  Required (short) list of word classes POS-tagger  Will make mistakes

Malta, May 2010Kilgarriff: BUCC 14 Marginal cases Numbers  twelve, seventeenth, fifties Closed sets  Days of week, months Countries  Capitals, nationalities, currencies, adjectives, languages regional/dialects, political groups, religions  easter, christmas, islam, republican  policies always needed

Malta, May 2010Kilgarriff: BUCC 15 Multiwords  According to Linguistically a word but  Multiword frequency list: top item of the Can't use freqs (alone) to select multiwords

Malta, May 2010Kilgarriff: BUCC 16 Homonymy  bank (river) and bank (money) ‏  Word sense disambiguation We can't do (with decent accuracy) ‏ We can't give freqs for senses  Lists of words not meanings Sometimes disconcerting

Malta, May 2010Kilgarriff: BUCC 17 Corpora  A fairly arbitrary sample of a lg  To limit arbitrariness of wordlist Make it big and diverse  WACKY corpora From web Can do for any language  ??? Comparable ??? Web language: less formal

Malta, May 2010Kilgarriff: BUCC 18 Comparing corpora  Corpora: new  We are all beginners  Best way to get sense of a corpus Compare with another Keywords of each vs. other  Case studies  Sketch Engine functions

Malta, May 2010Kilgarriff: BUCC 19 Comparing frequency lists Web1T –Present from google –All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (10 12) words of English that’s 1,000,000,000,000 Compare with BNC –Take top 50,000 items of each –105 Web1T words not in BNC top50k –50 words with highest Web1T:BNC ratio –50 words with lowest ratio

Malta, May 2010Kilgarriff: BUCC 20 Web-high (155 terms) ‏ 61 web and computing –config browser spyware url www forum 38 porn 22 US English (incl Spanish influence –los)‏ 18 business/products common on web –poker viagra lingerie ringtone dvd casino rental collectible tiffany –NB: BNC is old 4 legal –trademarks pursuant accordance herein

Malta, May 2010Kilgarriff: BUCC 21 Web-low Exclude British English, transcription/tokenisation anomalies –herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

Malta, May 2010Kilgarriff: BUCC 22 Observations Pronouns and past tense verbs –Fiction Masc vs fem Yesterday –Probably daily newspapers Constancy of ratios: –He/him/himself –She/her/herself

Malta, May 2010Kilgarriff: BUCC 23 Corpus Factory Many languages General corpus, 100m+ words  Fast  High quality  Comparable across languages

Malta, May 2010Kilgarriff: BUCC 24 Gather Seed words Wikipedia (Wiki) Corpora  many domains  free  265 languages covered, more to come Extract text from Wiki.  Wikipedia 2 Text Tokenise the text.  Morphology of the language is important  Can use the existing word tokeniser tools.

Malta, May 2010Kilgarriff: BUCC 25 Web Corpus Statistics

Malta, May 2010Kilgarriff: BUCC 26 Evaluation For each of the languages, two corpora available:  Web and Wiki  Dutch: also a carefully­designed lexicographic corpus. Hypothesis: Wiki corpora are ‘informational’  Informational --> typical written  Interactional --> typical spoken

Malta, May 2010Kilgarriff: BUCC 27 Evaluation 1st, 2nd person pronouns  strong indicators of interactional language.  English: I me my mine you your yours we us our For each language Take ten commonest 1 st and 2 nd person pronouns For each  Calculate ratio: web:wiki

Malta, May 2010Kilgarriff: BUCC 28 Results: ratios, web:wiki LanguageAverageMinMax Dutch Hindi Telugu Thai Vietnamese

Malta, May 2010Kilgarriff: BUCC 29 KELLY  EU lifelong learning project  Goal: wordcards Word in one lg on one side, other on other Language learning  9 languages, 36 pairs Arabic Chinese English Greek Italian Norwegian Polish Russian Sweden  Partners in 6 countries

Malta, May 2010Kilgarriff: BUCC 30 Method  Prepare monolingual lists  Translate Each into 8 target languages Professional translation services  Integrate, finalise  Produce cards  Goal for each set 9000 pairs at 6 levels

Malta, May 2010Kilgarriff: BUCC 31 Stages  Sort out corpora, tagging  Automatically generate M1 lists names, numbers, countries... keywords vis-a-vis other corpora  Review, compare, prepare M2 lists  Translate  Use translations: M3 lists  Finalise

Malta, May 2010Kilgarriff: BUCC 32 review - how?  points system 2 points for each of 6 levels 12 points for most freq words  deduct points for words in over- represented areas  add in words from other corpora

Malta, May 2010Kilgarriff: BUCC 33 Translation database  On the web  All translations entered into it  Queries like All Swedish words used as translations more than six times All 1:1:1:1... 'simple cases'

Malta, May 2010Kilgarriff: BUCC 34 Using the translations database  Find words not in M2 lists, that need adding Multiwords English look for Probably, the translation of a high-freq word in several of the 8 other lgs So:  add it to English list Homonyms: could be similar

Malta, May 2010Kilgarriff: BUCC 35 Monolingual master lists (M3) ‏  Based on a WAC corpus  Input from other same-lg corpora  And from translations from 8 lgs Useful words which might not be hi-freq  added words/multiwords must be above a lower freq threshold  Target 9000

Malta, May 2010Kilgarriff: BUCC 36 Numbers  Target: 9000 per list  M2 lists Estimate: needed We add multiwords and other 'back-translations'

Malta, May 2010Kilgarriff: BUCC 37 Current status  M1 lists prepared  Lists checked, compared with other lists Corpus-based and other  M2 lists prepared  Translation underway

Malta, May 2010Kilgarriff: BUCC 38 Big problems  Multiwords (as anticipated) ‏  Homonymy (as anticipated) ‏  orange banana alphabet elbow, Hello Worse than anticipated Lists from spoken corpora, learner corpora, needed Relation between  Competence for communicating  The corpora at our disposal

Malta, May 2010Kilgarriff: BUCC 39 Word lists are useful, but ...are they scientific? A tiny bit, occasionally ...could they be scientific? Yes  article of faith By the end of KELLY, we'll have a clearer idea how

Malta, May 2010Kilgarriff: BUCC 40 And now for something completely different: DANTE  Lexical database for English Detailed Accurate Extensive of English Highly corpus-driven 3 yr project 18 expert lexicographers Led by Sue Atkins  BNC, FrameNet, Euralex, COBUILD...  English side, New English-Irish dictionary  Available for NLP research imminently