1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.

Slides:



Advertisements
Similar presentations
The Cambridge Learner Corpus, English Profile, the Sketch Engine and the Kelly Project Adam Kilgarriff Lexical Computing Ltd
Advertisements

Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
WebBootCaT usage Adam Kilgarriff Lexical Computing Ltd.
CL Research ACL Pattern Dictionary of English Prepositions (PDEP) Ken Litkowski CL Research 9208 Gue Road Damascus,
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger 
Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex.
L EARNERS ’ D ICTIONARY Deny A. Kwary
The Sketch Engine -What is The Sketch Engine? -What is a corpus? -Looking at the BASE and the BAWE corpora. -How can this help.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Making useful wordlists for ELT Topical vocabulary from the WWW Simon Smith & Scott Sommers Ming Chuan University, Taipei Adam Kilgarriff, Lexical Computing.
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.
New Slovene corpora within the »Communication in Slovene« project Nataša Logar BergincSimon Krek University of LjubljanaAmebis, Kamnik Faculty of Social.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Tools for Historical corpus research, and a corpus of Latin Barbara McGillivray Oxford University Press Adam Kilgarriff Lexical Computing Ltd.
 Simplify Your Life. Use Google Docs. ELIB 570 Final Presentation: Web 2.0 Tool.
Labels: automation Adam Kilgarriff. Auckland 2012Kilgarriff / Labels: automation2 Which words are:  Most distinctive of business English?  Most often.
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.
First International Sketch Grammar Workshop Ljubljana 3-4 February 2010.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
1 The Long Road from Text to Meaning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.
Word senses Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Without data, nothing Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
The BNC Design Model Adam Kilgarriff, Sue Atkins, Michael Rundell The Lexicography MasterClass
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
A Language Independent Method for Question Classification COLING 2004.
Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7.
1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds.
Sketch Engine development: Done and To-do. Done (in last 18 months):  Corpus Architect Replaces the home page, CorpusBuilder, WebBootCat, Account mgt.
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
arTenTen A new, vast corpus for Arabic
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman.
The Sketch Engine as Infrastructure for Large Scale Text Collections for Humanities Research Adam Kilgarriff Lexical Computing Ltd. & Univ of Leeds, UK.
Integrating ICT in Secondary Gail Butler Macmillan teaching training 2010.
Do we need lexicographers? Prospects for automatic lexicography Adam Kilgarriff Lexical Computing Ltd University of Leeds UK.
Subcorpus configuration Adam Kilgarriff. Feb 2010Kilgarriff: IWSG: Subcorpora2 “you can’t get away from genre” Bonnie Weber, Keynote Lecture ICON (Indian.
Learning Usage of English KWICly with WebLEAP/DSR Takashi Yamanoue Kagoshima University, Japan Toshiro Minami Kyushu Institute of Information Sciences.
Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.
1 Word senses: a computational response Adam Kilgarriff.
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
GDEX: Automatically finding good dictionary examples in a corpus.
Measuring Monolinguality
Making useful wordlists for ELT
Evaluating word sketches and corpora
Writing Analytics Clayton Clemens Vive Kumar.
A Latin corpus for Sketch Engine
CS224N Section 3: Corpora, etc.
Corpora, Language Technology and Maltese
Presentation transcript:

1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd

Aston May 2009 Kilgarriff: Corpora for the coming decade How to build a corpus  Like this: (demo) 

Aston May 2009 Kilgarriff: Corpora for the coming decade3 How should they be different?  Bigger  Better

Aston May 2009 Kilgarriff: Corpora for the coming decade4 Bigger  Motivation Ample data for rare phenomena Big subcorpora For language modelling  More like Google-scale but without Google disadvantages  See Googleology is Bad Science, CL 2007

Aston May 2009 Kilgarriff: Corpora for the coming decade5 Better  Less noise  Fewer duplicates  Richer markup At word, sentence level At document level (text type, subcorpora) ‏

Aston May 2009 Kilgarriff: Corpora for the coming decade6 Divide and rule  Bigger (+ cleaning + deduplication) ‏ Big Web Corpus (BiWeC) ‏  Currently 5.5b fully processed  Target 20b words  Jan Pomikalek, Pavel Rychly  Better New Model Corpus

Aston May 2009 Kilgarriff: Corpora for the coming decade7 New Model Corpus  model 1.small version: model train 2.design: data model  New Model Corpus 1:100 scale model To replace BNC as design model

Aston May 2009 Kilgarriff: Corpora for the coming decade8 BNC design model  Most often used Eg for other languages  pre-web f(blog)=0  Corpora now bigger, far quicker, far cheaper, different issues  BNC design model past its sell-by Kilgarriff Atkins Rundell, Corpus Lg 2007

Aston May 2009 Kilgarriff: Corpora for the coming decade9 New model  Data  Markup

Aston May 2009 Kilgarriff: Corpora for the coming decade10 Data  From the web  100m words  Small sample size Copyright ??Creative Commons Licence

Aston May 2009 Kilgarriff: Corpora for the coming decade11 Composition  General crawl50  Targeted Fiction 7 Blog 7 Newspaper (RSS feed) 7 Speech10  Film transcripts, chatshow Domain-specific19  Business, medical, law

Aston May 2009 Kilgarriff: Corpora for the coming decade12 Markup  Collaborative We distribute data Anyone applies their tools  Pos-tagger, parser, co-ref resolution, domain classifier, WSD, semantic classifier, time phrases, named entities... We integrate, display in Sketch Engine Research potential from multiple markup

Aston May 2009 Kilgarriff: Corpora for the coming decade13 Recombine the two strands  Apply methods with good accuracy (and fast) to BiWeC  Result will be Bigger Better

Aston May 2009 Kilgarriff: Corpora for the coming decade The Sketch Engine  Full-functionality corpus system  Fast  Web-based  In daily use for lexicography at OUP, Collins, CUP, Macmillan, … Le Robert, Cornelsen, Patakis, INL, …  Many universities, language teaching  Free trial  Demo:

Aston May 2009 Kilgarriff: Corpora for the coming decade What can computers count up to?  By default, 2 billion 32 bits, one for the sign, 2 31 = 2 billion Re-engineering required to go beyond  Most corpus systems: tough limit  Sketch Engine Recently re-engineered for 64-bit integers No longer limited

Aston May 2009 Kilgarriff: Corpora for the coming decade16 NLP by web services?  Big corpora big to hold, hard to access fast  Sketch Engine: corpus specialist  Web API FrameNet TEDDCLOG: Taiwan English Data Driven Cloze (test sentence) Generation  All welcome

Aston May 2009 Kilgarriff: Corpora for the coming decade17 Practicalities  Free trial accounts  Collaborators, innovative users free longer-term accounts Wikinomics, Tapscott and Williams  API Details under 'help' on SkE home page  New Model Corpus Available by end 2009: watch Corpora

Aston May 2009 Kilgarriff: Corpora for the coming decade Thank you Enjoy!