Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.

Similar presentations


Presentation on theme: "1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd."— Presentation transcript:

1 1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd

2 Aston May 2009 Kilgarriff: Corpora for the coming decade How to build a corpus  Like this: (demo)  http://beta.sketchengine.co.uk/auth/wbc http://beta.sketchengine.co.uk/auth/wbc

3 Aston May 2009 Kilgarriff: Corpora for the coming decade3 How should they be different?  Bigger  Better

4 Aston May 2009 Kilgarriff: Corpora for the coming decade4 Bigger  Motivation Ample data for rare phenomena Big subcorpora For language modelling  More like Google-scale but without Google disadvantages  See Googleology is Bad Science, CL 2007

5 Aston May 2009 Kilgarriff: Corpora for the coming decade5 Better  Less noise  Fewer duplicates  Richer markup At word, sentence level At document level (text type, subcorpora) ‏

6 Aston May 2009 Kilgarriff: Corpora for the coming decade6 Divide and rule  Bigger (+ cleaning + deduplication) ‏ Big Web Corpus (BiWeC) ‏  Currently 5.5b fully processed  Target 20b words  Jan Pomikalek, Pavel Rychly  Better New Model Corpus

7 Aston May 2009 Kilgarriff: Corpora for the coming decade7 New Model Corpus  model 1.small version: model train 2.design: data model  New Model Corpus 1:100 scale model To replace BNC as design model

8 Aston May 2009 Kilgarriff: Corpora for the coming decade8 BNC design model  Most often used Eg for other languages  pre-web f(blog)=0  Corpora now bigger, far quicker, far cheaper, different issues  BNC design model past its sell-by Kilgarriff Atkins Rundell, Corpus Lg 2007

9 Aston May 2009 Kilgarriff: Corpora for the coming decade9 New model  Data  Markup

10 Aston May 2009 Kilgarriff: Corpora for the coming decade10 Data  From the web  100m words  Small sample size Copyright ??Creative Commons Licence

11 Aston May 2009 Kilgarriff: Corpora for the coming decade11 Composition  General crawl50  Targeted Fiction 7 Blog 7 Newspaper (RSS feed) 7 Speech10  Film transcripts, chatshow Domain-specific19  Business, medical, law

12 Aston May 2009 Kilgarriff: Corpora for the coming decade12 Markup  Collaborative We distribute data Anyone applies their tools  Pos-tagger, parser, co-ref resolution, domain classifier, WSD, semantic classifier, time phrases, named entities... We integrate, display in Sketch Engine Research potential from multiple markup

13 Aston May 2009 Kilgarriff: Corpora for the coming decade13 Recombine the two strands  Apply methods with good accuracy (and fast) to BiWeC  Result will be Bigger Better

14 Aston May 2009 Kilgarriff: Corpora for the coming decade The Sketch Engine  Full-functionality corpus system  Fast  Web-based  In daily use for lexicography at OUP, Collins, CUP, Macmillan, … Le Robert, Cornelsen, Patakis, INL, …  Many universities, language teaching  Free trial  Demo: http://sketchengine.co.ukhttp://sketchengine.co.uk

15 Aston May 2009 Kilgarriff: Corpora for the coming decade What can computers count up to?  By default, 2 billion 32 bits, one for the sign, 2 31 = 2 billion Re-engineering required to go beyond  Most corpus systems: tough limit  Sketch Engine Recently re-engineered for 64-bit integers No longer limited

16 Aston May 2009 Kilgarriff: Corpora for the coming decade16 NLP by web services?  Big corpora big to hold, hard to access fast  Sketch Engine: corpus specialist  Web API FrameNet TEDDCLOG: Taiwan English Data Driven Cloze (test sentence) Generation  All welcome

17 Aston May 2009 Kilgarriff: Corpora for the coming decade17 Practicalities  Free trial accounts  Collaborators, innovative users free longer-term accounts Wikinomics, Tapscott and Williams  API Details under 'help' on SkE home page  New Model Corpus Available by end 2009: watch Corpora

18 Aston May 2009 Kilgarriff: Corpora for the coming decade Thank you http://www.sketchengine.co.uk Enjoy!


Download ppt "1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd."

Similar presentations


Ads by Google