Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger 

Similar presentations


Presentation on theme: "1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger "— Presentation transcript:

1 1 Corpora for the coming decade Adam Kilgarriff

2 Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger  Better  Communal

3 Dublin June 2009 Kilgarriff: Corpora for the coming decade3 Bigger  Motivation Ample data for rare phenomena Big subcorpora For language modelling  More like Google-scale but without Google disadvantages  See Googleology is Bad Science, CL 2007

4 Dublin June 2009 Kilgarriff: Corpora for the coming decade4 Better  Less noise  Fewer duplicates  Richer markup At word, sentence level At document level (text type, subcorpora) ‏

5 Dublin June 2009 Kilgarriff: Corpora for the coming decade5 Divide and rule  Bigger (+ cleaning + deduplication) ‏ Big Web Corpus (BiWeC) ‏  Currently 5.5b fully processed  Target 20b words  Jan Pomikalek, Pavel Rychly  Better New Model Corpus

6 Dublin June 2009 Kilgarriff: Corpora for the coming decade6 New Model Corpus  model 1.small version: model train 2.design: data model  New Model Corpus 1:100 scale model To replace BNC as design model

7 Dublin June 2009 Kilgarriff: Corpora for the coming decade7 BNC design model  Most often used Eg for other languages  pre-web f(blog)=0  Corpora now bigger, far quicker, far cheaper, different issues  BNC design model past its sell-by Kilgarriff Atkins Rundell, Corpus Lg 2007

8 Dublin June 2009 Kilgarriff: Corpora for the coming decade8 New model  Data  Markup

9 Dublin June 2009 Kilgarriff: Corpora for the coming decade9 Data  From the web  100m words  Small sample size Copyright ??Creative Commons Licence

10 Dublin June 2009 Kilgarriff: Corpora for the coming decade10 Composition  General crawl50  Targeted Fiction 7 Blog 7 Newspaper (RSS feed) 7 Speech10  Film transcripts, chatshow Domain-specific19  Business, medical, law

11 Dublin June 2009 Kilgarriff: Corpora for the coming decade11 Markup  Collaborative We distribute data Anyone applies their tools  Pos-tagger, parser, co-ref resolution, domain classifier, WSD, semantic classifier, time phrases, named entities... We integrate, display in Sketch Engine Research potential from multiple markup

12 Dublin June 2009 Kilgarriff: Corpora for the coming decade12 Two strands  Apply methods with good accuracy (and fast) to BiWeC

13 Dublin June 2009 Kilgarriff: Corpora for the coming decade13 Two strands  Apply methods with good accuracy (and fast) to BiWeC  Bigger  Better

14 Dublin June 2009 Kilgarriff: Corpora for the coming decade Communal  Collective effort to produce Markup, see above  Access Free and open Integrate into applications

15 Dublin June 2009 Kilgarriff: Corpora for the coming decade15 NLP by web services?  Big corpora big to hold, hard to access fast  Sketch Engine: corpus specialist  Web API FrameNet TEDDCLOG: Taiwan English Data Driven Cloze (test sentence) Generation  All welcome

16 Dublin June 2009 Kilgarriff: Corpora for the coming decade16 Practicalities  Free trial accounts  Collaborators, innovative users free longer-term accounts Wikinomics, Tapscott and Williams  API Details under 'help' on SkE home page  New Model Corpus Available by end 2009: watch Corpora

17 Dublin June 2009 Kilgarriff: Corpora for the coming decade  Thank you  http://www.sketchengine.co.uk http://www.sketchengine.co.uk


Download ppt "1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger "

Similar presentations


Ads by Google