Download presentation
Presentation is loading. Please wait.
Published byAgnes Stevens Modified over 9 years ago
1
1 Corpora for the coming decade Adam Kilgarriff
2
Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different? Bigger Better Communal
3
Dublin June 2009 Kilgarriff: Corpora for the coming decade3 Bigger Motivation Ample data for rare phenomena Big subcorpora For language modelling More like Google-scale but without Google disadvantages See Googleology is Bad Science, CL 2007
4
Dublin June 2009 Kilgarriff: Corpora for the coming decade4 Better Less noise Fewer duplicates Richer markup At word, sentence level At document level (text type, subcorpora)
5
Dublin June 2009 Kilgarriff: Corpora for the coming decade5 Divide and rule Bigger (+ cleaning + deduplication) Big Web Corpus (BiWeC) Currently 5.5b fully processed Target 20b words Jan Pomikalek, Pavel Rychly Better New Model Corpus
6
Dublin June 2009 Kilgarriff: Corpora for the coming decade6 New Model Corpus model 1.small version: model train 2.design: data model New Model Corpus 1:100 scale model To replace BNC as design model
7
Dublin June 2009 Kilgarriff: Corpora for the coming decade7 BNC design model Most often used Eg for other languages pre-web f(blog)=0 Corpora now bigger, far quicker, far cheaper, different issues BNC design model past its sell-by Kilgarriff Atkins Rundell, Corpus Lg 2007
8
Dublin June 2009 Kilgarriff: Corpora for the coming decade8 New model Data Markup
9
Dublin June 2009 Kilgarriff: Corpora for the coming decade9 Data From the web 100m words Small sample size Copyright ??Creative Commons Licence
10
Dublin June 2009 Kilgarriff: Corpora for the coming decade10 Composition General crawl50 Targeted Fiction 7 Blog 7 Newspaper (RSS feed) 7 Speech10 Film transcripts, chatshow Domain-specific19 Business, medical, law
11
Dublin June 2009 Kilgarriff: Corpora for the coming decade11 Markup Collaborative We distribute data Anyone applies their tools Pos-tagger, parser, co-ref resolution, domain classifier, WSD, semantic classifier, time phrases, named entities... We integrate, display in Sketch Engine Research potential from multiple markup
12
Dublin June 2009 Kilgarriff: Corpora for the coming decade12 Two strands Apply methods with good accuracy (and fast) to BiWeC
13
Dublin June 2009 Kilgarriff: Corpora for the coming decade13 Two strands Apply methods with good accuracy (and fast) to BiWeC Bigger Better
14
Dublin June 2009 Kilgarriff: Corpora for the coming decade Communal Collective effort to produce Markup, see above Access Free and open Integrate into applications
15
Dublin June 2009 Kilgarriff: Corpora for the coming decade15 NLP by web services? Big corpora big to hold, hard to access fast Sketch Engine: corpus specialist Web API FrameNet TEDDCLOG: Taiwan English Data Driven Cloze (test sentence) Generation All welcome
16
Dublin June 2009 Kilgarriff: Corpora for the coming decade16 Practicalities Free trial accounts Collaborators, innovative users free longer-term accounts Wikinomics, Tapscott and Williams API Details under 'help' on SkE home page New Model Corpus Available by end 2009: watch Corpora
17
Dublin June 2009 Kilgarriff: Corpora for the coming decade Thank you http://www.sketchengine.co.uk http://www.sketchengine.co.uk
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.