Presentation is loading. Please wait.

Presentation is loading. Please wait.

Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.

Similar presentations


Presentation on theme: "Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex."— Presentation transcript:

1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

2 Leeds, April 2010 Kilgarriff: Corpora by Web Services 2 Starting a PhD in NLP  Then Prolog Type in a few  grammar rules  Lexical entries  Example sentences We’re off!

3 Leeds, April 2010 Kilgarriff: Corpora by Web Services 3 Now  Corpus Which? Budget/schedule Howe much can we afford? Hard disk space  Access software Build  Big job, makign it fast is hard – or Research, acquire, install, maintain …

4 Leeds, April 2010 Kilgarriff: Corpora by Web Services 4  Resarch question Morphology, syntax, discourse structure, semantics, anaphor  First six months at least Acquiring data, software Complications

5 Leeds, April 2010 Kilgarriff: Corpora by Web Services 5

6 Leeds, April 2010 Kilgarriff: Corpora by Web Services 6 If you’re not super-geeky  Did I do it properly?  Dumbing down Let’s choose an easier question  Looking over shoulder

7 Leeds, April 2010 Kilgarriff: Corpora by Web Services 7 Disappointment

8 Leeds, April 2010 Kilgarriff: Corpora by Web Services 8 Making it easy  Like picking up a hire car

9 Leeds, April 2010 Kilgarriff: Corpora by Web Services 9 Corpora by web services  Possible?  Already available

10 Leeds, April 2010 Kilgarriff: Corpora by Web Services 10 Sketch Engine  Corpus querying  Fast  Handles large corpora  In use for lexicography at OUP, CUP, Macmillan, Collins, Le Robert  Word sketches Data-driven summary of a word’s grammatical and collocational behaviour

11 Leeds, April 2010 Kilgarriff: Corpora by Web Services 11

12 Leeds, April 2010 Kilgarriff: Corpora by Web Services 12 Corpora 63Welsh53Romanian 174Vietnamese66Portuguese149Greek 108Thai6Persian1627German 5Telugu95Norwegian126French 114Swedish409Japanese5508English 117Spanish1910Italian128Dutch 738Slovene34Irish800Czech 536Slovak102Indonesian456Chinese 188Russian31Hindi174Arabic

13 Leeds, April 2010 Kilgarriff: Corpora by Web Services 13 Big, High Quality corpora  Big Performance  Banko and Brill 2004  There’s no data like more data Ample data for rare phenomena Big subcorpora  5b  Medical: 30m

14 Leeds, April 2010 Kilgarriff: Corpora by Web Services 14 Quality  Bad data Spam Navigation-bars Duplicates Lists Bungled formatting Wrong language …  Less discussed Maybe a footnote I wonder why  Quick fixes and run

15 Leeds, April 2010 Kilgarriff: Corpora by Web Services 15 The Google/Yahoo/Bing option  Appeal Not setup costs Start googling today

16 Leeds, April 2010 Kilgarriff: Corpora by Web Services 16  Very interesting work Keller and Lapata  Validity of SE counts vs BNC counts vs psycholinguistic validity of collocations  36 queries per collocation “fulfil obligation” “fulfil ? Obligation” “fulfilling obligations”... Nakov, Nakov and Hearst  Great interest in query syntax

17 Leeds, April 2010 Kilgarriff: Corpora by Web Services 17 but  Limited hits-per-query  Limited hits-per-day  Sort order Not documented 'unsorted' not possible  Snippets too short for research  No (documented) morphology  Limited query syntax

18 Leeds, April 2010 Kilgarriff: Corpora by Web Services 18 and  At mercy of commercial company  Might change at any time  Not replicable

19 Leeds, April 2010 Kilgarriff: Corpora by Web Services 19 So  Appeal No setup costs  Serious research Many difficult practical issues Not a tool designed for linguists  Conclusion If only SE indexes are big enough  Yes Else no

20 Leeds, April 2010 Kilgarriff: Corpora by Web Services 20 Strategy  More languages Corpus Factory, as Sharoff  Bigger Big Web Corpus (BiWeC) ‏ Currently 5.5b fully processed Target 20b  Better

21 Leeds, April 2010 Kilgarriff: Corpora by Web Services 21 New Model Corpus  BNC is past its sell-by Early 1990s Pre web Still dominant model  New model needed

22 Leeds, April 2010 Kilgarriff: Corpora by Web Services 22 Model Small: model train  Model train Design: software model  NMC 1:100 for BiWeC-scale  100m Update of BNC as design model  Data from web but  Text type avalable

23 Leeds, April 2010 Kilgarriff: Corpora by Web Services 23 Open-source/collaboration  We distribute  You annotate Pos-tags, parses, anaphor, discourse moves, semantics, multiwords, entity- types... Domain, register, region...  Send us annotations  We integrate And give access in SkE

24 Leeds, April 2010 Kilgarriff: Corpora by Web Services 24 Divide and rule  Bigger (BiWeC) ‏  Better (NMC) ‏  Take best annotations Accuracy Speed Usefulness Good collaboration  from NMC, apply to BiWeC

25 Leeds, April 2010 Kilgarriff: Corpora by Web Services 25 TEDDCLOG Taiwan English Data-Driven CLOze Generation with Simon Smith and colleagues, Taipei  API case study

26 Leeds, April 2010 Kilgarriff: Corpora by Web Services 26 Cloze  'fill-the gap' Several metal _____ violently with cold water  A: behave  B: react  C: realise  D: respond  Popular with students, teachers, testers Unpopular with theorists :-(

27 Leeds, April 2010 Kilgarriff: Corpora by Web Services 27 One objection  Test item writers make them up  Not naturally-occurring language The Sinclair-Johns critique Also: expensive  TEDDCLOG Uses corpus sentences and distractors

28 Leeds, April 2010 Kilgarriff: Corpora by Web Services 28 react Thesaurus module Several metals react violently with cold water. Diffs module Concordance module behave, interact, respond Text processing module Several metals ___ violently with cold water. (a) behave (b) react (c) realise (d) respond behave realise respond metals behave x metals respond x metals realise x metals react √

29 Leeds, April 2010 Kilgarriff: Corpora by Web Services 29 API calls  Find distractorts thesaurus  Find key-only collocate Sketch diffs  Needs optimising  Find carrier sentence Concordance with GDEX module  Good Dictionary Example Finder

30 Leeds, April 2010 Kilgarriff: Corpora by Web Services 30 Current status  TEDDCLOG Next phase: produccing decent results  Corpora by Web Services Upping server capacity Looking for users (currently with UKWaC) ‏  New Model Corpus Nervous over copyright but Available in SkE, for download

31 Leeds, April 2010 Kilgarriff: Corpora by Web Services 31 Another announcement: DANTE  Lexical database for English Detailed Accurate Extensive of English Highly corpus-driven 3 yr project 18 expert lexicographers Led by Sue Atkins  BNC, FrameNet, Euralex, COBUILD...  English side, New English-Irish dictionary  Available for NLP research imminently


Download ppt "Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex."

Similar presentations


Ads by Google