Presentation is loading. Please wait.

Presentation is loading. Please wait.

CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman.

Similar presentations


Presentation on theme: "CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman."— Presentation transcript:

1 CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman

2 CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 2 "When you have tons of data and tons of computation you can make things work that don’t work on smaller systems" - Google's VP-engineering, Urs Hölzle

3 CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 3 History within CL  1989: corpora arrive on scene  1989-1993: “too dirty”: battles  1993: CL Special Issue: consummation  …  1999: web arrives on scene  1999-2003: “too dirty”  2003: CL Special Issue .

4 CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 4 History within CL  1989: corpora arrive on scene  1989-1993: “too dirty”: battles  1993: CL Special Issue: consummation  1993: WVLC workshop series starts  …  1999: web arrives on scene  1999-2003: “too dirty”  2003: CL Special Issue  2005: WAC workshop series starts

5 CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 5 History 10 9 10 8 10 7 10 6 Size (in words) 1960s 1970s 1980s 1990s 2000s 2010 Brown/LOB COBUILD BNC Gigaword ?

6 CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 6 Approaches  Use Google hit counts  Use snippets  Use google, then download pages  Spider from relevant starting sites (Marco Baroni’s analysis)

7 CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 7 The Trouble with Google  not enough instances (max 1000)  not enough context –ca 10-word snippet around search term  ridiculous sort order –search term in titles and headings  linguistically dumb –not lemmatised  think/thinks/thinking/thought: four searches –not POS-tagged  mixes up beat (n) and beat (v) –and why not parsed

8 CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 8 DIY  do it ourselves –this community  Wacky

9 CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 9 Components 1.web crawler 2.filters/classifiers - language id, non-text, boilerplate, genre 3.linguistic processor (optional) 4.database/indexing 5.statistical summariser (optional) 6.user interface.

10 CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 10 Programme 9.30 Welcome, goals Adam Kilgarriff 10.00Crawling Marco Baroni 10.30coffee 11.00Creating specialized and general corpora using automated search engine querying Marco Baroni and Serge Sharoff 12.00Small groups: what we have all been doing 1.00lunch 2.30 Processing web-derived text Sebastian Hoffman 3.15 Indexing and interfaces Stefan Evert and Adam Kilgarriff 4.00coffee 4.30Representing genre-specific websites Alexander Mehler and Rüdiger Gleim 5.00Small groups: “what are critical next steps for WaC activity?” 5.30 Plenary: where next? 6.10end

11 CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 11 Small groups (proposal) Around topics:  wac for theoretical linguistics  wac for applied linguistics –language teaching, translation, terminology  wac for nlp  wac for lexicography  wac for ontology engineering Around problems:  large crawls  text processing, boilerplate removal, etc.  indexing and interfaces


Download ppt "CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman."

Similar presentations


Ads by Google