Presentation is loading. Please wait.

Presentation is loading. Please wait.

Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.

Similar presentations


Presentation on theme: "Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff."— Presentation transcript:

1 Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff

2 Auckland 2012Kilgarriff: Web Corpora2 You can’t help noticing Replaceable or replacable? –http://googlefight.comhttp://googlefight.com

3 Auckland 2012Kilgarriff: Web Corpora3 Very very large –2006 estimates for duplicate free, linguistic, Google- indexed web German: 44 billion words Italian: 25 billion words English: 1,000 billion -10,000 billion words Most languages Most language types Up-to-date Free Instant access

4 Auckland 2012Kilgarriff: Web Corpora4 Overview Is the web a corpus? Representativeness What is out there? –Web1T Googleology Web corpus types –Targeted sites: Oxford English Corpus –General: WaC family –WebBootCaT

5 Auckland 2012Kilgarriff: Web Corpora5 Is the web a corpus? Sinclair –in “Developing linguistic corpora, a guide to good practice. Corpus and Text – Basic Principles” “…not a corpus because dimensions unknown, constantly changing not designed from a linguistic perpective But –We can find out dimensions –Many corpora are not designed “as much chatroom dialogue as I can get” Def: a corpus is a collection of texts –when viewed as an object of language research

6 Auckland 2012Kilgarriff: Web Corpora6 Is the web a corpus? Yes

7 Auckland 2012Kilgarriff: Web Corpora7 but it’s not representative

8 Auckland 2012Kilgarriff: Web Corpora8 Theory A random sample of a population is representative of it. Observations on sample support inferences about population (within confidence bounds)‏

9 Auckland 2012Kilgarriff: Web Corpora9 Theory A random sample of a population is … What is the population? –production and reception –speech and text –copying

10 Auckland 2012Kilgarriff: Web Corpora10 Theory Population not defined Representative sample not possible

11 Auckland 2012Kilgarriff: Web Corpora11 sublanguage Language = core + sublanguages Options for corpus construction –none –some –all None –impoverished view of language Some: BNC –cake recipes and gastro-uterine disease –not car repair manuals or astronomy or … All: until recently, not viable

12 Auckland 2012Kilgarriff: Web Corpora12 Representativeness The web is not representative but nor is anything else Text type variation –under-researched, lacking in theory Atkins Clear Ostler 1993 on design brief for BNC; Biber 1988, Kilgarriff 2001, Sharoff 2006 Text type is an issue across linguistics –Web: issue is acute because, as against BNC or WSJ, we simply don’t know what is there

13 Auckland 2012Kilgarriff: Web Corpora13 What is out there? What text types are there on the web? –some are new: chatroom –proportions is it overwhelmed by porn? How much? Hard question

14 Auckland 2012Kilgarriff: Web Corpora14 Comparing frequency lists Web1T –Present from google –All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (10 12) words of English that’s 1,000,000,000,000 Compare with BNC –Take top 50,000 items of each –105 Web1T words not in BNC top50k –50 words with highest Web1T:BNC ratio –50 words with lowest ratio

15 Auckland 2012Kilgarriff: Web Corpora15 Web-high (155 terms)‏ 61 web and computing –config browser spyware url www forum 38 porn 22 US English (incl Spanish influence –los)‏ 18 business/products common on web –poker viagra lingerie ringtone dvd casino rental collectible tiffany –NB: BNC is old 4 legal –trademarks pursuant accordance herein

16 Auckland 2012Kilgarriff: Web Corpora16 Web-low Exclude British English, transcription/tokenisation anomalies –herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

17 Auckland 2012Kilgarriff: Web Corpora17 Observations Pronouns and past tense verbs –Fiction Masc vs fem Yesterday –Probably daily newspapers Constancy of ratios: –He/him/himself –She/her/herself

18 Auckland 2012Kilgarriff: Web Corpora18 The web –a social, cultural, political phenomenon –new, little understood –a legitimate object of science –mostly language we are well placed –a lot of people will be interested Let’s –study the web –source of language data –apply our tools for web use (dictionaries, MT)‏ –use the web as infrastructure

19 Auckland 2012Kilgarriff: Web Corpora19 Using Search Engines No setup costs Start querying today Methods Hit counts ‘snippets’ –Metasearch engines, WebCorp Find pages and download

20 Auckland 2012Kilgarriff: Web Corpora20 Googleology Google hit counts for language modelling –Example: (Keller & Lapata 2003) –36 queries to estimate freq(fulfil, obligation) to each of Google and Altavista Very interesting work Great interest in query syntax

21 Auckland 2012Kilgarriff: Web Corpora21 The Trouble with Google not enough instances –max 1000 not enough queries –max 1000 per day with API not enough context –10-word snippet around search term sort order –search term in titles and headings untrustworthy hit counts limited search options linguistically dumb, eg not lemmatised aime/aimer/aimes/aimons/aimez/aiment …

22 Auckland 2012Kilgarriff: Web Corpora22 Appeal –Zero-cost entry, just start googling Reality –High-quality work: high-cost methodology

23 Auckland 2012Kilgarriff: Web Corpora23 Also: No replicability Methods, stats not published At mercy of commercial corporation

24 Auckland 2012Kilgarriff: Web Corpora24 Also: No replicability Methods, stats not published At mercy of commercial corporation Googleology is bad science

25 Auckland 2012Kilgarriff: Web Corpora25 Web corpus types Large, general corpora Small, specialised corpora –Specially for translators

26 Auckland 2012Kilgarriff: Web Corpora26 Basic steps Gather pages –Google hits –Select and gather whole sites –General crawl Filter De-duplicate Linguistic processing Load into corpus tool

27 Auckland 2012Kilgarriff: Web Corpora27 Oxford English Corpus Whole domains chosen and harvested –control over text type 2.3 billion words‏

28 Auckland 2012Kilgarriff: Web Corpora28 Oxford English Corpus

29 Auckland 2012Kilgarriff: Web Corpora29 Oxford English Corpus

30 Auckland 2012Kilgarriff: Web Corpora30 WaC family (DeWaC, ItWaC) 1.5 B words each Baroni and colleagues Seeds: –mid-frequency words from ‘core vocab’ lists and corpora Google on seed words, then crawl TenTen family (enTenTen, deTenTen) 2-10 billion words each, same methodology Lexical Computing

31 Auckland 2012Kilgarriff: Web Corpora31 Filtering Non-text (sound, image etc) files Boilerplate (within file)‏ –Copyright notices, navigation bars –“high markup” heuristic Not “text in sentences” –Look for function words –Lists?? Sports results?? Crossword puzzles?? Spam, pornography –Tough De-duplication (also tough)‏

32 Auckland 2012Kilgarriff: Web Corpora32 Small, specialised corpora Terminologists Translators needing target-language domain-specific vocab Specialist dictionaries –Don’t exist –Expensive/inaccessible –Out of date

33 Auckland 2012Kilgarriff: Web Corpora33 BootCat ( Bootstrapping Corpora and Terms) –Put in seed terms –Google/Yahoo/bing search –Retrieve Google/Yahoo/bing hits Remove duplicates, boilerplate –Small instant corpora –Baroni and Bernardini, LREC 2004 –Web version WebBootCaT At Sketch Engine site

34 Auckland 2012Kilgarriff: Web Corpora34 Task Choose area of specialist interest –Choose your language Select at least 5 seed terms –Specialist: good Build corpus –At least 100,000 words –Iterate if necessary Find at least six words/phrases/meanings you did not know before


Download ppt "Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff."

Similar presentations


Ads by Google