Presentation is loading. Please wait.

Presentation is loading. Please wait.

How Useful Is the Web as a Linguistic Corpus? William H. Fletcher United States Naval Academy 2002 North American Symposium on Corpus Linguistics American.

Similar presentations


Presentation on theme: "How Useful Is the Web as a Linguistic Corpus? William H. Fletcher United States Naval Academy 2002 North American Symposium on Corpus Linguistics American."— Presentation transcript:

1 How Useful Is the Web as a Linguistic Corpus? William H. Fletcher United States Naval Academy 2002 North American Symposium on Corpus Linguistics American Association of Applied Corpus Linguistics Indianapolis, IN, 1-3 November 2002

2 Making the Web More Useful as a Corpus Objective of this ongoing study To develop and evaluate linguistic methods and PC tools to identify domain-relevant and linguistically representative documents more efficiently Long-range goal To establish the Web both as a "corpus of first resort" and as a supplementary corpus for language professionals and learners

3 Advantages of Web Virtually comprehensive coverage of major languages and language varieties, content domains and written text types Virtually comprehensive coverage of major languages and language varieties, content domains and written text types Ready availability and low cost throughout developed world Ready availability and low cost throughout developed world Freshness and topicality: emerging usage and current issues well documented Freshness and topicality: emerging usage and current issues well documented Easy to compile an ad-hoc corpus to answer a specific question or meet a specialized information need Easy to compile an ad-hoc corpus to answer a specific question or meet a specialized information need User familiarity with Web and independent motivation to become more proficient in using it User familiarity with Web and independent motivation to become more proficient in using it

4 Disadvantages of Web Generally unknown provenance and authorship, reliability and authorativeness of texts, both for content and linguistic form Generally unknown provenance and authorship, reliability and authorativeness of texts, both for content and linguistic form Predominance of certain text types among coherent texts, especially legal, journalistic, commercial and academic prose Predominance of certain text types among coherent texts, especially legal, journalistic, commercial and academic prose Overall lower standards of form and content verification than printed sources Overall lower standards of form and content verification than printed sources Systematically accessible only through commercial search engines, which support only very rough search criteria Systematically accessible only through commercial search engines, which support only very rough search criteria Counts of a given linguistic feature give only a general numeric indication, not statistical proof Counts of a given linguistic feature give only a general numeric indication, not statistical proof

5 “Noise” Filter for HRDs Highly Repetitive Documents Highly Repetitive Documents Discussion groups where replies incorporate original postDiscussion groups where replies incorporate original post Internal linksInternal links BoilerplateBoilerplate Search engine SpamSearch engine Spam Strategy: identify documents with frequent n-grams Strategy: identify documents with frequent n-grams 8-grams, 12-grams, 25-grams useful range8-grams, 12-grams, 25-grams useful range Either eliminate document or eliminate redundant textEither eliminate document or eliminate redundant text

6 “Noise” Filter for VIDs Virtually Identical Documents Virtually Identical Documents Mirrored documents with slight differencesMirrored documents with slight differences News storiesNews stories Rank and absolute frequency of 3- to 5-grams alerts to VIDs Rank and absolute frequency of 3- to 5-grams alerts to VIDs

7 “Noise” Filter for IDs (Fully) Identical Documents (Fully) Identical Documents Mirrored documentsMirrored documents Multiple URLs for same documentMultiple URLs for same document Server-generated error messagesServer-generated error messages MD5 SHA (Message Digest 5 Secure Hash Algorithm) reduces normalized text of any length to 20-byte code with high probability of uniqueness MD5 SHA (Message Digest 5 Secure Hash Algorithm) reduces normalized text of any length to 20-byte code with high probability of uniqueness MD5 codes from thousands of documents can be stored in binary tree for efficient comparison and elimination of redundant documents MD5 codes from thousands of documents can be stored in binary tree for efficient comparison and elimination of redundant documents

8 ? Unproven “Noise” Filters Microsoft Word Spelling Checker to recognize, normalize ill-formed documents automatically Microsoft Word Spelling Checker to recognize, normalize ill-formed documents automatically Some success; deserves further attentionSome success; deserves further attention Problem: large number of items (personal, commercial and place names, technological terms) not in default lexicon, so it rejects too many good documents.Problem: large number of items (personal, commercial and place names, technological terms) not in default lexicon, so it rejects too many good documents. Patterns of 1- and 2-grams to recognize PFDs (Primarily Fragmentary Documents) Patterns of 1- and 2-grams to recognize PFDs (Primarily Fragmentary Documents) Some high-frequency types (articles, copula) rare in fragments, others (common prepositions) frequentSome high-frequency types (articles, copula) rare in fragments, others (common prepositions) frequent Content words and special terms (see above) relatively prominentContent words and special terms (see above) relatively prominent

9 Size as A Priori Filter Webpages under 3 kB or over 150 kB have lower “signal to noise” ratio Webpages under 3 kB or over 150 kB have lower “signal to noise” ratio In these extreme ranges documents consist of coherent text less frequently or to a lesser degreeIn these extreme ranges documents consist of coherent text less frequently or to a lesser degree Shorter files tend to have much lower ratio of text file size to HTML file size (49% vs. 64% overall)Shorter files tend to have much lower ratio of text file size to HTML file size (49% vs. 64% overall) Rule of thumb: download and process only pages larger than 5 kB or smaller than 200 kB (size before stripping HTML tags) Rule of thumb: download and process only pages larger than 5 kB or smaller than 200 kB (size before stripping HTML tags)

10 My Web Corpus 1 Compiled one afternoon in October 2001 via KWiCFinder searches on the 20 most frequent words in English Compiled one afternoon in October 2001 via KWiCFinder searches on the 20 most frequent words in English Preliminary studies of 100 and 5859 webpages respectively revealed great bias towards commercial sites due to "paid positioning" on AltaVista; sites ranked highest for this reason were excluded from this study Preliminary studies of 100 and 5859 webpages respectively revealed great bias towards commercial sites due to "paid positioning" on AltaVista; sites ranked highest for this reason were excluded from this study Initially consisted of 11,201 online documents (OLDs) Initially consisted of 11,201 online documents (OLDs) Various "noise filters" were applied to make the results more useful Various "noise filters" were applied to make the results more useful 7294 survived automatic elimination of IDs and VIDs 7294 survived automatic elimination of IDs and VIDs 256 HRDs were eliminated 256 HRDs were eliminated Remaining documents were viewed individually and classified as Remaining documents were viewed individually and classified as Primarily useful textPrimarily useful text "Noisy" text"Noisy" text Primarily non-text (link lists, fragments, headers / footers predominated...)Primarily non-text (link lists, fragments, headers / footers predominated...)

11 My Web Corpus unique documents passed all automatic tests and human classification 4949 unique documents passed all automatic tests and human classification 5.25 million tokens in 35 MB of files 5.25 million tokens in 35 MB of files Longer coherent texts from government, academic, legal, religious (Christian, Jewish, Muslim, Hindu), journalistic and commercial sources, plus many “hobbyist” pages on a wide range of topics Longer coherent texts from government, academic, legal, religious (Christian, Jewish, Muslim, Hindu), journalistic and commercial sources, plus many “hobbyist” pages on a wide range of topics Compared to BNC as a standard to reference corpus (see appendix with annotated comparison of n-gram frequencies). Compared to BNC as a standard to reference corpus (see appendix with annotated comparison of n-gram frequencies). Generally quite comparable, but important differences: Generally quite comparable, but important differences: UK vs. US bias in institutions, place names, spellingUK vs. US bias in institutions, place names, spelling BNC: bias toward third person, past tense, narrative styleBNC: bias toward third person, past tense, narrative style WC: bias toward first (especially we) and second person, present tense, interactive styleWC: bias toward first (especially we) and second person, present tense, interactive style Words referring to Internet concepts and information missing or rare in BNC, highly prominent in WC (and in contemporary English)Words referring to Internet concepts and information missing or rare in BNC, highly prominent in WC (and in contemporary English)


Download ppt "How Useful Is the Web as a Linguistic Corpus? William H. Fletcher United States Naval Academy 2002 North American Symposium on Corpus Linguistics American."

Similar presentations


Ads by Google