Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.

Similar presentations


Presentation on theme: "1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer."— Presentation transcript:

1 1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer

2 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 2 Features of corpora Size (little/big/huge) Plasticity (finite/monitor) Metadata (none/lots) Annotation (none, …, lots) Balance

3 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 3 Features: size Relative over time Currently, micro/small/large/massive

4 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 4 Features: size Relative over time  1960's: 1M words (Brown)  1990's: 4.5M words (Penn Treebank)  2000's: 415M words (BOE)  2000's: 1000M (English Gigaword) Currently, micro/small/large/massive

5 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 5 Features Finite size established in advance sample sizes adjusted accordingly doesn't change over time Monitor allow diachronic analysis grows over time

6 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 6 Metadata (practically) none  language, at least  document boundaries some  document attributes title body author date PMID- 6509398 DP - 1984 Nov TI - The natural history of Machado-Joseph disease. An analysis of 138 personally examined cases. PG - 510-25 AB - We have examined 138 cases of a disorder previously described in people of Portuguese origin and which has received many names. By computer analysis of 46 different items of a standardized neurological examination carried out in each patient, we have been able to delineate the main components of

7 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 7 Metadata Lots  Author characteristics gender, age, mother tongue(s), dialect, educational level  genre classification news scientific personal  topic  relevance MH - Aged MH - Azores/ethnology MH - Cerebellar Ataxia/diagnosis MH - Gene Frequency MH - Human MH - Phenotype MH - Portugal/ethnology MH - Support, Non-U.S. Gov't MH - Syndrome MH - United States MH - Variation (Genetics)

8 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 8 Balanced corpora What are you balancing? Most common: genre Authors  gender  age  education  dialect

9 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 9 Balanced corpora speechwriting unpublished published non-fiction fiction informativeinstructionalpersuasive Composition of the International Corpus of English academicpopularnews (Adapted from Meyer 2002)

10 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 10 Balanced corpora speechwriting dialogue monologue scripted unscripted talksnewsspeeches Composition of the International Corpus of English (Adapted from Meyer 2002)

11 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 11 Corpus length Overall length Sample size  partial 2,000 words (Brown, LOB, ICE) 5,000 words (London-Lund)  full takes up space copyright permission issues harder

12 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 12 Sample size Motivating assumption: more important to maximize number of authors/genres than length of text from each

13 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 13 By purpose Linguistic-y  lexicon vs. other NLP  General purpose  information retrieval  information extraction

14 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 14 By purpose Linguistic-y  lexicon vs. other NLP  General purpose  information retrieval  information extraction Foreign language instruction  Native L2  "Learner" L2

15 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 15 Is there a corpus…

16 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 16 Is there a corpus… http://www.ldc.upenn.edu/

17 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 17 Annotation None/some/lots

18 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 18 Annotation None  "collection" Some  POS  lemmas lemma(be) = {be, am, is, are, were, being, been}

19 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 19 Annotation Lots  syntax (treebank, "bracketing")  semantics predicate/argument structure ontological Dogs make me happy.

20 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 20 Diachronic Historical (OE, ME, …) Later sampling of earlier balanced corpus Monitor

21 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 21 Spoken Phonetically motivated (elicited) Other ("natural")

22 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 22 Multilingual Parallel  L1 contents == L2 contents  Parliamentary proceedings in English & French  Shakespeare in English and German Translation/comparable  two L1's; genre == genre  E.g., weather reports

23 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 23 Penn Treebank treebank: corpus of syntactically- annotated data first release: 4.5 million words, 3 years' work currently 4.9 M

24 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 24 Penn Treebank

25 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 25 Penn Treebank POS-tagged Switchboard data Dysfluency-annotated Switchboard data Syntactically-annotated Switchboard data http://www.cis.upenn.edu/~treebank/switch-samp-pos.html http://www.cis.upenn.edu/~treebank/switch-samp-dfl.html http://www.cis.upenn.edu/~treebank/switch-samp-bkt.html

26 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 26 GENIA 2000 abstracts red blood cell transcription factors POS-tagged (HW2, #16) semantic annotation with molecular biology ontology

27 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 27 Corpora/resources Dictionaries, ontologies,...  CELEX  WordNet

28 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 28 Corpora/resources Dictionaries, ontologies,... "discovery procedure" phonology  contrasts  phonotactics morphology  term formation  inflectional

29 LING 5200, 2006 BASED on Kevin Cohen’s LING 5200 29 McEnery & Wilson's definition of "corpus" sampled & representative finite size machine-readable "standard reference" ???


Download ppt "1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer."

Similar presentations


Ads by Google