1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.

Slides:



Advertisements
Similar presentations
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Advertisements

Corpus design See G Kennedy, Introduction to Corpus Linguistics, Ch.2
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
C SC 620 Advanced Topics in Natural Language Processing Sandiway Fong.
EngL 3601: Analysis of the English Language and Culture.
Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
The origins of language curriculum development
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
LELA English Corpus Linguistics
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Daniel Nkemleke, Humboldt Kolleg Kamerun, 30/07/2008 Corpus Linguistics and Language Education: Development and Utility of the Corpus of Cameroon English.
Research methods in corpus linguistics Xiaofei Lu.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.
ELN – Natural Language Processing Giuseppe Attardi
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Search Engines and Information Retrieval Chapter 1.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
1 The BT Digital Library A case study in intelligent content management Paul Warren
Representatıvness, balance and samplıng ın a corpus Lınguistıcs.
Researching language with computers Paul Thompson.
Historical linguistics Historical linguistics (also called diachronic linguistics) is the study of language change. Diachronic: The study of linguistic.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
The Great Vowel Shift Continued The reasons behind this shift are something of a mystery, and linguists have been unable to account for why it took place.
Information Technology – Dialogue Systems Ulm University (Germany) Speech Data Corpus for Verbal Intelligence Estimation.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
NLP ? Natural Language is one of fundamental aspects of human behaviors. One of the final aim of human-computer communication. Provide easy interaction.
Anthropology 340 LANGUAGE AND CULTURE Course Overview.
Corpus lexicography in Russia: recent trends and perspectives Maria Khokhlova St.Petersburg State University Philological Faculty
1 Branches of Linguistics. 2 Branches of linguistics Linguists are engaged in a multiplicity of studies, some of which bear little direct relationship.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 1 (03/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Introduction to Natural.
1 Introduction to Python LING 5200 Computational Corpus Linguistics Martha Palmer.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Open Health Natural Language Processing Consortium
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Welcome to All S. Course Code: EL 120 Course Name English Phonetics and Linguistics Lecture 1 Introducing the Course (p.2-8) Unit 1: Introducing Phonetics.
INTRODUCTION TO APPLIED LINGUISTICS
COGS Bilge Say1 Introduction to Corpora and Corpus Linguistics COGS 523-Lecture 2 Corpus Design Issues I.
CORPUS LINGUISTICS 1) A revision of corpus linguistics 2) Language corpora in the ESL/EFL classroom.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
contrastive linguistics
Corpus Linguistics Anca Dinu February, 2017.
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.

Computational and Statistical Methods for Corpus Analysis: Overview
Corpus Linguistics I ENG 617
عمادة التعلم الإلكتروني والتعليم عن بعد
Natural Language Processing (NLP)
contrastive linguistics
Corpus-Based ELT CEL Symposium Creating Learning Designers
Introduction to Semantic Metadata & Semantic Web
IL Step 3: Using Bibliographic Databases
Using GOLD to Tracking L2 Development
COMPARATIVE Linguistics 2018/2019
Natural Language Processing (NLP)
contrastive linguistics
contrastive linguistics
Natural Language Processing (NLP)
Presentation transcript:

1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer

LING 5200, 2006 BASED on Kevin Cohen’s LING Features of corpora Size (little/big/huge) Plasticity (finite/monitor) Metadata (none/lots) Annotation (none, …, lots) Balance

LING 5200, 2006 BASED on Kevin Cohen’s LING Features: size Relative over time Currently, micro/small/large/massive

LING 5200, 2006 BASED on Kevin Cohen’s LING Features: size Relative over time  1960's: 1M words (Brown)  1990's: 4.5M words (Penn Treebank)  2000's: 415M words (BOE)  2000's: 1000M (English Gigaword) Currently, micro/small/large/massive

LING 5200, 2006 BASED on Kevin Cohen’s LING Features Finite size established in advance sample sizes adjusted accordingly doesn't change over time Monitor allow diachronic analysis grows over time

LING 5200, 2006 BASED on Kevin Cohen’s LING Metadata (practically) none  language, at least  document boundaries some  document attributes title body author date PMID DP Nov TI - The natural history of Machado-Joseph disease. An analysis of 138 personally examined cases. PG AB - We have examined 138 cases of a disorder previously described in people of Portuguese origin and which has received many names. By computer analysis of 46 different items of a standardized neurological examination carried out in each patient, we have been able to delineate the main components of

LING 5200, 2006 BASED on Kevin Cohen’s LING Metadata Lots  Author characteristics gender, age, mother tongue(s), dialect, educational level  genre classification news scientific personal  topic  relevance MH - Aged MH - Azores/ethnology MH - Cerebellar Ataxia/diagnosis MH - Gene Frequency MH - Human MH - Phenotype MH - Portugal/ethnology MH - Support, Non-U.S. Gov't MH - Syndrome MH - United States MH - Variation (Genetics)

LING 5200, 2006 BASED on Kevin Cohen’s LING Balanced corpora What are you balancing? Most common: genre Authors  gender  age  education  dialect

LING 5200, 2006 BASED on Kevin Cohen’s LING Balanced corpora speechwriting unpublished published non-fiction fiction informativeinstructionalpersuasive Composition of the International Corpus of English academicpopularnews (Adapted from Meyer 2002)

LING 5200, 2006 BASED on Kevin Cohen’s LING Balanced corpora speechwriting dialogue monologue scripted unscripted talksnewsspeeches Composition of the International Corpus of English (Adapted from Meyer 2002)

LING 5200, 2006 BASED on Kevin Cohen’s LING Corpus length Overall length Sample size  partial 2,000 words (Brown, LOB, ICE) 5,000 words (London-Lund)  full takes up space copyright permission issues harder

LING 5200, 2006 BASED on Kevin Cohen’s LING Sample size Motivating assumption: more important to maximize number of authors/genres than length of text from each

LING 5200, 2006 BASED on Kevin Cohen’s LING By purpose Linguistic-y  lexicon vs. other NLP  General purpose  information retrieval  information extraction

LING 5200, 2006 BASED on Kevin Cohen’s LING By purpose Linguistic-y  lexicon vs. other NLP  General purpose  information retrieval  information extraction Foreign language instruction  Native L2  "Learner" L2

LING 5200, 2006 BASED on Kevin Cohen’s LING Is there a corpus…

LING 5200, 2006 BASED on Kevin Cohen’s LING Is there a corpus…

LING 5200, 2006 BASED on Kevin Cohen’s LING Annotation None/some/lots

LING 5200, 2006 BASED on Kevin Cohen’s LING Annotation None  "collection" Some  POS  lemmas lemma(be) = {be, am, is, are, were, being, been}

LING 5200, 2006 BASED on Kevin Cohen’s LING Annotation Lots  syntax (treebank, "bracketing")  semantics predicate/argument structure ontological Dogs make me happy.

LING 5200, 2006 BASED on Kevin Cohen’s LING Diachronic Historical (OE, ME, …) Later sampling of earlier balanced corpus Monitor

LING 5200, 2006 BASED on Kevin Cohen’s LING Spoken Phonetically motivated (elicited) Other ("natural")

LING 5200, 2006 BASED on Kevin Cohen’s LING Multilingual Parallel  L1 contents == L2 contents  Parliamentary proceedings in English & French  Shakespeare in English and German Translation/comparable  two L1's; genre == genre  E.g., weather reports

LING 5200, 2006 BASED on Kevin Cohen’s LING Penn Treebank treebank: corpus of syntactically- annotated data first release: 4.5 million words, 3 years' work currently 4.9 M

LING 5200, 2006 BASED on Kevin Cohen’s LING Penn Treebank

LING 5200, 2006 BASED on Kevin Cohen’s LING Penn Treebank POS-tagged Switchboard data Dysfluency-annotated Switchboard data Syntactically-annotated Switchboard data

LING 5200, 2006 BASED on Kevin Cohen’s LING GENIA 2000 abstracts red blood cell transcription factors POS-tagged (HW2, #16) semantic annotation with molecular biology ontology

LING 5200, 2006 BASED on Kevin Cohen’s LING Corpora/resources Dictionaries, ontologies,...  CELEX  WordNet

LING 5200, 2006 BASED on Kevin Cohen’s LING Corpora/resources Dictionaries, ontologies,... "discovery procedure" phonology  contrasts  phonotactics morphology  term formation  inflectional

LING 5200, 2006 BASED on Kevin Cohen’s LING McEnery & Wilson's definition of "corpus" sampled & representative finite size machine-readable "standard reference" ???