Research methods in corpus linguistics Xiaofei Lu.

Slides:



Advertisements
Similar presentations
Uses of a Corpus “[E]xplore actual patterns of language use”
Advertisements

Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically.
Compiling and Analyzing Your Own Learner Corpus Xiaofei Lu CALPER 2012 Summer Workshop July 17, 2012.
Høgskolen i Oslo Using Self-Compiled, Discipline- Specific Corpora as a Practical Learning-Research Tool for Developing Written Language Skills in English.
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
Multilingual eLearning in LANGuage Engineering. Project Overview  Project span: Oct 2004 – Oct 2007  Kick-off meeting Oct  Project goals:
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
Using Corpora in Linguistics
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
CALL – computer assisted language learning A short course delivered by Dr. Klaus Schwienhorst. MITE January 2002.
Corpus Linguistics: session 2 Corpus Linguistics (2): The Tools of the Trade 669o4zt
Resources for Using Corpus Linguistics in ELT Kenji Kitao Doshisha University Kyoto, Japan S. Kathleen Kitao Doshisha Women ’ s College Kyoto, Japan.
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
1 Vocab Assessment & Corpora and Concordancing Major vocabulary assessment tools Major corpora and concordancers.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Memory Strategy – Using Mental Images
ELN – Natural Language Processing Giuseppe Attardi
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Corpus linguistics for translators Amanda Saksida University of Nova Gorica.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
 What is the BNC?  What is Xaira?  How to use the BNC for: › Language teaching and learning › Research.
Researching language with computers Paul Thompson.
©2006 Barry Natusch Tools for Language Researchers Barry Natusch “ Man is a tool-using animal. Without tools he is nothing, with tools he is all. ” - Thomas.
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo
practical aspects1 Translation Tools Translation Memory Systems Text Concordance Tools Useful Websites.
Compiling and Analyzing Your Own Learner Corpus Xiaofei Lu CALPER 2012 Summer Workshop July 16, 2012.
Chapter 10 Language and Computer English Linguistics: An Introduction.
TALC Applying some Developments in Corpus Building Technology to Language Teaching and Learning TALC 2006 Paris.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Corpus Linguistics in Research Doctorate in Education University of Warwick 6th November 2008.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Compiler Design Introduction 1. 2 Course Outline Introduction to Compiling Lexical Analysis Syntax Analysis –Context Free Grammars –Top-Down Parsing –Bottom-Up.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 1 (03/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Introduction to Natural.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Colorado State University
Putting it All Together Xiaofei Lu APLNG 596D July 17, 2009.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
CORPUS LINGUISTICS 1) A revision of corpus linguistics 2) Language corpora in the ESL/EFL classroom.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
Corpus Linguistics Anca Dinu February, 2017.
Computational and Statistical Methods for Corpus Analysis: Overview
Corpus Linguistics I ENG 617
عمادة التعلم الإلكتروني والتعليم عن بعد
Introduction to Corpus Linguistics: Dispersion/concordance plots
Introduction to Corpus Linguistics: Key Word Analysis
Corpus-Based ELT CEL Symposium Creating Learning Designers
A Brief Intro to Corpus Techniques in ELT Research
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Using Dictionaries in Translation (223 TRAJ)
Presentation transcript:

Research methods in corpus linguistics Xiaofei Lu

2 Overview  What is a corpus?  Types of corpora  Corpus design  Where to obtain corpora  Corpus annotation  Corpus analysis  Note on research project design  Exercises and demos in between  Future courses on corpus linguistics

3 What is a corpus?  Leech (1992): an unexciting phenomenon, a helluva lot of text, stored on a computer  Francis (1982): a collection of texts assumed to be representative of a given language, dialect, or other subset of a language to be used for linguistic analysis  Sinclair (1991): a collection of naturally-occurring language text, chosen to characterise a state or a variety of language

4 Types of corpora  General-purpose monolingual corpora The British National Corpus  Specialized corpora Lancaster Corpus of Academic Written English  Learner corpora International Corpus of Learner English  Parallel & comparable corpora The JRC-Acquis Multilingual Parallel Corpus The English-Chinese Parallel Concordancer  Corpora and varieties International Corpus of English  Synchronic and diachronic corpora

5 Corpus design  Purpose  Comparability  Type  Content: mode, interaction, domain, medium  Structure: proportions  Size  Sampling?  Design of the BNC Design of the BNC

6 Where to obtain corpora  Linguistic data consortium Linguistic data consortium  Bookmarks for corpus-based linguists Bookmarks for corpus-based linguists  Ask on the corpora listthe corpora list  Compile your own corpora Design your corpus Getting permission File format, metadata, and data markupdata markup Text capture  Scanning, typing, electronic files, web crawlers, e.g., WebSPHINXWebSPHINX  Transcription tools, e.g., TranscriberTranscriber A Guide to Good Practice

7 Corpus annotation  Why annotate  Levels of corpus annotation  Difficulties for corpus annotation  Tools for corpus annotation

8 Why annotate  For linguistic research Allow more effective corpus searches  For natural language processing Spelling and grammar checking Text summarization Machine translation Question answering

9 Levels of corpus annotation  Sentence segmentation  Word segmentation/tokenization  Part-of-speech (POS) tagging  Chunking/shallow parsing  Syntactic parsing  Semantic annotation  Pragmatic annotation  Parallel corpora: sentence alignment  Learner corpora: error annotation

10 Difficulties for corpus annotation  Ambiguity I saw a pig with binoculars. Problems for tagging, parsing, & WSD  Unknown words Identification POS tagging Semantic annotation

11 Tools for corpus annotation  Bookmarks for corpus-based linguists Bookmarks for corpus-based linguists  Corpora and Corpus Annotation Tools on the WWW Corpora and Corpus Annotation Tools on the WWW  POS tagger demonstration POS tagger demonstration Sentence segmentation POS tagging Extracting NPs of the form DT NN NN  Dexter: Tools for analyzing language data Dexter: Tools for analyzing language data

12 Corpus analysis  Levels of corpus analysis  Tools for corpus analysis  Interpreting corpus data

13 Levels of corpus analysis  Word frequency lists  Concordances Collocation (lexical patterning) Colligation (syntactic patterning)  Keyword lists

14 Tools for corpus analysis  Bookmarks for corpus-based linguists Bookmarks for corpus-based linguists  Recommendations: WordSmith Tools (not free) WordSmith Tools AntConc (free) AntConc TextStat (free) TextStat  Unix tools  Write your own scripts

15 Exercise (part 1)  Download and install AntConcAntConc  Download some text for processing Project Gutenberg  Generate a word frequency list for your mini-corpus

16 Interpreting corpus data  Are frequency differences statistically significant? w appears x times in an n-word corpus, and y times in an m-word corpus Chi-square test (doesn’t work well for small numbers) Chi-square test Fisher’s Exact Test (doesn’t work for a cross table larger than 2×2) Fisher’s Exact Test

17 Exercise (part 2)  Compare your word frequency list with that of BNCthat of BNC  Anything interesting?  Run the chi-square test and Fisher’s Exact test on some interesting words

18 Interpreting corpus data (cont.)  Collocational analysis: How strongly are x and y associated Mutual information  Measures difference between observed and expected frequencies of (X,Y)  Higher MI, stronger association  Doesn’t work well for low frequencies T-test  Measures confidence with which to claim strong association between X and Y  Higher t-score, higher association  Online calculations Online calculations

19 Exercise (part 3)  Generate a concordance for a target word  Find a word that co-occurs frequently with the target word  Test if the word is strongly associated with the target word

20 Note on research project design  Purpose of project  Corpus compilation and annotation  Corpus analysis Bottom-up: from observations of recurring patterns to hypothesis and generalizations Top-down: start with given categories and search for evidence of use and variance  Caution on generalizability

21 Future courses on corpus linguistics  Spring 2007 APLING 597E: Introduction to Corpus Linguistics Hands-on course on principles and tools for corpus compilation, annotation, processing, and analysis  Spring 2008 APLING 597: Seminar on Corpus Linguistics Advanced seminar on using corpora for serious research projects