Corpus Linguistics Anca Dinu February, 2017.

Slides:



Advertisements
Similar presentations
Uses of a Corpus “[E]xplore actual patterns of language use”
Advertisements

Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Corpus Linguistics: session 2 Corpus Linguistics (2): The Tools of the Trade 669o4zt
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Research methods in corpus linguistics Xiaofei Lu.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Communicative Language Teaching Vocabulary
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
Researching language with computers Paul Thompson.
©2006 Barry Natusch Tools for Language Researchers Barry Natusch “ Man is a tool-using animal. Without tools he is nothing, with tools he is all. ” - Thomas.
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo
Chapter 10 Language and Computer English Linguistics: An Introduction.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.
New Englishes. Global English  ‘[…] the English language ceased to be the sole possession of the English some time ago’ (Rushdie, 1991)  Loss of ownership.
Levels of Linguistic Analysis
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
INTRODUCTION TO APPLIED LINGUISTICS
CORPUS LINGUISTICS 1) A revision of corpus linguistics 2) Language corpora in the ESL/EFL classroom.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Labov’s Principles—1972 Language in Society, Vol.1 No. 1 “ Principles ” 1.Cumulative Principle 2.The Neogrammarian Hypothesis 3.The Uniformitarian Principle.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
Using Parallel Corpora for Contrastive Studies Michael Barlow.
Text Linguistics. Definition of linguistics Linguistics can be defined as the scientific or systematic study of language. It is a science in the sense.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Language Identification and Part-of-Speech Tagging
contrastive linguistics
Collecting Written Data
An Introduction to Linguistics
Syntax 1 Introduction.
Cognitive Processes in SLL and Bilinguals:
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Inquiry, Pedagogy, & Technology: Automated Textual Analysis of 30 Refereed Journal Articles David A. Thomas Mathematics Center, University of Great Falls,
XML QUESTIONS AND ANSWERS

Using Corpora in Linguistics
What is linguistics?.
Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Phil Durrant Mark Brenchley Debra Myhill
Corpus Linguistics I ENG 617
عمادة التعلم الإلكتروني والتعليم عن بعد
Introduction to Corpus Linguistics: Exploring Collocation
Natural Language Processing (NLP)
Introduction to Corpus Linguistics: Dispersion/concordance plots
contrastive linguistics
Corpus-Based ELT CEL Symposium Creating Learning Designers
Social Knowledge Mining
Phil Durrant Debra Myhill Mark Brenchley
Topics in Linguistics ENG 331
Social Research Methods
Levels of Linguistic Analysis
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
COMPARATIVE Linguistics 2018/2019
Natural Language Processing (NLP)
contrastive linguistics
AntConc Search Wildcards (not Regex)
contrastive linguistics
Natural Language Processing (NLP)
Presentation transcript:

Corpus Linguistics Anca Dinu February, 2017

Where did corpus linguistics come from? Understanding that language in use is worthy of study; Understanding that large quantities of authentic language are needed for meaningful study; Understanding that context is important; General shift in social sciences to empiricism; Rise of technology; Recognition that a data-based approach opens up research.

Corpus Linguistics Corpus linguistics is a method of carrying out linguistic analyses. Corpus linguistics is the analysis of naturally occurring language on the basis of computerized corpora. Usually, the analysis is performed with the help of the computer, i.e. with specialized software, and takes into account the frequency of the phenomena investigated.

Corpus Linguistics It has become one of the most wide-spread methods of linguistic investigation. It can be used for the investigation of many kinds of linguistic questions. It has the potential to yield highly interesting, fundamental, and often surprising new insights about language.

Linguistic Data What data do linguists use to investigate linguistic phenomena? Roughly, four types of data can be distinguished: 1) data gained by intuition a) the researcher’s own intuition (“introspection”) b) other people’s intuition (accessed, for example, by elicitation tests) 2) naturally occurring language a) randomly collected texts or occurrences (“anecdotal evidence”) b) systematic collections of texts - CORPORA

Corpora A corpus is as a systematic collection of naturally occurring texts (of both written and spoken language). Systematic means that the structure and contents of the corpus follows certain extralinguistic principles or criteria.

Corpora For example, the texts or transcriptions of a corpus are often restricted to certain time span, domain, genre, style, dialect, language, etc... If several of these subcategories are present in a corpus, these are often represented by the same amount of text and separated as such in the corpus. Different types of corpora are used for different kind of analysis.

Use of corpora In linguistics, the typical use of corpora is: the (in)validation of linguistic hypothesis and statistical analysis of the linguistic data (Corpus Pattern Analysis, frequency lists, word co-occurrences, concordances, idioms, structures). The (semi-)automated data extraction (like argumental structure, thematic role) for the creation of electronic lexicons.

What corpora are there? Depending of the type of text or transcript, corpora can be: general/reference corpora (vs. specialized corpora) (e.g. BNC = British National Corpus, or Bank of English) aim at representing a language or variety as a whole (contain both spoken and written language, different text types etc.) historical corpora (vs. corpora of present-day language) (e.g. Helsinki Corpus, ARCHER) aim at representing an earlier stage or earlier stages of a language.

What corpora are there? regional corpora (vs. corpora containing more than one variety) (e.g. WCNZE = Wellington Corpus of Written New Zealand English) aim at representing one regional variety of a language. learner corpora (vs. native speaker corpora) (e.g. ICLE = International Corpus of Learner English) aim at representing the language as produced by learners of this language. multilingual corpora (vs. one-language corpora) aim at representing several, at least two, different languages, often with the same text types (for contrastive analyses). spoken (vs. written vs. mixed corpora) (e.g. LLC = London-Lund Corpus of Spoken English) aim at representing spoken language.

Annotation Annotation of corpora means that some kind of linguistic analysis has already been performed and marked on the texts, such as sentence analysis, or part of speach tagging. Depending of the type of annotation made on the text or transcript, a corpus can be: un-annotated (ortographic, raw, with just meta-annotation), phonetically, morphologically, syntactically, semantically or pragmatically annotated.

Annotation Annotation schemata should focus on a single coherent theme: Different linguistic phenomena should be annotated separately over the same corpus. Annotations must be consistent with each other: Unification and merging of multiple annotation is necessary.

Example of semantic annotation Predicators and their named arguments: [The man]agent painted [the wall]patient. • Anaphors and their antecedents: [The protein] inhibits growth in yeast. [It] blocks production . . . • Acronyms and their long forms: [Platelet-derived growth factor] (known as [pdgf]) impacts . . . • Semantic Typing of entities: [The man]human fired [the gun]firearm.

Annotation Corpus annotation is usually made in a standardized manner with: XML (eXtensible Markup Language), designed to be both human- and machine-readable, via intuitive tags. Or TEI (Text Encoding Initiative), a text-centric community of practice that defined text guidelines in XML format).

Corpus Software Two types of software for corpus analysis can be distinguished in principle: software that is tailored to one specific corpus, (such as SARA and BNCWeb for BNC, or ICE-CUP for ICE-GB) and software that can be used with almost any kind of corpus (such as AntConc, MonoConc Pro and WordSmith Tools, which is probably the most widely used corpus software) We will use AntConc.

What can the software do? While there are many differences between the software packages designed for corpus analysis, certain basic functions can be performed by practically all the available software. For most kinds of linguistic analyses, the most important one of these is the possibility of searching the corpus in question for the (co-)occurrence of certain strings (words or phrases).

What can the software do? As output, the software then usually gives information on: the number of these strings occurring in the corpus, on the text in which they were found, and the so-called concordance-lines, which show the string in question in context (with the search term(s) highlighted).

Bibliography Nadja Nesselhauf , Corpus Linguistics: A Practical Introduction, 2011 Charlotte Taylor, What is corpus linguistics? What the data says, ICAME Journal No. 32, 2008