Current trends in corpus linguistics

Slides:



Advertisements
Similar presentations
How can you think like a Historian?
Advertisements

Module 2 Sessions 10 & 11 Report Writing.
Mainly about text.
Understanding the ELA/Literacy Evidence Tables. The tables contain the Reading, Writing and Vocabulary Major claims and the evidences to be measured on.
Action Research Not traditional educational research often research tests theory not practical Teacher research in classrooms and/or schools/districts.
The World Wide Web. 2 The Web is an infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.1 Chapter Five Data Collection and Sampling.
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Machine Translation II How MT works Modes of use.
How to write your assignment
Developing a Global Vision Through Marketing Research
CODE/ CODE SWITCHING.
1 Lesson 15 Evaluating Electronic Information Computer Concepts BASICS 4 th Edition Wells.
Chapter 12 User Interface Design
Uses of a Corpus “[E]xplore actual patterns of language use”
What is VOICE? VOICE, the Vienna-Oxford International Corpus of English, is a structured collection of language data, the first computer-readable corpus.
Statistical Methods and Linguistics - Steven Abney Thur. POSTECH Computer Science NLP Lab Shim Jun-Hyuk.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Corpora and Language Teaching
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
Lecture 1 Introduction: Linguistic Theory and Theories
1. Introduction Which rules to describe Form and Function Type versus Token 2 Discourse Grammar Appreciation.
The Langue/Parole distinction`
Phonetics, Phonology, Morphology and Syntax
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Weakness of Structural linguistics Functionalism
Educator’s Guide Using Instructables With Your Students.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Corpus linguistics for translators Amanda Saksida University of Nova Gorica.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
A COMPETENCY APPROACH TO HUMAN RESOURCE MANAGEMENT
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
CHAPTER 1 Understanding RESEARCH
Levels of Language 6 Levels of Language. Levels of Language Aspect of language are often referred to as 'language levels'. To look carefully at language.
1 Chapter Two: Sampling Methods §know the reasons of sampling §use the table of random numbers §perform Simple Random, Systematic, Stratified, Cluster,
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
PROCESSING OF DATA The collected data in research is processed and analyzed to come to some conclusions or to verify the hypothesis made. Processing of.
CSA2050 Introduction to Computational Linguistics Parsing I.
1.  Interpretation refers to the task of drawing inferences from the collected facts after an analytical and/or experimental study.  The task of interpretation.
Introduction Chapter 1 Foundations of statistical natural language processing.
IR 202 Research Methods This course aims to introduce students what is social research, what are the different types of research and the research process.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Some Distinctions in Linguistics. Descriptivism & Prescriptivism Synchronic & diachronic Speech & writing Language & parole Competence & performance Traditional.
What is Research?. Intro.  Research- “Any honest attempt to study a problem systematically or to add to man’s knowledge of a problem may be regarded.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
2. The standards of textuality: cohesion Traditional approach to the study of lannguage: sentence as conventional object of study Structuralism (Bloofield,
Abstract  An abstract is a concise summary of a larger project (a thesis, research report, performance, service project, etc.) that concisely describes.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
WP4 Models and Contents Quality Assessment
Linguistics Linguistics can be defined as the scientific or systematic study of language. It is a science in the sense that it scientifically studies the.
Corpus Linguistics Anca Dinu February, 2017.
Introduction to Corpus Linguistics
Statistical NLP: Lecture 7
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Reading and Frequency Lists
Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Corpus Linguistics I ENG 617
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Information Retrieval
Presentation transcript:

Current trends in corpus linguistics

Sinclair (1991 :171)  A corpus is a collection of naturally-occurring language text, chosen to characterize a state or variety of a language. In modern computational linguistics, a corpus typically contains many millions of words: this is because it is recognised that the creativity of natural language leads to such immense variety of expression that it is difficult to isolate the recurrent patterns that are the clues to the lexical structure of the language.

EAGLES (Expert Advisory Group on Language Engineering Standards) A corpus is a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language. Note that the non-committal word `pieces' is used above, and not `texts'. This is because of the question of sampling techniques used. If samples are to be all the same size, then they cannot all be texts. Most of them will be fragments of texts, arbitrarily detached from their contents.

A computer corpus is a corpus which is encoded in a standardised and homogenous way for […] retrieval tasks. Its constituent pieces of language are documented as to their origins and provenance.

The Text Encoding Initiative (TEI) An international standard that helps libraries, museums, publishers, and individual scholars represent all kinds of literary and linguistic texts for online research and teaching, using a strict encoding scheme Its main aim is the reusability of corpora http://www.tei-c.org/

The International Computer Archive of Modern and Medieval English (ICAME) An international organization of linguists and information scientists working with English machine-readable texts.

Corpus design In order to draw conclusions that are significant, one has to adhere to clearly defined rules in the composition of a corpus. If there is a selection bias, the conclusions will not be valid. Sinclair (1991:13) even argues that the job could be « outsourced » to social scientists.

Spoken vs. Written Most corpora are short on data that reflect spoken use of the language EAGLES guidelines warn against the use of material that is not “gathered from the genuine communications of people going about their normal business. […] For example, some television shows deliberately put participants into artificial and indeed bizarre conditions and induce extremely odd responses. Casual conversation is expected to be impromptu but it can be rehearsed by one or more parties.”

The birth of corpus linguistics Corpus linguistics is linked with the advent of the computer. Computational Analysis of Present-Day American English (Kucera and Francis 1967). The Brown Corpus (1960) was a carefully compiled selection of current American English, totalling about a million words drawn from a wide variety of sources. Kucera and Francis subjected it to a variety of computational analyses. Their book combines elements of linguistics, psychology, statistics, and sociology.

Corpus linguistics in the UK The British National Corpus (100 million words of modern British English, 10% spoken). It has inspired various works, notably Sinclair (1990). It is searchable through the website Phrases in English.

Corpus linguistics in France The FRANTEXT database was created in the 1960s and is maintained by the INALF. It contains texts that range from the Renaissance period to modern French The corpus is made up of about 80% literary works and 20% technical or scientific writing. It served as a basis for the «Trésor de la langue française informatisé » http://atilf.atilf.fr/tlf.htm Base lexicale du français (Binon, Verlinde) http://ilt.kuleuven.be/blf/

Fillmore’s description of the two approaches in " Corpus Linguistics” or “Computer-aided armchair linguistics”’ (1992) The corpus linguist : "He has all the primary facts that he needs, in the form of a corpus of approximately one zillion running words, and he sees his job as that of deriving secondary facts from his primary facts. At the moment, he is busy determining the relative frequencies of the eleven parts of speech as the first word of a sentence versus the second word of a sentence."

The "armchair" (introspective) linguist: "He sits in a deep soft armchair, with his eyes closed and his hands clasped behind his head. Once in a while he opens his eyes, sits up abruptly shouting, ‘Wow, what a neat fact!’, grabs his pencil, and writes something down… having come still no closer to knowing what language is really like."

Chomsky’s opinion about corpus linguistics (1958 conference) “Any natural corpus will be skewed. Some sentences won’t occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [based upon it] would be no more than a mere list.”

Chomsky criticized corpus data as being only a small sample of a potentially infinite population. This criticism can be applied not just to CL but to any form of scientific investigation which is based on sampling. Chomsky’s criticism was based on the fact that corpora were relatively small when he started airing those views.

Chomsky on corpus linguistics (2004 interview) “Corpus linguistics doesn't mean anything. It’s like saying […] suppose physics and chemistry decide that instead of relying on experiments, what they’re going to do is take videotapes of things happening in the world and they’ll collect huge videotapes of everything that’s happening and from that maybe they’ll come up with some generalizations or insights.”

Performance may be flawed/ ungrammatical, due to attention/ memory lapses or other psychological factors – and consequently cannot be taken at face value. The ‘raw data’ has to be ‘idealised’.

Chomsky (1965, p. 4) admitted the similarity between the competence-performance distinction and that of the Saussurian langue-parole; but to him, whereas langue is merely a "systematic inventory of items," competence refers to the conception of 'a system of generative processes." The motivation for the distinction stems from the observations of fluctuations in grammaticality of the speech of individuals and the ascription of a proper theoretical significance to this observation, ( the speech of individuals does not directly reflect their grammatical knowledge).

A mature speaker's knowledge of his language does not fluctuate from moment to moment as does grammaticality of his utterances Consequently, the linguist's task in building a grammar of his native language becomes in effect, one of describing the speaker's "permanent knowledge" of his language, or, his linguistic competence. It is then left for the psychologist to describe how the interfering effects that manifest themselves during speaking interact with the speaker's linguistic-competence to produce the grammatically impaired utterances that are typical in everyday situations.

Corpus-based linguistics The essential characteristics of corpus-based analysis according to Biber (1998:4) it is empirical, analysing the actual pattern of use in natural texts; it utilizes a large and principled collection of natural texts, known as a “corpus”, as the basis for analysis; it makes extensive use of computers for analysis, using both automatic and interactive techniques; it depends on both quantitative and qualitative techniques.

Tagging The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) The most recent edition can also be expressed in the Extensible Markup Language (XML)

An example <pb n='474'/> <div1 type="chapter" n='38'> <p>Reader, I married him. A quiet wedding we had: he and I, the parson and clerk, were alone present. When we got back from church, I went into the kitchen of the manor-house, where Mary was cooking the dinner, and John cleaning the knives, and I said —</p> <p><q>Mary, I have been married to Mr Rochester this morning.</q> The housekeeper and her husband were of that decent, phlegmatic order of people,[…]; but Mary, bending again over the roast, said only — </p> <p><q>Have you, miss? Well, for sure!</q></p>

a TEI document at the textual level consists of the following elements: <front> contains any prefatory matter (headers, title page, prefaces, dedications, etc.) found before the start of a text proper. <group> contains a number of unitary texts or groups of texts. <body> contains the whole body of a single unitary text, excluding any front or back matter. <back> contains any appendixes, etc., following the main part of a text.

Part-of-speech tagging The man who was mixing it fell into the cement he was mixing. the/DT man/NN who/WP was/VBD mixing/VBG it/PRP fell/VBD into/IN the/DT cement/NN he/PRP was/VBD mixing/VBG The horse raced past the barn fell. the/DT horse/NN raced/VBD past/JJ the/DT barn/NN fell/VBD

Parsing or Tree-tagging

Expression of syntactic dependencies via square brackets [S [NP [NP [Det the][N man]][S [NP who][VP was mixing it]]] [VP [V fell] [PP [P into][NP [NP [Det the][N cement]][S he was mixing]]]]].

Semantic tagging It is still in its infancy, but some promising applications using word sense disambiguation have been tested on easy cases (e.g. pen in English). Wordnet is a thesaurus-like data base that groups various word senses in synsets. It is available in the major European languages.

Study of lexical co-occurrence This is done through the use of concordancing software, which provides a KWIC (Key-Word in Context) display. Such software also provides a wide range of statistical information about the corpus and the collocates of any given word.

Example of a KWIC display 1. ent été piratés en 2005, hors piratages numériques via Internet, selon l'OCDE. 20080 ... 2. ient été piratés en 2005, hors piratage numériques via Internet, selon des chiffres publié ... 3. La Loi sur la confiance dans l'économie numérique (LCEN) ne prévoit pas une responsabilit ... 4. alisation de la technologie de synthèse numérique d'horloges de référence multiples (MRCG ... 5. MRCG) de Motorola. « Cette technologie numérique permet de s'affranchir des limites des ... 6. ue d'information, notamment d'appareils numériques multifonctions réseau (MFP) et d'imprim ... 7. tifs aux offres de logiciels d'imagerie numérique de Peerless, ainsi que tous les brevets ... 8. fabrique et commercialise des copieurs numériques couleur et noir et blanc, des appareils ... 9. puissant et des dernières technologies numériques réseau, Kyocera Mita soutient les entre ... 10. des marchés de l'imagerie documentaire numérique, comprenant notamment les fabricants de ... 11. couleur et monochromes, et d'appareils numériques. Afin de traiter les textes numériques ... 12. numériques. Afin de traiter les textes numériques et les graphiques, les produits d'image ... 13. s, les produits d'imagerie documentaire numériques se basent sur un logiciel d'imagerie et ... 14. lorsque cet objet est un enregistrement numérique, les États membres peuvent prévoir que ... 15. à disposition du demandeur, sous format numérique, sur un ou plusieurs sites publics acce ... 16. (faux médicaments et moyens de stockage numérique faciles à copier). L'octroi du ma ... 17. (faux médicaments et moyens de stockage numérique faciles à copier). L'octroi du ma ... 18. l Rights Management (gestion des droits numériques, euphémisme pour "protection contre la ... 19. l Rights Management (gestion des droits numériques, euphémisme pour "protection contre la ... 20. (faux médicaments et moyens de stockage numérique faciles à copier). L'octroi du ma ...

Collocates for « droits » in a French corpus containing texts about intellectual property 59 respect 28 certains 44 voisins 214 propriété 31 aspects 14 tous 37 fondamentaux 4 concernés 23 protection 5 les 13 d'auteur 4 location 15 titulaires 5 différents 13 réservés 4 protégés 15 Charte 4 nouveaux 8 incorporels 4 — 8 protéger 4 lesdits 8 nationaux 3 page 6 Application 3 ayants 6 protégés 3 6 Inc 1 ces 5 exclusifs 3 brevet 5 respecte 1 II 5 visés 3 libertés 5 Français 1 section 5 d’auteur 2 sinon

Terminological extraction TE is one of the fastest developing applications in the field of natural language processing (NLP), along with computer-assisted translation (CAT). It is based on the automatic identification of typical terminological syntactic patterns (e.g. ADJ N or N N in English). Terminological extraction produces a list of “candidate terms” from which the noise must be sifted.

An example of N-ADJ patterns drawn from the same corpus 532 propriété intellectuelle 61 parlement européen 57 propriété industrielle 49 santé publique 38 sanctions pénales

"Intelligent" automatic term extraction needs to focus on word sense disambiguation to reduce the amount of noise. The frequency criterion cannot be applied too systematically if the extraction process is meant to be comprehensive (many terms occur only once in a given corpus).

Learner corpora They are corpora compiled with texts written by non-native students in a given foreign language. Study of such corpora allows language teachers to focus on the most frequent grammar mistakes that are typical of a particular language pair, and on any over- or under-used syntactic patterns or lexical items. The major learner corpus project is the International Corpus of Learner English headed by Sylviane Granger (Université de Louvain-la-Neuve, Belgium).

Bilingual corpora There are two kinds of bilingual corpora : Translation corpora, which consist of translated texts that are generally aligned at sentence level (they may involve more than two languages). Comparable corpora, in which both halves have a common subject matter but are not mutual translations. The appellation "parallel corpus" is considered ambiguous, as it may be used to refer to either kind of corpus.

Using the Web as a corpus The web does not fit most linguists’ definitions of a corpus. Sinclair (1991), p.171 : A corpus is a collection of naturally-occurring language text, chosen to characterize a state or variety of a language. Biber (1998), p. 4 : a large and principled collection of natural texts

The Web may be viewed as a very large corpus, which is constantly being updated, and cannot possibly be annotated. If it is to be used as a sample for linguistic exploration, questions must be raised about what exactly it is representative of. It is probably biased as regards several social categories (age, gender, social class) and is consequently not representative of general usage. Furthermore, an undefined percentage of its contents (probably high in English) is posted by non-natives.