Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany.

Slides:



Advertisements
Similar presentations
Corpus Linguistics Richard Xiao
Advertisements

CHAPTER 2 THE NATURE OF LEARNER LANGUAGE
Contrastive Analysis, Error Analysis, Interlanguage
A learner corpus of students’ examination work in English language (a project) Sylwia Twardo Centre for Foreign Language Teaching, Warsaw University, Poland.
Metaphorical Uses of Language in Native and Non-native Student Writing: A corpus-based study By: Claudia Marcela Chapetón Castro M.A. in Applied Linguistics.
Chapter eleven linguistics and foreign language teaching
LIN 540G Second Language Acquistion
A Corpus-based Study of Discourse Features in Learners ’ Writing Development Yu-Hua Chen Lancaster University, UK.
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
LELA English Corpus Linguistics
Part of speech (POS) tagging
Corpora and Language Teaching
Template produced at the Graphics Support Workshop, Media Centre Combining the strengths of UMIST and The Victoria University of Manchester Aims The GerManC.
Investigating the effect of Vocabulary Learning Strategies on Iraqi students Vocabulary Knowledge A case study of Iraqi Primary schools in Malaysia Anfal.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Albert Gatt Corpora and Statistical Methods Lecture 9.
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
METHODS References INTRODUCTION Cummins, J. (1991). Language development and language learning. In L. Malave & G. Duquette (Eds.), Language culture and.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
National Institute of Informatics Kiyoko Uchiyama 1 A Study for Introductory Terms in Logical Structure of Scientific Papers.
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
Dr. Monira Al-Mohizea MORPHOLOGY & SYNTAX WEEK 12.
Experimental Research Methods in Language Learning Chapter 1 Introduction and Overview.
A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Developing the language skills: reading Dr. Abdelrahim Hamid Mugaddam.
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,
Information commitments, evaluative standards and information searching strategies in web-based learning evnironments Ying-Tien Wu & Chin-Chung Tsai Institute.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Error Correction: For Dummies? Ellen Pratt, PhD. UPR Mayaguez.
Multidisciplinary perspectives to learner corpora SLE Language contact: at the crossroads of disciplines and frameworks.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
Page 1 NAACL-HLT 2010 Los Angeles, CA Training Paradigms for Correcting Errors in Grammar and Usage Alla Rozovskaya and Dan Roth University of Illinois.
A case study of Iraqi Primary schools in Malaysia Anfal Sabeeh P71843
Communicative and Academic English for the EFL Professional.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Fita Ariyana Rombel 7 (Thursday 9 am).
POS Tagger and Chunker for Tamil
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Correcting Comma Errors in Learner Essays, and Restoring Commas in Newswire Text Ross Israel Indiana University Joel Tetreault Educational Testing Service.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Towards Semi-Automated Annotation for Prepositional Phrase Attachment Sara Rosenthal William J. Lipovsky Kathleen McKeown Kapil Thadani Jacob Andreas Columbia.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
A classifier-based approach to preposition and determiner error correction in L2 English Rachele De Felice, Stephen G. Pulman Oxford University Computing.
 Student : Joanna Yang  Adviser: Dr. Raung - fu Chung  Date : 2011/06/10 Southern Taiwan University Department of Applied English.
Topic The common errors in usage of written cohesive devices among secondary school Malaysian learners of English of intermediate proficiency.
Assistant Instructor Nian K. Ghafoor Feb Definition of Proposal Proposal is a plan for master’s thesis or doctoral dissertation which provides the.
Yvette Coyle and Julio Roca de Larios Coyle, Yvette, and Julio Roca de Larios. "EXPLORING THE ROLE PLAYED BY ERROR CORRECTION AND MODELS ON CHILDREN?S.
Author: Zhenhui Rao Student: 范明麗 Olivia I D:
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Language Identification and Part-of-Speech Tagging
WP4 Models and Contents Quality Assessment
Ma Rui Tianjin Normal University
Corpus-Based ELT CEL Symposium Creating Learning Designers
The Nature of Learner Language (Chapter 2 Rod Ellis, 1997) Page 15
Applied Linguistics Chapter Four: Corpus Linguistics
The Nature Of Learner Language
The Nature of learner language
Presentation transcript:

Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

1. Introduction A study on linguistic annotation of learner corpora, in particular Part-Of-Speech (POS) annotation, which aims to discuss where native POS tagsets fail to accurately describe learner language, by: Describing POS annotation practice in learner corpora, and Characterizing the areas where properties of learner language differ from those assumed by native POS annotation schemes.

Learner corpora can play a role in identifying areas of relevance in, for example, FLT, SLA, materials design, etc. The terminology used to single out learner language aspects needs to be mapped to instances in the corpus, i.e. annotation.

Linguistic annotation of learner corpora, in particular POS tagging, is becoming a common practice because: –By the use of generally agreed linguistic categories, it allows to objectively identify units of interest. –Other annotations specific to learner corpora (error-tagging) mostly allow research into deviances, it is costly and involves a degree of subjectivity. –In SLA research there is an interest in the developmental stages of the acquisition process. –POS tagging can be done automatically.

Recent initiatives: International Corpus of Learner English (ICLE) Cambridge Learner Corpus (CLC) Japanese EFL Learner Corpus (JEFLL) Polish Learner Corpus of English

Automatic POS-tagging consists of 2 parts: –Tag look-up: all possible tags for the given token are determined based on lexical database reference or morphological analysis. –Tag disambiguation: all possible tags are reduced to the correct tag based on distribution. Fallback strategies: weaker versions of the 3 previous sources of evidence and, as a last resort, uses of the most frequent tags.

POS-tagging learner language is essentially perceived as an instance of domain transfer (van Rooy & Schäfer 2003; Thouësny 2009): –Automatic POS-taggers trained on native data are run on learner data. –Due to differences in genre and data type, the annotations are less accurate. –To make up for this degradation of performance, post-correction is often added.

De Haan (2000) and Van Rooy & Schäfer (2002) investigated into POS tagging error types. Spelling errors seem to be source of major problems, which can be handled rather straightforwardly, especially if they result in non-words. De Haan (2000) proposes a fine-grained classification of learner errors that become relevant to the POS tagging process. He suggests adapting the TOSCA-ICLE POS tagset to cater for these learner-specific features.

If native taggers, -Map linguistic categories of native language in POS tags, based on the combinatory possibilities of stem- morphology-distribution. The demonstrations ended without confrontation NNS but learner language -Does not always present the same POS categories because the combinatory possibilities of stem- morphology-distribution are different, […] If he want to know this […] VB/VBP? Do native taggers always provide the categories needed to describe learner language?

2. Method This paper is based on a sample of the NOn-native Corpus of English (NOCE, Díaz Negrillo, 2007), containing around 40,000 words. The NOCE corpus is a written corpus of EFL: –Over 300,000 words of written English by Spanish undergraduates. –1,054 samples of an average of 250 words each.

The samples were collected: –From 2003 to 2009 primarily among first year students doing the English degree programme at the Universities of Granada and Jaén (Spain), –At 3 stages in the academic year (beginning, mid-term and end), –By the students’ lecturers, assisted by corpus compilers and in 1-hour teaching sessions, –As a timed classroom task: essay writing, and –On a voluntary basis and under the appropriate anonymous conditions.

The corpus contains 3 types of annotation: –Editorial annotation: the corpus is annotated for students’ editions of their own writing (e.g. struckouts, late insertions, reordering of units and missing/unreadable text). –Error annotation: a section of the corpus of around 40,000 words is error-tagged with the tagset EARS (Error-Annotation and Retrieval System, Díaz Negrillo, 2009). –POS annotation: the corpus is annotated with 3 automatic POS taggers: TnT, Stanford and Treebank.

General observations of the corpus’ POS annotations by the 3 POS taggers suggest: –There are areas where the taggers do not provide the same tag for a given token, –Certain cases are easy to disambiguate manually, but –In other cases disambiguation is difficult because the tagsets do not fully map the categories present in the learner corpus.

A preliminary examination of the mismatches between the native and learner POS categories suggest 4 main types of mismatches. The mismatches are discussed on the basis of the 3 sources of information handled by automatic POS taggers in the selection of tags for tokens: –Lexical look-up: token’s stem, –Morphology: token’s derivational and inflectional markings, and –Distribution: token’s syntactic context.

3. Mismatches in POS classification variables (1) You can find a big vary of beautiful beaches […] Verb ≠ Noun (2) They are very kind and friendship […] Noun ≠ Adjective ≠ Noun Case 1. Stem-Distribution mismatch Stem Distribution Morphology

3. Mismatches in POS classification variables (3) […] one of the favourite places to visit for foreigns. Adjective ≠ Noun ≠ Noun (4) […] to be choiced for a job […] Noun ≠ Verb ≠ Verb Case 2. Stem-Distribution Stem-Morphology mismatch Stem Distribution Morphology

3. Mismatches in POS classification variables (5) […] this film is one of the bests ever. Adjective ≠ Adjective ≠ Noun (6) […] television, radio are very subjectives […] Adjective ≠ Adjective ≠ Noun Case 3. Stem-Morphology mismatch Stem Distribution Morphology

3. Mismatches in POS classification variables (7) […] for almost every jobs nowadays. Noun ≠ Noun Sing ≠ Noun Pl (8) […] it has grew up a lot especially since 1996 […] Verb ≠ Verb PP ≠ Verb PT Case 4. Distribution-Morphology mismatch Stem Distribution Morphology

4. POS tagging learner data and deviances Not all learner errors demand special attention in POS-tagging: ( 9) […] Internet can modificate […] (10) He runned to by one […] (11) […] The 11th March cames to out minds. (12) Childrens spend so much time […] (13) […] people shouldn’t be menospreciated […]

4. Conclusions Linguistic annotation of learner data is a powerful means to gain access to learner properties with a view to conducting theoretical and applied research. Application of native automatic POS-taggers is a sensible point of departure. However, for linguistic annotations to be fully relevant in learner corpus research, annotation should capture the properties of learner language systematically. Adaptation of existing native POS-tagsets to learner data specifications seems necessary.

References de Haan, P Tagging non-native English with the TOSCA-ICLE tagger. In C. Mair & M. Hundt (Eds.), Corpus Linguistics and Linguistic Theory (pp ). Amsterdam: Rodopi. Díaz Negrillo, A A Fine-Grained Error Tagger for Learner Corpora. Unpublished Ph.D. thesis, University of Jaen, Jaén. Díaz Negrillo, A EARS: A User’s Manual. Munich: LINCOM. Thouësny, S Increasing the reliability of a part-of-speech tagging tool for use with learner language. Paper presented at the Automatic Analysis of Learner Language (AALL’09) Workshop, Tempe, AZ. van Rooy, B. & Schäfer, L The effect of learner errors on POS tag errors during automatic POS tagging. Southern African Linguistics and Applied Language Studies, 20, van Rooy, B. & Schäfer, L An evaluation of three POS taggers for the tagging of the Tswana Learner English Corpus. In D. Archer, P. Rayson, A. Wilson & T. McEnery (Eds.), Proceedings of the Corpus Linguistics 2003 Conference Lancaster University (UK), March Vol. 16 (pp ). Lancaster: UCREL, Lancaster University.

Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany