Presentation is loading. Please wait.

Presentation is loading. Please wait.

GIVE and TAKE: towards overcoming of the bottlenecks in learner corpus linguistics Przemysław Kaszubski School of English Adam Mickiewicz University Poznań,

Similar presentations

Presentation on theme: "GIVE and TAKE: towards overcoming of the bottlenecks in learner corpus linguistics Przemysław Kaszubski School of English Adam Mickiewicz University Poznań,"— Presentation transcript:

1 GIVE and TAKE: towards overcoming of the bottlenecks in learner corpus linguistics Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland ICAME 2001 Louvain-la-Neuve 16-20 May 2001

2 ICAME 2001, Louvain-la-Neuve Central issues yPedagogical perspective: annotation (& disambiguation) of data yProcedures accessible for applied (non-specialist) researchers yCorpus-based goals vs. corpus-driven methods

3 ICAME 2001, Louvain-la-Neuve Example: research into learners’ core-word based phraseology yEFL learners’ overuse of high-frequency words: what does it mean? xintensive collocability of core lexical items xmulti-word extensions (compounds, coinages, idioms, expressions, phrasals) xneed for an idiomatic scale ymulti-corpus scheme with Polish advanced EFL learner data as hub data

4 ICAME 2001, Louvain-la-Neuve Corpus-driven methods: precision & recall problems zLanguage-based obstacles: yNature of learner language yCross-corpus comparability zTechnical aspects: yPOS tagging error margin yWord-sense disambiguation and / or syntactic parsing yCooccurrence statistics

5 ICAME 2001, Louvain-la-Neuve Problem 1: the nature of learner data zDifference in proficiency levels essential in cross-corpus comparisons yRecall: misspelled words may get mistagged by taggers and overlooked by concordancers, unless edited beforehand yWrong or inconsistent hyphenation may mislead taggers, e.g. ‘money making’ vs. ‘moneymaking’ vs. ‘money-making’ yUnrecognised words vs. tagger default option tag

6 ICAME 2001, Louvain-la-Neuve Problem 2: cross-corpus comparability zgenre homogeneity ztopic-skewed distribution: heuristic method of isolation: sort by standard deviation

7 ICAME 2001, Louvain-la-Neuve Technical 3: performance of POS taggers xAffected: extraction of lemmas meeting POS criteria xTagset vs. research criteria e.g. gerund = noun or verb? xPrecision (noise in data): non-verbs tagged as verbs, e.g.: Not-telling VB(lex,montr,ingp) ?not-tel?...(7) agressive VB(lex,intr,infin) ?agressive?...(3) well-behaved VB(lex,montr,edp) ?well-behave?...(2) xRecall (data ignored): verbs tagged as non-verbs, or lexical verbs tagged as auxiliaries, e.g:... who in sharing their lives with a retarded sibiling [sic!] and taking {taking} part in every-day care problems, may decide never to have... xUntagged and heuristically tagged items: explicit marking vs default tag

8 ICAME 2001, Louvain-la-Neuve Tracking & rectifying POS errors zTOSCA-ICLE tagger built-in tag editor: on-line targeting of precision & recall errors yProblem: insufficient query language: word OR lemma OR tag pattern zno tagger built-in editor: yProblem 1: comprehensive check or intuitive selection yProblem 2: most concordancers are browsers only yProblem 3: large corpora may not load into common editors zUse of text-processing UNIX tools

9 ICAME 2001, Louvain-la-Neuve Technical 4: semantic disambiguation and associations zsometimes only grouping data uncovers a meaningful type of association (Stubbs 1998:4) zautomatic word-sense disambiguation (WSD) and machine-readable lexicons (e.g. WordNet 1.7, EuroWordNet): the Senseval Project zUniversity of Lancaster disambiguation tool (ACASD package, Thomas & Wilson 1996) zTools unavailable or not at implementable stage zConsequently: manual, POS-aided disambiguation

10 ICAME 2001, Louvain-la-Neuve Technical 5: corpus-driven phraseology extraction (1) zcollocation vs cooccurrence & adjacency yword clusters xprecision: many identified clusters have little linguistic significance (‘is the’; ‘of the’; ‘it BE a’) xrecall: Many genuine collocations and MWUs are not contiguous (Kennedy 1998: 114) and may spill outside the typical 4:4 window (e.g. ‘TAKE care of...’ vs ‘TAKE good care of’; ‘the chance which were not eager to take’) xstop-listing not quite possible with high-frequency items (exc. Ted Pedersen’s ‘Bigram Statistics Package’)

11 ICAME 2001, Louvain-la-Neuve Technical 5: corpus-driven phraseology extraction (2) yco-occurrence statistics (WordSmith, TACT) xprecision: not all co-occurrence patterns testify to meaningful collocations xrecall: collocations may extend beyond typical 4:4 word spans xMI: mostly identifies ‘idiosyncratic collocations’ (Oakes 1998; 90): GIVE1722458birth4.65vote4.24opening4.24 antibiotic4.01vaccination3.91ingenuity 3.91isolate3.43habit3.43happiness3.24 away2.91 xWordSmith: only 10 collocate output

12 ICAME 2001, Louvain-la-Neuve Enhancing collocation extraction  Oliver Mason’s QWICK ( : xMI with weighting factors for frequent words xunlimited display of collocates xmulti-test package: incl. t-score; log-likelihood; modified log likelihood; expected/observed ratio yRemaining problems xeffective stop-listing not quite possible with high-frequency item tests xcollocations outside a heuristic window xlexical associations between collocates (synsets) xsemi-manual grouping of data essential

13 ICAME 2001, Louvain-la-Neuve Semi-manual disambiguation (WordSmith Tools)

14 ICAME 2001, Louvain-la-Neuve Problems with semi-manual editing yCONCORDANCER: xnot all important information can be marked: insufficient single letter annotation xlimited saving options: no possibility to circulate concordance data yTEXT EDITOR xno node-based display: no easy sorting xlarge corpora: handling of large or multiple files

15 ICAME 2001, Louvain-la-Neuve Solution: dedicated concordancer- annotator yFeature 1: simple built-in POS tagger [?] yFeature 2: allow editing of concordance lines - text and/or tags and/or lemmas - like built-in tagger editors yFeature 3: allow adding custom information to concordance lines (specialised annotation / grouping of data) yFeature 4: allow saving concordances as text BACK into the corpus (pasting) yFeature 5: multiple coocurrence tests

16 ICAME 2001, Louvain-la-Neuve Summary xDifficult to find/compile truly homogenous AND comparable sets of corpora = small corpus analysis often a necessity xWith small corpora, mere automated methods of processing and analysis display insufficient precision and recall xLoss of data may be prove too costly when pedagogical conclusions are sought xInstead of automatisation: increase the pace of assisted pre- processing and semi-manual analysis (disambiguation) xDedicated new type of hybrid concordancer-editor needed

17 ICAME 2001, Louvain-la-Neuve This show shortly available from:

Download ppt "GIVE and TAKE: towards overcoming of the bottlenecks in learner corpus linguistics Przemysław Kaszubski School of English Adam Mickiewicz University Poznań,"

Similar presentations

Ads by Google