Presentation on theme: "Corpus annotation and retrieval: an introduction Paul Rayson Computing Department, Lancaster University Dawn Archer School of Humanities, University of."— Presentation transcript:
Corpus annotation and retrieval: an introduction Paul Rayson Computing Department, Lancaster University Dawn Archer School of Humanities, University of Central Lancashire
Text Mining for Historians July Glasgow University Session outline What is a corpus? What is corpus linguistics? Applying these techniques to historical data What research questions can we answer with CL techniques … in linguistics …? … in computing …? … in history …?
1. Background Corpora, corpus linguistics, annotation, retrieval methods
Text Mining for Historians July Glasgow University Underlying assumption Intuition is not enough to study language … Reaction to Noam Chomskys focus on introspection in 1950s/60s Empirical observation of naturally occurring data versus theory of how human language processing is actually undertaken
Text Mining for Historians July Glasgow University What is a corpus? Old meaning = body of text (Latin) Now = (any) collection of texts or language examples – usually in an electronic format Demonstrates extent to which CL-revival led by advances in computing technology
Text Mining for Historians July Glasgow University A corpus tends to be representative i.e. a balanced sample of a language or a particular variety of language --- c.f. national corpora (British, American, Czech, Polish …) Reasoning? Helps to remove intuitive bias Helps us to find common/ rare phenomena Exceptions …?
Text Mining for Historians July Glasgow University And large … … because size helps us to: Establish norms about the variety being studied Reveal lots of cases of rare features of language Zipfs law
Text Mining for Historians July Glasgow University Size matters! Brown/LOB 1960s 1 million BNC 1990s 100 million Web Present day ? billion
Text Mining for Historians July Glasgow University Birmingham corpus million Collins Bank of English Cambridge International Corpus Oxford English Corpus million – 1 billion Web Future ? billions
Text Mining for Historians July Glasgow University So what is corpus linguistics? =the study of language using corpora = empirical methodology = a useful means of exploring: Synchronic and diachronic variation Syntax, semantics, pragmatics Lexicography Dialects, minority languages Not just English
Text Mining for Historians July Glasgow University Corpus techniques we utilise Retrieval Frequency profiling Concordancing Collocations Key words Key domains Annotation POS tagging Semantic tagging
Text Mining for Historians July Glasgow University Annotation Part of speech tagging Semantic field tagging Retrieval Frequency lists Concordances
Text Mining for Historians July Glasgow University Key words Text Keywords Text or reference corpus What are key words? And why are they so useful?
Text Mining for Historians July Glasgow University Key words Word Clouds If we compare text A … with text B… we can discover the most significant items within text A … and not only the frequent items
Text Mining for Historians July Glasgow University Collocations Collocation = a relationship between words that tend to occur together in texts Words that tend to occur near word X are the collocates of word X (consider fish and XXXXX) Based on frequency (how frequent separate vs. how frequent together) The company a word keeps: implicit associations or assumptions Bachelor: eligible, flat, life, days Spinster: elderly, widows, sisters, parish
Text Mining for Historians July Glasgow University Corpus software
Text Mining for Historians July Glasgow University Modern methods in an historical setting (focussing on EmodE period) Tools/methods dont take account of spelling variation Variant spelling detector (VARD) The need to use historically valid taxonomies or thesauri, or revise our existing modern tagsets Historical Thesaurus of English Spevack (1993)
Text Mining for Historians July Glasgow University Using automated systems of annotation on historical texts is problematic … EModE texts pose the following problems: Archaic –eth and –(e)st verb suffixes, e.g. doth, hath, hast, sayeth, etc., which persist in specialised contexts: religious and poetic usage Fused forms, e.g. Tis (It is) Spellings that are variable even in modern-day usage, e.g. center/centre, skilful/skillful/skilfull, the suffixes -or/- our, -ise/-ize Archaic forms like howbeit, betwixt, for which no obvious modern equivalent exists Compound words, e.g. it self, now adays, in stead Proper names of Latin origin that are sometimes modernised, e.g. Galilaeo (Galileo) Due to different conventions and compositing practices
Text Mining for Historians July Glasgow University Previous work in … Fuzzy search engine Aimed at successful retrieval for novice users without expertise in the text Expand the search term using known letter replacements Changing dictionary built in to corpus annotation software Back-dating inbuilt dictionaries by adding historical variants Information Retrieval Corpus linguistics Natural language processing
Text Mining for Historians July Glasgow University Our scenario SEM TAGGER POS TAGGER VARD: Detect variant spellings and insert modern equivalents
Text Mining for Historians July Glasgow University An important point about the VARD Although the VARD allows for the detection and normalisation of variants to their modern equivalents, it should be noted that... The original variants are retained in the text Were not carrying out spell checking per se (no correct spelling in EmodE period)... Our ultimate aim is to develop a system that automatically regularises variants within a text to their modernised forms so that historical corpora become more amenable to further annotation and analysis.
Text Mining for Historians July Glasgow University Scott, Mike, & Tribble, Chris. (2006). Textual patterns: Keyword and corpus analysis in language education, Amsterdam: Benjamins. Meyer, Charles. (2002). English corpus linguistics: An introduction. Cambridge: Cambridge University Press. McEnery, Tony, & Wilson, Andrew. (2001). Corpus linguistics (2nd Ed.). Edinburgh : Edinburgh University Press. Biber, Douglas, Conrad, Susan, & Reppen, Randi. (1998). Corpus Linguistics: Investigating language structure and use. Cambridge: CUP. Kennedy, Graeme D. (1998). An introduction to corpus linguistics. London: Longman. Hockey, Susan. (2000). Electronic texts in the humanities: Principles and practice. Oxford: Oxford University Press. Geoffrey Sampson & McCarthy, Diana (Eds.). (2004). Corpus linguistics. London: Continuum. Hunston, Susan. (2002). Corpora in applied linguistics. Cambridge: Cambridge University Press. Adolphs, Svenja (2006). Introducing electronic text analysis: a practical guide for language and literary studies. London: Routledge. UCREL website
2. Historical data
Text Mining for Historians July Glasgow University Existing corpora What is already available: LOB-family, Brown family (20 th Century) 15 genres: press, religion, skills & hobbies, biography, learned, fiction (detective, science, adventure), romance, humour Lampeter ( ) Religion, Politics, Economy, Science, Law and Misc. Corpus of English Dialogues ( ) Trial proceedings, depositions, drama, prose fiction Helsinki (Old, Middle and Early Modern English) Archer ( , sampled at 50 year periods) Journals, letters, fiction, news, medicine, science
Text Mining for Historians July Glasgow University Book Search Other historical texts – not complied for corpus linguistics
Text Mining for Historians July Glasgow University Changing English Across the 20th Century: a corpus-based study ucrel.lancs.ac.uk/20thCenturyEnglish/ Leverhulme Trust (2005-7) /2 BrELanc-1901B-LOB (Lanc- 1931) LOBF-LOB AmE?Pre- Brown31 BrownFrown Background: Recent observations of significant shifts having occurred among expressions of obligation/necessity in the period e.g. a decline of the central modals MUST and NEED a spread of the semi-modals HAVE TO, NEED TO Research questions ? Are these changes recent ? How do these changes compare to the development of the semantic field of OBLIGATION/ NECESSITY as a whole? Project outputs Compile a new corpus of British English called Lancaster1901 Enhance the encoding and annotation of Lancaster1901 and the three existing corpora (Lancaster1931, LOB and FLOB) 10 conference presentations 1 book chapter 1 book 2 journal articles
Text Mining for Historians July Glasgow University Application 2: Historical CL In particular, courtroom research (1640+), from a linguistic perspective Utilise a specially designed corpus – Sociopragmatic Corpus – which has been annotated for: age, gender, status and role. speech acts such as questions, requests and commands [$ (^Record.^) $] He did not go out of your Company at all? [$ (^Ann.^) $] Yes about Ten a Clock. [$ (^Record.^) $] Woman you must be mistaken, he came to Town at Twelve or One, and might be in thy company, but it is plain he went to a Brokers in (^Long-lane^), and so to the (^Artillery-Ground^) at (^Cripple-Gate^), for I guess it might be so: Then they went to (^Whetstones-Park^), and spent Six-Pence, and after that they went into (^Drury-lane^). [$ (^Giles,^) $] My Lord, she don't say she was with us all the while, but we came to an House where she was, and several other People our Neighbours.
Text Mining for Historians July Glasgow University Some important findings Historical courtroom discourse is not just made up of questions and answers (even during examination sequences) The frequency with which questions – and directives - were used, the function that they served, and their ability to achieve their social and/or interactional goal depended (in large part) on a number of socio-pragmatic factors: type and date of trial position in discourse role of user & addressee ultimate aim of interaction was a period of emerging and changing roles Now beginning to explore the nineteenth century, i.e. period in which the courtroom adopted advocacy in its modern form (Cairns 1998) Utilising full trials: emerging need to consider opening/closing statements
Text Mining for Historians July Glasgow University Linguistic theory Natural language processing & Computational linguistics Corpus Empirical evidence to inform theory Statistical and rule- based language models Corpus Linguistics Historical theory Historical text mining (HTM) HTM
3. Over to you …
Text Mining for Historians July Glasgow University What research questions would you like to answer, but cant? Search engines for new text collections and digital libraries Named entity extraction for GIS Variant spellings Historical text mining New research methods in History