Presentation on theme: "Lou Burnard BNC-XML: an introduction."— Presentation transcript:
Lou Burnard http://www.natcorp.ox.ac.uk BNC-XML: an introduction
What is the BNC? a snapshot of British English, taken at the end of the 20 th century 100 million words in approx 4000 different text samples, both spoken (10%) and written (90%) synchronic (1990-4), sampled, general purpose corpus available under licence; latest edition is BNC-XML (13 mar 2007)
Production of the BNC managed by an academic-industrial consortium with significant government funding took three years (at least) cost GBP 1.6 million (at least) came about through an unusual coincidence of interests amongst: Lexicographical publishers Government (DTI) Engineering and Science Research Council Target audience: Lexicographers, NLP researchers, But not language teachers!
Remember the Nineties? WinWord or WP5? the choice is yours On your desk … a 386 with 50 Mb diskspace (just about enough to run Windows 3) In your lab... a VAX or a Sparc for serious work On the WWW (maybe)... Mosaic for X Little text in digital format Text encoding (under development) TEI SGML
Corpus linguistics 90s-style a world without the web! corpus linguistics Traditionalists (ICAME) Expansionists (LDC, monitor corpora) text encoding theory language engineering and NLP the JFIT mentality
Project Goals Stated A synchronic (1990-4) corpus of samples both spoken and written from the full range of British English language production of non-opportunistic design, for generic applicability with word class annotation and contextual information Unstated better, more authoritative, learner dictionaries a new template for European language resources a REALLY BIG corpus
The BNC “sausage machine” OUP Written (OUP/Chambers ) Spoken (Longman) Initial CDIF Conversion and Validation (OUCS) Word Class Annotation (UCREL) Header generation and final validation (OUCS) Selection, clearance, and captureEnrichment and encoding Documentation, distribution, maintenance
Distinctive features of the BNC non-opportunistic design standardized markup system structural annotation word class annotation contextual information general availability...in these respects, the BNC remains distinctive, twenty years on!
Why BNC XML? The BNC is still widely used... but the technology has moved on XML tools are everywhere... so using the corpus is much easier Conversion to XML was easy and (fairly) automatic... but with more tractable markup some dusty corners needed sweeping out
Needles and haystacks The BNC has an extraordinary range travel agent brochures, weather reports, formal invitations, advertising, publicity leaflets, children's talk, academic discourse, doctor's consultations, marketing meetings, oral history, jokes and anecdotes, high literature, best- sellers, business letters, personal diaries and correspondence... The problem is finding the specific texts you want Selection criteria Descriptive criteria Post-hoc categorization (or use the WLD principle)
BNC Design Criteria for written texts (90%) Medium (books, newspapers, unpublished…) Domain (informative, entertaining…) Criteria for transcribed speech events (10%) Context governed half predefined list of speech situations Demographically sampled half 200 volunteers, sampled for age, sex, region These selection criteria make up a taxonomy, which is defined in the corpus header
Descriptive criteria spoken texts speaker occupation, perceived accent, education level, personal relationship… speech domain, region, locale … written texts author age, sex, type audience, circulation, status text-type classification These criteria were used to maximize variation once selectional constraints had been applied
Annotation, encoding, markup A means of making explicit, and thus processable: structure texts, sections, paragraphs, turns, sentences, words... metadata text-type, situational parameters, context analysis morphology, syntactic function, translation Adopting a single framework facilitates integration and sharing of fragmentary resources thus enhancing research outcomes also makes tool development much easier
p p p p div 1 div s s s s s s s wtext stext div u u u u w w w w w w w 6,026,284 98,363,784 784,484 1,599,692 BNC-XML structure
Word class annotation CLAWS (Leech, Garside et al) approach What counts as a word? In BNC-XML, each word is explicitly marked and annotated with a root form or lemma an automatically assigned C5 word class code a simplified POS code This isn't prima facie obvious, in spite of spelling conventions.
Words and multiwords English orthography can be misleading In BNC XML, some “multiwords” are explicitly marked: in spite of... in spite of common sense... it wasn't me it was n't me
Structure of written texts Most written texts are organized hierarchically into various kinds of division, shown by headings or other features: Some divisions are typed: e.g. chapter, section, story, subsection, column, front, part, recipe, leaflet... all spoken texts are divided into “conversations”...
Features of written texts Paragraph-like marks paragraphs marks headings or captions marks lists marks quotes marks verse lines Paragraph-parts for typographic highlighting for corrected passages for deliberate omissions for page breaks
Speech in writing... Mr. Skinner... That millionaire mammy 's boy — Interruption Mr. Speaker Order. That is not wholly unparliamentary.
Structure of spoken texts marks a stretch of speech initiated by speaker identified as XXX marks a synchronization point detailed information on speakers is given in the text header other features of transcribed speech are also marked...
Features of spoken texts marks changes in voice quality e.g. whispering, laughing, etc., both as discrete events and as changes in voice quality affecting passages within an utterance. marks non-verbal but vocalised sounds e.g. coughs, humming noises etc. marks non-verbal and non-vocal events e.g. passing lorries, animal noises, and other matters considered worthy of note. marks significant pauses silence, within or between utterances, longer than was judged normal for the speaker or speakers. marks unclear passages whole utterances or passages within them which were inaudible or incomprehensible for a variety of reasons.
Contextual information each text has a TEI header identification and classification specific details (e.g. speakers) all common data in the corpus header classification(s) in header are pointed to by individual texts
Structure of the TEI Header File Description Title Statement Responsibility Statement/s Edition Statement Extent Publication Statement Identification numbers Source Description Encoding Description Tagging Declaration Profile Description Creation [Participant Description] Text Classification Revision Description
The title Statement How we won the open: the caddies' stories. Sample containing about 36083 words from a book (domain: leisure) Harlow Women's Institute committee meeting. Sample containing about 246 words speech recorded in public context 32 conversations recorded by `Frank' (PS09E) between 21 and 28 February 1992 with 9 interlocutors, totalling 3193 s-units, 20607 words, and 3 hours 22 minutes 23 seconds of recordings. [Leaflets advertising goods and products]. Sample containing about 23409 words of miscellanea (domain: commerce) The age of capital 1848-1875. Sample containing about 41650 words from a book (domain: world affairs) Data capture and transcription Oxford University Press
The edition statement BNC XML Edition, December 2006 41650 tokens; 41573 w-units; 1436 s-units Distributed under licence by Oxford University Computing Services on behalf of the BNC Consortium. This material is protected by international copyright laws and may not be copied or redistributed in any way. Consult the BNC Web Site at http://www.natcorp.ox.ac.uk for full licencing and distribution conditions. J0P AgeCap
The source description 1 The age of capital 1848-1875. Hobsbawm, E J Abacus London 1977 203-316
The profile description (written) W nonAc: humanities arts History, Modern - 19th century Capitalism - History - 19th century World, 1848-1875
Classification codes Codes used are predefined in the Corpus header Written Domain Imaginative Natural and pure sciences Applied sciences...
The profile description (spoken) 1992-02-23 20 Wayne unemployed Central South-west England.... Hampshire: Andover local shop visiting friends...
Has English moved on? types of text e-mail web pages / blogs SMS personal letters topics globalization internet Elvis Word Perfect
Out of date? The composition (and date) of any corpus affects inferences drawn from it There aren't many alternatives Web-as-corpus sources of spoken texts? monitor corpora are non-replicable copyright permissions unrepeatable Quantitative and qualitative comparative evaluations of BNC coverage are needed but “it's surprising how much is there”
Why is it still useful? The BNC is a problematizing resource... complements (and corrects) intuition increases learner autonomy critiques the myth of the native speaker ... for teacher and learner alike XML makes it more usable by non-specialist software Its range and availability make it unique
Where can I get one? BNC XML: http://www.natcorp.ox.ac.uk now available on DVD standalone single user licence or institutional licence existing licensees should renew XAIRA Delivered free with the BNC (and also available free from http://xaira.sf.net)http://xaira.sf.net Usable with any XML corpus Usable/ish on any platform