The BNC XML edition Guy Aston

1 The BNC XML edition Guy Aston

2 The BNC 100 million words of late 20th century British English (written and spoken) Synchronic: begun in 1991, completed in 1994 Slightly revised for the 2nd edition BNC World (2001), and the 3rd edition BNC XML (2007) Sub-corpora releases: –BNC Sampler (samples of one million written words, one million spoken) –BNC Baby (four one-million word samples from four different genres: academic, non-academic, newspaper, conversation)

3 BNC-XML (published last week) Single user licence £60 (£400 for 10) + VAT NB: requires Windows XP Network licence £350 + VAT Free online service with limited query options – Prices valid till 01/06/2007

4 The BNC consortium Oxford University Press Addison-Wesley Longman Larousse Kingfisher Chambers Oxford University Computing Services University Centre for Computer Corpus Research on Language, University of Lancaster British Library Research and Innovation Centre. Funded by the commercial partners, the Science and Engineering Council (now EPSRC) and the DTI under the Joint Framework for Information Technology programme. Additional support: British Library and British Academy

5 Selection criteria: written texts Domain 75% from informative writing: roughly equal quantities from the fields of applied sciences, arts, belief & thought, commerce & finance, leisure, natural & pure science, social science, world affairs 25% from imaginative writing, i.e literary and creative works

6 Selection criteria: written texts Medium 60% from books 25% from periodicals (newspapers etc.) 5-10% from miscellaneous published material (brochures, advertising leaflets, etc) 5-10% from unpublished material (personal letters and diaries, essays and memoranda, etc) <5% from written-to-be-spoken material (political speeches, play texts, broadcast scripts, etc.)

7 Selection criteria: written texts Time post-1975 a few imaginative works date back to 1964, because of their continued sales/popularity

8 Further classification criteria: written texts (original criteria) Sample size (number of words) and extent (start and end points) Topic or subject of the text Author's name, age, gender, region of origin, and domicile Target age group and gender "Level" of writing (reading difficulty) : the more literary or technical a text, the "higher" its level

9 New classification criteria (written texts: derived from Lee 2001) Academic writing Non-academic prose and biography Fiction and verse Newspapers Other published written material Unpublished written material

10 Selection criteria: spoken texts roughly equal quantities of: demographic (spoken conversation) transcriptions of spontaneous natural conversations made by members of the public context-governed (other spoken material) transcriptions of recordings made at specific types of meeting and event. The original recordings transcribed for inclusion in the BNC have been deposited at the National Sound Archives of the British Library.

11 Spoken texts: demographic 124 volunteers males and females of a wide range of ages and social groupings, living in 38 different locations across the UK similar numbers of men and women, from each age and from each social grouping conversations recorded unobtrusively over two or three days permissions obtained after each conversation participants' age, sex, accent, occupation, relationship recorded if possible as classification criteria

12 Spoken texts: context-governed Four broad categories of social context, roughly equal quantities of speech Educational and informative events, such as lectures, news broadcasts, classroom discussion, tutorials Business events such as sales demonstrations, trades union meetings, consultations, interviews Institutional and public events, such as sermons, political speeches, council meetings Leisure events, such as sports commentaries, after- dinner speeches, club meetings, radio phone-ins Specific type of event as classification criterion

13 BNC-XML: Composition texts w-units % Spoken demographic Spoken context-gov Written books/period Written-to-be-spoken Written miscellaneous TOTAL

14 BNC-XML same texts, same part-of-speech tagging as BNC world not checked against original texts/recordings numbers hopefully righter –odd duplicate texts and parts of texts eliminated –text categorisation errors corrected –tokenisation/segmentation errors corrected –multi-word tokens eliminated –non-linguistic and paralinguistic descriptions standardised query software improved (Xaira)

15 BNC-XML corpus structure 1 corpus header –information about corpus and corpus markup 1 bibliography file –information about single text documents 4049 text documents

16 BNC-XML document structure the header (written) or (spoken) the text or

17 BNC-XML: header Textual metadata –file description –source (bibliographic data) –selection and classification categorisations –participant data (speech) –other things …

18 [ACET factsheets & newsletters]. Sample containing about 6688 words of miscellanea (domain: social science) Data capture and transcription Oxford University Press 6688 tokens; 6708 w-units; 423 s-units Distributed under licence by Oxford University Computing Services on behalf of the BNC Consortium. This material is protected by international copyright laws and may not be copied or redistributed in any way. A00 [ACET factsheets & newsletters]. Aids Care Education & Training London W nonAc: medicine Health Sex …

19 BNC-XML: text elements wtext or stext div = section p = paragraph or u = utterance s = sentence w = word and c = punctuation also: head, note, caption, event, gap, vocal … word attributes –c5 = claws5 –pos = part-of-speech –hw = headword (lemma)

20 FACTSHEET WHAT IS AIDS ? AIDS ( Acquired Immune Deficiency Syndrome ) is a condition caused by a virus called HIV ( Human Immuno Deficiency Virus ). … …

21 ok, so what? you don't have to see it like that - do you prefer this? FACTSHEET WHAT IS AIDS? AIDS (Acquired Immune Deficiency Syndrome) is a condition caused by a virus called HIV (Human Immuno Deficiency Virus). but the markup is what enables you to (a) see it like this; (b) do interesting things, e.g. –distinguish aids=SUBST from aids=VERB, aids=NN1 from aids=NN2 –distinguish occurrences in writing from ones in speech –distinguish occurrences in headings from ones in text paragraphs





26 Out of date? new/obsolete text types – –web pages / blogs –SMS –personal letters new/obsolete topics –globalization –internet –Elvis –Word Perfect new/obsolete language (especially in speech age groups)

27 Out of date? Results always need interpreting bearing in mind the composition of the corpus There aren't many alternatives –Web-as-corpus: 85% of written texts aren't on the web - and spoken texts? –Results from monitor corpora non-replicable –Copyright permissions unrepeatable Surprising how few things don't occur in the BNC Quantitative/qualitative evaluations will arrive …

28 Xaira XML-aware indexing and retrieval application Not corpus- or language-specific Comes free with BNC-XML Lots of alternatives, but not many that allow use of XML markup (Zurich, CWB)

29 Xaira improvements Standalone use (Windows XP) Frequency/distribution data by text mode/class Query by text mode/class Exclude occurrences in headers (Oxford) Easier use of word-class data –Lemma queries (eg hw="be") –Addkey queries (eg pos="VERB") –Lemma collocations, POS colligations Oh, and you can examine the whole text if you want

30 What can you do with it in teaching/learning? And why should you want to? A few examples

31 A problem-solving and problem- discovering resource for Preparation (teacher) Classroom use (teacher/learner) Self-study (learner) Complements (and corrects) intuition Increases learner autonomy Critiques the myth of the native speaker

32 The ins and outs of autonomous use focus on patterns which recur, without necessarily trying to explain all the data (patterns not rules) notice memorable instances DON'T overgeneralise - take something you can use look for –collocations –colligations (including position in structural unit - Hoey) –semantic preferences –semantic prosodies/pragmatic associations (apathetic) –associations with particular genres/domains be curious: browse the context, investigate exceptions

33 What are "ins and outs"? 50 occurrences, sort left 2 colligation: (all) the ins and outs of semantic preference: know/learn/understand/keep up with/get to grips with/get down to/forget; explain/teach/guide through/give/look at semantic prosody: difficulty(?) analysis - mainly spoken conversation, but numbers too small for reliable inference



36 Grammar: the aim to do or the aim of doing aim NN1 aim n same frequency as aim v aim + of / to / to - with possessive (POS/DPS) + be; or of NP of - with the aim of VVG [good things!] chief/main (is to), stated








44 aim + to colligation: possessive/the aim BE to INF semantic prosody: positively evaluated outcome (cf right collocates - next slide)


46 aim + of







53 with the aim of V+ing (colligation) main/sole/stated/specific (semantic preference)

