Download presentation
Presentation is loading. Please wait.
1
Lou Burnard H UMANITIES C OMPUTING U NIT Oxford University Computing Services http://info.ox.ac.uk/bnc/ The British National Corpus: where did we go wrong?
2
What is the BNC? 100 million words of modern British English produced by a consortium of dictionary publishers and academic researchers OUP, Longman, Chambers Oxford, Lancaster, British Library funded as pre-competitive resource by DTI/ SERC under JFIT 1990-1994
3
Where did we go wrong? (if we did) or, The Benefit of Hindsight or, If I'd known then what I know now... or, Wisdom After the Event And, Where Do We Go From Here?
4
Production of the BNC took three years (at least) cost GBP 1.6 million (at least) came about through an unusual coincidence of interests amongst: Lexicographical publishers Government (DTI) Engineering and Science Research Council
5
The Neotenous Nineties WinWord or WP5? the choice is yours On your desk … a 386 with 50 Mb diskspace (just about enough to run Windows 3) In your lab... a VAX or a Sparc for serious work On the WWW (maybe)... Mosaic for X
6
Intellectual currents corpus linguistics the LOB school the Birmingham school the LDC view text encoding theory language engineering the JFIT mentality, or Reconciling Town and Gown
7
Stated Project Goals A synchronic (1990-4) corpus of samples both spoken and written from the full range of British English language production of non-opportunistic design, for generic applicability with word class annotation and contextual information
8
Actual (?) project goals Better ELT dictionaries authoritative both speech and writing A model for European corpus work design, and encoding Industrial-academic co-operation A REALLY BIG corpus
9
Consequences industrial scale text production system compromises in design and execution IPR and profitability The BNC looks back to Brown and LOB in its design and markup, and forward to the Web in its scope and indeterminacy
10
The BNC “sausage machine” OUP Written (OUP/Chambers) Written (OUP/Chambers) Spoken (Longman) Spoken (Longman) Initial CDIF Conversion and Validation (OUCS) Initial CDIF Conversion and Validation (OUCS) Word Class Annotation (UCREL) Header generation and final validation (OUCS) Header generation and final validation (OUCS) Selection, clearance, and captureEnrichment and encoding Documentation, distribution, maintenance
11
Task groups permissions selection, design criteria encoding and markup enrichment and annotation retrieval software
12
Through-put (million words/quarter)
13
Tensions desire to test annotation scheme requirement to meet deliverables slipping goal posts quantity above quality … an interesting learning experience for both sides!
14
BNC Selection Criteria Written selection criteria predefined proportions of different media (books, newspapers, unpublished…) different domains (informative, entertaining…) maximum sample size 45000 words all texts incomplete Spoken selection criteria context-governed demographically-sampled
15
Word tagging The Queen ‘s real annus horribilis began Sunday. word-pos pair white space problems validation problems
16
Sample written text CAMRA FACT SHEET No 1 How beer is brewed Beer seems such a simple drink that we tend to take it for granted.
17
Transcription practice Regionalised typists Markup makes explicit changes of speaker and overlap words as perceived by transcriber plus indications of false starts, truncation, uncertainty some performance features e.g. pausing, stage directions etc. speaker details where available (always for respondents, sometimes for others)
18
Sample spoken text Mm yes I told Paul that he can bring a lady up at Christmas-time. Is he not going home then ? No and erm I 'm leaving a turkey in the freezer Paul is quite good at cooking standard cooking.
19
Metadata each text has a TEI header identification and classification specific details (e.g. speakers) housekeeping information all common data in the corpus header classification(s) in header pointed to by individual texts
20
Text classifications spoken texts age, sex, class (of respondent) domain, region, type written texts author age, sex, type audience, circulation, status medium, domain Intention was to improve coverage, not accessibility
21
In retrospect… Some classifications were poorly defined and only partially populated Domain or text-type? Dating date of copy? first publication? Author age when? Author ethnic origin, domicile
22
That famous BNC balance BNC-1
23
That famous BNC balance BNC-W
24
Written Domains BNC-1
25
Written Domains BNC-2
26
Written Domains
27
Spoken domains
28
Availability BNC end-user licence commercial exploitation of the corpus is forbidden commercial exploitation of derived works is permitted OUCS is sole agent for licensing, reporting to Consortium Original restriction to EU has been lifted
29
Distribution methods 100 million words is (still) a lot of data IPR agreements imply not-for-profit distribution (which has its downsides too) The options are... install it yourself online access the sampler
30
Install it yourself (version 1) You need... £220 for a licence and 3 CDs £2000 for a Unix box with min 6 Gb disk some Unix expertise You get... access to the whole corpus using the tools of your choice configurable for a local network Version 2 will be delivered to run “standalone” on a suitably configured PC
31
BNC Online service You need... access to the Internet You get... free (but limited) access using any web browser free (temporary) access using SARA (PC only) for an annual fee, SARA plus documentation http://sara.natcorp.ox.ac.uk
32
Accesses per month
33
The BNC Sampler You need... $50 for a CD A PC with a CD drive and (preferably) 90 Mb disk space You get... 2% sample, half written, half spoken four different search engines documentation Available at this conference, at a special price !!
34
The BNC World Edition (aka BNC2) has IPR clearance for world usage (we lose about 50 texts) extensive set of revisions and corrections catching up with the standards accompanied with new enhanced version of SARA … and it’s nearly ready (honest)
35
Error correction issues Nothing can be added Catching up with the standards CDIF … TEI … EAGLES… CES … headers are now in TEI-conformant XML Indeterminacy of any transcription On the scale of the BNC, especially If seven maids with seven mops…
36
Error Corrections in BNC2 POS correction Systematic uses improved rules derived from BNC Sampler significantly reduced error rate and indeterminacy Major production errors fixed Semi-systematic duplicate texts wrongly labelled texts participant details classification errors and lacunae Typos remain... and will do so!
37
The BNC as an Open Corpus We chose SGML to encourage development of other tools This is coming more slowly than we expected,e.g. the Sampler But people still think the BNC and SARA are the same thing
38
New features in SARA POS code searches Collocation searches Subcorpora Lemmatization rules Usable with any TEI conformant corpus
39
What lessons have we learned? know your audience technological blindspots missed opportunities
40
Know your audience Everyone knows you should research the market first... small, specialist research community, lexicographers The actual market is immense: language learners applied linguists cultural historians and technically unsophisticated hence often misled or disappointed
41
Technological blind spots we didn't expect the XML revolution! so we wasted time in format conversion and compromises we didnt foresee pcs with 8Gb disks and sound cards! so we didn’t try to get rights to the audio and we focussed efforts on developing a client/server application
42
Missed opportunities: the R-word Original design talks of Representativeness This shifted to the idea of the BNC as a "fonds" : a source of specialist corpora This implies a clearer and agreed taxonomy of text types better access facilities for subcorpora
43
Missed opportunities: watching the river flow The BNC as a monitor corpus Diachronic sampling But this implies a constant ability to fund and integrate How long will we want to study the language of the nineties? Will the web provide?
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.