Lou Burnard H UMANITIES C OMPUTING U NIT Oxford University Computing Services The British National Corpus: where did we go wrong?

Slides:



Advertisements
Similar presentations
A Common Standard for Data and Metadata: The ESDS Qualidata XML Schema Libby Bishop ESDS Qualidata – UK Data Archive E-Research Workshop Melbourne 27 April.
Advertisements

XHTML Basics.
Lou Burnard BNC-XML: an introduction.
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
Integrated Library Management System
Introduction : corpora, corpus use, and the British National Corpus Dr. Ylva Berglund Prytz
Corpus Linguistics 2000 American National Corpus Lancaster, England Nancy Ide Vassar College Catherine Macleod New York University.
W3C - The World Wide Web Consortium Sam Rola Mitchell Smith Claire Stewart May 30 th 2007 Sam Rola Mitchell Smith Claire Stewart May 30 th 2007.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 5: Managing File Access.
© Tefko Saracevic, Rutgers University1 metadata considerations for digital libraries.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 5: Managing File Access.
CIS101 Introduction to Computing Week 11. Agenda Your questions Copy and Paste Assignment Practice Test JavaScript: Functions and Selection Lesson 06,
E-Content: design for all - Thessaloniki TRAIN THE TRAINERS 02. General medical and statistical data on blindness and visual impairment Definition.
1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Corpora and Language Teaching
DATABASES FROM HCT LIBRARIES. HCT has many online databases for students to use to find information. A database is a collection of information organized.
Research methods in corpus linguistics Xiaofei Lu.
Presented by Eroika Jeniffer.  What are we going to learn? - the use of chat in classroom - the most likely application on chat. And many more….. So,
English Corpora and Language Learning Tamás Váradi
Lou Burnard H UMANITIES C OMPUTING U NIT Oxford University Computing Services Introducing the British National Corpus.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
STANDARDIZATION OF SPEECH CORPUS Li Ai-jun, Yin Zhi-gang Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences.
DATA COMMUNICATION DONE BY: ALVIN SAMPATH CARLVIN SAMPATH.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
© 2011 Delmar, Cengage Learning Chapter 7 Managing a Web Server and Files.
 What is the BNC?  What is Xaira?  How to use the BNC for: › Language teaching and learning › Research.
Speaker Notes for employee discussions on how Canada Post is getting greener – August 2008 The Issues Canada Post needs to upgrade its current printer.
By Anthony W. Hill & Course Technology1 Common End User Problems.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 5: Managing File Access.
©2006, CSA Creating and Managing Your COS Expertise Profile Managing Your CV and Promoting Your Work ® Resources for Research, Worldwide.
1999 Asian Women's Network Training Workshop What the Internet Offers Communications  Across the country or across the world Information resources and.
Chapter 1 Understanding the Web Design Environment Principles of Web Design, 4 th Edition.
Week 1 Understanding the Web Design Environment. 1-2 HTML: Then and Now HTML is an application of the Standard Generalized Markup Language Intended to.
Marketing Management Online marketing
Usability Issues Documentation J. Apostolakis for Geant4 16 January 2009.
Level 2 IT Users Qualification – Unit 1 Improving Productivity Chris.
Chapter 8 Browsing and Searching the Web. Browsing and Searching the Web FAQs: – What’s a Web page? – What’s a URL? – How does a browser work? – How do.
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Language and Computation Day University of Essex 4 October 2005.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Chapter 8 Browsing and Searching the Web. 2Practical PC 5 th Edition Chapter 8 Getting Started In this Chapter, you will learn: − What is a Web page −
CLARIN work packages. Conference Place yyyy-mm-dd
COMP 208/214/215/216 – Lecture 8 Demonstrations and Portfolios.
Chapter 3 Installing and Learning Software. 2Practical PC 5 th Edition Chapter 3 Getting Started In this Chapter, you will learn: − What is in an application.
Adobe Dreamweaver CS3 Revealed CHAPTER SIX: MANAGING A WEB SERVER AND FILES.
How Can Corpora Help Me To Be Successful in CO150?
Building and analysing your own corpus 1. Building a corpus.
Well, sir, from the sounds of it, you've got yourself some pirated software. I'm afraid there's nothing we can do to help you.`
World Wide Web Guide * for Students to the Internet.
John Porter Sheng Shan Lu M. Gastil Gastil-Buhl With special thanks to Chau-Chin Lin and Chi-Wen Hsaio.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Oman College of Management and Technology Course – MM Topic 7 Production and Distribution of Multimedia Titles CS/MIS Department.
Information Literacy *Internet searches and Copyright* Created by Madison Library Media Specialists.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Text2PTO: Modernizing Patent Application Filing A Proposal for Submitting Text Applications to the USPTO.
Lou Burnard RESEARCH TECHNOLOGIES SERVICE Oxford University Computing Services BNC-XML and Xaira.
Chapter 8 Browsing and Searching the Web
Chapter 4: Application Software
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
XHTML Basics.
Computational and Statistical Methods for Corpus Analysis: Overview
Corpus Linguistics I ENG 617
XHTML Basics.
XHTML Basics.
Managing a Web Server and Files
2018 Digital Survey: Feedback & Analysis
XHTML Basics.
XHTML Basics.
Presentation transcript:

Lou Burnard H UMANITIES C OMPUTING U NIT Oxford University Computing Services The British National Corpus: where did we go wrong?

What is the BNC?  100 million words of modern British English  produced by a consortium of dictionary publishers and academic researchers  OUP, Longman, Chambers  Oxford, Lancaster, British Library  funded as pre-competitive resource by DTI/ SERC under JFIT

Where did we go wrong?  (if we did)  or, The Benefit of Hindsight  or, If I'd known then what I know now...  or, Wisdom After the Event  And, Where Do We Go From Here?

Production of the BNC  took three years (at least)  cost GBP 1.6 million (at least)  came about through an unusual coincidence of interests amongst:  Lexicographical publishers  Government (DTI)  Engineering and Science Research Council

The Neotenous Nineties  WinWord or WP5? the choice is yours  On your desk … a 386 with 50 Mb diskspace (just about enough to run Windows 3)  In your lab... a VAX or a Sparc for serious work  On the WWW (maybe)... Mosaic for X

Intellectual currents  corpus linguistics  the LOB school  the Birmingham school  the LDC view  text encoding theory  language engineering  the JFIT mentality, or Reconciling Town and Gown

Stated Project Goals  A synchronic (1990-4) corpus of samples both spoken and written from the full range of British English language production  of non-opportunistic design, for generic applicability  with word class annotation  and contextual information

Actual (?) project goals  Better ELT dictionaries  authoritative  both speech and writing  A model for European corpus work  design, and encoding  Industrial-academic co-operation  A REALLY BIG corpus

Consequences  industrial scale text production system  compromises in design and execution  IPR and profitability The BNC looks back to Brown and LOB in its design and markup, and forward to the Web in its scope and indeterminacy

The BNC “sausage machine” OUP Written (OUP/Chambers) Written (OUP/Chambers) Spoken (Longman) Spoken (Longman) Initial CDIF Conversion and Validation (OUCS) Initial CDIF Conversion and Validation (OUCS) Word Class Annotation (UCREL) Header generation and final validation (OUCS) Header generation and final validation (OUCS) Selection, clearance, and captureEnrichment and encoding Documentation, distribution, maintenance

Task groups  permissions  selection, design criteria  encoding and markup  enrichment and annotation  retrieval software

Through-put (million words/quarter)

Tensions  desire to test annotation scheme  requirement to meet deliverables  slipping goal posts  quantity above quality  … an interesting learning experience for both sides!

BNC Selection Criteria  Written selection criteria  predefined proportions of different media (books, newspapers, unpublished…) different domains (informative, entertaining…)  maximum sample size words  all texts incomplete  Spoken selection criteria  context-governed  demographically-sampled

Word tagging The Queen ‘s real annus horribilis began Sunday.  word-pos pair  white space problems  validation problems

Sample written text CAMRA FACT SHEET No 1 How beer is brewed Beer seems such a simple drink that we tend to take it for granted.

Transcription practice  Regionalised typists  Markup makes explicit  changes of speaker and overlap  words as perceived by transcriber  plus indications of false starts, truncation, uncertainty  some performance features e.g. pausing, stage directions etc.  speaker details where available (always for respondents, sometimes for others)

Sample spoken text Mm yes I told Paul that he can bring a lady up at Christmas-time. Is he not going home then ? No and erm I 'm leaving a turkey in the freezer Paul is quite good at cooking standard cooking.

Metadata  each text has a TEI header  identification and classification  specific details (e.g. speakers)  housekeeping information  all common data in the corpus header  classification(s) in header pointed to by individual texts

Text classifications  spoken texts  age, sex, class (of respondent)  domain, region, type  written texts  author age, sex, type  audience, circulation, status  medium, domain  Intention was to improve coverage, not accessibility

In retrospect…  Some classifications were poorly defined and only partially populated  Domain or text-type?  Dating date of copy? first publication?  Author age when?  Author ethnic origin, domicile

That famous BNC balance BNC-1

That famous BNC balance BNC-W

Written Domains BNC-1

Written Domains BNC-2

Written Domains

Spoken domains

Availability  BNC end-user licence  commercial exploitation of the corpus is forbidden  commercial exploitation of derived works is permitted  OUCS is sole agent for licensing, reporting to Consortium  Original restriction to EU has been lifted

Distribution methods  100 million words is (still) a lot of data  IPR agreements imply not-for-profit distribution  (which has its downsides too)  The options are...  install it yourself  online access  the sampler

Install it yourself (version 1)  You need...  £220 for a licence and 3 CDs  £2000 for a Unix box with min 6 Gb disk  some Unix expertise  You get...  access to the whole corpus  using the tools of your choice  configurable for a local network Version 2 will be delivered to run “standalone” on a suitably configured PC

BNC Online service  You need...  access to the Internet  You get...  free (but limited) access using any web browser  free (temporary) access using SARA (PC only)  for an annual fee, SARA plus documentation

Accesses per month

The BNC Sampler  You need...  $50 for a CD  A PC with a CD drive and (preferably) 90 Mb disk space  You get...  2% sample, half written, half spoken  four different search engines  documentation Available at this conference, at a special price !!

The BNC World Edition (aka BNC2)  has IPR clearance for world usage (we lose about 50 texts)  extensive set of revisions and corrections  catching up with the standards  accompanied with new enhanced version of SARA … and it’s nearly ready (honest)

Error correction issues  Nothing can be added  Catching up with the standards  CDIF … TEI … EAGLES… CES …  headers are now in TEI-conformant XML  Indeterminacy of any transcription  On the scale of the BNC, especially  If seven maids with seven mops…

Error Corrections in BNC2  POS correction  Systematic uses improved rules derived from BNC Sampler significantly reduced error rate and indeterminacy  Major production errors fixed  Semi-systematic duplicate texts wrongly labelled texts participant details classification errors and lacunae  Typos remain... and will do so!

The BNC as an Open Corpus  We chose SGML to encourage development of other tools  This is coming more slowly than we expected,e.g. the Sampler  But people still think the BNC and SARA are the same thing

New features in SARA  POS code searches  Collocation searches  Subcorpora  Lemmatization rules  Usable with any TEI conformant corpus

What lessons have we learned?  know your audience  technological blindspots  missed opportunities

Know your audience  Everyone knows you should research the market first...  small, specialist research community, lexicographers  The actual market is immense:  language learners  applied linguists  cultural historians  and technically unsophisticated  hence often misled or disappointed

Technological blind spots  we didn't expect the XML revolution! so we wasted time in format conversion and compromises  we didnt foresee pcs with 8Gb disks and sound cards! so we didn’t try to get rights to the audio and we focussed efforts on developing a client/server application

Missed opportunities: the R-word  Original design talks of Representativeness  This shifted to the idea of the BNC as a "fonds" : a source of specialist corpora  This implies  a clearer and agreed taxonomy of text types  better access facilities for subcorpora

Missed opportunities: watching the river flow  The BNC as a monitor corpus  Diachronic sampling  But this implies a constant ability to fund and integrate  How long will we want to study the language of the nineties?  Will the web provide?