Presentation on theme: "Investigating speech, thought and writing presentation in a corpus of spoken British English An AHRB funded project under the supervision of Mick Short,"— Presentation transcript:
Investigating speech, thought and writing presentation in a corpus of spoken British English An AHRB funded project under the supervision of Mick Short, Elena Semino and Tony McEnery Research Assistants John Heywood and Dan McIntyre
Project outline To compare speech, thought and writing presentation in spoken and written English. To build a new corpus of 260,000 words of spoken British English to compare with the ST&WP Written English Corpus ( ). To investigate the presentation of speech, thought and writing in the ST&WP Spoken Corpus by tagging with the Leech and Short (1981) category set. To further test and adapt the Leech and Short (1981) model of S&TP. The project is funded until February 2003.
Construction of the corpus 120 texts - approximately 260,000 words. Texts rich in ST&WP taken from the British National Corpus (BNC) and the Centre for North West Regional Studies (CNWRS) oral history archives at Lancaster University. CNWRS interview tapes digitised to be time- aligned with text.
Number and distribution of NWRS files in the corpus NWRS Archive Family and Social Life Archive Childhood and Schooling Archive Male Female Male Female records 7 records 8 records 8 records 15 records 15 records i.e. 60 files with an equal balance of male and female speakers in each age-range
Number and distribution of BNC files in the corpus BNC spoken data Spoken Demographic Spoken Context- Governed Male Female files 5 files 5 files 5 files 5 files 5 files 5 files 5 files 5 files 5 files 5 files 5 files i.e. 60 files with an equal balance of male and female speakers in each age-range
The development of the tag- set NNVNRSA-PNRS/ISFISNRS/DSFDS NNINRTA-PNRT/ITFITNRT/DTFDT NNWNRWA-PNRWS/IWFIWNRW/DWFDW NRANRSANRS/ISFISNRS/DSFDS NRTANRT/ITFITNRS/DTFDT Leech & Short (1981) The ST&WP Written Project (1995…) 3 main genres: Fiction, Biography & Autobiography, and Newspaper Journalism: each divided into Serious/Popular sections. embedded, hypothetical, inferred, quote
The development of the tag- set – new tags RM ARVRSA-PRS/ISFISRS/DSFDS ARIRTA-PRT/ITFITRT/DTFDT ARNRWA-PRW/IWFIWRW/DWFDW The ST&WP Spoken Project (2001) BNC spoken demographic data and NWRS oral history interviews embedded, negative / absence, hypothetical, inferred, quote, reiterated, interrogative, imperative, uncompleted, 2 / 3 / 4
A 15-field tag-set: 5 main categories FIELDCHARACTER VALUE 1x, A, F,Anything! Free 2x, #, R, I, DRepresentation, Indirect, Direct 3x, S, T, W, V, I, N, MSpeech, Thought, Writing, Voice, Internal state, WritiNg, Mention 4x, AAct 5x, PtoPic
A 15-field tag-set: 10 category attributes FIELDCHARACTERVALUE 6x, #, 1, 2, 3, 4# = odd interesting borderline cases, no.s = repeated (-ing or –ed) adjacent categories 7xexeembedded 8xxg/anegative action etc e.g. 'we weren't allowed to go', absence eg 'I didn't say anything' 9xxxhhypothetical 10xxxxiinferred 11xxxxxqquote 12xxxxxxriterative 13xxxxxxxv/pinterrogative, imperative 14xxxxxxxxuuncompleted 15xexxxxxxx2level of embedding (2, 3, 4)
Issues arising Technical issues: Legibility. Comparability between NWRS and BNC data. Tagging issues: Comparability between written and spoken corpora. What counts as ST&WP? Functional and formal criteria. Embedding. Repetition (e.g. he said he said well he said). Report of mention. Reading, hearing, listening and singing dogs!