Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew.

Similar presentations


Presentation on theme: "The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew."— Presentation transcript:

1 The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew Thwaites b, Paula Buttery c, Jeroen Geertzen c Billi Randall a, Meredith Shafto a, Barry Devereux a, Lorraine Tyler a a The Centre for Speech, Language and the Brain, University of Cambridge b The MRC Cognition and Brain Sciences Unit, Cambridge c Computation, Cognition and Language Group, RCEAL, University of Cambridge

2 Acknowledgments This work is part of the Computational Natural Language Processing and the Neuro-Cognition of Language (COMPLEX) project, supported by EPSRC (grant EP/F030061/1) and by a Medical Research Council UK grant to LKT (grant G ).

3 Outline of talk Motivation for Corpus Data collection Transcription Guidelines

4 Motivation To look at differences between speech populations: young and old; and healthy and brain-damaged patients The brain-damaged patients have mainly left-lateral damage (known speech processing areas) Desire to characterise speech output in these populations. This characterization hasn’t been not done before with respect to language generation

5 Description of corpus The finished corpus comprises of machine- friendly transcriptions of two speech tasks: spontaneous speech and the cookie-theft picture description Brief statistics: 232 healthy individuals, 110 patients, ≈ 23 hours of speech, ≈15000 ‘sentences’ Spontaneous speech task: 10 minute semi- prompted monologue

6 The ‘cookie-theft’ picture From the Boston Diagnostic Aphasia Examination - Goodglass & Kaplan, 1983

7 Participants Healthy individuals – volunteers part of a wider panel recruited for other behavioural and neuro-imaging studies. Patients – aetiology is varied but damage mainly left lateralised – patients were selected from a number of sources Neuro-imaging scans available for a third and growing

8 Participants

9 The recordings For healthy individuals: recordings were carried out in an isolated environment such as a sound attenuated interview room. The recordings are stored as uncompressed audio. For patients, sometimes at their home, normally with a family member present

10 Transcription Producing a machine-parseable transcription – XML based – retain prosodic information as far as possible – Paying special attention to speech phenomena (repetitions, hesitations, false-starts) Comparable corpora and existing guidelines

11 DTD validated XML Meta & participant data Interview transcription

12 Outline of the transcription schema Meta-data – Gender – Age – Aetiology – Type of damage – Broad location of damage – Date of recording – Who was in the room

13 Structural units – Utterance “And I’ve been in my van uhuh but i’ve been out all day” – Segment “(The kiddies are taking biscuits)(now one of them is falling off)” – Sub-segment “(erm)(mum)(washing up)”

14 Representing the nature of speech – Rep tag “it is is is falling over” – ‘…’ incompleteness “oh dear the sink is... and oh my the children” – Unclear tag etc. “and taps running” Suprasegmental features – Shifts Laughing Language change etc

15 Phonological information – phonological information “The sink is blAdin ” – IPA transcriptions Anonymisation – All personal names/places replaced with reference markers Misc – Kinetic – Vocal – Incident etc

16 The next phase On the corpus – Addressing gap in ages for healthy individuals with the cookie-theft task between 25 and 63yrs – Addressing shortfall within each aetiology Work derived from the corpus. – Identifying ages based on the cookie theft description – Identifying damage based on the tasks – Speech production issues more generally

17 References Harold Goodglass and Edith Kaplan Boston Diagnostic Aphasia Examination (BDAE). Lea and Febiger. Distributed by Psychological Assessment Resources, Odessa, FL.

18 Thank you Any questions?


Download ppt "The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williams a, Andrew."

Similar presentations


Ads by Google