Presentation on theme: "The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williamsa, Andrew."— Presentation transcript:
1 The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy IndividualsCaroline Williamsa, Andrew Thwaitesb, Paula Butteryc, Jeroen GeertzencBilli Randalla, Meredith Shaftoa, Barry Devereuxa, Lorraine TyleraaThe Centre for Speech, Language and the Brain, University of CambridgebThe MRC Cognition and Brain Sciences Unit, CambridgecComputation, Cognition and Language Group, RCEAL, University of CambridgeWho I amWhat I am going to talk aboutBrain damaged, cookie theft & spontaneuo speechWhat the EPSRC grant
2 AcknowledgmentsThis work is part of the Computational Natural Language Processing and the Neuro-Cognition of Language (COMPLEX) project, supported by EPSRC (grant EP/F030061/1) and by a Medical Research Council UK grant to LKT (grant G ).
3 Outline of talk Motivation for Corpus Data collection Transcription Guidelines
4 MotivationTo look at differences between speech populations: young and old; and healthy and brain-damaged patientsThe brain-damaged patients have mainly left-lateral damage (known speech processing areas)Desire to characterise speech output in these populations.This characterization hasn’t been not done before with respect to language generationX3 after the xxx. and what disaplins will be interested.
5 Description of corpusThe finished corpus comprises of machine-friendly transcriptions of two speech tasks: spontaneous speech and the cookie-theft picture descriptionBrief statistics: 232 healthy individuals, 110 patients, ≈ 23 hours of speech, ≈15000 ‘sentences’Spontaneous speech task: 10 minute semi-prompted monologueAim for ten minutes – brain damaged patients don’t always get thereOutline of questions (one past, one describe, and a xxx)
6 The ‘cookie-theft’ picture Visual cue,No speaking,Contrain the context of the speechIlicit particular strucutures of wordsFrom the Boston Diagnostic Aphasia Examination - Goodglass & Kaplan, 1983
7 Participants Healthy individuals Patients volunteers part of a wider panel recruited for other behavioural and neuro-imaging studies.Patientsaetiology is varied but damage mainly left lateralisedpatients were selected from a number of sourcesNeuro-imaging scans available for a third and growingGender balancedAetiology is the orogin of impairment
8 Participants Balence. From xxx, so what we can get. Gap. Older brain damage.It is still being added too, growing resorce
9 The recordingsFor healthy individuals: recordings were carried out in an isolated environment such as a sound attenuated interview room. The recordings are stored as uncompressed audio.For patients, sometimes at their home, normally with a family member present
10 Transcription Producing a machine-parseable transcription XML basedretain prosodic information as far as possiblePaying special attention to speech phenomena (repetitions, hesitations, false-starts)Comparable corpora and existing guidelinesPraat-Speech phenomina
11 Meta & participant data Interview transcription DTD validated XMLMeta & participant dataInterview transcription
12 Outline of the transcription schema Meta-dataGenderAgeAetiologyType of damageBroad location of damageDate of recordingWho was in the room
13 Structural units Utterance Segment Sub-segment “And I’ve been in my van uhuh but i’ve been out all day”Segment“(The kiddies are taking biscuits)(now one of them is falling off)”Sub-segment“(erm)(mum)(washing up)”
14 Representing the nature of speech Rep tag“it is <rep no=1 >is</rep> <rep no=2 >is</rep> falling over”‘…’ incompleteness“oh dear the sink is ... and oh my the children”Unclear tag etc.“and <unclear reason= ambiguous>taps</unclear> running”Suprasegmental featuresShiftsLaughingLanguage change etcGo through unclear tag!
15 Phonological information “The sink is <tr target=‘flooding’>blAdin</tr>”IPA transcriptionsAnonymisationAll personal names/places replaced with reference markersMiscKineticVocalIncident etc
16 The next phase On the corpus Addressing gap in ages for healthy individuals with the cookie-theft task between 25 and 63yrsAddressing shortfall within each aetiologyWork derived from the corpus.Identifying ages based on the cookie theft descriptionIdentifying damage based on the tasksSpeech production issues more generally
17 ReferencesHarold Goodglass and Edith Kaplan Boston Diagnostic Aphasia Examination (BDAE). Lea and Febiger. Distributed by Psychological Assessment Resources, Odessa, FL.
18 Thank you Any questions? The data set is not available yet as it is a work in progress, but will be released in the future, with audio, annotations, with brain scans.