Presentation on theme: "The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williamsa, Andrew."— Presentation transcript:
1The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy IndividualsCaroline Williamsa, Andrew Thwaitesb, Paula Butteryc, Jeroen GeertzencBilli Randalla, Meredith Shaftoa, Barry Devereuxa, Lorraine TyleraaThe Centre for Speech, Language and the Brain, University of CambridgebThe MRC Cognition and Brain Sciences Unit, CambridgecComputation, Cognition and Language Group, RCEAL, University of CambridgeWho I amWhat I am going to talk aboutBrain damaged, cookie theft & spontaneuo speechWhat the EPSRC grant
2AcknowledgmentsThis work is part of the Computational Natural Language Processing and the Neuro-Cognition of Language (COMPLEX) project, supported by EPSRC (grant EP/F030061/1) and by a Medical Research Council UK grant to LKT (grant G ).
3Outline of talk Motivation for Corpus Data collection Transcription Guidelines
4MotivationTo look at differences between speech populations: young and old; and healthy and brain-damaged patientsThe brain-damaged patients have mainly left-lateral damage (known speech processing areas)Desire to characterise speech output in these populations.This characterization hasn’t been not done before with respect to language generationX3 after the xxx. and what disaplins will be interested.
5Description of corpusThe finished corpus comprises of machine-friendly transcriptions of two speech tasks: spontaneous speech and the cookie-theft picture descriptionBrief statistics: 232 healthy individuals, 110 patients, ≈ 23 hours of speech, ≈15000 ‘sentences’Spontaneous speech task: 10 minute semi-prompted monologueAim for ten minutes – brain damaged patients don’t always get thereOutline of questions (one past, one describe, and a xxx)
6The ‘cookie-theft’ picture Visual cue,No speaking,Contrain the context of the speechIlicit particular strucutures of wordsFrom the Boston Diagnostic Aphasia Examination - Goodglass & Kaplan, 1983
7Participants Healthy individuals Patients volunteers part of a wider panel recruited for other behavioural and neuro-imaging studies.Patientsaetiology is varied but damage mainly left lateralisedpatients were selected from a number of sourcesNeuro-imaging scans available for a third and growingGender balancedAetiology is the orogin of impairment
8Participants Balence. From xxx, so what we can get. Gap. Older brain damage.It is still being added too, growing resorce
9The recordingsFor healthy individuals: recordings were carried out in an isolated environment such as a sound attenuated interview room. The recordings are stored as uncompressed audio.For patients, sometimes at their home, normally with a family member present
10Transcription Producing a machine-parseable transcription XML basedretain prosodic information as far as possiblePaying special attention to speech phenomena (repetitions, hesitations, false-starts)Comparable corpora and existing guidelinesPraat-Speech phenomina
12Outline of the transcription schema Meta-dataGenderAgeAetiologyType of damageBroad location of damageDate of recordingWho was in the room
13Structural units Utterance Segment Sub-segment “And I’ve been in my van uhuh but i’ve been out all day”Segment“(The kiddies are taking biscuits)(now one of them is falling off)”Sub-segment“(erm)(mum)(washing up)”
14Representing the nature of speech Rep tag“it is <rep no=1 >is</rep> <rep no=2 >is</rep> falling over”‘…’ incompleteness“oh dear the sink is ... and oh my the children”Unclear tag etc.“and <unclear reason= ambiguous>taps</unclear> running”Suprasegmental featuresShiftsLaughingLanguage change etcGo through unclear tag!
15Phonological information “The sink is <tr target=‘flooding’>blAdin</tr>”IPA transcriptionsAnonymisationAll personal names/places replaced with reference markersMiscKineticVocalIncident etc
16The next phase On the corpus Addressing gap in ages for healthy individuals with the cookie-theft task between 25 and 63yrsAddressing shortfall within each aetiologyWork derived from the corpus.Identifying ages based on the cookie theft descriptionIdentifying damage based on the tasksSpeech production issues more generally
17ReferencesHarold Goodglass and Edith Kaplan Boston Diagnostic Aphasia Examination (BDAE). Lea and Febiger. Distributed by Psychological Assessment Resources, Odessa, FL.
18Thank you Any questions? The data set is not available yet as it is a work in progress, but will be released in the future, with audio, annotations, with brain scans.