Presentation on theme: "Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory"— Presentation transcript:
Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory
Many thanks to Lou Burnard (re XML) Jiahong Yuan, UPenn (for P2FA aligner) Dave de Roure & Kevin Page (for discussions re linked data) John Pybus & Amir Nettler (for experiments with streamed audio fragments) for £££
Outline of our talk: Large audio corpora and their challenges Mining a Year of Speech Random access to audio snippets
Multimedia dominates the internet 2005: YouTube launched 2008: YouTube surpasses Yahoo as world’s No. 2 search engine 2011: video/audio dominates peak-time bandwidth in North America
Some browsable audio corpora (US Supreme Court recordings) whitehousetapes.net ( ) (Scottish Corpus of Texts and Speech) (British Library Archival Sound Recordings)
Challenges of very large audio collections of spoken language How does a researcher find audio segments of interest? How do audio corpus providers mark them up to facilitate searching and browsing? How to make very large scale audio collections accessible?
Server-side challenges Amount of material Storage – CD quality audio: 635 MB/hour – Uncompressed.wav files: 115 MB/hour – 1.02 TB/year – Library/archive.wav files: 1 GB/hr, 9 TB/yr 1 TB (1000 GB) hard drive: c. £65 Now £39.95! Spoken audio = 250 times XML ---
Server-side challenges Audio format issues – Uncompressed.wav files: 115 MB/hour – Temptation to use compressed formats – For speech analysis, low bitrate compression (40 kbs) is pretty disastrous – Spectral centre-of-gravity measures are unreliable even at higher compression rates, but pitch and formant estimation is OK van Son (2005) Acta Acustica with Acustica 91:
Challenges Amount of material Computing – distance measures, etc. – alignment of labels – searching and browsing – Just reading or copying 9 TB takes >1 day – Download time: days or weeks
How large? Some biggish transcribed corpora: Switchboard corpus: 13 days (included in MYS) Spoken Dutch: 1 month, only a fraction transcribed Spoken Spanish: 110 hours OSU Buckeye Corpus: 2 days Wellington Corpus, NZ: 3 days Mining a Year of Speech: 218 days so far, on track towards 3.6 years (>1200 days)
The “Year of Speech” A grove of corpora, held at various sites with a common indexing scheme and search tools: US English: 2,240 hours of telephone conversations 1,255 hours of broadcast news Talk show conversations (1,000 hrs), Supreme Court oral arguments (5,000 hrs), political speeches and debates British English: Spoken audio part of the British National Corpus >7.4 million words of transcribed speech 1,400 hours Digitized by collaboration with British Library
Analogue audio in libraries British Library: >1m disks and tapes, 5% digitized Library of Congress Recorded Sound Reference Center: >2m items, including … International Storytelling Foundation: >8000 hrs of audio and video European broadcast archives: >20m hrs (2,283 years) cf. Large Hadron Collider 74% on ¼” tape 19% shellac and vinyl 7% digital
Analogue audio in libraries World wide: ~100m hours (11,415 yrs) analogue i.e. 4-5 Large Hadron Colliders! Cost of professional digitization and cataloguing: ~£20/$32 per tape (e.g. C-90 cassette) Using speech recognition and natural language technologies (e.g. summarization) could provide more detailed cataloguing/indexing without time- consuming human listening
Why so large? Lopsided sparsity I Top ten words each occur You 58,000 times it the 's and n't a That12,400 words (23%) only Yeahoccur once
Why so large? Lopsided sparsity
A rule of thumb To catch most English sounds, you need minutes of audio common words of English … a few hours a typical person's vocabulary … >100 hrs pairs of common words … >1000 hrs arbitrary word-pairs … >100 years
Main problem in large corpora Finding needles in the haystack To address that challenge, we think there are two “killer apps” Forced alignment Data linking, or at least open exposure of digital material, coupled with cross-searching
Practicalities In order to be of much practical use, such very large corpora must be indexed at word and segment level All included speech corpora must therefore have associated text transcriptions We’re using P2FA, the Penn Phonetics Laboratory Forced Aligner, to associate each word and segment with the corresponding start and end points in the sound files
Mining (indexing by forced alignment) x 21 million
Mining (indexing by forced alignment)
Mining (a needle in a haystack)
Mining (a diamond in the rough)
Challenges for alignments Problems with documentation and records Transcription errors Long untranscribed portions Some transcribed regions with no audio (lost in copying)
Challenges for alignments Broadcast recordings may include untranscribed commercials Transcripts generally edit out dysfluencies Political speeches may extemporize, departing from the published script
Challenges for alignments Overlapping speakers Background noise/music/babble Variable signal loudness Reverberation Distortion Poor speaker vocal health/voice quality Unexpected accents: need multidialect pronouncing dictionary
Issues we’re still grappling with No standards for adding phonemic transcriptions and timing information to XML transcriptions Many different possible schemes How to decide?
Enabling other corpora to be brought in in future Promoting common standards for audio with linked transcription ? Well
Automatic Speech-to-Phoneme alignment
Aligner output to extended XML HTK example: HTK output+ XML -> extended XML How to represent the obtained time information within the existing TEI-XML structure? "IH1" "T” "IT”
Integrating alignment information in the TEI-XML structure Time information Word level Phoneme level Phonemic representation of each word Timeline
Other representations: EXMARaLDA EXMARaLDA: “Extensible Markup Language for Discourse Annotation” Good evening. I have with me tonight Ann Elk Mistress Ann Elk.
Other representations: Voices of the Holocaust This is the first utterance of the interviewer. This is the first utterance of the interviewee.
Other representations: IFA Dialog Video corpus, Phonetic Sciences, University of Amsterdam van Son, R., Wesseling, W., Sanders, E., and van den Heuvel, H., The IFADV corpus: A free dialog video corpus, LREC’08, Marrakech, beginnen we weer opnieuw?
Other representations: Labb-Cat (ONZE Miner) Transcriber or Praat representation
Other representations: Transcriber so what do you know of your family ’s history like do you know when and why they came to Oxford
Other representations: COLT Corpus – Sentence Level But I must see Mr [smile again.] [ spoiled again?]... – Word level But I must see Mr...
Other representations: Summary Mostly sentence/word level time information representation No phoneme analysis No phoneme time information Timeline representation TEI standard?
Other representations: Summary Mostly sentence/word level time information representation No phoneme analysis No phoneme time information Timeline representation TEI standard? Extended TEI-XML with time and phoneme information
Wanted me to.
Q. When you have an indexing scheme and a big database, what do you want to do with it? A. Random access to audio snippets
Random access to audio snippets Timing of fragments in URL e.g. Gaudi (Google Labs) everyzing.com (ramp.com) pregame-show.htm#q=something&seek= pregame-show.htm#q=something&seek=
Random access to audio snippets Audio objects in HTML5 (in the browser) e.g. W3C media fragments protocol e.g. Demo:
URN’s for audio snippets Linked data/semantic web approach: refer to each specific word, phoneme etc as a specific audio object, not just a time range inside an audio file Challenge: need for an ontology for sounds and sound timelines in audio recordings Some progress in music ontologies
Conclusion Sound and multimedia corpora/collections are getting very big In fact multimedia, not text, dominates the internet So, we need some standard ways for representing audio structure and accessing its parts Forced alignment allows us to map transcriptions to audio, reasonably accurately For searching, there are several “demonstration” possibilities, but this is still work in progress