A Fully Annotated Corpus of Russian Speech

Slides:



Advertisements
Similar presentations
Information structuring in English dialogue class 4
Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.
Analyses on IFA corpus Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC) Project meeting INTAS.
Frequency, Pitch, Tone and Length October 15, 2012 Thanks to Chilin Shih for making some of these lecture materials available.
1 The Effect of Pitch Span on the Alignment of Intonational Peaks and Plateaux Rachael-Anne Knight University of Cambridge.
Nuclear Accent Shape and the Perception of Prominence Rachael-Anne Knight Prosody and Pragmatics 15 th November 2003.
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Tone, Accent and Stress February 14, 2014 Practicalities Production Exercise #2 is due at 5 pm today! For Monday after the break: Yoruba tone transcription.
Making & marking text for synthesis Caroline Henton 10 August 2006.
Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Stockholm 6. Feb -04Robust Methods for Automatic Transcription and Alignment of Speech Signals1 Course presentation: Speech Recognition Leif Grönqvist.
Pavel Skrelin (Saint-Petersburg State University) Some Principles and Methods of Measuring Fo and Tempo.
Chapter three Phonology
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
A Corpus Search Methodology for Focus Realization Jonathan Howell and Mats Rooth Linguistics and CIS Cornell University.
An Elitist Approach to Articulatory-Acoustic Feature Classification in English and in Dutch Steven Greenberg, Shawn Chang and Mirjam Wester International.
1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.
Linguistics week 9 Phonology 2.
Phonology Katie Burns Title III Resource Teacher.
STUDY OF ENGLISH STRESS AND INTONATION
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Phonology, phonotactics, and suprasegmentals
Introduction Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are.
Data-driven approach to rapid prototyping Xhosa speech synthesis Albert Visagie Justus Roux Centre for Language and Speech Technology Stellenbosch University.
STANDARDIZATION OF SPEECH CORPUS Li Ai-jun, Yin Zhi-gang Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences.
How IPA is Used in SSML and PLS Paolo Baggia, Loquendo Wed. August 9 th, 2006.
Phonetics and Phonology
Perceived prominence and nuclear accent shape Rachael-Anne Knight LAGB 5 th September 2003.
1 Introducing The Buckeye Speech Corpus Kyuchul Yoon English Division, Kyungnam University March 21, 2008 School of English,
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Suprasegmentals Segmental Segmental refers to phonemes and allophones and their attributes refers to phonemes and allophones and their attributes Supra-
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Hands-on tutorial: Using Praat for analysing a speech corpus Mietta Lennes Palmse, Estonia Department of Speech Sciences University of Helsinki.
The Phonetic Patterning of Spontaneous American English Discourse Steven Greenberg, Hannah Carvey, Leah Hitchcock and Shuangyu Chang International Computer.
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.
Frequency, Pitch, Tone and Length October 16, 2013 Thanks to Chilin Shih for making some of these lecture materials available.
The Effect of Pitch Span on Intonational Plateaux Rachael-Anne Knight University of Cambridge Speech Prosody 2002.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
Introduction to Speech Neal Snider, For LIN110, April 12 th, 2005 (adapted from slides by Florian Jaeger)
Frequency, Pitch, Tone and Length February 12, 2014 Thanks to Chilin Shih for making some of these lecture materials available.
A quick walk through phonetic databases Read English –TIMIT –Boston University Radio News Spontaneous English –Switchboard ICSI transcriptions –Buckeye.
Roman Kálecký UČO: Segmental features  Sounds  Speach Trainer 3D Suprasegmental features and accents  Speak English  SpeakAP  Accentuate!
Tone, Accent and Quantity October 19, 2015 Thanks to Chilin Shih for making some of these lecture materials available.
ONZEminer Margaret Maclagan, ONZE director Robert Fromont, designer.
TEACHING PRONUNCIATION
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Pitch Tracking + Prosody January 19, 2012 Homework! For Tuesday: introductory course project report Background information on your consultant and the.
Chapter 2: The variation problem 1: Inter-speaker variation J. Jenkins The phonology of English as an international language Presented by: Carrie Newdall.
Pronunciation Course Class # 1 WELCOME. Our Course Divided into three main parts: Pronunciation Pronunciation Stress Stress Rhythm and Intonation Rhythm.
11 How we organize the sounds of speech 12 How we use tone of voice 2009 년 1 학기 담당교수 : 홍우평 언어커뮤니케이션의 기 초.
Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.
Revisiting common errors Tutor: Prof. Cecilia A. Zemborain Student: Ma. De los Ángeles Svidersky.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Acoustics´08 Paris, 29 June – July 2008
an Introduction to English
Text-To-Speech System for English
An invitation for a study of the Enets prosody: Enets digital corpora
Job Google Job Title: Linguistic Project Manager
Audio Books for Phonetics Research
 
Phonetics: Sound Principles
Automatic Prosodic Event Detection
Presentation transcript:

A Fully Annotated Corpus of Russian Speech 25.04.2017 A Fully Annotated Corpus of Russian Speech Pavel Skrelin, Nina Volskaya, Daniil Kocharov, Karina Evgrafova, Olga Glotova, Vera Evdokimova Department of Phonetics, Saint-Petersburg State University Ladies and Gentlemen. My name is Daniil Kocharov and I present here a new corpus of Russian speech, CORPRES, developed at the Department of Phonetics of Saint-Petersburg State University by my colleagues and me. kocharov@phonetics.pu.ru

CORPRES fully annotated COrpus of Russian Professionally REad Speech 25.04.2017 CORPRES fully annotated COrpus of Russian Professionally REad Speech developed at the Department of Phonetics, Saint-Petersburg State University developed for: unit-selection TTS possible linguistic use: research in the Russian phonetics and inter- and intra-speaker speech variability The corpus name is an acronym that means Corpus of Russian Professionally Read Speech. It was originally developed for a unit-selection text-to-speech synthesis of Russian, but our department also uses it as material for our phonetic research of Russian speech and also of inter- and intra-speaker speech variability. It is possible due to comprehensive and very precise segmentation of speech and its labeling. D. Kocharov, Fully Annotated Corpus of Russian Speech

Corpus Description 8 speakers (4 women and 4 men) 25.04.2017 Corpus Description 8 speakers (4 women and 4 men) 60 hours of read speech (7.5 hours from each speaker). Texts of different styles: fiction narrative texts, a play with emotionally expressive dialogues informational texts on IT, politics, economy 6 levels of annotation. The corpus consists of recordings of eight professional speakers, four women and four men. There are seven and a half hours of speech recorded from each speaker, which makes a total of sixty hours of speech. The corpus contains only read speech. Different styles of texts were selected for recording with specific characteristics of those styles in mind: fiction narrative, a play containing emotionally expressive dialogues, purely informational neutral texts on IT, politics and economy. We use six levels of annotation that allow us to comprehensively describe speech data. These are… D. Kocharov, Fully Annotated Corpus of Russian Speech

Annotation Level 1: pitch period boundaries Level 2: phonetic events 25.04.2017 Annotation Level 1: pitch period boundaries Level 2: phonetic events Level 3: real phonetic transcription Level 4: ideal phonetic transcription Level 5: orthographic transcription Level 6: prosodic transcription The first annotation level contains boundaries of fundamental frequency periods. The second annotation level contains boundaries and labels of various phonetic events: epenthetic vowels, voice onsets, voiced plosures, stationary parts of voiceless consonants, laryngalization, and glottalization. The third annotation level contains real phonetic transcription that reflects the sounds actually pronounced by the speakers. The forth annotation level contains ideal phonetic transcription which was automatically generated by a linguistic transcriber in accordance with a canonical set of rules. The fifth annotation level contains orthographic transcription. And the sixth annotation level contains prosodic transcription which includes labels for different types of pauses, types of tone unit, and non-speech events. D. Kocharov, Fully Annotated Corpus of Russian Speech

Annotated Speech Sample 25.04.2017 Annotated Speech Sample Annotation file format: 0,1, 0,2,h 0,8,хотелось 0,32,h 0,64,12 286,16,- 968,16, … This slide contains a sample of annotated speech signal and a piece of annotation file introducing annotation file format. Here you can see all the annotation levels I introduced earlier. D. Kocharov, Fully Annotated Corpus of Russian Speech

Labeling Periods of Fundamental Frequency and Phonetic Events 25.04.2017 Labeling Periods of Fundamental Frequency and Phonetic Events F0 periods were detected automatically. The efficiency of automatic F0 detection and F0 period labeling was up to 98%. The results of the automatic procedure were checked and corrected manually. Phonetic events were detected manually: epenthetic vowels, voice onsets, voiced plosures, stationary parts of voiceless consonants, glottalization. The fundamental frequency periods were detected automatically. A linear combination of a number of methods was used for this purpose The efficiency of automatic pitch detection and pitch periods labeling was about 98%. To make the labeling absolutely correct, the results of the automatic procedure were checked and corrected manually. D. Kocharov, Fully Annotated Corpus of Russian Speech

Phonetic Transcription 25.04.2017 Phonetic Transcription Version of SAMPA for Russian was used for transcription. 18 symbols were used to mark positional allophones of 6 Russian vowel phonemes /a/, /o/, /i/, /u/, /e/, /y/. They contained indication of the vowel’s position regarding stress: 0 – stressed accented vowel, 1 – unstressed vowel in a pretonic syllable, 4 – unstressed one in a post-tonic syllable. The set of consonant symbols included 41 symbols: 36 Russian consonant phonemes 5 voiced allophones of voiceless consonants The transcription symbols used were a version of SAMPA for the Russian language. To mark positional allophones of 6 Russian vowel phonemes /a/, /o/, /i/, /u/, /e/, /y/ 18 symbols were used. Each vowel symbol contained indication of the sound’s position regarding stress. Thus 0 was used to for a stressed accented vowel, 1 - for an unstressed vowel in a pretonic syllable, 4 – an unstressed one in a post-tonic syllable. The set of consonant symbols included 41 symbols to cover 36 Russian consonant phonemes and 5 voiced allophones of voiceless consonants which occur frequently at word junctions. There are two levels of phonetic transcription in CORPRES. We called them “real phonetic transcription” and “ideal phonetic transcription”. We call the first one ‘real’ because it reflects the sounds actually pronounced by the speakers. It is a fully manual transcription, made by expert phoneticians. The ‘ideal’ transcription was generated from orthographic texts which were read by the speaker in accordance with a set of phonological rules. D. Kocharov, Fully Annotated Corpus of Russian Speech

Real Phonetic Transcription 25.04.2017 Real Phonetic Transcription Speech signal was manually: segmented transcribed peer-revised. Huge work! Time efficiency: ~1 sound per minute. => 1 minute of speech per 1-2 hours. To produce the real phonetic transcription, the speech signal was manually segmented, transcribed and peer-revised by expert phoneticians. This transcription is a fully manual transcription made by means of acoustic and perceptual analysis of sound by sound. It was a REALLY huge work. It takes from one to two hours to transcribe one minute of speech. One minute of speech contains about 80 sounds. D. Kocharov, Fully Annotated Corpus of Russian Speech

Ideal Phonetic Transcription 25.04.2017 Ideal Phonetic Transcription Transcription: is generated from texts. Labels: placed automatically to coincide with the label positions produced manually on the real transcription level. Automatic labeling: not perfect due to the mismatch of ideal and real phonetic transcriptions. => results of the automatic procedure were further manually corrected. The ‘ideal’ transcription was generated from the texts read by the speakers in accordance with a set of phonological rules without reference to the actual sound. Thus it is made without reference to the actual sound. The labels were placed automatically to coincide with the label positions produced manually on the real transcription level. Procedure of automatic labeling is based on calculating the Levenshtein distance. Automatic labeling is not perfect due to the mismatch of ideal and real phonetic transcriptions. Therefore, the results of the automatic procedure were further manually corrected. D. Kocharov, Fully Annotated Corpus of Russian Speech

Orthographic and Prosodic Transcription 25.04.2017 Orthographic and Prosodic Transcription Orthographic transcription (Level 5) contains the boundaries of words and word labels. prosodically prominent words are labeled with special symbols. Prosodic transcription (Level 6) contains boundaries of tone units and pauses and their labels. Prosodic information was marked by expert phoneticians on the basis of perceptual and acoustic analysis of the speech data in a text file containing orthographic transcription. Labels were later automatically transferred from the text file to the annotation files to coincide with the phonetic transcription levels. Orthographic transcription was stored on Level 5, it contains the boundaries of words and word labels. Besides the prosodically prominent words are labeled with special symbols. Prosodic information was stored on Level 6, it contains the boundaries of tone units and pauses and their labels. Prosodic information was marked by expert phoneticians on the basis of perceptual and acoustic analysis of the speech data in a text file containing orthographic transcription. Labels were later automatically transferred from the text file to the annotation files to coincide with the phonetic transcription levels. D. Kocharov, Fully Annotated Corpus of Russian Speech

Corpus Data Description 25.04.2017 Corpus Data Description Fully Annotated Data Partly Annotated Data Total Amount Phonemes 1 048 867 – Words 211 437 317 021 528 458 Tone Units 64 055 86 546 150 601 Hours 24 36 60 40% of the corpus: manually segmented and fully annotated on all six levels. 60% of the corpus: partly annotated labels for pitch period and phonetic event labels: no manual phonetic transcription orthographic, prosodic transcription, ideal phonetic transcription are done, but not aligned with speech signal. 40% of the corpus (24 hours of speech) was manually segmented and fully annotated on all six levels. 60% of the corpus was partly annotated; there are labels for pitch period and phonetic event labels. Orthographic and prosodic transcription, as well as the ideal phonetic transcription (see Section 3 for detail) for this part was generated and then stored as text files, but was not transferred to sound file labels. The fully annotated part of the corpus covers all speaking styles included in the corpus and all speakers. Table 1 shows general corpus statistics. D. Kocharov, Fully Annotated Corpus of Russian Speech

Mismatch Between Ideal and Real Transcriptions 25.04.2017 Mismatch Between Ideal and Real Transcriptions Total Correctly Mispronounced Elided Count 1 118 833 947 508 101 292 70 033 Percent 100 84.7 9.05 6.25 The table reveals that despite the fact that as many as 84.7% of the ideal transcription reflects the actual pronunciation, 9.05% of the expected sounds are replaced by other sounds, and 6.25% of the expected sounds are actually not pronounced at all. SIGNIFICANT MISMATCH! About a half of the mismatches is more or less consistent and could be formalized, but the other part is almost unpredictable and depends on instant speaker preference. D. Kocharov, Fully Annotated Corpus of Russian Speech

Conclusions It is the only available corpus for Russian TTS. 25.04.2017 Conclusions It is the only available corpus for Russian TTS. Precise annotation provides an especially valuable resource for both linguistic research and speech applications development. The corpus is large enough for: high-quality TTS phonetic research of intra- and inter-speaker pronunciation variability Our experience shows: manual transcription is very expensive, but worth doing. D. Kocharov, Fully Annotated Corpus of Russian Speech

25.04.2017 Thank you! D. Kocharov, Fully Annotated Corpus of Russian Speech