EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.

Slides:



Advertisements
Similar presentations
Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.
Advertisements

SWE 423: Multimedia Systems Chapter 3: Audio Technology (2)
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **
1 Frequency Domain Analysis/Synthesis Concerned with the reproduction of the frequency spectrum within the speech waveform Less concern with amplitude.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
AN INTRODUCTION TO PRAAT Tina John M.A. Institute of Phonetics and digital Speech Processing - University Kiel Institute of Phonetics and Speech Processing.
SOME SIMPLE MANIPULATIONS OF SOUND USING DIGITAL SIGNAL PROCESSING Richard M. Stern demo August 31, 2004 Department of Electrical and Computer.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Spring 2002EECS150 - Lec13-proj Page 1 EECS150 - Digital Design Lecture 13 - Final Project Description March 7, 2002 John Wawrzynek.
William Stallings Data and Computer Communications 7th Edition (Selected slides used for lectures at Bina Nusantara University) Data, Signal.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Chapter 1: Introduction Business Data Communications, 4e.
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Chapter 15 Speech Synthesis Principles 15.1 History of Speech Synthesis 15.2 Categories of Speech Synthesis 15.3 Chinese Speech Synthesis 15.4 Speech Generation.
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.
Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Digital signal Processing Digital signal Processing ECI Semester /2004 Telecommunication and Internet Engineering, School of Engineering, South.
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
A PRESENTATION BY SHAMALEE DESHPANDE
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
A Text-to-Speech Synthesis System
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
LE 460 L Acoustics and Experimental Phonetics L-13
Digital Sound and Video Chapter 10, Exploring the Digital Domain.
1 COMS 161 Introduction to Computing Title: The Digital Domain Date: September 1, 2004 Lecture Number: 4.
® Automatic Scoring of Children's Read-Aloud Text Passages and Word Lists Klaus Zechner, John Sabatini and Lei Chen Educational Testing Service.
04/08/04 Why Speech Synthesis is Hard Chris Brew The Ohio State University.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Multimedia Specification Design and Production 2013 / Semester 2 / week 3 Lecturer: Dr. Nikos Gazepidis
Supervisor: Dr. Eddie Jones Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification System for Security.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Chapter 16 Speech Synthesis Algorithms 16.1 Synthesis based on LPC 16.2 Synthesis based on formants 16.3 Synthesis based on homomorphic processing 16.4.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,
Segmental encoding of prosodic categories: A perception study through speech synthesis Kyuchul Yoon, Mary Beckman & Chris Brew.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Introduction to Computational Linguistics
Audio processing methods on marine mammal vocalizations Xanadu Halkias Laboratory for the Recognition and Organization of Speech and Audio
Imposing native speakers’ prosody on non-native speakers’ utterances: Preliminary studies Kyuchul Yoon Spring 2006 NAELL The Division of English Kyungnam.
Ways to generate computer speech Record a human speaking every sentence HAL will ever speak (not likely) Make a mathematical model of the human vocal.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
© 2013 by Larson Technical Services
ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.
Engineering Vocabulary By: David Flores. Additive Synthesis A synthesis technique that creates a periodic signal by adding sinusoids together.
IIT Bombay ISTE, IITB, Mumbai, 28 March, SPEECH SYNTHESIS PC Pandey EE Dept IIT Bombay March ‘03.
The role of prosody in dialect authentication Simulating Masan dialect with Seoul speech segments Kyuchul Yoon Division of English, Kyungnam University.
SOME SIMPLE MANIPULATIONS OF SOUND USING DIGITAL SIGNAL PROCESSING Richard M. Stern demo January 15, 2015 Department of Electrical and Computer.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Dialect Simulation through Prosody Transfer: A preliminary study on simulating Masan dialect with Seoul dialect Kyuchul Yoon Division of English, Kyungnam.
영어교육에 있어서의 영어억양의 역할 (The role of prosody in English education) Korea Nazarene University Kyuchul Yoon English Division Kyungnam University.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
G. Anushiya Rachel Project Officer
Mr. Darko Pekar, Speech Morphing Inc.
Text-To-Speech System for English
Studying Intonation Julia Hirschberg CS /21/2018.
Speech and Language Processing
MODULATION AND DEMODULATION
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Informatique et Phonétique
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Multimodal Caricatural Mirror
Applied Linguistics Chapter Four: Corpus Linguistics
COMS 161 Introduction to Computing
Indian Institute of Technology Bombay
An overview of course assessment
Presentation transcript:

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing EE2F1 Multimedia (1): Speech & Audio Technology Lecture 7: Speech Synthesis (1) Martin Russell Electronic, Electrical & Computer Engineering School of Engineering The University of Birmingham

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 2 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Stages in “text-to-speech” synthesis  Text normalisation  Text-to-phone conversion  Linguistic analysis  Semantic analysis  Conversion of phone-sequence to sequence of synthesiser control parameters  Synthesis of acoustic speech signal

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 3 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Approaches to synthesis  Final stage is to convert ‘phone’ or word sequence into a sequence of synthesiser control parameters  Two main approaches: –Waveform concatenation –Model-based speech synthesis (inludes articulatory synthesis)

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 4 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Waveform Concatenation  Join together, or concatenate, stored sections of real speech  Sections may correspond to whole word, or sub- word units  Early systems based on whole words –E.G: Speaking clock - UK telephone system, 1936  Storage and access major issues  Speech quality requires data-rates of 16,000 to 32,000 bits per second (bps)

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 5 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing 1936 “Speaking Clock” From John Holmes, “Speech synthesis and recognition”, courtesy of British Telecommunications plc

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 6 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Whole word concatenation (1)  Whole word concatenation can give good quality speech (as in speaking clock), but has many disadvantages: –pronunciation of a word influenced by neighbouring words (co-articulation) –prosodic effects like intonation, rate-of-speaking and amplitude also influenced by context. –interpretation of a sentence will be strongly influenced by details of individual words used (“Mary didn’t buy Sam a pizza”)

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 7 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Whole word concatenation (2)  Disadvantages (continued): –words must be extracted from the right sort of sentence –most suitable for applications where structure of the sentence is constrained, e.g., announcements, lists… –may need to record more than one example of each word, e.g., raised pitch at end of a list, pre- pause lengthening…

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 8 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Example – original recording The next train to arrive at platform 2 will call at Bromsgrove, Droitwich Spa, Worcester Foregate Street and Malvern Link

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 9 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Example – trivial concatenative synthesis The next train to arrive at platform 2 will call at Malvern Link, Worcester Foregate Street, Droitwich Spa and Bromsgrove

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 10 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Example repeated  Original recording  ‘Concatenative synthesis’

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 11 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Whole word concatenation (3)  Disadvantages (continued): –to add new words the original speaker must be found, or all words must be re-recorded –even with specialist facilities, selection and extraction of suitable words is labour intensive and time consuming

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 12 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Sub-word concatenation (1)  Limitations of word-based methods suggest concatenative speech synthesis based on sub-word units  Need well-annotated, phonetically-balanced corpus of speech recordings  Extract fragments from waveforms in the corpus which represent ‘basic units’ of speech, and can be concatenated and used for speech synthesis

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 13 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Sub-word concatenation (2)  Difficulties include: –identification of a set of suitable units –careful annotation of large amounts of data –derivation of a good method for concatenation

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 14 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Sub-word concatenation (3)  Sub-word concatenation overcomes difficulties with adding new words to the application vocabulary,  But, other problems exacerbated.  In particular, coarticulation and pitch continuity problems occur within, as well as between, words.  Necessary to use several examples of each phone (corresponding roughly to different allophones).

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 15 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Sub-word concatenation (4)  Natural to select fragments that characterise the phone target values, but modelling transitions between these targets is a significant problem

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 16 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Example: sub-word concatenation “stack” (original) “task” sub-word concatenative synthesis

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 17 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Transitional units (1)  Central regions of many speech sounds are approximately stationary and less susceptible to coarticulation effects.  Hence select fragments which characterise transitions between phones, rather than phone targets.  e.g., diphone - transition between two phones.

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 18 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Transitional units (2)  There are contextually-induced differences between instantiations of the central region of phone, which cause discontinuities if they are not attended to.  Possible solutions are: –use several different examples of each diphone –store short transition regions, and –interpolate between end values

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 19 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Transitional units (3)  Coping with coarticulation effects by modelling transitions and –(a) using multiple examples to cope with variation in the instantiation of the phone centres, and –(b) by interpolation between short transition regions

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 20 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing More on prosody  Discontinuity in the fundamental frequency exacerbated for sub-word methods.  Can use source-filter model to separate- excitation signal from vocal-tract shape.  Vocal-tract shape descriptions can then be concatenated and an appropriately smooth fundamental frequency pattern can be added separately.

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 21 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing PSOLA: Pitch Synchronous Overlap and Add  PSOLA (Charpentier, 1986)  Most successful current approach to concatenative synthesis  In PSOLA, the end regions of windowed waveform samples are overlapped pitch- synchronously and added  BT’s Laureate is an example

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 22 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing PSOLA From: John Holmes and Wendy Holmes, “Speech synthesis and recognition”, Taylor & Francis 2001

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 23 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Speech modification using PSOLA  In addition to speech synthesis from segments, there are two other common applications of PSOLA: –Pitch modification –Duration modification

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 24 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Increasing pitch using PSOLA From: John Holmes and Wendy Holmes, “Speech synthesis and recognition”, Taylor & Francis 2001

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 25 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Decreasing pitch using PSOLA From: John Holmes and Wendy Holmes, “Speech synthesis and recognition”, Taylor & Francis 2001

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 26 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing The ‘Laureate’ System  The BT “Laureate” system is a modern, PSOLA-based synthesiser  See Edington et al. (1996a), also look at the web site  Demonstration

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 27 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing PSOLA strengths and weaknesses  Strengths –Produces good quality speech  Weaknesses –Large, annotated corpus needed for each ‘voice’ –Requires accurate pitch peak detection –Inflexible – new voices can only be produced by recording and labelling significant speech corpora from new speakers  Automatic annotation of corpora using techniques from speech recognition

EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 28 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Summary  Concatenative speech synthesis  Whole word concatenation  Importance of prosody  Sub-word concatenation  Choice of sub-word units  PSOLA