1st and 2nd Generation Synthesis

Name: 1st and 2nd Generation Synthesis
Uploaded: 2017-07-11T21:41:57+00:00
Duration: PTM17S9
Description: 1st and 2nd Generation Synthesis

1st and 2nd Generation Synthesis
Speech Synthesis Generation First: Ground Up Synthesis Second: Data Driven Synthesis by Concatenation Input (Sequence of) Phonetic symbols Duration F0 contours Amplification factors Data Rule-based parameters Linear Prediction: Stored diphone parameters

Early Synthesis History
Klatt, 1987 “Review of text-to-sppech conversion for English Audio: Milestones 1939 Worlds Fair, Voder, Dudley First TTS, Umeda, 1968 Low rate resynthesis, Speak and Spell, Wiggins, 1980 Natural sounding resynthesis, multi-pulse Linear Prediction, Atal, 1982 resynthesis Natural Sounding Synthesis, Klatt, 1986

Formant Synthesizer Design
Concept Create individual components for each synthesizer unit Feed the system with a set of parameters Advantage If the parameters are set properly, perfect natural sounding speech is created Disadvantages The combination of parameters becomes obscure Parameter settings do not enable an automated algorithm Demo Program:

Formant Synthesizer IIR filter: hn = b0sn – a1yn-1 – a2yn-2
Design for individual formant Components IIR filter: hn = b0sn – a1yn-1 – a2yn-2 Transfer Function: H(z) = b0z0/{(1-a1z-1 – a2z-2)} Transfer Function: H(z) = 1/{(1-p1z-1)(1-p2z-1)} Because they are conjugate pairs H(z) = 1/{(1-reiθz-1)(1-re-iθ z-1)} = 1/(1-re-iθ z-1-reiθ z-1 + reiθz-1re-i θz-1) = 1/(1-r(e-iθ+eiθ)z-1+r2z-2) = 1/(1-2rcosθz-1+r2z-2) The filter: yn = xn – 2rcosθ yn-1+r2yn-2 Parameters (θ controls formant frequency; r controls bandwidth) Θ = 2 πf/F , r = e-πβ/F β = desired bandwidth, F = sampling rate, f = frequency

Parallel or Cascade Cascaded connections Parallel connections
Lose control over components because skirts of poles interact Parallel connections Add filtered signals together to maintain component control System Input Parameters A1,2,3 = Amplitudes F1,2,3 = Frequencies BW1,2,3 = Bandwidths Gain = Output multiplier

Periodic Source Flanagan model Lijencrants-Fant model (figure)
Glottis approximation formulas Flanagan model Explicit periodic function u[n] = ½(1-cos(πn/L)) if 0≤n ≤L u[n] = cos(π(n-L)/(2M)) if L<n ≤M u[n] = 0 otherwise Lijencrants-Fant model (figure) 0 to amplitude Av at time Tp Te where the derivative reaches E Te is the glottal closing instant The open quotient Oq = Te / T0. The ratio between the opening and closing phase is αm. Abrupt closure after maximum excitation between OqT0 and T0.

Radiation From the Lips
Actual modeling of the lips is very complicated Rule based synthesizers want to use specific formulas for simulation Experiments show Lip radiation contains at least one anti-resonance (a zero in the transfer function) The approximation formula often used: R(z) = 1 – αz-1 where 0.95 ≤α ≤0.98 This turns out to be the same formula for preemphasis

Consonants and Nasals Nasals Fricatives
One resonator models the oral cavity Another resonator models the nasal cavity Add a zero in series with resonators Outputs added to generate output Fricatives Source either noise or glottis or both One set of resonators model point in front of place of constriction Another set behind point of constriction Outputs added together

The Klatt Synthesizer

Klatt Parameters

Evaluation of Formant Synthesizers
Quality Speech produced is understandable Output sounds metallic (not natural) Problems System uses lumped parameters (like components of a spring), it is not distributed (like the vocal tract) Individually valid assumptions are invalid when joined together in a system Speech subtleties are too complex for the formant model Transitions between sounds is not modeled Formants are not present in obstruent sounds

Classical Linear Prediction (LP Synthesis)
Concept Use the all-pole tube model of Linear Prediction Y(z) = X(z)/(1-a1z-1 – a2z-2 - … - zpz-P) leads to the linear prediction formula yn = xn + a1yn-1 + a2yn-2 + … + apyn-p Improvements over formant synthesis Obtain parameters directly from speech, not from experimentation or human intervention The glottal filter is subsumed in the LP equation, so synthesizing the glottal source becomes unnecessary Tradeoffs Lose modularity and physical interpretations of coefficients Lack of zeros make modeling nasals and fricatives difficult Modeling transitions between sounds problematic

LP diphone-concatenation Synthesis
Definition: The unit that starts from the middle of one phone and ends at the middle of the next phone Concept Capture and store the vocal tract dynamics of each frame Alter the F0 by changing the impulse rate Alter duration as needed Concatenate stored frames together to accomplish synthesis Input: array of {phone symbol, F0 value, duration}

LP difficulties Boundary point transition artifacts
Approach: Interpolate the LP parameters between adjacent frames The output has a metallic or buzz quality because the LP filter does not entirely capture the characteristic of the source. The residual contains spikes at each pitch period Experiment to resynthesize a speech waveform Resynthesize with residual: speech sound perfect Resynthesize without residual Same pitch and duration: sounds degraded but okay Alter pitch: speech becomes buzzy Alter duration: degraded but okay

Articulatory Synthesis
The oldest approach: mimic the vocal tract components Kempelen Mechanical device with tubes, bellows, and pipes Played as one plays a musical instrument Digital version Controls are the tubes, not the formants Can obtain LP tube parameters from the LP filter Difficulties Difficult to obtain values that shape the tubes The glottis and lip radiation still need to be modeled Existing models produce poor speech Current Applicable Research Articulatory physiology, gestures, audio-visual synthesis, talking heads

2nd Generation Synthesis by Concatenation
Extension of 1st Generation LP-Concatenation Comparisons to 1st generation models Input Still explicitly defines the F0 contour and duration and phonetic symbols Output Source waveform generated from a database of diphones (one diphone per phone) Discards impulse pulses and noise generators Concatenation Pitch and duration algorithms glue together diphones

Diphone Inventory Requirements
If 40 phonemes 40 left diphones and 40 right diphones can combine in 1600 ways A phonotactic grammar can reduce the database size Pick long units rather than short ones (It is easier to shorten duration than lengthen it) Normalize the phases of the diphones All diphones should have equal pitch Finding diphone sound waves to build the inventory Search a corpus (if one exists) Specifically record words containing the diphones Record nonsense words (logotomes) with desired features

Pitch-synchronous overlap and add (PSOLA)
Purpose: Modify pitch or timing of a signal PSOLA is a time domain algorithm Pseudo code Find the exact pitch periods in a speech signal Create overlapping frames centered on epochs extending back and forward one pitch period Apply hamming window Add waves back Closer together for higher pitch, further apart for lower pitch Remove frames to shorten or insert frames to lengthen Undetectable if epochs are accurately found. Why? We are not altering the vocal filter, but changing the amplitude and spacing of the input

PSOLA Illustrations Pitch (window and add) Duration (insert or remove)

PSOLA Epochs PSOLA requires an exact marking of pitch points in a time domain signal Pitch mark Marking any part within a pitch period is okay as long as the algorithm marks the same point for every frame The most common marking point is the instant of glottal closure, which identifies a quick time domain descent Create an array of sample sample numbers comprise an analysis epoch sequence P = {p1, p2, …, pn} Estimate pitch period distance = (pk – pk+1)/2

PSOLA pseudo code Identify the epochs using an array of sample indices, P For each input object Extract the desired F0, phoneme, and duration speech = looked up phoneme sound wave from stored data Identify the epochs in the phoneme with array, P Break up the phoneme into frames If F0 value differs from that of the phoneme Window each frame into an array of frames speech = overlap and add frames using desired F0 IF duration is larger than desired Delete extra frames from speech at regular intervals ELSE if duration is smaller than desired Duplicate frames at regular intervals in speech Note: Multiple F0 points in a phoneme requires multiple input objects

PSOLA Evaluation Advantages Disadvantages
As a time domain algorithm, it is unlikely that any other approach is more efficient (O(N)) If pitch and timing differences are within 25%, listeners cannot detect the alterations Disadvantages Epoch marking must be exact Only pitch and timing changes are possible If used with unit selection, several hundred megabytes of storage could be needed

LP - PSOLA Algorithm Analysis
If the synthesizer uses linear prediction to compress phoneme sound waves, the residual portion of the signal is already available for additional waveform modifications Mark the epoch points of the LP residual and overlap /combine with the PSOLA approach Analysis Resulting speech is competitive with PSOLA, but not superior

Sinusoidal Models Find contributing sinusoids in a signal using linear regression techniques Definition: Statistically estimate relationships between variables that are related in a linear fashion Advantage: The algorithm is less sensitive to finding exact pitch points General approach Filter the noise component from the signal Successively match signal against a high frequency sinusoidal wave, subtracting the match from the wave The lowest remaining wave is F0 Use PSOLA type algorithm to alter pitch and duration

MBROLA Overview PSOLA synthesis has very poor quality (very hoarse quality) if the pitch points are not correctly marked. MBROLA addresses this issue by preprocessing the database of phonemes Ensure that all phonemes have the same phase Force all phonemes to have the same pitch Overlap and synthesis then works with complete accuracy Home Page:

Issues and Discussion Concatenation Synthesis
Micro-concatenation Problems: Joining phonemes can cause clicks at the boundary Solution: Tapering waveforms at the edges Joining segments with mismatched phases Solution: force all segments to be phase aligned Optimal coupling points Solution: algorithms for matching trajectories Solution: interpolate LP parameters Macro-concatenation: ensure a natural spectral envelope Requires an accurate F0 contour

1st and 2nd Generation Synthesis

Similar presentations

Presentation on theme: "1st and 2nd Generation Synthesis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1st and 2nd Generation Synthesis

Similar presentations

Presentation on theme: "1st and 2nd Generation Synthesis"— Presentation transcript:

Similar presentations

About project

Feedback