Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.

Similar presentations


Presentation on theme: "Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007."— Presentation transcript:

1 Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007

2 Outline  Introduction  Topics in speech processing –Speech coding –Speech recognition –Speech synthesis –Speaker verification/recognition  Conclusion

3 Introduction  Speech is our basic communication tool.  We have been hoping to be able to communicate with machines using speech. C3PO and R2D2

4 Speech Production Model Anatomy Structure Mechanical Model

5 Characteristics of Digital Speech Waveform Spectrogram Speech

6 Voiced and Unvoiced Speech Silenceunvoiced voiced

7 Short-time Parameters Short time power Waveform Envelop

8 Zero crossing rate Pitch period

9 Speech Coding  Similar to images, we can also compress speech to make it smaller and easier to store and transmit.  General compression methods such as DPCM can also be used.  More compression can be achieved by taking advantage of the speech production model.  There are two classes of speech coders: –Waveform coder –Vocoder

10 LPC Speech Coder Speech buffer Speech Analysis Pitch Voiced/ unvoiced Vocal track Parameter Energy Parameter Quantizer Code generation speech Code stream Frame n Frame n+1

11 LPC and Vocal Track x(n) =  p=1 k a p x(n-p) + e(n)  Mathematically, speech can be modeled as the following generation model:  {a 1, a 2, …, a k } are called Linear Prediction Coefficients (LPC), which can be used to model the shape of vocal track.  e(n) is the excitation to generate the speech.

12 Decoding and Speech Synthesis Impulse Train Generator Glottal Pulse Generator Random Noise Generator Vocal Track Model Radiation Model Pitch Period Gain speech U/V

13 An Example for Synthesizing Speech Blending region Glottal Pulse Go through vocal track filter with gain control Go through radiation filter

14 LPC10 (FS1015)  2.4kbps LPC10 was DOD speech coding standard for voice communication at 2.4kbps.  LPC10 works on speech of 8Hz, using a 22.5ms frame and 10 LPC coefficients. Original Speech LPC Decoded Speech

15 Mixed Excitation LP  For real speech, the excitation is usually not pure pulse or noise but a mixture.  The new 2.4kbps standard (MELP) addresses this problem. Bandpass filter Bandpass filter + w 1-w pulses noise Vocal Track Model Radiation Model Gain speech Original Speech MELP Decoded Speech

16 Hybrid Speech Codecs  For higher bit rate speech coders, hybrid speech codecs have more advantage than vocoders.  FS1016: CELP (Code Excitation Linear Predictive)  G.723.1: A dual bit rate codec (5.3kbps and 6.3kbps) for multimedia communication through Internet.  G.729: CELP based codec at 8kbps. “perceptual” comparison Model parameter generation Speech synthesis Analysis by Synthesis speech code Sound at 5.3kbpsSound at 6.3kbps Sound at 8kbps

17 Speech Recognition  Speech recognition is the foundation of human computer interaction using speech.  Speech recognition in different contexts –Dependent or independent on the speaker. –Discrete words or continuous speech. –Small vocabulary or large vocabulary. –In quiet environment or noisy environment. Parameter analyzer Comparison and decision algorithm Language model Reference patterns speech Words

18 How does Speech Recognition Work? Words: grey whales Phonemes: g r e y w e y l z Each phoneme has different characteristics (for example, The power distribution).

19 Speech Recognition g g r ey ey ey ey w ey ey l l z How do we “match” the word when there are time and other variations?

20 Hidden Markov Model S1S2 S3 P12 {a,b,c,…}

21 Dynamic Programming in Decoding time states We can find a path that corresponds to max-probable phonemes to generate the observation “feature” (extracted in each speech frame) sequence.

22 HMM for a Unigram Language Model HMM1 (word1) HMM2 (word2) HMM3 (wordn) p1 p2 p3 s0

23 Speech Synthesis  Speech synthesis is to generate (arbitrary) speech with desired prosperities (pitch, speed, loudness, articulation mode, etc.)  Speech synthesis has been widely used for text-to- speech systems and different telephone services.  The easiest and most often used speech synthesis method is waveform concatenation. Increase the pitch without changing the speed

24 Speaker Recognition  Identifying or verifying the identity of a speaker is an application where computer exceeds human being.  Vocal track parameter can be used as a feature for speaker recognition. LPC covariance feature Speaker oneSpeaker two

25 Applications Speech recognition Call routing Directory Assistance Operator Services Document input Speaker recognition Personalized service Fraud Control Text-to-Speech synthesis Speech Interface Document Correction Voice Commands Speech Coding Wireless Telephone Voice over Internet


Download ppt "Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007."

Similar presentations


Ads by Google