A program for voice analysis

Slides:

Advertisements

Similar presentations

Digital Signal Processing

Advertisements

Philip Harrison J P French Associates & Department of Language & Linguistic Science, York University IAFPA 2006 Annual Conference Göteborg, Sweden Variability.

Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),

Voice quality variation with fundamental frequency in English and Mandarin.

Franz de Leon, Kirk Martinez Web and Internet Science Group  School of Electronics and Computer Science  University of Southampton {fadl1d09,

Tools for Speech Analysis Julia Hirschberg CS4995/6998 Thanks to Jean-Philippe Goldman, Fadi Biadsy.

Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.

Tools for Speech Analysis Julia Hirschberg CS4995/6998 Thanks to Jean-Philippe Goldman, Fadi Biadsy.

Auto-tuning for Electric Guitars using Digital Signal Processing Pat Hurney, 4ECE 31 st March 2009.

Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.

Itay Ben-Lulu & Uri Goldfeld Instructor : Dr. Yizhar Lavner Spring /9/2004.

VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.

Xkl: A Tool For Speech Analysis Eric Truslow Adviser: Helen Hanson.

Created by Amanda Shultz About Section 1 Section 2 Section 3 Links.

AN INTRODUCTION TO PRAAT Tina John M.A. Institute of Phonetics and digital Speech Processing - University Kiel Institute of Phonetics and Speech Processing.

Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight.

1 Speech Parametrisation Compact encoding of information in speech Accentuates important info –Attempts to eliminate irrelevant information Accentuates.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,

Tools for Speech Analysis 2 How do we choose? What kind of data? Which task?

COMP 4060 Natural Language Processing Speech Processing.

Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.

Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.

A PRESENTATION BY SHAMALEE DESHPANDE

A Full Frequency Masking Vocoder for Legal Eavesdropping Conversation Recording R. F. B. Sotero Filho, H. M. de Oliveira (qPGOM), R. Campello de Souza.

Representing Acoustic Information

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

LE 460 L Acoustics and Experimental Phonetics L-13

GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.

Hoarse meeting in Liverpool April 22, 2005 Subglottal pressure and NAQ variation in Classically Trained Baritone Singers Eva Björkner*†, Johan Sundberg†,

Kinect Player Gender Recognition from Speech Analysis

Knowledge Base approach for spoken digit recognition Vijetha Periyavaram.

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

Automatic Pitch Tracking September 18, 2014 The Digitization of Pitch The blue line represents the fundamental frequency (F0) of the speaker’s voice.

MUSIC 318 MINI-COURSE ON SPEECH AND SINGING

Automatic Pitch Tracking January 16, 2013 The Plan for Today One announcement: Starting on Monday of next week, we’ll meet in Craigie Hall D 428 We’ll.

Chapter 16 Speech Synthesis Algorithms 16.1 Synthesis based on LPC 16.2 Synthesis based on formants 16.3 Synthesis based on homomorphic processing 16.4.

A Simplified System for Analyzing Stop Consonant Acoustics By Zach Polen Advisor: Helen Hanson.

Jacob Zurasky ECE5526 – Spring 2011

♥♥♥♥ 1. Intro. 2. VTS Var.. 3. Method 4. Results 5. Concl. ♠♠ ◄◄ ►► 1/181. Intro.2. VTS Var..3. Method4. Results5. Concl ♠♠◄◄►► IIT Bombay NCC 2011 : 17.

Speech analysis with Praat Paul Trilsbeek DoBeS training course June 2007.

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

Pitch Determination by Wavelet Transformation Santhosh Bellikoth ECE Speech Processing Instructor: Dr Kepuska.

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Praat LING115 November 4, Getting started Basic phonetic analyses with Praat –Creating sound objects Recording, reading from a file, creating from.

Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural.

Audio Tempo Extraction Presenter: Simon de Leon Date: February 9, 2006 Course: MUMT611.

GUIDED BY T.JAYASANKAR, ASST.PROFESSOR OF ECE, ANNA UNIVERSITY OF TIRUCHIRAPPALLI. PRESENTED BY C.SENTHILKUMAR, REG.NO: , M.E(MBCBS),COM SYSTEM,VI.

Pitch Estimation by Enhanced Super Resolution determinator By Sunya Santananchai Chia-Ho Ling.

IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro.Intro.

Performance Comparison of Speaker and Emotion Recognition

Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.

IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.

A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.

By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.

Acoustic Phonetics 3/14/00.

HIGH-RESOLUTION SINUSOIDAL MODELING OF UNVOICED SPEECH GEORGE P. KAFENTZIS, YANNIS STYLIANOU MULTIMEDIA INFORMATICS LABORATORY DEPARTMENT OF COMPUTER SCIENCE.

Praat: doing phonetics by computer Introductory tutorial Kyuchul Yoon Division of English Kyungnam University.

Fluency in Oral Interaction Workshop (FLOW)

An Introduction to : a closer look at analysing vowels

Talking with computers

Speech Processing AEGIS RET All-Hands Meeting

Automated Detection of Speech Landmarks Using

CS 591 S1 – Computational Audio -- Spring, 2017

†Department of Speech Music Hearing, KTH, Stockholm, Sweden

Tools for Speech Analysis

Auditory Morphing Weyni Clacken

Presentation transcript:

A program for voice analysis VoiceSauce: A program for voice analysis

Yen-Liang Shue yshue@ee.ucla.edu Speech Processing and Auditory Perception Laboratory Department of Electrical Engineering, UCLA yshue@ee.ucla.edu

Patricia Keating Chad Vicenik Phonetics Lab, Linguistics, UCLA

Overview VoiceSauce is a new application, currently implemented in Matlab, which provides automated voice measurements over time from audio recordings. VoiceSauce computes many voice measures, including corrections for formant frequencies and bandwidths. It outputs values as text or for Emu database. VoiceSauce is available free by downloading.

VoiceSauce algorithms: F0 estimation First the STRAIGHT algorithm (Kawahara et al. 1998) is used to find F0 at 1 ms intervals. The Snack Sound Toolkit (Sjölander 2004) can also be used to estimate F0 at variable intervals. Future versions will allow integration with other F0 estimators.

VoiceSauce algorithms: Harmonic magnitudes Harmonic spectra magnitudes are computed pitch-synchronously, over a 3-cycle window. This eliminates much of the variability in spectra computed over a fixed time window. Harmonics are found using standard optimization techniques to find the maximum of the spectrum around the peak locations as estimated by F0. This enables a much more accurate measure without relying on large FFT calculations.

VoiceSauce algorithms: Formants estimation The Snack Sound Toolkit (Sjölander 2004) is used to find the frequencies and bandwidths of the first four formants, using as defaults the covariance method, pre-emphasis of .96, window length of 25 ms, and frame shift of 1 ms (to match STRAIGHT). Future versions will allow integration with other formant estimators.

VoiceSauce algorithms: Formant corrections In previous work, Iseli & Alwan (2000, 2004) developed an algorithm that estimates the voice source parameters F0, H1*-H2*, and H1*-A3*, where the asterisk denotes that the corresponding spectral magnitudes (H1, H2 and A3) are corrected for the effect of formants (frequencies and bandwidths). Further developments have been reported by Iseli et al. (2006, 2007).

In VoiceSauce, the harmonic amplitudes for all measures of spectral magnitude are corrected every frame using the Snack formant frequencies and bandwidths. (For H1*-H2*, only F1 and F2 are used in the correction; for e.g. H1*-A3*, F1 through F3 are used.) Finally, the measures are smoothed with a moving average filter with a default length of 20 samples.

Variables computed F0 from Snack F0 from STRAIGHT F1-F4 and B1-B4 from Snack H1 H2 H4 A1, A3, A3 Cepstral Peak Prominence Energy H1-H2(*) H1-A1(*) H1-A2(*) H1-A3(*) H2-H4(*) All harmonic measures come both corrected (*) and uncorrected

VoiceSauce: User Interface

Module: Parameter estimation User specifies: Where to find .wav files to analyze, and where to save .mat files of results Acoustic parameters to be calculated (next slide) whether to limit analysis to segments in Praat textgrid for each file (next slide) whether to display waveform of each file during its processing

Parameter selection Default setting is that all available parameters will be calculated:

Using Praat textgrids in parameter estimation Praat textgrids (Boersma 2001) are used to delimit and label segments of interest. These can then be used to guide VoiceSauce acoustic analysis (only labeled segments are analyzed). They are also used to structure VoiceSauce output.

Module: Settings Change how parameters are calculated, how textgrids are used:

Module: Output to text From Praat textgrids, VoiceSauce identifies all labeled segments. Writes out the results for those segments all values (frame interval) or, averages over N intervals User specifies which parameters to output. Output can be one giant text file with all parameters, or separate smaller text files with subsets of parameters.

Including EGG data EGGWorks, a free program by Henry Tehrani based on PCQuirerX, computes several EGG measures from .wav files of EGG channels, in batch mode. Its output file can be included as an input to VoiceSauce’s output step.

Module: Output to Emu For use in Emu speech databases (Harrington in press) Emu’s trackdata files in SSFF format One track file per parameter per audio file as in Emu Can view, query, analyze in Emu, or in R using Emu library

Sample display in Emu

Comparing VoiceSauce to other methods Compare VoiceSauce’s outputs to By hand measurements, taken from FFT spectra (traditional method) Praat (Boersma 2001) Same speech materials all three methods Taken from Vicenik (2008)

Speech corpus Voice measures made for low vowel [a], after 9 Georgian stops Three stop types - voiceless aspirated, voiced, ejective Three places of articulation – bilabial, alveolar, velar Measurements averaged over the first third of the vowel (Praat, VoiceSauce), or measured immediately after vowel onset (by hand) Five speakers – middle-aged women from Tblisi, Georgia

H1-H2 by hand Measured in PCQuirer Created FFT spectrum with 21 Hz Bandwidth and 40 ms window Spectrum taken at vowel onset Manually marked and logged H1 and H2 using cursor Very slow Many files could not be analyzed H1 H2

H1-H2 by Praat Using a new script based on one by Bert Remijsen (his “msr&check_f1f2_indiv_interv.psc”) Measures H1-H2, H1-A1, H1-A2 and H1-A3 for each labeled segment on a tier Not pitch-synchronous, not corrected If Praat cannot find F0 or all three formants, a file is skipped

H1-H2 by VoiceSauce All VoiceSauce measures computed, but some are not available from other methods Here, uncorrected spectral magnitude measures were used

H1-H2 for three consonant manners By-hand vs. VoiceSauce vs. Praat

Differences Overall results from the 3 methods are similar. Measurements made by hand have the smallest H1-H2 range; Praat has largest H1-H2 range (larger category differences). BUT Praat measurements also have greater variation than from VoiceSauce or by hand – about twice as much for non-modal phonation. You might want to consider putting in a slide comparing more practical differences. Such as, ‘by hand’ takes ages, while VS and Praat took similar amounts of time to run through the data (about 1.5 hours for this set of 676 vowels).

What makes Praat more variable? H1 and H2 measures both more variable Some possible reasons: the STRAIGHT pitch-tracker used in VoiceSauce is very good having F0 values every msec avoids discontinuities harmonic amplitudes are found by optimization, which is equivalent to using a very long FFT window [Chad] This was my guess, too. Praat calculates Hx by measuring the pitch at the middle of the chunk (in this case, three chunks, and we’re only looking at the first chunk, so Praat measured the pitch at 1/6th of the way through the vowel), then creates a spectrum – To Ltas (1-to-1)… - and gets the maximum at the measured pitch frequency (+/- 10%). VS is averaging all the 1 ms point measurements within the first third of the vowel together. The might create a difference in Hx values. Also, if the spectrum generation algorithm is different between Praat and VS, that would give different Hx values. Or, depending on the calculated pitch and the search range (f0 +/- blah), Praat and VS could be looking for H1 in slightly different places.

Conclusions We hope that VoiceSauce will be a useful and easy-to-use tool for researchers interested in multiple voice measures over running speech. Future versions will incorporate additional features.

Download VoiceSauce VoiceSauce is free Currently requires Matlab to run Future versions will be compiled executables Available now from: http://www.ee.ucla.edu/~spapl/voicesauce/

References Boersma, P. (2001) “Praat, a system for doing phonetics by computer”, Glot International 5: 341-5. Harrington, J. (in press 2010) Phonetic Analysis of Speech Corpora, John Wiley and Sons Iseli, M. and A. Alwan (2000) “Inter- and Intra-speaker Variability of Glottal Flow Derivative using the LF Model,'' 6th International Conference on Spoken Language Processing, ICSLP 2000. Volume 1, pp. 477-480 Iseli, M. and A. Alwan (2004) "An Improved Correction Formula for The Estimation of Harmonic Magnitudes and Its Application to Open Quotient Estimation," Proceedings of IEEE ICASSP, May 2004, pp. 669-672 Iseli, M. Y. Shue and A. Alwan (2006) “Age- and Gender-Dependent Analysis of Voice Source Characteristics'', Proceedings of IEEE ICASSP, vol. I, May 2006, pp. 389-392 Iseli, M., Y.-L. Shue and A. Alwan (2007) “Age, sex, and vowel dependencies of acoustic measures related to the voice source“, J. Acoust. Soc. Am.121, 2283-2295 Kawahara, H., A. de Cheveign, and R. D. Patterson (1998) “An instantaneous-frequency-based pitch extraction method for high quality speech transformation: revised TEMPO in the STRAIGHT-suite,” in Proceedings ICSLP’98, Sydney, Australia, December 1998 Remijsen, B., Praat scripts: http://www.ling.ed.ac.uk/~bert/praatscripts.html Sjolander, K. (2004) "Snack sound toolkit," KTH Stockholm, Sweden. http://www.speech.kth.se/snack Tehrani, H, (2009) EGGWorks, a program for automated analysis of EGG signals. Vicenik, C. (2008). “An Acoustic Analysis of Georgian Stops.” Unpublished UCLA Masters Thesis.

Acknowledgments NSF grant BCS-0720304 Code contributors: Henry Tehrani and Markus Iseli VoiceSauce beta users: Kristine Yu, Christina Esposito, Sameer Khan, Marc Garellek, Jianjing Kuang Co-PIs: Abeer Alwan and Jody Kreiman