Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multimedia Communication Signal Processing Group Analysis, Modelling and Synthesis of British, Australian and American Accents Qin Yan Saeed Vaseghi Multimedia.

Similar presentations


Presentation on theme: "Multimedia Communication Signal Processing Group Analysis, Modelling and Synthesis of British, Australian and American Accents Qin Yan Saeed Vaseghi Multimedia."— Presentation transcript:

1

2 Multimedia Communication Signal Processing Group Analysis, Modelling and Synthesis of British, Australian and American Accents Qin Yan Saeed Vaseghi Multimedia Communication Signal processing Lab Department of Electronic and Computer Engineering Brunel University Supported by EPSRC

3 Multimedia Communication Signal Processing Group Content 1- Introduction to Phonetics and Acoustics of Accents 2- Research Issues in Modelling Acoustics of Accents of English 3- Current Research Problems 4- Accent Analysis and Models 5- Accent Morphing 6- Audio Demo

4 Multimedia Communication Signal Processing Group 1.1 Background Accents are acoustic manifestations of differences in pronunciation and intonations by a community of people from a national, regional or a socio- economic grouping. Accents are dynamic processes in that they evolve over time influenced by large- scale immigration, socio-economic changes and cultural trends. Applications of accent models include: - speech recognition, - text to speech synthesis, - voice editing, - accent morphing in broadcasting and films, - toys and computer games, - accent coaching, education. 1. Introduction to Phonetics and Acoustics of Accents

5 Multimedia Communication Signal Processing Group The importance of an accent feature depends on its distance from that of the ‘standard’ or ‘received’ pronunciation and the frequency with which that feature occurs in the acoustics of speech. 1.2 Basic Structure of Accents Generally the structural differences between accents can be divided into two broad parts: (a) Differences in phonetic transcriptions. (b) Differences in acoustics correlates and intonations of accents.

6 Multimedia Communication Signal Processing Group 1.3 Phonetics of Accents A dominant aspect of accents is in the differences in pronunciation as transcribed by a phonetic dictionary. The differences in phonetic transcription can be categorized into two classes: a) Differences in the number and identity of the phonemes. For example, British English as transcribed by Cambridge University’s BEEP dictionary 2 has five extra vowels: /ax( ə ) ea( ɛə) ia( iə) ua ( uə) ah ( ɒ) / compared to American as transcribed by Carnegie Melon University CMU dictionary. /iə ɛə uə/,are allophones of /i ɛ u/. American /ɒ/ is merged with /a/ compared with British accent. American transcription has three different levels of stress for vowels and diphthongs. Also Australian English has distinctive vowels such as /æi/ instead of /ei/ and /æƆ/for /au/. b) Differences in phonetic realizations: phoneme substitution, deletion, insertion. For example, ‘JOHN’ is pronounced as / ʤ Λn/ in American but as / ʤƆ n/ in British and Australian English. The word ‘SAY’ is pronounced as /sei/ in British and American but it is pronounced as /sæi / in Australian.

7 Multimedia Communication Signal Processing Group 1.4 Acoustics of Accents Perceived acoustics differences of accents are due to the differences, during the production of sound, in the configurations, positioning, tension and movement of laryngeal and supra-laryngeal articulatory parameters, namely vocal folds, vocal tract, tongue and lips Four aspects of acoustic correlates of accents are considered essential for accent models and accent synthesis. These are: (a) Formants (i.e. frequency of vocal tract resonance) correlates of accents, including: (i) Formant trajectories F k j (t), k is the formant index and j is phoneme index. (ii) Timing and magnitude of the formant target point(s) in formant space for each phonetic unit.

8 Multimedia Communication Signal Processing Group (b) Pitch prosody correlates of accents, include: (i) Pitch trajectory at various linguistic contexts and positions. e.g. pitch rise, at the beginning of a voiced group or phrase, pitch fall at the end of a phrase. (ii) Pitch nucleus i.e. the timing and magnitude of the prominent pitch event in a voiced group. (c) Duration and Timing correlates of accents, (i) Duration of vowels and diphthongs. (ii) Relative duration and timings of the two constituent vowels of diphthongs. (d) Laryngeal (glottal) correlates of accents, i.e the voice quality of speech segments in certain contexts as a function of accent.

9 Multimedia Communication Signal Processing Group 2. Research Issues in Modelling Acoustics of Accents of English Definition of an accent ‘feature set’ composed of formants’ trajectories, formants’ target points, pitch trajectory, power trajectory, duration. Separation, normalisation, or averaging out of speakers’ characteristics from accent characteristics, this is required for modelling parameters of accent. Modelling formants of vowels and diphthongs, the latter is composed of two connected elementary sounds. Modelling the duration of vowels and diphthongs and the relative duration of the two halves of diphthongs. Modelling pitch trajectory in different phonetic/linguistic positions and contexts. Modelling voice quality correlates of an accents in different phonetic/linguistic positions and contexts. Integration of all accent features within a coherent generative model.

10 Multimedia Communication Signal Processing Group Accent Profile (AP) ParametersCommentsRank Phonetic Parameters Substitution, insertion, deletion Pronunciation differences obtained from phonetic transcription dictionaries ***** Supra-laryngeal and Laryngeal Correlates Formants & their trajectories2 nd formant with largest variance is most sensitive to accent**** Glottal pulse (Voice Quality)Durations and shapes of opening and closing of glottal folds** Prosody Correlates F 0 meanAverage of pitch* F 0 rangeRange of pitch* Pitch NucleusProminent point (stressed) within an intonation group (Tone Unit) *** Initial Pitch RiseFirst pitch slope of a narrative utterance*** Final Pitch LoweringFinal fall pitch slope of a narrative utterance*** Final Pitch RiseFinal rise pitch slope of a narrative utterance*** Timing and Delivery Correlates Speaking RatePhonemes or words per second* Phoneme DurationVowel duration elongation and complete pronunciation all affect *** Excessive Co-articulationClipped or short duration sounds****

11 Multimedia Communication Signal Processing Group Speech Accent Feature Analysis Method The basic processes involved in accent analysis includes Speech phonetic labelling and boundary segmentation using HMMs Pitch trajectory and pitch nucleus estimation Formant models and formant track estimation Duration and power trajectory analysis HMM Training Labeling & Segmentation Formants & Trajectories Pitch Contour Tracker Pitch Marker Tone Nucleus Features F0 Range/Mean Pitch Accents Accent Profile Speaking Rate & Durations Input Speech Block diagram illustration of the processes involved in accent analysis

12 Multimedia Communication Signal Processing Group Analysis of Duration Correlate of AU, US and UK Accent Speech Figure: Comparison of speaking rates of British, Australian and American. Figure: Comparison of phoneme durations of British, Australian and American. 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 aaae ahaoawayehereyihiyowoyuhuw AustralianBritishAmerican Duration (sec)

13 Multimedia Communication Signal Processing Group Model Input British Model American Model Australian Model British12.829.334.9 American30.68.829.94 Australian33.127.37.28 Table : (%) word error of speech recognition across British, American and Australian accents. Australian speaking (word) rate is 23% slower than British American speaking (word) rate is 15% slower than British Comparison of speaking rates of British, American and Australian Accents. Speaking Rate (number/sec) Phone Word British12.13.64 American11.63.1 Australian10.82.8 There is an apparent correlation between automatic speech recognition and speaking rate. Australian with the slowest speaking rate obtains the best recognition results followed by American and British.

14 Multimedia Communication Signal Processing Group Formant Estimation with 2D-HMM Segmentation & window LPC Model Polynomial roots LP-based Formant-candidate feature extraction method Formant candidate Feature vector Speech Frequency,Bandwidth Intensity Calculation Formant feature extraction, illustrated consists of three main functions, (1)an LP model, (2) a polynomial root finder, and (3) a contour trend estimator. Consider the z-transfer function of an LP model with K real poles and I complex pole pairs and a gain factor G as where A k is the pole radius, F i the pole frequency and F s sampling frequency.  estimator

15 Multimedia Communication Signal Processing Group Frequency(Hz) Time(s) Illustration of of LP spectrum and the modelling of 6 complex pole pairs of a speech segment with an HMM composed of 4 formant-states. 2D HMMs span time and frequency dimensions Left-right HMM states across frequency model formants such that the first state models the first formant, the second state the second formant and so on The distribution of formants in each state is modelled by a mixture Gaussian density.

16 Multimedia Communication Signal Processing Group Three spectrogram examples of formant tracks superimposed on LPC spectrum of speech

17 Multimedia Communication Signal Processing Group Comparison of histograms (thin solid line) and Gaussian HMMs of formants of Australian English (bold dashed line). X axis: frequency (Hz); Y axis: probability. The figures show that HMMS are excellent models of the distribution of the formants.

18 Multimedia Communication Signal Processing Group Comparison of Formants Spaces of American, Australian and British Accents Note the following features: Rising of vowels /ae/ and /eh/ in Australian. Fronting of the open vowel /aa/ and high vowel /uw/ in Australian. Fronting and rising of the vowel /er/ in Australian. The vowels /iy/, /eh/ and /ae/ in Australian are closer. F1 vs F2 space of British, Australian and American English. Click phoneme to listen.

19 Multimedia Communication Signal Processing Group Figure : Comparison of trajectories and target time of formant of British, Australian and American accents

20 Multimedia Communication Signal Processing Group Accent Pairs Formant Ranking Order 1234 British & Australian1 st 2 nd 4 th 3 rd British & American2 nd 1 st 3 rd 4 th Australian & American2 nd 1 st 3 rd 4 th 2 nd Formant has widest frequency range and is most sensitive to Accent Formant Ranking using a normalised distance Figure : Comparison of formants of Australian, British and American (female)

21 Multimedia Communication Signal Processing Group Accent Morphing Method Figure : Diagram of a voice morphing system used for accent conversion Source Speech Speech Labeling & Segmentation Formant Mapping Formant Estimation Prosody Modification Accent Model HMM Training/ Adaptation Accent Synthesised Speech Formant Mapping : Transformation of formants of the source towards those of the target accent is based on non-uniform linear prediction model frequency warping. Prosody Modification : based on time domain pitch synchronous overlap and add (TD-PSOLA) method. Prosody Modification includes pitch slope, duration and power trajectory. Application : Text to speech synthesis, Broadcasting System e.g. Accent modification in films, Education software such language teaching, Speech interface in mobile, Call centre and other electronic products Pitch Tracker

22 Multimedia Communication Signal Processing Group Formant Transformation via Non-Uniform LP Frequency Warping Figure Illustration of a non-uniform frequency warping using LP model frequency response. The spectrum is divided into a number of bands centered on the formants and a different set of warping parameters is applied to each band. F 01 0 0.1 0.20.30.40.50.60.70.80.9 1 -75 -70 -65 -60 -55 -50 -45 -40 -35 F 12 F 23 F 34 F 45 BW 1 2 3 4 I 12 I 23 I 34 Magnitude (dB) Frequency (Hz) Figure : Illustration modification of spectrum towards formants of target accent Speech Linear Prediction Model LP Spectrum Mapping Formant Estimation Formant Transformation Ratios Accent modified spectrum Formant HMMs Polynomial roots Pole estimation

23 Multimedia Communication Signal Processing Group The frequency bands of the source speaker [F 01 F 12 F 23 F 34 F 45 ] are mapped to the target accent using a set of warping ratios derived from differences in the formants of phonetic segments of speech across accents as )1()1( )1(    iiii ii ff  S i S i T i T i ii ff ff       1 1 )1(  Where f i T and f i S are the i th formants of the source and target accents The frequency mapping can be expressed as Figure : Illustration of warped(solid line) and original(dash dot line) formant trajectories of /aa/ in accent conversion from Australian to British.

24 Multimedia Communication Signal Processing Group Pitch Modification Using Time Domain PSOLA (TD-PSOLA) Source pitch marks Target pitch marks TD-PSOLA is applied into each corresponding voiced speech segment to modify the pitch slope and duration of the segments Source Speech Pitch Marks Target Speech Pitch Marks Illustration of mapping of pitch periods of a source speech to a target

25 Multimedia Communication Signal Processing Group Examples of changes in accent/duration modulation of pitch (a) ‘article’ in Australian, (b) Australian-accent ‘article’ transformed to British accent (c) ‘asked’ in Australian, (d) Australian-accent ‘article’ transformed to British accent (a) (b) (c) (d)

26 Multimedia Communication Signal Processing Group Model Estimation LP Model Formant Trajectory Source Speech Target Speech LP Model Formant Trajectory Mapped Speech Warping Factors Target Speaker HMM Model Source Speaker HMM Model Formant Tracking Formant Mapping Speech Recon struction Speech Reconstruction LPC - Spectrum Warping / Pole Rotation Model Estimation LP Model Formant Trajectory Source Speech Target Speech LP Model Formant Trajectory Mapped Speech Warping Factors Target Speaker HMM Model Source Speaker HMM Model Formant Tracking Formant Mapping Speech Recon- struction Speech Reconstruction LPC - Spectrum Warping / Pole Rotation Transformed(AM m->f) American male American female An Outline of Voice-Morph: A system for Voice and Accent Conversion An example of voice conversion

27 Multimedia Communication Signal Processing Group Accent Conversion Demonstration Australian British Transformed BritishAmerican Transformed ‘Article’ ‘Claim’ ‘Cooperation’ ‘Beige’ Source AccentTarget Accent Spoken word ‘Boston’ ‘Opposition’ ‘The occupied’

28 Multimedia Communication Signal Processing Group The End


Download ppt "Multimedia Communication Signal Processing Group Analysis, Modelling and Synthesis of British, Australian and American Accents Qin Yan Saeed Vaseghi Multimedia."

Similar presentations


Ads by Google