Unsupervised and Semi-Supervised Learning of Tone and Pitch Accent Gina-Anne Levow University of Chicago June 6, 2006.

Slides:

Advertisements

Similar presentations

1 Manifold Alignment for Multitemporal Hyperspectral Image Classification H. Lexie Yang 1, Melba M. Crawford 2 School of Civil Engineering, Purdue University.

Advertisements

Tone perception and production by Cantonese-speaking and English- speaking L2 learners of Mandarin Chinese Yen-Chen Hao Indiana University.

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.

Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.

Machine learning continued Image source:

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

Analyzing Students’ Pronunciation and Improving Tonal Teaching Ropngrong Liao Marilyn Chakwin Defense.

Context and Learning in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago May 18, 2007.

Prosody in Spoken Language Understanding Gina Anne Levow University of Chicago January 4, 2008 NLP Winter School 2008.

Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013.

Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

Identifying Local Corrections in Human-Computer Dialogue Gina-Anne Levow University of Chicago October 5, 2004.

Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.

SPOKEN LANGUAGE SYSTEMS MIT Computer Science and Artificial Intelligence Laboratory Mitchell Peabody, Chao Wang, and Stephanie Seneff June 19, 2004 Lexical.

Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.

Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.

Active Learning with Support Vector Machines

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.

Image Categorization by Learning and Reasoning with Regions Yixin Chen, University of New Orleans James Z. Wang, The Pennsylvania State University Published.

Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation Gina-Anne Levow University of Chicago October 14, 2005.

Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.

Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff.

Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Unsupervised Learning of Categories from Sets of Partially Matching Image Features Kristen Grauman and Trevor Darrel CVPR 2006 Presented By Sovan Biswas.

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.

This week: overview on pattern recognition (related to machine learning)

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.

A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.

Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

Managing Ambiguity, Gaps, and Errors in Spoken Language Processing Gina-Anne Levow May 14, 2009.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.

Transductive Regression Piloted by Inter-Manifold Relations.

GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.

Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

Singer similarity / identification Francois Thibault MUMT 614B McGill University.

Levels of Image Data Representation 4.2. Traditional Image Data Structures 4.3. Hierarchical Data Structures Chapter 4 – Data structures for.

H. Lexie Yang1, Dr. Melba M. Crawford2

National Taiwan University, Taiwan

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Semi-Supervised Clustering

Investigating Pitch Accent Recognition in Non-native Speech

Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang

Machine Learning Basics

Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE

Neuro-Computing Lecture 4 Radial Basis Function Network

Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang

Using Manifold Structure for Partially Labeled Classification

MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn

Automatic Prosodic Event Detection

Presentation transcript:

Unsupervised and Semi-Supervised Learning of Tone and Pitch Accent Gina-Anne Levow University of Chicago June 6, 2006

Roadmap Challenges for Tone and Pitch Accent –Variation and Learning Data collections & processing Learning with less –Semi-supervised learning –Unsupervised clustering Approaches, structure, and context Conclusion

Challenges: Tone and Variation Tone and Pitch Accent Recognition –Key component of language understanding Lexical tone carries word meaning Pitch accent carries semantic, pragmatic, discourse meaning –Non-canonical form (Shen 90, Shih 00, Xu 01) Tonal coarticulation modifies surface realization –In extreme cases, fall becomes rise –Tone is relative To speaker range –High for male may be low for female To phrase range, other tones –E.g. downstep

Challenges: Training Demands Tone and pitch accent recognition –Exploit data intensive machine learning SVMs (Thubthong 01,Levow 05, SLX05) Boosted and Bagged Decision trees (X. Sun, 02) HMMs: (Wang & Seneff 00, Zhou et al 04, Hasegawa-Johnson et al, 04,…) –Can achieve good results with large sample sets ~10K lab syllabic samples -> > 90% accuracy –Training data expensive to acquire Time – pitch accent 10s of time real-time Money – requires skilled labelers Limits investigation across domains, styles, etc –Human language acquisition doesn’t use labels

Strategy: Overall Tone and intonation across languages –Common machine learning classifiers –Acoustic-prosodic model No word label, POS, lexical stress info No explicit tone label sequence model –English, Mandarin Chinese

Strategy: Training Challenge: –Can we use the underlying acoustic structure of the language – through unlabeled examples – to reduce the need for expensive labeled training data? Exploit semi-supervised and unsupervised learning –Semi-supervised Laplacian SVM –K-means and asymmetric k-lines clustering –Substantially outperform baselines Can approach supervised levels

Data Collections I: English English: (Ostendorf et al, 95) –Boston University Radio News Corpus, f2b –Manually ToBI annotated, aligned, syllabified –Pitch accent aligned to syllables 4-way: Unaccented, High, Downstepped High, Low –(Sun 02, Ross & Ostendorf 95) Binary: Unaccented vs Accented

Data Collections II: Mandarin Mandarin: –Lexical tones: High, Mid-rising, Low, High falling, Neutral

Data Collections III: Mandarin Mandarin Chinese: –Lab speech data: (Xu, 1999) 5 syllable utterances: vary tone, focus position –In-focus, pre-focus, post-focus –TDT2 Voice of America Mandarin Broadcast News Automatically force aligned to anchor scripts –Automatically segmented, pinyin pronunciation lexicon –Manually constructed pinyin-ARPABET mapping –CU Sonic – language porting –4-way: High, Mid-rising, Low, High falling

Local Feature Extraction Motivated by Pitch Target Approximation Model Tone/pitch accent target exponentially approached –Linear target: height, slope (Xu et al, 99) Scalar features: –Pitch, Intensity max, mean (Praat, speaker normalized) –Pitch at 5 points across voiced region –Duration –Initial, final in phrase Slope: –Linear fit to last half of pitch contour

Context Features Local context: –Extended features Pitch max, mean, adjacent points of adjacent syllable –Difference features wrt adjacent syllable Difference between –Pitch max, mean, mid, slope –Intensity max, mean Phrasal context: –Compute collection average phrase slope –Compute scalar pitch values, adjusted for slope

Experimental Configuration English Pitch Accent: –Proportionally sampled: 1000 examples 4-way and binary classification –Contextualization representation, preceding syllables Mandarin Tone: –Balanced tone sets: 400 examples Vary data set difficulty: clean lab -> broadcast 4 tone classification –Simple local pitch only features »Prior lab speech experiments effective with local features

Semi-supervised Learning Approach: –Employ small amount of labeled data –Exploit information from additional – presumably more available –unlabeled data Few prior examples: EM, co-& self-training: Ostendorf ‘05 Classifier: –Laplacian SVM (Sindhwani,Belkin&Niyogi ’05) –Semi-supervised variant of SVM Exploits unlabeled examples –RBF kernel, typically 6 nearest neighbors

Experiments Pitch accent recognition: –Binary classification: Unaccented/Accented –1000 instances, proportionally sampled Labeled training: 200 unacc, 100 acc –>80% accuracy (cf. 84% w/15x labeled SVM) Mandarin tone recognition: –4-way classification: n(n-1)/2 binary classifiers –400 instances: balanced; 160 labeled Clean lab speech- in-focus-94% – cf. 99% w/SVM, 1000s train; 85% w/SVM 160 training samples Broadcast news: 70% –Cf. <50% w/supervised SVM 160 training samples; 74% 4x training

Unsupervised Learning Question: –Can we identify the tone structure of a language from the acoustic space without training? Analogous to language acquisition Significant recent research in unsupervised clustering Established approaches: k-means Spectral clustering: Eigenvector decomposition of affinity matrix –(Shih & Malik 2000, Fischer & Poland 2004, BNS 2004) –Little research for tone Self-organizing maps (Gauthier et al,2005) –Tones identified in lab speech using f0 velocities

Unsupervised Pitch Accent Pitch accent clustering: –4 way distinction: 1000 samples, proportional 2-16 clusters constructed –Assign most frequent class label to each cluster Learner: –Asymmetric k-lines clustering (Fischer & Poland ’05): »Context-dependent kernel radii, non-spherical clusters –> 78% accuracy –Context effects: Vector w/context vs vector with no context comparable

Contrasting Clustering Approaches –3 Spectral approaches: Asymmetric k-lines (Fischer & Poland 2004) Symmetric k-lines (Fischer & Poland 2004) Laplacian Eigenmaps (Belkin, Niyogi, & Sindhwani 2004) – Binary weights, k-lines clustering –K-means: Standard Euclidean distance –# of clusters: 2-16 Best results: > 78% –2 clusters: asymmetric k-lines; > 2 clusters: kmeans Larger # of clusters more similar

Contrasting Learners

Tone Clustering Mandarin four tones: 400 samples: balanced 2-phase clustering: 2-3 clusters each Asymmetric k-lines –Clean read speech: In-focus syllables: 87% (cf. 99% supervised) In-focus and pre-focus: 77% (cf. 93% supervised) –Broadcast news: 57% (cf. 74% supervised) Contrast: – K-means: In-focus syllables: 74.75% Requires more clusters to reach asymm. k-lines level

Tone Structure First phase of clustering splits high/rising from low/falling by slope Second phase by pitch height, or slope

Conclusions Exploiting unlabeled examples for tone and pitch accent –Semi- and Un-supervised approaches Best cases approach supervised levels with less training –Leveraging both labeled & unlabeled examples best –Both spectral approaches and k-means effective »Contextual information less well-exploited than in supervised case Exploit acoustic structure of tone and accent space

Future Work Additional languages, tone inventories –Cantonese - 6 tones, –Bantu family languages – truly rare data Language acquisition –Use of child directed speech as input –Determination of number of clusters

Thanks V. Sindhwani, M. Belkin, & P. Niyogi; I. Fischer & J. Poland; T. Joachims; C-C. Cheng & C. Lin Dinoj Surendran, Siwei Wang, Yi Xu This work supported by NSF Grant #

Spectral Clustering in a Nutshell Basic spectral clustering –Build affinity matrix –Determine dominant eigenvectors and eigenvalues of the affinity matrix –Compute clustering based on them Approaches differ in: –Affinity matrix construction Binary weights, conductivity, heat weights –Clustering: cut, k-means, k-lines

K-Lines Clustering Algorithm Due to Fischer & Poland Initialize vectors m1...mK (e.g. randomly, or as the ¯first K eigenvectors of the spectraldata yi) 2. for j=1...K: –Define Pj as the set of indices of all points yi that are closest to the line defined by mj, and create the matrix Mj = [yi], i in Pi whose columns are the corresponding vectors yi 3. Compute the new value of every mj as the ¯first eigenvector of MjMTj 4. Repeat from 2 until mj 's do not change

Asymmetric Clustering Replace Gaussian kernel of fixed width –(Fischer & Poland TR-ISDIA-12-04, p. 12), –Where tau = 2d+ 1 or 10, largely insensitive to tau

Laplacian SVM Manifold regularization framework –Hypothesize intrinsic (true) data lies on a low dimensional manifold, Ambient (observed) data lies in a possibly high dimensional space Preserves locality: –Points close in ambient space should be close in intrinsic –Use labeled and unlabeled data to warp function space –Run SVM on warped space

Laplacian SVM (Sindhwani)

Input : l labeled and u unlabeled examples Output : Algorithm : –Contruct adjacency Graph. Compute Laplacian. –Choose Kernel K(x,y). Compute Gram matrix K. –Compute –And

Current and Future Work Interactions of tone and intonation –Recognition of topic and turn boundaries –Effects of topic and turn cues on tone real’n Child-directed speech & tone learning Support for Computer-assisted tone learning Structured sequence models for tone –Sub-syllable segmentation & modeling Feature assessment –Band energy and intensity in tone recognition

Related Work Tonal coarticulation: –Xu & Sun,02; Xu 97;Shih & Kochanski 00 English pitch accent –X. Sun, 02; Hasegawa-Johnson et al, 04; Ross & Ostendorf 95 Lexical tone recognition –SVM recognition of Thai tone: Thubthong 01 –Context-dependent tone models Wang & Seneff 00, Zhou et al 04

Pitch Target Approximation Model Pitch target: –Linear model: –Exponentially approximated: –In practice, assume target well-approximated by mid-point (Sun, 02)

Classification Experiments Classifier: Support Vector Machine –Linear kernel –Multiclass formulation SVMlight (Joachims), LibSVM (Cheng & Lin 01) –4:1 training / test splits Experiments: Effects of –Context position: preceding, following, none, both –Context encoding: Extended/Difference –Context type: local, phrasal

Results: Local Context ContextMandarin ToneEnglish Pitch Accent Full74.5%81.3% Extend PrePost74%80.7% Extend Pre74%79.9% Extend Post70.5%76.7% Diffs PrePost75.5%80.7% Diffs Pre76.5%79.5% Diffs Post69%77.3% Both Pre76.5%79.7% Both Post71.5%77.6% No context68.5%75.9%

Results: Local Context ContextMandarin ToneEnglish Pitch Accent Full74.5%81.3% Extend PrePost74.0%80.7% Extend Pre74.0%79.9% Extend Post70.5%76.7% Diffs PrePost75.5%80.7% Diffs Pre76.5%79.5% Diffs Post69.0%77.3% Both Pre76.5%79.7% Both Post71.5%77.6% No context68.5%75.9%

Results: Local Context ContextMandarin ToneEnglish Pitch Accent Full74.5%81.3% Extend PrePost74%80.7% Extend Pre74%79.9% Extend Post70.5%76.7% Diffs PrePost75.5%80.7% Diffs Pre76.5%79.5% Diffs Post69%77.3% Both Pre76.5%79.7% Both Post71.5%77.6% No context68.5%75.9%

Results: Local Context ContextMandarin ToneEnglish Pitch Accent Full74.5%81.3% Extend PrePost74%80.7% Extend Pre74%79.9% Extend Post70.5%76.7% Diffs PrePost75.5%80.7% Diffs Pre76.5%79.5% Diffs Post69%77.3% Both Pre76.5%79.7% Both Post71.5%77.6% No context68.5%75.9%

Discussion: Local Context Any context information improves over none –Preceding context information consistently improves over none or following context information English: Generally more context features are better Mandarin: Following context can degrade –Little difference in encoding (Extend vs Diffs) Consistent with phonological analysis (Xu) that carryover coarticulation is greater than anticipatory

Results & Discussion: Phrasal Context Phrase ContextMandarin ToneEnglish Pitch Accent Phrase75.5%81.3% No Phrase72%79.9% Phrase contour compensation enhances recognition Simple strategy Use of non-linear slope compensate may improve

Context: Summary Employ common acoustic representation –Tone (Mandarin), pitch accent (English) Cantonese: ~64%; 68% with RBF kernel SVM classifiers - linear kernel: 76%, 81% Local context effects: –Up to > 20% relative reduction in error –Preceding context greatest contribution Carryover vs anticipatory Phrasal context effects: –Compensation for phrasal contour improves recognition

Context: Summary Employ common acoustic representation –Tone (Mandarin), pitch accent (English) SVM classifiers - linear kernel: 76%, 81% Local context effects: –Up to > 20% relative reduction in error –Preceding context greatest contribution Carryover vs anticipatory Phrasal context effects: –Compensation for phrasal contour improves recognition

Aside: More Tones Cantonese: –CUSENT corpus of read broadcast news text –Same feature extraction & representation –6 tones: –High level, high rise, mid level, low fall, low rise, low level –SVM classification: Linear kernel: 64%, Gaussian kernel: 68% –3,6: 50% - mutually indistinguishable (50% pairwise) »Human levels: no context: 50%; context: 68% Augment with syllable phone sequence –86% accuracy: 90% of syllable w/tone 3 or 6: one dominates

Aside: Voice Quality & Energy By Dinoj Surendran Assess local voice quality and energy features for tone –Not typically associated with Mandarin Considered: –VQ: NAQ, AQ, etc; Spectral balance; Spectral Tilt; Band energy Useful: Band energy significantly improves –Esp. neutral tone Supports identification of unstressed syllables –Spectral balance predicts stress in Dutch

Roadmap Challenges for Tone and Pitch Accent –Contextual effects –Training demands Modeling Context for Tone and Pitch Accent –Data collections & processing –Integrating context –Context in Recognition Reducing Training demands –Data collections & structure –Semi-supervised learning –Unsupervised clustering Conclusion

Strategy: Context Exploit contextual information –Features from adjacent syllables Height, shape: direct, relative –Compensate for phrase contour –Analyze impact of Context position, context encoding, context type > 20% relative improvement over no context