Brian King, Advised by Les Atlas Electrical Engineering, University of Washington This research was funded by Air Force Office.

Brian King, bbking@uw.edubbking@uw.edu Advised by Les Atlas Electrical Engineering, University of Washington This research was funded by Air Force Office of Scientific Research 1

Problem Statement  Develop a theoretical framework for complex probabilistic latent semantic analysis (CPLSA) and its application in single-channel source separation 2

Outline  Introduction  Background  My current contributions  Proposed work 3

Nonnegative Matrix Factorization (NMF) X f,t B f,k W k,t X Time (t) Frequency (f) Basis Index (k) 4 [1] D.D. Lee and H.S. Seung, “Algorithms for Non-Negative Matrix Factorization,” Neural Information Processing Systems, 2001, pp. 556--562.

Using Matrix Factorization for Source Separation Find Bases Find Weights x indiv x mixed Y1Y1 Separation Y2Y2 STFT * ISTFT ** y1y1 y2y2 X indiv X mixed *Short Time Fourier Transform **Inverse Short Time Fourier Transform B, W 5 Separation

Using Matrix Factorization for Synthesis / Source Separation Matrix Factorization X B1B1 B2B2 W1W1 W2W2 X Y1Y1 Bases f,k Y2Y2 Weights k,t Separated Signals f,t Synthesis B, W Y1Y1 Y2Y2 Source Separation Y1Y1 Synthesized Signal f,t B W 6

NMF Cost Function: Frobenius Norm with Sparsity where 7 Frobenius 2 L 1 Sparsity X f,t B f,k W k,t X

Probabilistic Latent Semantic Analysis (PLSA)  Views the magnitude spectrogram as a joint probability distribution 8 [2] M. Shashanka, B. Raj, and P. Smaragdis, “Probabilistic Latent Variable Models as Nonnegative Factorizations,” Computational Intelligence and Neuroscience, vol. 2008, 2008, pp. 1-9.

Probabilistic Latent Semantic Analysis (PLSA)  Uses the following generative model Pick a time, P(t) Pick a base from that time, P(k|t) Pick a frequency of that base, P(f|k) Increment the chosen (f,t) by one Repeat  Can be written as 9

Probabilistic Latent Semantic Analysis (PLSA)  Relationship to NMF P(t) is the sum of all magnitude at time t P(k|t) similar to weight matrix W k,t P(f|k) similar to base matrix B f,k  NMF  PLSA 10

Probabilistic Latent Semantic Analysis  Advantage of PLSA over NMF: Extensibility A tremendous amount of applicable literature on generative models ○ Entropic priors [2] ○ HMM’s with state-dependent dictionaries [6] [2] M. Shashanka, B. Raj, and P. Smaragdis, “Probabilistic Latent Variable Models as Nonnegative Factorizations,” Computational Intelligence and Neuroscience, vol. 2008, 2008, pp. 1-9. [6] G.J. Mysore, “A Non-Negative Framework for Joint Modeling of Spectral Structures and Temporal Dynamics in Sound Mixtures,” PhD Thesis, Stanford University, 2010. 11

… but superposition? Original Sources Mixture Proper Separation NMF Separation #1#2 !!! 12

CMF Cost Function: Frobenius Norm with Sparsity where 13 Frobenius 2 L 1 Sparsity X f,t B f,k W k,t X [3] H. Kameoka, N. Ono, K. Kashino, and S. Sagayama, “Complex NMF: A New Sparse Representation for Acoustic Signals,” International Conference on Acoustics, Speech, and Signal Processing, 2009.

Comparing NMF and CMF via ASR: Introduction  Data Boston University news corpus [7] 150 utterances (72 minutes) Two talkers synthetically mixed at 0 dB target/masker ratio 1 minute each of clean speech used for training  Recognizers Sphinx-3 (CMU) SRI [7] M. Ostendorf, “The Boston University Radio Corpus,” 1995. 14

Comparing NMF and CMF via ASR: Results Unprocessed Non-negative Complex * Error bars mark 95% confidence level Word Accuracy % Better 15

Comparing NMF and CMF via ASR: Conclusion  Incorporating phase estimates into matrix factorization can improve source separation performance  Complex matrix factorization is worth further research 16 [4] B. King and L. Atlas, “Single-Channel Source Separation Using Complex Matrix Factorization,” IEEE Transactions on Audio, Speech, and Language Processing (submitted). [5] B. King and L. Atlas, “Single-channel Source Separation using Simplified-training Complex Matrix Factorization,” International Conference on Acoustics, Speech, and Signal Processing, Dallas, TX: 2010.

… but overparameterization?  can result in a potentially infinite number of solutions… which isn’t a good thing!  Example: estimate observation with 3 bases, #1#3#2 17

Overparameterization Difficult to Extend Review of Current Methods NMF PLSA CMF ? Extendible UniqueAdditiveSuperposition 18

Proposed Solution: Complex Probabilistic Latent Semantic Analysis (CPLSA)  Goal: incorporate phase observation and estimation into current nonnegative PLSA framework  Implicitly solves Extensibility Superposition  Proposal to solve Overparameterization 19

Proposed Solution: Outline  Transform complex to nonnegative data  3 CPLSA variants  Phase constraints for STFT consistency Unique solution 20

Transform Complex to Nonnegative Data  Why is this important? Modeling observed data X f,t as a probability mass function PMF’s are nonnegative, real Observation needs to be nonnegative, real 21 If then

Transform Complex to Nonnegative Data  Starting point: Shashanka [8] N real → N+1 nonnegative  Algorithm N+1-length orthogonal vectors (A N+1,N ) Affine transform (for nonnegativity) Normalize  My new, proposed method N complex → 2N real 2N real data → 2N+1 nonnegative [8] M. Shashanka, “Simplex Decompositions for Real-Valued Datasets,” IEEE International Workshop on Machine Learning for Signal Processing, 2009, pp. 1-6. 22

Transform Complex to Nonnegative Data 23

3 Variants of CPLSA  #1 Complex bases Phase is associated with bases Not a good model for STFT  #2 Nonnegative bases + base- dependent phases Good model for audio, but overparameterized 24

3 Variants of CPLSA  Nonnegative bases + source- dependent phases Additive source model Good model for audio Fewer parameters Simplifies to NMF for single-source case  Compare with CPLSA #2 25

Phase Constraints for STFT Consistency  STFT is consistent when  Incorporate STFT consistency [9] into phase estimation step for separated sources  Unique solution! [9] J. Le Roux, N. Ono, and S. Sagayama, “Explicit Consistency Constraints for STFT Spectrograms and Their Application to Phase Reconstruction,” 2008. 26

Summary of Proposed Theory  Goal: incorporate phase observation and estimation into current nonnegative PLSA framework (extensible, additive, unique)  Theory Transform complex to nonnegative data 3 CPLSA variants Phase constraints for STFT consistency 27

Proposed Experiments  Separating speech in structured, nonstationary noise  Methods CPLSA, PLSA, CMF  Noise Babble noise Automotive noise  Measurements Objective perceptual ASR 28

Objective Measurement Tests  Goal: explore parameter space How they affect performance in CPLSA Find best-performing parameters Compare performance of CPLSA with PLSA, CMF  Data TIMIT corpus [10]  Measurements Blind Source Separation Evaluation Toolbox [11] Perceptual Evaluation of Speech Quality (PESQ) [12] [10] J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, and N.L. Dahlgren, DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus, NIST, 1993. [11] E. Vincent, R. Gribonval, and C. Fevotte, “Performance Measurement in Blind Audio Source Separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, 2006, pp. 1462-1469. [12] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual Evaluation of Speech Quality (PESQ) - A New Method for Speech Quality Assessment of Telephone Networks and Codecs,” ICASSP, 2001, pp. 749-752 vol.2. 29

Automatic Speech Recognition Tests  Goal: test robustness of parameters Use best-performing parameters from objective measurements Compare performance of CPLSA with PLSA, CMF  Data Wall Street Journal corpus [13]  ASR System Sphinx-3 (CMU) [13] D.B. Paul and J.M. Baker, “The Design for the Wall Street Journal-Based CSR Corpus,” Proceedings of the workshop on Speech and Natural Language, Stroudsburg, PA, USA: Association for Computational Linguistics, 1992, pp. 357–362. 30

Examples 31

Frequency (Hz) Time (s) Subway Noise NMF 4.3 dB improvement

Frequency (Hz) Time (s) Subway Noise NMF 4.2 dB improvement

34 Fountain Noise Example #1  Target speaker synthetically added at -3 dB SNR  Speaker model trained on 60 seconds clean speech

35 Fountain Noise Example #2  No “clean speech” available for training of target talker Generic speaker model used

Mixed Speech (0 dB, no reverb) 36

Mixed Speech (0 dB, reverb) 37

Thank you! 38

Why not encode phase into bases? Individual phase term 40 X 11e jπ/1 1e jπ/5 22e jπ/2 2e jπ/6 33e jπ/3 3e jπ/7 44e jπ/4 4e jπ/8 1 2 3 4 111e j0 e jπ/1 e jπ/5 e j0 e jπ/2 e jπ/6 e j0 e jπ/3 e jπ/7 e j0 e jπ/4 e jπ/8 BWejθejθ

Why not encode phase into bases? Complex B, W 41 X 11e jπ/1 1e jπ/5 22e jπ/2 2e jπ/6 33e jπ/3 3e jπ/7 44e jπ/4 4e jπ/8 1 2 3 4 1e j? BW

BSS Evaluation Measures 42

… but superposition? 43

Brian King, Advised by Les Atlas Electrical Engineering, University of Washington This research was funded by Air Force Office.

Similar presentations

Presentation on theme: "Brian King, Advised by Les Atlas Electrical Engineering, University of Washington This research was funded by Air Force Office."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Brian King, Advised by Les Atlas Electrical Engineering, University of Washington This research was funded by Air Force Office.

Similar presentations

Presentation on theme: "Brian King, Advised by Les Atlas Electrical Engineering, University of Washington This research was funded by Air Force Office."— Presentation transcript:

Similar presentations

About project

Feedback