Presentation on theme: "EE 516 Lecture 1 Geoffrey Zweig Microsoft Research 4/2/2009."— Presentation transcript:
EE 516 Lecture 1 Geoffrey Zweig Microsoft Research 4/2/2009
Our Topics From JHU 2002 SuperSID Final Presentation – Reynolds et al. Introducing today!
Topic Coverage By Day Data Representations and Models (4/23) – Vector Quantization – Gaussian Mixtures – The EM Algorithm Speaker Identification (5/7) Language Identification (5/7) Hidden Markov Models (5/14) – Dynamic Programming Building a Speech Recognizer (5/14)
Language Identification – Why Do it? Multi-lingual society – Applications should be able to deal with anyone Businesses – Automated help systems – Reservations, account access, etc. – Travel Airport Kiosks Train stations Government – Funds research to identify languages – Runs evaluations in it
How Do You Do it? English Acoustic Model French Acoustic Model Tamil Acoustic Model … Output Likeliest Gaussian Mixture Models - 4/23
How Do You Do It? (2) After Zissman 1996 Simple HMMs – 5/14Language Models – 4/30 “p ih n s” – probably English… “k r p s t” – probably Czech…
How Do You Do It (3) After Zissman 1996 Same methods multiple times Acero et al., Chapter 4 4/23
How Do You Do It? (4) And we will see several other ways, and combinations! Run a complete speech recognizer in each language After Zissman 1996
Gauging Progress – The NIST Evaluations National Institute of Standards and Technology Has sponsored benchmark tests in multiple language processing areas for over a decade – Topic Detection & Tracking – Content Extraction – Video Analysis – Speech Recognition – Language Identification – Speaker Identification – Machine Translation – Coordination with site funding by Defense Advanced Research Projects Agency (DARPA) Along with business interest, the driving force in advancing the State-of-the-Art
For Example, Progress in Speech Recognition
Language Identification - How Well Can It Be Done – Who Salutes? OrganizationLocation Beijing Naphoo Technology Company+China Brno University of TechnologyCzech Republic Georgia Institute of TechnologyUSA Groupe des Ecoles des Telecommunication, Ecole Nationale Superieure des Telecommunications France IBMUSA IKERLAN Technological Research CenterSpain Institut de Recherche en Informatique de ToulouseFrance Institute for Infocomm ResearchSingapore Institute of Acoustics, Chinese Academy of Sciences+China Institut National de Recherche sur les Transports et Leur SecuriteFrance International Computer Science Institute (USA)USA Laboratoire d'Informatique pour la Mecanique et les Sciences de l'Ingenieur France MIT Lincoln LaboratoryUSA Nanyang Technological UniversitySingapore Politecnico di TorinoItaly Spescom DatavoiceSouth Africa Telefonica I & DSpain TNO Human FactorsThe Netherlands Tsinghua UniversityChina Universidad Autnoma de MadridSpain University of the Basque CountrySpain University of StellenboschSouth Africa University of Science and Technology of China+China From NIST 2007 LRE Website
How Well Can it Be Done – What Languages? From NIST 2007 LRE Website
How Well Can It Be Done? – Testing Conditions 26 languages and dialects Telephone speech Multiple duration conditions – 3, 10, 30 seconds Detection Error Tradeoff (DET) Curves used to measure performance
How Well Can it Be Done – Some Numbers From NIST 2007 LRE Website
Language Identification Project Build a language ID system with the Call Friend Data set Implement several of the main techniques Set up a demo on your laptop that will recognize someone’s language
Flavors of Speaker Recognition From JHU 2002 SuperSID Final Presentation – Reynolds et al. Our Focus!
Speaker Recognition – Why Do It? Personal Applications – Voice-print passwords – Voic transcription – who left that message? Business Applications – Calling your bank Government – Is that Osama calling from Pakistan? – Prison call monitoring – Automated parolee calling – is he where you think?
How Do You Do It? The most basic approach: Gaussian Mixture Models - 4/23 More recently: Support vector machines operating on GMMs (!)
How Do You Do It? (2) Also use high-level information! From JHU 2002 SuperSID Final Presentation – Reynolds et al.
How Well Can It Be Done – Who Salutes? From NIST 2008 SRE Presentation, Martin & Greenberg
More Salutes From NIST 2008 SRE Presentation, Martin & Greenberg
From Europe From NIST 2008 SRE Presentation, Martin & Greenberg
More From Europe From NIST 2008 SRE Presentation, Martin & Greenberg
U.S. Entries From NIST 2008 SRE Presentation, Martin & Greenberg
How Well Can It Be Done – Testing Conditions Conditions for different amounts of data – 10 sec. – 3-5 minutes – 8 minutes – Separate channel and summed channel conditions English-speakers, non-English speakers, multilingual speakers
How Well Can It Be Done?
Speaker Verification Project Implement a Speaker-ID system – Template based – GMM based – SVM based – Vector space model Demonstrate it: – NIST data, e.g Evaluation – Your own voice – implement on laptop
Speech Recognition Project Implement an HMM based recognition system Use, e.g., Phonebook isolated word data data set or Aurora digit set Write features with existing front-end Build your own HMM trainer/decoder Set it up on your laptop for online word recognition (?!)
Highlights of Syllabus Required Texts: – Huang, Acero, Hon: Spoken Language Processing – Deng and O’Shaughnessy, Speech Processing – EE516 Reader, at Professional Copy ‘n Print, 4200 University Way Grading: – Projects: 50% – Final Exam: 30% – Homework 20% Projects: – Small team or individual Teams are self-forming – Presentation times TBD – Read ahead & pick an area!!! Talk to relevant instructor – Suggest deciding no later than 4/30 Office Hours at end of class and by appointment Please sign in on list!