AUTOMATIC PHONETIC ANNOTATION OF AN ORTHOGRAPHICALLY TRANSCRIBED SPEECH CORPUS Rui Amaral, Pedro Carvalho, Diamantino Caseiro, Isabel Trancoso, Luís Oliveira.

Slides:



Advertisements
Similar presentations
LABORATOIRE DINFORMATIQUE CERI 339 Chemin des Meinajariès BP AVIGNON CEDEX 09 Tél (0) Fax (0)
Advertisements

Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Discriminative Training in Speech Processing Filipp Korkmazsky LORIA.
Building an ASR using HTK CS4706
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Research & Development ICASSP' Analysis of Model Adaptation on Non-Native Speech for Multiple Accent Speech Recognition D. Jouvet & K. Bartkova France.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
INSTRUCTOR:Dr.Veton Kepuska STUDENT:Dileep Narayan.Koneru YES/NO RECOGNITION SYSTEM.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS.
Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Speech Recognition. What makes speech recognition hard?
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Non-native Speech Languages have different pronunciation spaces
VESTEL database realistic telephone speech corpus:  PRNOK5TR: 5810 utterances in the training set  PERFDV: 2502 utterances in testing set 1 (vocabulary.
LORIA Irina Illina Dominique Fohr Christophe Cerisara Torino Meeting March 9-10, 2006.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Automatic Continuous Speech Recognition Database speech text Scoring.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
Introduction to Automatic Speech Recognition
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Speech and Language Processing
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
Diamantino Caseiro and Isabel Trancoso INESC/IST, 2000 Large Vocabulary Recognition Applied to Directory Assistance Services.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
By: Meghal Bhatt.  Sphinx4 is a state of the art speaker independent, continuous speech recognition system written entirely in java programming language.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Jacob Zurasky ECE5526 – Spring 2011
LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Speech, Perception, & AI Artificial Intelligence CMSC February 13, 2003.
Language modelling María Fernández Pajares Verarbeitung gesprochener Sprache.
A Fully Annotated Corpus of Russian Speech
Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.
IPSOM Indexing, Integration and Sound Retrieval in Multimedia Documents.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
The HTK Book (for HTK Version 3.2.1) Young et al., 2002.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
Performance Comparison of Speaker and Emotion Recognition
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
Phonetic features in ASR Kurzvortrag Institut für Kommunikationsforschung und Phonetik Bonn 17. Juni 1999 Jacques Koreman Institute of Phonetics University.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Automatic Speech Recognition
Statistical Models for Automatic Speech Recognition
Speech Processing Speech Recognition
Audio Books for Phonetics Research
Statistical Models for Automatic Speech Recognition
Speech Recognition: Acoustic Waves
Presentation transcript:

AUTOMATIC PHONETIC ANNOTATION OF AN ORTHOGRAPHICALLY TRANSCRIBED SPEECH CORPUS Rui Amaral, Pedro Carvalho, Diamantino Caseiro, Isabel Trancoso, Luís Oliveira IST, Instituto Superior Técnico INESC, Instituto de Engenharia de Sistemas e Computadores

Summary Motivation System Architecture –Module 1: Grapheme-to-phone converter (G2P) –Module 2: Alternative transcriptions generator (ATG) –Module 3: Acoustic signal processor –Module 4: Phonetic decoder and aligner Training and Test Corpora Results –Transcription and alignment (Development phase) –Test corpus annotation (Evaluation phase) Conclusions and Future Work

Motivation Time consuming, repetitive task ( over 60 x real time) Large corpora processing No expert intervention –Non-existence of widely adopted standard procedures –Error prone –Inconsistency's among human annotators

System Architecture speech corpus Orthographically transcribed Acoustic signal processor Alternative Transcriptions Generator Phonetic Decoder/Aligner Rules Lexicon Grapheme-to-Phone Converter Phonetically annotated speech corpus

- Module 1 - Grapheme-to-Phone Converter Modules of the Portuguese TTS system (DIXI) Text normalisation –Special symbols, numerals, abbreviations and acronyms Broad Phonetic Transcription –Careful pronunciation of the word pronunciation –Set of 200 rules –Small exceptions dictionary (364 entries) –SAMPA phonetic alphabet

- Module 2 - Alternative Transcriptions Generator Transformation of phone sequences into lattices Based on optional rules: –Which account for: »Sandhi »Vowel reduction –Specified using finite-state-grammars and simple transduction operators A (B C) D

Examples: vowel reduction oito["ojtu]["ojt] Alternative pronunciations viagens[vj"aZ6~j~S][vj"aZe~S]

Phrasevou para a praia. Canonical P.T. [v"o p6r6 6 pr"aj6] Narrow P. T. ( most freq. ) [v"o pr"a pr"ai6] =sandhi + vowel reduction Rules: DEF_RULE 6a, ( (6 NULL) (sil NULL) (6 a) ) DEF_RULE pra, ( p ("6 NULL) r 6 ) Lattice rp"6r6sil6 p... ar Example (rules application):

- Module 3 - Acoustic Signal Processor Extraction of acoustical signal characteristics Sampling: 16 kHz, 16 bits Parameterisation: MFCC (Mel - Frequency Cepstral Coefficients) –Decoding: 14 coefficients, energy, 1 st and 2 nd order differences, 25 ms Hamming windows, updated every 10 ms. –Alignment: 14 coefficients, energy, 1 st and 2 nd order differences, 16 ms Hamming windows, updated every 5 ms.

- Module 4 - Phonetic Decoder and Aligner Selection of the phonetic transcription which is closest to the utterance Viterbi algorithm 2 x 60 HMM models –Architecture »left-to-right »3-state »3-mixture NOTE: modules 3 and 4 use Hidden Markov Model Toolkit (Entropic Research Labs)

Training and Test Corpora Subset of the EUROM 1 multilingual corpus –European Portuguese –Collected in an anechoic room, 16 kHz, 16 bits. –5 male + 5 female speakers (few talkers) –Prompt texts »Passages: Paragraphs of 5 related sentences Free translations of the English version of EUROM 1 Adapted from books and newspaper text »Filler sentences: 50 sentences grouped in blocks of 5 sentences each Built to increase the numbers of different diphones in the corpus –Manually annotated.

Training and Test Corpora (cont.) Training Corpus Test Corpus 1 Test Corpus 2 Passages: O0-O9, P0-P9: English translations Q0-Q9, R0-R9: Books and newspaper text. Filler sentences: F0-F9

Transcription and alignment results Transcription: –Precision = ((correct - inserted)/Total) x 100% Alignment: –% of cases in which the absolute error is < 10 ms –average absolute error including 90 % of cases

Annotation strategies and Results NOTE: Alignment evaluated only in places where the decoded sequence matched the manual sequence TranscriptionAlignment Strategy 1 HMM alignment Strategy 2 HMM recognition Strategy 3 HMM recognitionHMM alignment

Annotation results - Transcription - Comments –Better precision achieved for canonical transcriptions of Test 2 –Highest global precision achieved in Test 1 –Successive application of the rules leads to a better precision Precision Rules Test 1Test 2 Canonical 74 %76,9 % Sandhi 77,1 %79,4 % Vowel reduction and alternative pronunciation 85,1 %84,5 %

Annotation results - Alignment - Comments –Better alignment obtained with the best decoder –Some problematic transitions: vowels, nasals vowels and liquids.

Conclusions Better annotations results with: –Alternative Transcriptions (comparatively to canonical). –Use of different models for alignment and recognition About 84 % precision in transcription and 22 ms of maximum alignment error for 90 % of the cases

Future Works Automatic rule inference –1st Phase: comparison and selection of rules –2nd Phase: validation or phonetic-linguistic interpretation Annotation of other speech corpora to build better acoustic models Assignment of probabilistic information to the alternative pronunciations generated by rule

TOPIC ANNOTATION IN BROADCAST NEWS Rui Amaral, Isabel Trancoso IST, Instituto Superior Técnico INESC, Instituto de Engenharia de Sistemas e Computadores

Preliminary work System Architecture –Two-stage unsupervised clustering algorithm »nearest-neighbour search method »Kullback-Leibler distance measure –Topic language models »smoothed unigrams statistics –Topic Decoder »based on Hidden Markov Models (HMM) NOTE: topic models created with CMU Cambridge Statistical Language Modelling Toolkit

System Architecture

Training and Test Corpora Subset of the BD_PUBLICO newspaper text corpus –20000 stories –6 month period (September 95 - February 96) –topic annotated –size between 100 and 2000 word –normalised text