Advances in Phonetics-based Sub-Unit Modeling for Transcription, Alignment and Sign Language Recognition. Vassilis Pitsikalis 1, Stavros Theodorakis 1,

Slides:



Advertisements
Similar presentations
Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.
Advertisements

Vogler and Metaxas University of Toronto Computer Science CSC 2528: Handshapes and Movements: Multiple- channel ASL recognition Christian Vogler and Dimitris.
CODE/ CODE SWITCHING.
DDDAS: Stochastic Multicue Tracking of Objects with Many Degrees of Freedom PIs: D. Metaxas, A. Elgammal and V. Pavlovic Dept of CS, Rutgers University.
Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen.
Kien A. Hua Division of Computer Science University of Central Florida.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
SOMM: Self Organizing Markov Map for Gesture Recognition Pattern Recognition 2010 Spring Seung-Hyun Lee G. Caridakis et al., Pattern Recognition, Vol.
Broadcast News Parsing Using Visual Cues: A Robust Face Detection Approach Yannis Avrithis, Nicolas Tsapatsoulis and Stefanos Kollias Image, Video & Multimedia.
ICIP 2000, Vancouver, Canada IVML, ECE, NTUA Face Detection: Is it only for Face Recognition?  A few years earlier  Face Detection Face Recognition 
The 1980’s Collection of large standard corpora Front ends: auditory models, dynamics Engineering: scaling to large vocabulary continuous speech Second.
Rodent Behavior Analysis Tom Henderson Vision Based Behavior Analysis Universitaet Karlsruhe (TH) 12 November /9.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Overview of Search Engines
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Information Retrieval in Practice
Introduction to Automatic Speech Recognition
TuniSigner: An avatar-based system to interpret SignWriting notations Yosra Bouzid & Mohamed Jemni Research Laboratory LaTICE, University of Tunis, Tunisia.
A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.
Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Application of Audio and Video Processing Methods for Language.
Graphical models for part of speech tagging
Author: James Allen, Nathanael Chambers, etc. By: Rex, Linger, Xiaoyi Nov. 23, 2009.
Abstract Developing sign language applications for deaf people is extremely important, since it is difficult to communicate with people that are unfamiliar.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Project title : Automated Detection of Sign Language Patterns Faculty: Sudeep Sarkar, Barbara Loeding, Students: Sunita Nayak, Alan Yang Department of.
ASL I RICHLAND HS J. PARMLEY BIRDVILLE ISD The Structure and Function of ASL.
Sign Language corpora for analysis, processing and evaluation A. Braffort, L. Bolot, E. Chételat-Pelé, A. Choisier, M. Delorme, M. Filhol, J. Segouat,
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
CVPR Workshop on RTV4HCI 7/2/2004, Washington D.C. Gesture Recognition Using 3D Appearance and Motion Features Guangqi Ye, Jason J. Corso, Gregory D. Hager.
Fingerspelling Alphabet Recognition Using A Two-level Hidden Markov Model Shuang Lu, Joseph Picone and Seong G. Kong Institute for Signal and Information.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
ECE 8443 – Pattern Recognition EE 3512 – Signals: Continuous and Discrete Objectives: Spectrograms Revisited Feature Extraction Filter Bank Analysis EEG.
Vision-based human motion analysis: An overview Computer Vision and Image Understanding(2007)
1 Research Question  Can a vision-based mobile robot  with limited computation and memory,  and rapidly varying camera positions,  operate autonomously.
Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Multimodality, universals, natural interaction… and some other stories… Kostas Karpouzis & Stefanos Kollias ICCS/NTUA HUMAINE WP4.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Experimental Results Abstract Fingerspelling is widely used for education and communication among signers. We propose a new static fingerspelling recognition.
Letter to Phoneme Alignment Using Graphical Models N. Bolandzadeh, R. Rabbany Dept of Computing Science University of Alberta 1 1.
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
Performance Comparison of Speaker and Emotion Recognition
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
 Present by 陳群元.  Introduction  Previous work  Predicting motion patterns  Spatio-temporal transition distribution  Discerning pedestrians  Experimental.
ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
ELAN as a tool for oral history CLARIN Oral History Workshop Oxford Sebastian Drude CLARIN ERIC 18 April 2016.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
A. M. R. R. Bandara & L. Ranathunga
Linguistic knowledge for Speech recognition
Speech Recognition UNIT -5.
Recognizing Deformable Shapes
Natural Language Processing (NLP)
Hand Gesture Recognition Using Hidden Markov Models
Morphological Segmentation of Natural Gesture
Natural Language Processing (NLP)
Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.
Natural Language Processing (NLP)
Presentation transcript:

Advances in Phonetics-based Sub-Unit Modeling for Transcription, Alignment and Sign Language Recognition. Vassilis Pitsikalis 1, Stavros Theodorakis 1, Christian Vogler 2 and Petros Maragos 1 1 School of Electrical and Computer Engineering, National Technical University of Athens 2 Institute for Language and Speech Processing/Gallaudet University Workshop on Gesture Recognition, June 20,

Overview 1. Gestures, signs, and goals 2. Sign language data and visual processing 3. Data-Driven Sub-Units without Phonetic Evidence for Recognition 4. Phonetic modeling What is it? Annotations vs phonetics Conversion of annotations to structured phonetic description Training and alignment 5. Recognition experiments 6. Conclusions Workshop on Gesture Recognition, June 20,

1. Gestures versus Signs Gestures Isolated hand, body, and facial movements Can be broken down into primitives (but rarely are in gesture recognition work) Few constraints, other than convention Signs Hand body and facial movements, both in isolation and as part of sentences Can be broken down into primitives (cheremes/phonemes/ph ones) Numerous phonetic, morphological, and syntactic constraints Workshop on Gesture Recognition, June 20,

SL Recognition vs Gesture Recognition Continuous SL recognition is invariably more complex than gestures, but:  Isolated sign recognition (i.e. the forms found in a dictionary) is essentially the same as gesture recognition  Methods that work well on isolated sign recognition should work well on gesture recognition  Exploit 30+ years of research into structure of signs Workshop on Gesture Recognition, June 20,

Subunit Modeling Two fundamentally different ways to break down signs into parts:  Data-driven  Phonetics-based (i.e. linguistics) Similar benefits:  Scalability  Robustness  Reduce required training data Workshop on Gesture Recognition, June 20,

Goals of this Presentation Work with large vocabulary (1000 signs) Compare data-driven and phonetic breakdown of signs into subunits Advance state of the field in phonetic breakdown of signs Workshop on Gesture Recognition, June 20,

2. Sign Language Data Corpus of 1000 Greek Sign Language Lemmata  5 repetitions per sign  Signer-dependent, 2 signers (only 1 used for this paper)  HD video, 25 fps interlaced Tracking and feature extraction  Pre-processing, Configuration, Statistics, Skin color training Workshop on Gesture Recognition, June 20,

Interlaced data and pre-processing Workshop on Gesture Recognition, June 20, Refined Skin color masks De-interlacedInterlaced 2 nd Version Full Resolution, Frame rate

Tracking Video, GSL Lemmas Corpus 9 Workshop on Gesture Recognition, June 20, 2011

3. Data-Driven Subunit Modeling Extremely popular in SL recognition lately Good results Different approaches exist  ours is based on distinguishing between dynamic and static subunits Workshop on Gesture Recognition, June 20,

Workshop on Gesture Recognition, June 20, Dynamic (Movement)-Static (Position) Segmentation: Intuitive, Segments + Labels Separate Modeling, SUs, Clustering wrt. Feature type (e.g. static vs. dynamic features); Parameters (e.g. Model Order) and Architecture (HMM, GMM); Normalize features Training, Data-Driven Lexicon Recognize SUs, Signs Dynamic-Static SU Recognition

Dynamic-Static SU Extraction V. Pitsikalis, S. Theodorakis and P. Maragos, Data-Driven Sub-Units and Modeling Structure for Continuous Sign Language Recognition with Multiple Cues, LREC, 2010

Dynamic-Static SU extraction Workshop on Gesture Recognition, June 20, Dynamic clustersStatic clusters

4. Phonetic Modeling Based on modeling signs linguistically Little recent work Workshop on Gesture Recognition, June 20,

Phonetics Phonetics: the study of the sounds that constitute a word Equivalently: the study of the elements that constitute a sign (i.e. its “pronunciation”) Workshop on Gesture Recognition, June 20,

The Role of Phonetics Words consist of smaller parts, e.g.:  cat → /k/ /æ/ /t/ So do signs, e.g.:  CHAIR → (HS, orientation, location, movement)  Parts well-known: 30+ years of research  Less clear: a good structured model Gestures can borrow from sign inventory Workshop on Gesture Recognition, June 20,

The Role of Phonetics in Recognition The most successful speech recognition systems model words in terms of their constituent phones/phonemes, not in terms of data-driven subunits  Adding new words to dictionary  Linguistic knowledge & robustness Why don’t sign language recognition systems do this?  Phonetics are complex, and phonetic annotations/lexica are expensive to create Workshop on Gesture Recognition, June 20,

Annotation vs Phonetic Structure There is a difference between annotations (writing down) of a word and its phonetic structure, required for recognition  Annotations cannot be applied directly to recognition, although an expert can infer the full pronunciation and structure from an annotation Annotations for signed languages are much less time consuming than writing the full phonetic structure Workshop on Gesture Recognition, June 20,

Annotation of a Sign Basic HamNoSys annotation of CHAIR: Workshop on Gesture Recognition, June 20, Symmetry Handshape Orientation Location Movement Repetition

Phonetic Structure of Signs Postures, Detentions, Transitions, Steady Shifts (PDTS)  > improved over 1989 Movement-Hold model  Alternating postures and transitions  CHAIR: Workshop on Gesture Recognition, June 20, Posture Trans Posture Trans Posture Trans Det shoulderStraight down chestBack upshoulderStraight down chest

How Expensive is Phonetic Modeling? Basic HamNoSys annotation of CHAIR:  Same sign with full phonetic structure:  (starting posture)  (downward transition & ending posture)  (transition back & starting posture – due to repetition)  (downward transition & ending posture) Workshop on Gesture Recognition, June 20, Over 70 characters compared to just 8!

Automatic Extraction of Phonetic Structure First contribution: Automatically extract phonetic structure of sign from HamNoSys Combines convenience of annotations with required detail for recognition Recovers segmentation, postures, transitions, and relative timing of hands Based on symbolic analysis of movements, symmetries, etc. Workshop on Gesture Recognition, June 20,

Training and Alignment of Phonetic SUs Second contribution: Train classifiers based on phonetic structure, and align with data to recover frame boundaries  Frame boundaries not needed for recognition, but can be used for further data-driven analysis Classifiers based on HMMs – why?  Proven track record for this type of task  No explicit segmentation required, just concatenate SUs, use Baum-Welch training  Trivial to scale up lexicon size to 1000s of signs Workshop on Gesture Recognition, June 20,

Phonetic Models to HMM Workshop on Gesture Recognition, June 20, Posture Trans Posture Trans Posture Trans Det shoulderStraight down chestBack upshoulderStraight down chest …

Phonetic Subunit Training, Alignment Workshop on Gesture Recognition, June 20, Transition/Epenthesis Segments Superimposed Initial-End Frames + Arrow Posture/Detention Segments Single Frame ETT E P PP

Phonetic Sub-Units Workshop on Gesture Recognition, June 20, Transition/Epenthesis

Phonetic Sub-units Workshop on Gesture Recognition, June 20, Postures

4. Recognition based on both Data-Driven SUs + Phonetic Transcriptions Workshop on Gesture Recognition, June 20, Segmentation Visual Front-End Sub-Units +Training Deco- ding Recognize Signs, Data-driven SUs 1. Data-Driven Subunits (based on Dynamic-Static) Recognize Signs, + PDTS Sequence +Alignment Visual Front-End Deco- ding 2. Data+Phonetic Subunits PDTS System Sub-units +Training Segmen tation Align- ment PDTS Sequence

Workshop on Gesture Recognition, June 20, Varying Dynamic SUs and method, Static SUs=300, #Signs = 961 Varying # Signs and method Data-Driven vs. Phonetic Subunits Recognition

5. Conclusions and The Big Picture Rich SL corpora annotations are rare (in contrast to speech) Human annotations of sign language (HamNoSys) are expensive, subjective, contain errors, inconsistencies HamNoSys contain no time structure Data-Driven approaches Efficient but construct abstract SubUnits Workshop on Gesture Recognition, June 20, Convert HamNoSys to PDTS; Gain Time Structure and Sequentiality Construct meaningful phonetics-based SUs Further exploit the PDTS+Phonetic-SUs Correct Human Annotations Automatically Valuable for SU based SL Recognition, Continuous SLR, Adaptation, Integration of Multiple Streams

Thank you ! Workshop on Gesture Recognition, June 20, Workshop on Gesture Recognition, June 20, 2011 The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/ ) under grant agreement n◦ Theodore Goulas contributed the HamNoSys annotations for GSL.