Inducing Structure for Perception Slav Petrov Advisors: Dan Klein, Jitendra Malik Collaborators: L. Barrett, R. Thibaux, A. Faria, A. Pauls, P. Liang,

Slides:

Advertisements

Similar presentations

Feature Forest Models for Syntactic Parsing Yusuke Miyao University of Tokyo.

Advertisements

Parsing German with Latent Variable Grammars Slav Petrov and Dan Klein UC Berkeley.

Pattern Finding and Pattern Discovery in Time Series

Université du Québec École de technologie supérieure Face Recognition in Video Using What- and-Where Fusion Neural Network Mamoudou Barry and Eric Granger.

Preliminary Results (Synthetic Data) We generate a random 4-ary MRF and we sample training and test data. We forget the structure and start learning with.

Segmentation and Classification Optimally selected HMMs using BIC were integrated into a Superior HMM framework A Soccer video topology was generated utilising.

What Did We See? & WikiGIS Chris Pal University of Massachusetts A Talk for Memex Day MSR Redmond, July 19, 2006.

DTU Informatics Introduction to Medical Image Analysis Rasmus R. Paulsen DTU Informatics TexPoint fonts.

Improved Inference for Unlexicalized Parsing Slav Petrov and Dan Klein.

. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.

Learning Structured Models for Phone Recognition Slav Petrov, Adam Pauls, Dan Klein.

Self-training with Products of Latent Variable Grammars Zhongqiang Huang, Mary Harper, and Slav Petrov.

Discriminative Training in Speech Processing Filipp Korkmazsky LORIA.

Particle Filtering Sometimes |X| is too big to use exact inference

Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.

Learning and Inference for Hierarchically Split PCFGs Slav Petrov and Dan Klein.

Exponential Decay Pruning for Bottom-Up Beam-Search Parsing Nathan Bodenstab, Brian Roark, Aaron Dunlop, and Keith Hall April 2010.

Supervised Learning Recap

1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.

Hidden Markov Models Theory By Johan Walters (SR 2003)

Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.

Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.

Learning Accurate, Compact, and Interpretable Tree Annotation Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein.

Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.

SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.

Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars Kewei TuVasant Honavar Departments of Statistics and Computer Science University.

1 Data-Driven Dependency Parsing. 2 Background: Natural Language Parsing Syntactic analysis String to (tree) structure He likes fish S NP VP NP VNPrn.

Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation

Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.

1 Statistical Parsing Chapter 14 October 2012 Lecture #9.

1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav.

Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.

Statistical NLP Spring 2010 Lecture 14: PCFGs Dan Klein – UC Berkeley.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Latent Dirichlet Allocation

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.

Shallow Parsing for South Asian Languages -Himanshu Agrawal.

NTU & MSRA Ming-Feng Tsai

Statistical Models for Automatic Speech Recognition Lukáš Burget.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

CSC 594 Topics in AI – Natural Language Processing

PRESENTED BY: PEAR A BHUIYAN

Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

CSC 594 Topics in AI – Natural Language Processing

Statistical NLP Spring 2011

Machine Learning in Natural Language Processing

LING/C SC 581: Advanced Computational Linguistics

Statistical Models for Automatic Speech Recognition

Automatic Speech Recognition: Conditional Random Fields for ASR

Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.

Presentation transcript:

Inducing Structure for Perception Slav Petrov Advisors: Dan Klein, Jitendra Malik Collaborators: L. Barrett, R. Thibaux, A. Faria, A. Pauls, P. Liang, A. Berg a.k.a. Slav’s split&merge Hammer

The Main Idea Complex underlying process Observation Manually specified structure True structure MLE structure He was right.

The Main Idea Complex underlying process Observation He was right. Manually specified structure Automatically refined structure EM

Why Structure? the the the food cat dog ate and t e c a e h t g f a o d o o d n h e t d a

Structure is important The dog ate the cat and the food. The dog and the cat ate the food. The cat ate the food and the dog.

Syntactic Ambiguity Last night I shot an elephant in my pajamas.

Visual Ambiguity Old or young?

Three Peaks? Machine Learning Computer Vision Natural Language Processing

No, One Mountain! Machine Learning Computer Vision Natural Language Processing

Three Domains SpeechScenesSyntax

Timeline LearningInference Syntactic MT Bayesian Conditional Summer ISI ‘07 ‘08‘09 Learning Decoding Synthesis LearningInference Syntax Scenes Speech TrecVid Now

Syntax Speech Scenes Language Modeling Split & Merge Learning Syntactic Machine Translation Coarse-to-Fine Inference Non- parametric Bayesian Learning Generative vs. Conditional Learning Syntax

Learning accurate, compact and interpretable Tree Annotation Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein

Motivation (Syntax)  Task: He was right.  Why?  Information Extraction  Syntactic Machine Translation

Treebank Treebank Parsing S  NP VP.1.0 NP  PRP0.5 NP  DT NN0.5 … PRP  She1.0 DT  the1.0 … Grammar

Non-Independence Independence assumptions are often too strong. All NPsNPs under SNPs under VP

The Game of Designing a Grammar  Annotation refines base treebank symbols to improve statistical fit of the grammar  Parent annotation [Johnson ’98]

The Game of Designing a Grammar  Annotation refines base treebank symbols to improve statistical fit of the grammar  Parent annotation [Johnson ’98]  Head lexicalization [Collins ’99, Charniak ’00]

The Game of Designing a Grammar  Annotation refines base treebank symbols to improve statistical fit of the grammar  Parent annotation [Johnson ’98]  Head lexicalization [Collins ’99, Charniak ’00]  Automatic clustering?

Forward Learning Latent Annotations EM algorithm: X1X1 X2X2 X7X7 X4X4 X5X5 X6X6 X3X3 Hewasright.  Brackets are known  Base categories are known  Only induce subcategories Just like Forward-Backward for HMMs. Backward

Inside/Outside Scores A x ByBy CzCz Inside:Outside: A x CzCz ByBy

Learning Latent Annotations (Details)  E-Step:  M-Step: A x ByBy CzCz

Overview Limit of computational resources - Hierarchical Training - Adaptive Splitting - Parameter Smoothing

Refinement of the DT tag DT-1 DT-2 DT-3 DT-4 DT

Refinement of the DT tag DT

Hierarchical refinement of the DT tag DT

Hierarchical Estimation Results ModelF1 Baseline87.3 Hierarchical Training88.4

Refinement of the, tag  Splitting all categories the same amount is wasteful:

The DT tag revisited Oversplit?

Adaptive Splitting  Want to split complex categories more  Idea: split everything, roll back splits which were least useful

Adaptive Splitting  Want to split complex categories more  Idea: split everything, roll back splits which were least useful

Adaptive Splitting  Evaluate loss in likelihood from removing each split = Data likelihood with split reversed Data likelihood with split  No loss in accuracy when 50% of the splits are reversed.

Adaptive Splitting (Details)  True data likelihood:  Approximate likelihood with split at n reversed:  Approximate loss in likelihood:

Adaptive Splitting Results ModelF1 Previous88.4 With 50% Merging89.5

Number of Phrasal Subcategories

PP VP NPNP Number of Phrasal Subcategories

X NA C Number of Phrasal Subcategories

TOTO, PO S Number of Lexical Subcategories

IN DT RBRB VBx Number of Lexical Subcategories

N NN S NN P JJ

Smoothing  Heavy splitting can lead to overfitting  Idea: Smoothing allows us to pool statistics

Linear Smoothing

ModelF1 Previous89.5 With Smoothing90.7 Result Overview

 Proper Nouns (NNP):  Personal pronouns (PRP): NNP-14Oct.Nov.Sept. NNP-12JohnRobertJames NNP-2J.E.L. NNP-1BushNoriegaPeters NNP-15NewSanWall NNP-3YorkFranciscoStreet PRP-0ItHeI PRP-1ithethey PRP-2itthemhim Linguistic Candy

 Relative adverbs (RBR):  Cardinal Numbers (CD): RBR-0furtherlowerhigher RBR-1morelessMore RBR-2earlierEarlierlater CD-7onetwoThree CD CD-11millionbilliontrillion CD CD CD

Nonparametric PCFGs using Dirichlet Processes Percy Liang, Slav Petrov, Dan Klein and Michael Jordan

Improved Inference for Unlexicalized Parsing Slav Petrov and Dan Klein

1621 min

Coarse-to-Fine Parsing [Goodman ‘97, Charniak&Johnson ‘05] Coarse grammar NP … VP NP-dog NP-cat NP-apple VP-run NP-eat… Refined grammar … Treebank Parse Prune NP-17 NP-12 NP-1 VP-6 VP-31… Refined grammar … Parse

Prune? For each chart item X[i,j], compute posterior probability: …QPNPVP… coarse: refined: E.g. consider the span 5 to 12: < threshold

1621 min 111 min (no search error)

Hierarchical Pruning Consider again the span 5 to 12: …QPNPVP… coarse: split in two: …QP1QP2NP1NP2VP1VP2… …QP1 QP3QP4NP1NP2NP3NP4VP1VP2VP3VP4… split in four: split in eight: ……………………………………………

Intermediate Grammars X-Bar= G 0 G= G1G2G3G4G5G6G1G2G3G4G5G6 Learning DT 1 DT 2 DT 3 DT 4 DT 5 DT 6 DT 7 DT 8 DT 1 DT 2 DT 3 DT 4 DT 1 DT DT 2

1621 min 111 min 35 min (no search error)

State Drift (DT tag) some this That these Thatthissome the these thissome that Thatthissome the these thissome that …………………………………………some thesethisThatThisthat EM

G1G2G3G4G5G6G1G2G3G4G5G6 Learning G1G2G3G4G5G6G1G2G3G4G5G6 Projected Grammars X-Bar= G 0 G= Projection  i 0(G)1(G)2(G)3(G)4(G)5(G)0(G)1(G)2(G)3(G)4(G)5(G) G

Estimating Projected Grammars Nonterminals? Nonterminals in G NP 1 VP 1 VP 0 S0S0 S1S1 NP 0 Nonterminals in  (G) VP S NP Projection  Easy:

Rules in G Rules in  (G) Estimating Projected Grammars Rules? S 1  NP 1 VP S 1  NP 1 VP S 1  NP 2 VP S 1  NP 2 VP S 2  NP 1 VP S 2  NP 1 VP S 2  NP 2 VP S 2  NP 2 VP S  NP VP ? ???

Treebank Estimating Projected Grammars [Corazza & Satta ‘06] Rules in  (G) S  NP VP Rules in G S1  NP1 VP S1  NP1 VP S1  NP2 VP S1  NP2 VP S2  NP1 VP S2  NP1 VP S2  NP2 VP S2  NP2 VP Infinite tree distribution … … 0.56 Estimating Grammars

Calculating Expectations  Nonterminals:  c k (X) : expected counts up to depth k  Converges within 25 iterations (few seconds)  Rules:

1621 min 111 min 35 min 15 min (no search error)

G1G2G3G4G5G6G1G2G3G4G5G6 Learning Parsing times X-Bar= G 0 G= 60 % 12 % 7 % 6 % 5 % 4 %

Bracket Posteriors (after G 0 )

Bracket Posteriors (after G 1 )

Bracket Posteriors (Movie)(Final Chart)

Bracket Posteriors (Best Tree)

Parse Selection Computing most likely unsplit tree is NP-hard:  Settle for best derivation.  Rerank n-best list.  Use alternative objective function. Parses: -2 Derivations:

Final Results (Efficiency)  Berkeley Parser:  15 min  91.2 F-score  Implemented in Java  Charniak & Johnson ‘05 Parser  19 min  90.7 F-score  Implemented in C

Final Results (Accuracy) ≤ 40 words F1 all F1 ENG Charniak&Johnson ‘05 (generative) This Work GER Dubey ‘ This Work CHN Chiang et al. ‘ This Work

Conclusions (Syntax)  Split & Merge Learning  Hierarchical Training  Adaptive Splitting  Parameter Smoothing  Hierarchical Coarse-to-Fine Inference  Projections  Marginalization  Multi-lingual Unlexicalized Parsing

Generative vs. Discriminative  Conditional Estimation  L-BFGS  Iterative Scaling  Conditional Structure  Alternative Merging Criterion

How much supervision?

Syntactic Machine Translation  Collaboration with ISI/USC:  Use parse trees  Use annotated parse trees  Learn split synchronous grammars

Speech Scenes Syntax Speech Synthesis Split & Merge Learning Coarse-to-Fine Decoding Combined Generative + Conditional Learning Speech

Learning Structured Models for Phone Recognition Slav Petrov, Adam Pauls, Dan Klein

Motivation (Speech) and you couldn’t care less Words: ae n d y uh k uh d n t k ae r l eh s Phones:

Traditional Models dad Start End Begin - Middle - End Structure Triphones #-d-ad-a-da-d-# Triphones + Decision Tree Clustering d 17 =c(#-d-a)a 1 =c(d-a-d) d 9 =c(a-d-#) Mixtures of Gaussians

Model Overview Traditional: Our Model:

Differences to Grammars vs.

Refinement of the ih-phone

Inference  Coarse-To-Fine  Variational Approximation

Phone Classification Results MethodError Rate GMM Baseline (Sha and Saul, 2006) 26.0 % HMM Baseline (Gunawardana et al., 2005) 25.1 % SVM (Clarkson and Moreno, 1999) 22.4 % Hidden CRF (Gunawardana et al., 2005) 21.7 % This Paper 21.4 % Large Margin GMM (Sha and Saul, 2006) 21.1 %

Phone Recognition Results MethodError Rate State-Tied Triphone HMM (HTK) (Young and Woodland, 1994) 27.1 % Gender Dependent Triphone HMM (Lamel and Gauvain, 1993) 27.1 % This Paper 26.1 % Bayesian Triphone HMM (Ming and Smith, 1998) 25.6 % Heterogeneous classifiers (Halberstadt and Glass, 1998) 24.4 %

Confusion Matrix

How much supervision?  Hand-aligned  Exact phone boundaries are known  Automatically-aligned  Only sequence of phones is known

Generative + Conditional Learning  Learn structure generatively  Estimate Gaussians conditionally  Collaboration with Fei Sha

Speech Synthesis  Acoustic phone model:  Generative  Accurate  Models phone internal structure well  Use it for speech synthesis!

Large Vocabulary ASR  ASR System = Acoustic Model + Decoder  Coarse-to-Fine Decoder:  Subphone  Phone  Phone  Syllable  Word  Bigram  …

Scenes Syntax Split & Merge Learning Decoding Scenes Speech

Motivation (Scenes) Sky Water Grass Rock Seascape

Motivation (Scenes)

Learning  Oversegment the image  Extract vertical stripes  Extract features  Train HMMs

Inference  Decode stripes  Enforce horizontal consistency

Alternative Approach  Conditional Random Fields  Pro:  Vertical and horizontal dependencies learnt  Inference more natural  Contra:  Computationally more expensive

Timeline LearningInference Syntactic MT Bayesian Conditional Summer ISI ‘07 ‘08‘09 Learning Decoding Synthesis LearningInference Syntax Scenes Speech TrecVid Now

Results so far  State of the art parser for different languages:  Automatically learnt  Simple & Compact  Fast & Accurate  Available for download  Phone recognizer:  Automatically learnt  Competitive performance  Good foundation for speech recognizer

Proposed Deliverables  Syntax Parser  Speech Recognizer  Speech Synthesizer  Syntactic Translation Machine  Scene Recognizer

Thank You!