Yann LeCun Factor Graphs Structured Prediction Factor Graphs Structured Prediction Yann LeCun, The Courant Institute of Mathematical Sciences New York.

Slides:



Advertisements
Similar presentations
Lecture 16 Hidden Markov Models. HMM Until now we only considered IID data. Some data are of sequential nature, i.e. have correlations have time. Example:
Advertisements

CS590M 2008 Fall: Paper Presentation
Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.
An Overview of Machine Learning
Supervised Learning Recap
Chapter 4: Linear Models for Classification
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Hilbert Space Embeddings of Hidden Markov Models Le Song, Byron Boots, Sajid Siddiqi, Geoff Gordon and Alex Smola 1.
Pattern Recognition and Machine Learning
Conditional Random Fields
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Crash Course on Machine Learning
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Computer vision: models, learning and inference
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Graphical models for part of speech tagging
7-Speech Recognition Speech Recognition Concepts
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
Hidden Markov Models in Keystroke Dynamics Md Liakat Ali, John V. Monaco, and Charles C. Tappert Seidenberg School of CSIS, Pace University, White Plains,
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Linear Models for Classification
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Ch 5b: Discriminative Training (temporal model) Ilkka Aho.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
John Lafferty Andrew McCallum Fernando Pereira
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Yann LeCun Energy-Based Learning. Structured Output Models Energy-Based Learning. Structured Output Models Yann LeCun, The Courant Institute of Mathematical.
This whole paper is about...
Yann LeCun Other Methods and Applications of Deep Learning Yann Le Cun The Courant Institute of Mathematical Sciences New York University
Yann LeCun Energy-Based Models: Structured Learning Beyond Likelihoods Yann LeCun, The Courant Institute of Mathematical Sciences New York University Yann.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CS 9633 Machine Learning Support Vector Machines
Learning Deep Generative Models by Ruslan Salakhutdinov
ECE 417 Lecture 1: Multimedia Signal Processing
Deep Feedforward Networks
Structured prediction
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Learning Coordination Classifiers
COMP24111: Machine Learning and Optimisation
Conditional Random Fields for ASR
CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
COMP61011 : Machine Learning Ensemble Models
Intelligent Information System Lab
Neural networks (3) Regularization Autoencoder
Deep learning and applications to Natural language processing
CSC 594 Topics in AI – Natural Language Processing
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Conditional Random Fields
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18
Hidden Markov Models Part 2: Algorithms
Deep Architectures for Artificial Intelligence
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
[Figure taken from googleblog
المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen
Deep Learning for Non-Linear Control
ML – Lecture 3B Deep NN.
LECTURE 15: REESTIMATION, EM AND MIXTURES
Neural networks (3) Regularization Autoencoder
Conditional Random Fields
Introduction to Neural Networks
CS249: Neural Language Model
Presentation transcript:

Yann LeCun Factor Graphs Structured Prediction Factor Graphs Structured Prediction Yann LeCun, The Courant Institute of Mathematical Sciences New York University Yann LeCun, The Courant Institute of Mathematical Sciences New York University

Yann LeCun Energy-Based Model for Decision-Making Model: Measures the compatibility between an observed variable X and a variable to be predicted Y through an energy function E(Y,X). Inference: Search for the Y that minimizes the energy within a set If the set has low cardinality, we can use exhaustive search.

Yann LeCun Complex Tasks: Inference is non-trivial When the cardinality or dimension of Y is large, exhaustive search is impractical. We need to use a “smart” inference procedure: min- sum, Viterbi,.....

Yann LeCun Decision-Making versus Probabilistic Modeling Energies are uncalibrated The energies of two separately-trained systems cannot be combined The energies are uncalibrated (measured in arbitrary untis) How do we calibrate energies? We turn them into probabilities (positive numbers that sum to 1). Simplest way: Gibbs distribution Other ways can be reduced to Gibbs by a suitable redefinition of the energy. Partition function Inverse temperature

Yann LeCun Perceptron Loss for Binary Classification Energy: Inference: Loss: Learning Rule: If Gw(X) is linear in W:

Yann LeCun Examples of Loss Functions: Generalized Margin Losses Most Offending Incorrect Answer: discrete case Most Offending Incorrect Answer: continuous case First, we need to define the Most Offending Incorrect Answer

Yann LeCun Examples of Generalized Margin Losses Hinge Loss [Altun et al. 2003], [Taskar et al. 2003] With the linearly-parameterized binary classifier architecture, we get linear SVMs Log Loss “soft hinge” loss With the linearly-parameterized binary classifier architecture, we get linear Logistic Regression

Yann LeCun Negative Log-Likelihood Loss Conditional probability of the samples (assuming independence) Gibbs distribution: We get the NLL loss by dividing by P and Beta: Reduces to the perceptron loss when Beta->infinity

Yann LeCun What Make a “Good” Loss Function Good and bad loss functions

Yann LeCun Latent Variable Models The energy includes “hidden” variables Z whose value is never given to us

Yann LeCun Latent variables in Weakly Supervised Learning Variables that would make the task easier if they were known: Scene Analysis: segmentation of the scene into regions or objects. Parts of Speech Tagging: the segmentation of the sentence into syntactic units, the parse tree. Speech Recognition: the segmentation of the sentence into phonemes or phones. Handwriting Recognition: the segmentation of the line into characters. In general, we will search for the value of the latent variable that allows us to get an answer (Y) of smallest energy.

Yann LeCun Probabilistic Latent Variable Models Marginalizing over latent variables instead of minimizing. Equivalent to traditional energy-based inference with a redefined energy function: Reduces to traditional minimization when Beta->infinity

Yann LeCun Energy-Based Factor Graphs When the energy is a sum of partial energy functions (or when the probability is a product of factors): Efficient inference algorithms can be used for inference (without the normalization step). E1(X,Z1)E2(Z1,Z2)E3(Z2,Z3)E4(Z3,Y) + XZ1Z2Z3Y

Yann LeCun Efficient Inference: Energy-Based Factor Graphs The energy is a sum of “factor” functions Example: Z1, Z2, Y1 are binary Z2 is ternary A naïve exhaustive inference would require 2x2x2x3=24 energy evaluations (= 96 factor evaluations) BUT: Ea only has 2 possible input configurations, Eb and Ec have 4, and Ed 6. Hence, we can precompute the 16 factor values, and put them on the arcs in a trellis. A path in the trellis is a config of variable The cost of the path is the energy of the config Factor graph Equivalent trellis

Yann LeCun Example : The Conditional Random Field Architecture A CRF is an energy-based factor graph in which: the factors are linear in the parameters (shallow factors) The factors take neighboring output variables as inputs The factors are often all identical

Yann LeCun Example : The Conditional Random Field Architecture Applications: X is a sentence, Y is a sequence of Parts of Speech Tags (there is one Yi for each possible group of words). X is an image, Y is a set of labels for each window in the image (vegetation, building, sky....).

Yann LeCun Shallow Factors / Deep Graph Linearly Parameterized Factors (shallow factors) with the NLL Loss : Lafferty's Conditional Random Field Kumar&Hebert's DRF. with Hinge Loss: Taskar's Max Margin Markov Nets with Perceptron Loss Collins's sequence labeling model With Log Loss: Altun/Hofmann sequence labeling model

Yann LeCun Energy-Based Belief Prop The previous picture shows a chain graph of factors with 2 inputs. The extension of this procedure to trees, with factors that can have more than 2 inputs the “min-sum” algorithm (a non- probabilistic form of belief propagation) Basically, it is the sum-product algorithm with a different semi- ring algebra (min instead of sum, sum instead of product), and no normalization step. [Kschischang, Frey, Loeliger, 2001][McKay's book]

Yann LeCun Feed-Forward, Causal, and Bi-directional Models EBFG are all “undirected”, but the architecture determines the complexity of the inference in certain directions X Y X Y X Y Feed-Forward Predicting Y from X is easy Predicting X from Y is hard “Causal” Predicting Y from X is hard Predicting X from Y is easy Bi-directional X->Y and Y->X are both hard if the two factors don't agree. They are both easy if the factors agree Complicated function Distance

Yann LeCun Deep Factors / Deep Graph: ASR with TDNN/DTW Trainable Automatic Speech Recognition system with convolutional nets (TDNN) and dynamic time warping (DTW) Training the feature extractor as part of the whole process. with the LVQ2 Loss : Driancourt and Bottou's speech recognizer (1991) with NLL: Bengio's speech recognizer (1992) Haffner's speech recognizer (1993)

Yann LeCun Deep Factors / Deep Graph: ASR with TDNN/HMM Discriminative Automatic Speech Recognition system with HMM and various acoustic models Training the acoustic model (feature extractor) and a (normalized) HMM in an integrated fashion. With Minimum Empirical Error loss Ljolje and Rabiner (1990) with NLL: Bengio (1992) Haffner (1993) Bourlard (1994) With MCE Juang et al. (1997) Late normalization scheme (un-normalized HMM) Bottou pointed out the label bias problem (1991) Denker and Burges proposed a solution (1995)

Yann LeCun Really Deep Factors / Really Deep Graph Handwriting Recognition with Graph Transformer Networks Un-normalized hierarchical HMMs Trained with Perceptron loss [LeCun, Bottou, Bengio, Haffner 1998] Trained with NLL loss [Bengio, LeCun 1994], [LeCun, Bottou, Bengio, Haffner 1998] Answer = sequence of symbols Latent variable = segmentation