The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh Gatsby UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Bayesian Belief Propagation

Hierarchical Dirichlet Process (HDP)

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

Probabilistic models Haixu Tang School of Informatics.

Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.

Hierarchical Dirichlet Processes

DEPARTMENT OF ENGINEERING SCIENCE Information, Control, and Vision Engineering Bayesian Nonparametrics via Probabilistic Programming Frank Wood

Constrained Approximate Maximum Entropy Learning (CAMEL) Varun Ganapathi, David Vickrey, John Duchi, Daphne Koller Stanford University TexPoint fonts used.

Rutgers CS440, Fall 2003 Review session. Rutgers CS440, Fall 2003 Topics Final will cover the following topics (after midterm): 1.Uncertainty & introduction.

An Introduction to Variational Methods for Graphical Models.

1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou

1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.

Particle Filters Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read.

Latent Dirichlet Allocation a generative model for text

Particle filters (continued…). Recall Particle filters –Track state sequence x i given the measurements ( y 0, y 1, …., y i ) –Non-linear dynamics –Non-linear.

Maximum A Posteriori (MAP) Estimation Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

Presenting: Assaf Tzabari

1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.

Using CTW as a language modeler in Dasher Phil Cowans, Martijn van Veen Inference Group Department of Physics University of Cambridge.

Sequential Monte Carlo and Particle Filtering Frank Wood Gatsby, November 2007 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Language Modeling Approaches for Information Retrieval Rong Jin.

Latent Tree Models Part II: Definition and Properties

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Forgetting Counts : Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process Nicholas Bartlett, David Pfau, Frank Wood Presented by Yingjian.

Applications of Bayesian sensitivity and uncertainty analysis to the statistical analysis of computer simulators for carbon dynamics Marc Kennedy Clive.

Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF. Read the.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann.

Continuous Variables Write message update equation as an expectation: Proposal distribution W t (x t ) for each node Samples define a random discretization.

1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.

CS Statistical Machine learning Lecture 24

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation Frank Wood and Yee Whye Teh AISTATS 2009 Presented by: Mingyuan.

Lecture 2: Statistical learning primer for biologists

Latent Dirichlet Allocation

The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

Gaussian Processes For Regression, Classification, and Prediction.

Natural Language Processing Statistical Inference: n-grams

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

CS498-EA Reasoning in AI Lecture #23 Instructor: Eyal Amir Fall Semester 2011.

Learning Deep Generative Models by Ruslan Salakhutdinov

Machine Learning and Data Mining Clustering

Non-Parametric Models

LECTURE 10: EXPECTATION MAXIMIZATION (EM)

Bayes Net Learning: Bayesian Approaches

Jeremy Bolton, PhD Assistant Teaching Professor

Filtering and State Estimation: Basic Concepts

N-Gram Model Formulas Word sequences Chain rule of probability

Speech recognition, machine learning

Machine Learning and Data Mining Clustering

Speech recognition, machine learning

Presentation transcript:

The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh Gatsby UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A AA A A A

Executive Summary Model – Smoothing Markov model of discrete sequences – Extension of hierarchical Pitman Yor process [Teh 2006] Unbounded depth (context length) Algorithms and estimation – Linear time suffix-tree graphical model identification and construction – Standard Chinese restaurant franchise sampler Results – Maximum contextual information used during inference – Competitive language modelling results Limit of n-gram language model as n !1 – Same computational cost as a Bayesian interpolating 5-gram language model

Executive Summary Uses – Any situation in which a low-order Markov model of discrete sequences is insufficient – Drop in replacement for smoothing Markov model Name? – ``A Stochastic Memoizer for Sequence Data ’’ ! Sequence Memoizer (SM) Describes posterior inference [Goodman et al ‘ 08]

Statistically Characterizing a Sequence Sequence Markov models are usually constructed by treating a sequence as a set of (exchangeable) observations in fixed-length contexts trigrambigramunigram4-gram Increasing context length / order of Markov model Decreasing number of observations Increasing number of conditional distributions to estimate (indexed by context) Increasing power of model

Finite Order Markov Model Example

Learning Discrete Conditional Distributions Discrete distribution $ vector of parameters Counting / Maximum likelihood estimation – Training sequence x 1: N – Predictive inference Example – Non-smoothed unigram model (u = ²)

Bayesian Smoothing Estimation Predictive inference Priors over distributions Net effect – Inference is “ smoothed ” w.r.t. uncertainty about unknown distribution Example – Smoothed unigram (u = ²)

A Way To Tie Together Distributions Tool for tying together related distributions in hierarchical models Measure over measures Base measure is the “ mean ” measure A distribution drawn from a Pitman Yor process is related to its base distribution – (equal when c = 1 or d = 1) concentrationdiscount base distribution TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A [Pitman and Yor ’ 97]

Pitman-Yor Process Continued Generalization of the Dirichlet process (d = 0) – Different (power-law) properties – Better for text [Teh, 2006] and images [Sudderth and Jordan, 2009] Posterior predictive distribution Forms the basis for straightforward, simple samplers Rule for stochastic memoization TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A Can’t actually do this integral this way

Hierarchical Bayesian Smoothing Estimation Predictive inference Naturally related distributions tied together Net effect – Observations in one context affect inference in other context. – Statistical strength is shared between similar contexts Example – Smoothing bi-gram (w = ², u, v 2 Σ )

SM/HPYP Sharing in Action Conditional Distributions Posterior Predictive Probabilities Observations TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A

CRF Particle Filter Posterior Update Conditional Distributions Posterior Predictive Probabilities Observations CPU

CRF Particle Filter Posterior Update Conditional Distributions Posterior Predictive Probabilities Observations CPU

HPYP LM Sharing Architecture Share statistical strength between sequentially related predictive conditional distributions – Estimates of highly specific conditional distributions – Are coupled with others that are related – Through a single common, more- general shared ancestor Corresponds intuitively to back-off Unigram 2-gram 3-gram 4-gram

Hierarchical Pitman Yor Process Bayesian generalization of smoothing n-gram Markov model Language model : outperforms interpolated Kneser-Ney (KN) smoothing Efficient inference algorithms exist – [Goldwater et al ’ 05; Teh, ’ 06; Teh, Kurihara, Welling, ’ 08] Sharing between contexts that differ in most distant symbol only Finite depth [Goldwater et al ’ 05, Teh ’ 06]

Alternative Sequence Characterization A sequence can be characterized by a set of single observations in unique contexts of growing length Increasing context length Always a single observation Foreshadowing: all suffixes of the string “ cacao ”

``Non-Markov ’’ Model Example Smoothing essential – Only one observation in each context! Solution – Hierarchical sharing ala HPYP

Sequence Memoizer Eliminates Markov order selection Always uses full context when making predictions Linear time, linear space (in length of observation sequence) graphical model identification Performance is limit of n-gram as n !1 Same or less overall cost as 5-gram interpolating Kneser Ney

Graphical Model Trie Observations Latent conditional distributions with Pitman Yor priors / stochastic memoizers

Suffix Trie Datastructure All suffixes of the string “cacao”

Suffix Trie Datastructure Deterministic finite automata that recognizes all suffixes of an input string. Requires O(N 2 ) time and space to build and store [Ukkonen, 95] Too intensive for any practical sequence modelling application.

Suffix Tree Deterministic finite automata that recognizes all suffixes of an input string Uses path compression to reduce storage and construction computational complexity. Requires only O(N) time and space to build and store [Ukkonen, 95] Practical for large scale sequence modelling applications

Suffix Trie Datastructure

Suffix Tree Datastructure

Graphical Model Identification This is a graphical model transformation under the covers. These compressed paths require being able to analytically marginalize out nodes from the graphical model The result of this marginalization can be thought of as providing a different set of caching rules to memoizers on the path-compressed edges

Marginalization Theorem 1: Coagulation [Pitman ’99; Ho, James, Lau ’06; W., Archambeau, Gasthaus, James, Teh ‘09] G1G1 G2G2 G3G3 → G1G1 G3G3

Graphical Model Trie

Graphical Model Tree

Graphical Model Initialization Given a single input sequence – Ukkonen ’ s linear time suffix tree construction algorithm is run on its reverse to produce a prefix tree – This identifies the nodes in the graphical model we need to represent – The tree is traversed and path compressed parameters for the Pitman Yor processes are assigned to each remaining Pitman Yor process

Nodes In The Graphical Model

Never build more than a 5-gram

Sequence Memoizer Bounds N-Gram Performance HPYP exceeds SM computational complexity

Language Modelling Results

The Sequence Memoizer The Sequence Memoizer is a deep (unbounded) smoothing Markov model It can be used to learn a joint distribution over discrete sequences in time and space linear in the length of a single observation sequence It is equivalent to a smoothing ∞ -gram but costs no more to compute than a 5-gram