Forgetting Counts : Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process Nicholas Bartlett, David Pfau, Frank Wood Presented by Yingjian.

Slides:



Advertisements
Similar presentations
N 3 = 7 Input (s) Output X(s) s1s1 s2s2 s3s3 sPsP m = Number of Input Points (=3) n i = Number of Outputs at Input s i X (i) = Set of Outputs X j (i) at.
Advertisements

Bayesian Belief Propagation
Motivating Markov Chain Monte Carlo for Multiple Target Tracking
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
15-583:Algorithms in the Real World
Probabilistic models Haixu Tang School of Informatics.
Hierarchical Dirichlet Processes
DEPARTMENT OF ENGINEERING SCIENCE Information, Control, and Vision Engineering Bayesian Nonparametrics via Probabilistic Programming Frank Wood
CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct
Regret in the On-Line Decision Problem Dean Foster & Rakesh Vohara Presented by: Tom Whipple 2/7/2006.
Chapter 15 Probabilistic Reasoning over Time. Chapter 15, Sections 1-5 Outline Time and uncertainty Inference: ltering, prediction, smoothing Hidden Markov.
{bojan.furlan, jeca, 1/42 Probabilistic Graphical Models For Text Mining: A Topic Modeling Survey V. Jelisavčić*, B.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Spatial and Temporal Data Mining
A Data Compression Algorithm: Huffman Compression
The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh Gatsby UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Using CTW as a language modeler in Dasher Phil Cowans, Martijn van Veen Inference Group Department of Physics University of Cambridge.
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
Fundamentals of Multimedia Chapter 7 Lossless Compression Algorithms Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
Computer vision: models, learning and inference
Huffman Coding Vida Movahedi October Contents A simple example Definitions Huffman Coding Algorithm Image Compression.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF. Read the.
Simulation of the matrix Bingham-von Mises- Fisher distribution, with applications to multivariate and relational data Discussion led by Chunping Wang.
Testing Models on Simulated Data Presented at the Casualty Loss Reserve Seminar September 19, 2008 Glenn Meyers, FCAS, PhD ISO Innovative Analytics.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Markov Random Fields Probabilistic Models for Images
An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
1 Publishing Naive Bayesian Classifiers: Privacy without Accuracy Loss Author: Barzan Mozafari and Carlo Zaniolo Speaker: Hongwei Tian.
Stick-Breaking Constructions
CS Statistical Machine learning Lecture 24
A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation Frank Wood and Yee Whye Teh AISTATS 2009 Presented by: Mingyuan.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
Point Estimation of Parameters and Sampling Distributions Outlines:  Sampling Distributions and the central limit theorem  Point estimation  Methods.
Reducing MCMC Computational Cost With a Two Layered Bayesian Approach
Protein Family Classification using Sparse Markov Transducers Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology.
Bayesian Density Regression Author: David B. Dunson and Natesh Pillai Presenter: Ya Xue April 28, 2006.
The Uniform Prior and the Laplace Correction Supplemental Material not on exam.
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Introduction to Sampling Methods Qi Zhao Oct.27,2004.
Rao-Blackwellised Particle Filtering for Dynamic Bayesian Network Arnaud Doucet Nando de Freitas Kevin Murphy Stuart Russell.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
The Nested Dirichlet Process Duke University Machine Learning Group Presented by Kai Ni Nov. 10, 2006 Paper by Abel Rodriguez, David B. Dunson, and Alan.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Machine Learning and Data Mining Clustering
Online Multiscale Dynamic Topic Models
Non-Parametric Models
Probabilistic Robotics
Course: Autonomous Machine Learning
COP3530- Data Structures B Trees
Linear Regression.
Statistical NLP: Lecture 4
Machine Learning and Data Mining Clustering
Presentation transcript:

Forgetting Counts : Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process Nicholas Bartlett, David Pfau, Frank Wood Presented by Yingjian Wang Nov. 17, 2010

Background The sequential memoizer Forgetting The dependent HPY Experiment results Outline

Background 2006,Teh, ‘A hierarchical Bayesian language model based on Pitman-Yor processes’ N-gram Markov chain language model with the HPY prior. 2009, Wood, ‘A Stochastic Memoizer for Sequence Data’ The Sequential Memoizer (SM) with linear space/time inference scheme. (lossless) 2010, Gasthaus, ’ Lossless compression based on the Sequence Memoizer’ Combine the SM with an arithmetic coder to develop a compressor (PLUMP/dePLUMP), see , Bartlett, ‘Forgetting Counts : Constant Memory Inference for a Dependent HPY’ Develop a constant memory/space inference for the SM, by using a dependent HPY. (with loss)

SM-Two concepts Memoizer (Donald Michie, 1968): A device whichDonald Michie returns former results under the same input instead of recalculating in order to save time. Stochastic Memoizer (Wood, 2009): The returned results can change since the prediction probability is based upon a stochastic process.

SM-model and trie model: The prefix trie: restaurants.

SM-the NSP (1) The Normalized Stable Process: (Perman, 1990) Pitman-Yor Process: A Normalized Stable Process Dirichlet Process: Concentration parameter: c=0 Discount parameter: d=0

Collapse the middle restaurants: Theorem: If: Then: Prefix tree: restaurants (Weiner, 1973; Ukkonen, 1995) SM-the NSP (2)

SM-linear space inference

Forgetting Motivation: to achieve constant memory inference on the basis of SM. How to do? --- Methods – Forgetting/delete the restaurants. Restaurants - the basic memory units in the context tree: How to delete? – two deletion schemes: random deletion; greedy deleting.

Deletion schemes Random deletion: uniformly delete one leaf restaurant. Greedy deletion: least negatively impacts the estimated likelihood of the observed sequence. Leaf restaurants

The SMC algorithm

The dependent HPY But wait, what we get after the deletion- addition? Will the processes be independent? – No (Since the seating arrangement in the parent restaurant has been changed.)

The experiment results