Gibbs Sampling and Hidden Markov Models in the Event Detection Problem By Marc Sobel.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Chapter 7 Hypothesis Testing
A Tutorial on Learning with Bayesian Networks
Brief introduction on Logistic Regression
Probabilistic models Haixu Tang School of Informatics.
1 Parametric Empirical Bayes Methods for Microarrays 3/7/2011 Copyright © 2011 Dan Nettleton.
CHAPTER 2 Building Empirical Model. Basic Statistical Concepts Consider this situation: The tension bond strength of portland cement mortar is an important.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
CS 589 Information Risk Management 6 February 2007.
Estimation from Samples Find a likely range of values for a population parameter (e.g. average, %) Find a likely range of values for a population parameter.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
CHAPTER 6 Statistical Analysis of Experimental Data
Probability (cont.). Assigning Probabilities A probability is a value between 0 and 1 and is written either as a fraction or as a proportion. For the.
Probability Distributions and Frequentist Statistics “A single death is a tragedy, a million deaths is a statistic” Joseph Stalin.
Today Concepts underlying inferential statistics
Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data Lesson6-1 Lesson 6: Sampling Methods and the Central Limit Theorem.
Maximum likelihood (ML)
McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited. Adapted by Peter Au, George Brown College.
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Activity Detection in Videos Riu Baring CIS 8590 Perception of Intelligent System Temple University Fall 2007.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
1 As we have seen in section 4 conditional probability density functions are useful to update the information about an event based on the knowledge about.
Model Inference and Averaging
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
A Beginner’s Guide to Bayesian Modelling Peter England, PhD EMB GIRO 2002.
Dr. Gary Blau, Sean HanMonday, Aug 13, 2007 Statistical Design of Experiments SECTION I Probability Theory Review.
Education Research 250:205 Writing Chapter 3. Objectives Subjects Instrumentation Procedures Experimental Design Statistical Analysis  Displaying data.
Bayesian Analysis and Applications of A Cure Rate Model.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Fast Simulators for Assessment and Propagation of Model Uncertainty* Jim Berger, M.J. Bayarri, German Molina June 20, 2001 SAMO 2001, Madrid *Project of.
Bayesian vs. frequentist inference frequentist: 1) Deductive hypothesis testing of Popper--ruling out alternative explanations Falsification: can prove.
Bayesian statistics Probabilities for everything.
First topic: clustering and pattern recognition Marc Sobel.
Chapter 7 Sampling Distributions Statistics for Business (Env) 1.
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
Learning to Detect Events with Markov-Modulated Poisson Processes Ihler, Hutchins and Smyth (2007)
MCMC (Part II) By Marc Sobel. Monte Carlo Exploration  Suppose we want to optimize a complicated distribution f(*). We assume ‘f’ is known up to a multiplicative.
Review I A student researcher obtains a random sample of UMD students and finds that 55% report using an illegally obtained stimulant to study in the past.
Ka-fu Wong © 2003 Chap 6- 1 Dr. Ka-fu Wong ECON1003 Analysis of Economic Data.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Sampling and estimation Petter Mostad
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Probability Distributions ( 확률분포 ) Chapter 5. 2 모든 가능한 ( 확률 ) 변수의 값에 대해 확률을 할당하는 체계 X 가 1, 2, …, 6 의 값을 가진다면 이 6 개 변수 값에 확률을 할당하는 함수 Definition.
Oliver Schulte Machine Learning 726
Probability Distributions: a review
Model Inference and Averaging
Appendix A: Probability Theory
Bayes Net Learning: Bayesian Approaches
Week 10 Chapter 16. Confidence Intervals for Proportions
Review of Probability and Estimators Arun Das, Jason Rebello
A Non-Parametric Bayesian Method for Inferring Hidden Causes
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical NLP: Lecture 4
Particle Filters for Event Detection
Parametric Methods Berlin Chen, 2005 References:
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Presentation transcript:

Gibbs Sampling and Hidden Markov Models in the Event Detection Problem By Marc Sobel

Event Detection Problems A process like e.g., traffic flow, crowd formation, or financial electronic transactions is unfolding in time. We can monitor and observe the flow frequencies at many fixed time points. Typically, there are many causes influencing changes in these frequencies.

Causes for Change Possible causes for change include: a) changes due to noise; i.e., those best modeled by e.g., a Gaussian error distribution. b) periodic changes; i.e., those expected to happen over periodic intervals. c) changes not due to either of the above: these are usually the changes we would like to detect.

Examples: Examples include: 1) Detecting ‘Events’, which are not pre- planned, involving large numbers of people at a particular location. 2) Detecting ‘Fraudulent transactions’. We observe a variety of electronic transactions at many time intervals. We would like to detect when the number of transactions is significantly different from what is expected.

Model for Changes Due to Noise or Periodic Changes or other We model changes in ‘flow frequency’ due to all possible known causes. This is done using latent Poisson processes. The frequency count N(t) at time t is observed. N 0 (t), N E (t) are independent latent Poisson processes. N 0 (t) denotes the frequency due to periodic and noise changes at time ‘t’. We use the terminology, λ(t) for the average rate for such changes. We write N 0 (t)~P(N,λ(t)) in this case. N E (t) denotes the frequency due to causes different from periodic and noise changes. It has a rate function γ(t). We write, N E (t)~P(N,γ(t)) in this case. The rate function λ(t) is regressed on a parametric function of the periodic effects as follows:

The Process N 0 (t): We focus on the first example given above and consider the problem of modeling the frequencies of people entering a building with the eventual purpose of modeling special ‘events’ connected with these frequencies. We let a) ‘d’ stand for day, b) ‘hh’ for half hour time interval. c) ‘b’ for base

Rate Function Due to Periodic and Noise Changes. The rate function due to periodic and noise changes is:

Rate Function Explained This makes sense because we are thinking of the fact that, for a time ‘t’ in day ‘d’ and half-hour period ‘h’, we have (by Bayes rule) In the sequel, we assume time ‘t’ has been broken up into half-hour periods without re-indexing.

Example: Work week Say you worked 21 hours on average per week. So your base work rate (per week) is λ b = 21. Therefore your daily base work rate is 3. Your average work rate for Sunday relative to this base is λ Sunday =(total Sunday rate)/3. The sum of your work rates for Sunday, …Saturday is λ Sunday +…+ λ Saturday =7

Modeling Occasional changes in event flow and Noise Where does the noise come in? How do we model occasional changes in the periodic rate parameters? The missing piece is (dramatic pause) ??????!!!!!!!!!

Priors Come to the Rescue Priors serve the purpose of modeling noise and occasional changes in the values of parameters. Thus spake the prior The base parameter is given a gamma prior, λ base ~ π(λ)=λ α-1 δ α exp(-λδ)/Γ(α) By flexibly assigning values to the hyperparameter α,δ we can biuld a distribution which properly characterizes the base rate.

Interpretation The λ day ’s, being conditional rates, satisfy, ∑λ day = ∑[(average day ‘i’ totals)/λ base ]=7 Similarly, summing over periods, ∑λ j’th period during day i = ∑[(average j’th period frequency in day i)/λ day i ]=D where ‘D’ stands for the number of half hour intervals in a day.

A Simple example illustrating the issue. What do these mean. Assume, for purposes of simplification, that there are a total of 2 days in a ‘week’; Sunday and Monday. Daily rates are measured in events per day. The base rate is the average rate for sundays and mondays combined. A) The Sunday and Monday relative rates add up to 2. B) If we observe 10 people (total) on sunday and 30 on Monday, over a total of 10 ‘weeks’.

(continued) C) maximum likelihood dictates estimating the base rate (per week) as (40/10)=4 people per week (or 2 people per day), sundays relative rate is (10 (people)/[2*10 (weeks)])=.5 and Mondays relative rate is 1.5. D) But, this is incorrect because, (i) it turns out that one week out of 10, the conditional Monday rate shoots up to 1.90, while the Sunday rate decreases to 0.1. (ii) it turns out that usually, the conditional Sunday rate is 1 rather than.5.

The Bayesian formulation wins out: We can biuld a model with this new information by assuming a beta prior for half the Monday,Sunday relative rates – (.5)*λ sunday ~ [λ (.66-1) (1-λ) (.66-1) /β(.66,.66) This prior has the required properties that the Sunday rate dips down to.10 about 10 percent of the time, but averages 1 over the entire interval.

The Failure of classical theory The MLE of λ sunday is The Bayes estimator of λ sunday is But even more importantly, the posterior distribution of the parameter provides information useful to all other inference and prediction in the problem. Medicare for classical statistics?

illustration Posterior distribution for the twice the Sunday frequency rate

Actual Priors Used For our example, we have seven rather than 2 days in a week. We use scaled Dirichlet priors (extensions of beta priors) for this: Smaller alpha’s indicate smaller apriori relative frequencies. Smaller sum of alpha’s indicate greater relative frequency variance for the p’s. This provides a flexible way to model the daily rates.

Events: The Process N E Events signify times during which there are higher frequencies which are not due to periodic or noise causes. We can model this by assuming z(t)=1 for such events and z(t)=0 if not. P(z(t)=1|z(t-1)=0)= 1-z 00 ; P(z(t)=0|z(t-1)=0)= z 00; P(z(t)=1|z(t-1)=1)=z 11 ; P(z(t)=0|z(t-1)=1)=1-z 11 i.e., if there is no event at time t-1, the chance of an event at time t is 1-z 00

The need for a Bayesian treatment for events This gives a latent ‘geometric’ distribution. Assume z 00 =.8; z 11 =.1. Then non-events tend to last an average of (1/.2)=5 half-hours while events tend to last an average of (1/.9)=1.11 half-hours. Classical statistics would dictate direct estimation of the z’s – but note that this says nothing about the tendencies of events to exhibit non-average behavior. It doesn’t provide information about prediction and estimation.

Priors for event probabilities Beta distributions: priors for the z’s. z 00 z 00 a[0]-1 (1-z 00 ) b[0]-1 and z 11 analogously. This characterizes the behavior of the underlying latent process. The hyperparameters a,b are designed to model that behavior. Recall that N 0 (t) (the non-event process) characterizes periodic and noise changes. The event process N E (t) characterizes other changes. N E (t) is 0 if z(t)=0 and Poisson with rate γ(t) if z(t)=1. So, if there is no event, N=N 0 (t). If there is an event, the frequency due to periodic or noise changes is N=N 0 (t)+N E (t) The rate γ(t) is itself gamma with parameters a E and b E. Hence it is marginally negative binomial with p=(b E /(1+b E ) and n=N.

Gibbs Sampling Gibbs sampling works by simulating each parameter/latent variable conditional on all the rest. The λ’s are parameters and the z’s,N’s are the latent variables. The resulting simulated values have an empirical distribution similar to the true posterior distribution. It works as a result of the fact that the joint distribution of parameters is determined by the set of all such conditional distributions.

Gibbs Sampling Given z(t)=0 and the remaining parameters, Put N 0 (t)=N(t) and N E (t)=0. If z(t)=1, simulate N E (t) as negative binomial with parameters, N(t) and b E /(1+b E ). Put N 0 (t)=N(t)-N E (t). To simulate z(t), define

More of Gibbs Sampling Then, if the previous state was 0, we get:

Gibbs Sampling (Continued) Having simulated z(t), we can simulate the parameters as follows: Where ‘N day ’ denotes the number of ‘day’ units in the data, ‘N hh ’ denotes the number of ‘hh’ periods in the data.

Gibbs Sampling (conclusion) We can simulate from the remaining conditional posterior distributions using standard MCMC techniques.

END – Thank You

Polya Tree Priors A more general methodology for introducing multiple prior levels is through Polya Tree Priors. (see Michael Lavine). For these priors, we divide the time interval (e.g., a week) into parts with relative frequencies: p 1,…,p k. ‘p’ has a dirichlet distribution. Given p, we further divide up each of the time interval parts into parts with corresponding conditional dirichlet distributions. We can continue to subdivide until it is no longer useful.