Lecture 11 Probability and Time

Slides:



Advertisements
Similar presentations
Computer Science CPSC 322 Lecture 25 Top Down Proof Procedure (Ch 5.2.2)
Advertisements

Computer Science CPSC 322 Lecture 3 AI Applications.
Decision Theory: Sequential Decisions Computer Science cpsc322, Lecture 34 (Textbook Chpt 9.3) Nov, 28, 2012.
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Jan, 29, 2014.
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
Probabilistic Reasoning over Time
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Department of Computer Science Undergraduate Events More
Dynamic Bayesian Networks (DBNs)
Lirong Xia Hidden Markov Models Tue, March 28, 2014.
Lirong Xia Approximate inference: Particle filter Tue, April 1, 2014.
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
Introduction of Probabilistic Reasoning and Bayesian Networks
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Advanced Artificial Intelligence
1 Reasoning Under Uncertainty Over Time CS 486/686: Introduction to Artificial Intelligence Fall 2013.
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) March, 16, 2009.
CPSC 322, Lecture 30Slide 1 Reasoning Under Uncertainty: Variable elimination Computer Science cpsc322, Lecture 30 (Textbook Chpt 6.4) March, 23, 2009.
CS 188: Artificial Intelligence Fall 2009 Lecture 20: Particle Filtering 11/5/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.
CPSC 322, Lecture 12Slide 1 CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12 (Textbook Chpt ) January, 29, 2010.
CPSC 322, Lecture 17Slide 1 Planning: Representation and Forward Search Computer Science cpsc322, Lecture 17 (Textbook Chpt 8.1 (Skip )- 8.2) February,
CPSC 322, Lecture 31Slide 1 Probability and Time: Markov Models Computer Science cpsc322, Lecture 31 (Textbook Chpt 6.5) March, 25, 2009.
CPSC 322, Lecture 32Slide 1 Probability and Time: Hidden Markov Models (HMMs) Computer Science cpsc322, Lecture 32 (Textbook Chpt 6.5) March, 27, 2009.
CPSC 422, Lecture 14Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14 Feb, 4, 2015 Slide credit: some slides adapted from Stuart.
CS 188: Artificial Intelligence Fall 2009 Lecture 19: Hidden Markov Models 11/3/2009 Dan Klein – UC Berkeley.
Department of Computer Science Undergraduate Events More
1 Robot Environment Interaction Environment perception provides information about the environment’s state, and it tends to increase the robot’s knowledge.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
CPSC 322, Lecture 32Slide 1 Probability and Time: Hidden Markov Models (HMMs) Computer Science cpsc322, Lecture 32 (Textbook Chpt 6.5.2) Nov, 25, 2013.
UIUC CS 498: Section EA Lecture #21 Reasoning in Artificial Intelligence Professor: Eyal Amir Fall Semester 2011 (Some slides from Kevin Murphy (UBC))
Reasoning Under Uncertainty: Conditioning, Bayes Rule & the Chain Rule Jim Little Uncertainty 2 Nov 3, 2014 Textbook §6.1.3.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
QUIZ!!  In HMMs...  T/F:... the emissions are hidden. FALSE  T/F:... observations are independent given no evidence. FALSE  T/F:... each variable X.
Tractable Inference for Complex Stochastic Processes X. Boyen & D. Koller Presented by Shiau Hong Lim Partially based on slides by Boyen & Koller at UAI.
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
CPSC 422, Lecture 17Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 17 Oct, 19, 2015 Slide Sources D. Koller, Stanford CS - Probabilistic.
CPS 170: Artificial Intelligence Markov processes and Hidden Markov Models (HMMs) Instructor: Vincent Conitzer.
Probability and Time. Overview  Modelling Evolving Worlds with Dynamic Baysian Networks  Simplifying Assumptions Stationary Processes, Markov Assumption.
Reasoning Under Uncertainty: Independence and Inference CPSC 322 – Uncertainty 5 Textbook §6.3.1 (and for HMMs) March 25, 2011.
Probability and Time. Overview  Modelling Evolving Worlds with Dynamic Baysian Networks  Simplifying Assumptions Stationary Processes, Markov Assumption.
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) Nov, 13, 2013.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
CS 541: Artificial Intelligence Lecture VIII: Temporal Probability Models.
CS498-EA Reasoning in AI Lecture #23 Instructor: Eyal Amir Fall Semester 2011.
Example: Localization for “Pushed around” Robot
Probability and Time: Hidden Markov Models (HMMs)
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 2
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reasoning Under Uncertainty: Conditioning, Bayes Rule & Chain Rule
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14
Markov ó Kalman Filter Localization
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Probability and Time: Markov Models
Hidden Markov Models Part 2: Algorithms
Probability and Time: Markov Models
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 2
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Probability and Time: Markov Models
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14
Probability and Time: Markov Models
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14
Presentation transcript:

Lecture 11 Probability and Time Computer Science CPSC 502 Lecture 11 Probability and Time (Ch. 6.5)

Where are we? Environment Stochastic Deterministic Problem Type Arc Representation Reasoning Technique Environment Stochastic Deterministic Problem Type Arc Consistency Constraint Satisfaction Vars + Constraints Search Static Belief Nets Belief Nets Query Logics Variable Elimination Variable Elimination Search Approximate Inference Temporal Inference Decision Nets Sequential Variable Elimination STRIPS Decision Nets Planning Variable Elimination Belief Nets extended to cover temporal processes Search Markov Processes Markov Processes Markov Processes Value Iteration Value Iteration

Overview Modelling Evolving Worlds with DBNs Markov Chains Hidden Markov Models Inference in Temporal Models

Modeling Evolving Worlds So far we have looked at techniques for probabilistic reasoning in a static world E.g., keep collecting evidence to diagnose the cause of a fault in a system. The true cause does not change as one gathers new evidence, what changes is the …………… over the possible causes.

Dynamic Bayesian Networks (DBN)‏ DBN are an extension of Bayesian networks devised for reasoning under uncertainty in dynamic environments Basic approach World’s dynamics captured via series of snapshots, or time slices, each representing the state of the world at a specific point in time Each time slice contains a set of random variables, representing the state of the world at time t: state variables Xt E.g., student’s knowledge and morale during a tutoring session This assumes discrete time; step size depends on problem Notation: Xa:b = Xa , Xa+1,…, Xb-1 , Xb Knows-Sub1 Knows-Add1 Morale1 Knows-Sub2 Knows-Add2 Morale2 Knows-Sub3 Knows-Add3 Morale3 5

Stationary Processes How do we build a Bnet from these times slices and their variables? Could use the procedure we defined for building static Bnets order variables (temporally), insert them in the networks one at time, find suitable parents by checking conditional dependencies given predecessors First Problem – we could have a very long sequence of time slices: how to specify CPTs for all of them? Assumption of stationary process: the mechanism that regulates how state variables change overtime is stationary, that is it can be described by a single transition model P(Xt|Xt-1) Note that Xt is a vector representing a set of state variables 6

Markov Assumption Second Problem: there could be an infinite number of parents for each node, coming from all previous time-slices Markov assumption: current state Xt depends on bounded subset of previous states X0:t-1 Processes satisfying this assumption are called Markov Processes or Markov Chains 7

Simplest Possible DBN P(St|St-1, ….S0) = One random variable for each time slice: let’s assume St represents the state at time t with domain {s1 …sn } Each random variable depends only on the previous one, thus P(St|St-1, ….S0) = Intuitively St conveys all of the information about the history that can affect the future states. “The future is independent of the past given the present.”

Simplest Possible DBN (cont’) How many CPTs do we need to specify? Stationary process assumption: the mechanism that regulates how state variables change overtime is stationary, that is it can be described by a single transition model P(St|St-1) 9

Stationary Markov Chain (SMC) A stationary Markov Chain : for all t >0 P (St+1| S0,…,St) = P (St+1|St) P (St +1|St) is the same for every t Markov Assumption (first order) Stationary Process We only need to specify P (S0) and P (St +1 |St) Simple model, easy to specify Often the natural model The network can extend indefinitely Variations of SMC are at the core of most Natural Language Processing (NLP) applications! 10

Stationary Markov-Chain: Example Domain of variable Si is {t , q, p, a, h, e} We only need to specify… t q .6 .4 p a h e P (S0) Probability of initial state Stochastic Transition Matrix t q p .3 .4 .6 1 a h e P (St+1|St) P (St+1| St =q) P (St+1| St =a)

Markov-Chain: Inference Probability of a sequence of states S0 … ST P (S0,…,St) = P (S0) P (S1|S0) P (S2|S1) …………. Example: P (t,q,p) = P (St+1|St) t q p .3 .4 .6 1 a h e P (S0) t q .6 .4 p a h e

Key problems in NLP “I made her duck” Impossible to estimate  w1 w2 w3 w4 Assign a probability to a sentence Part-of-speech tagging Word-sense disambiguation, Probabilistic Parsing Predict the next word Speech recognition Hand-writing recognition Augmentative communication for the disabled Summarization, Machine Translation…..... Impossible to estimate 

Impossible to estimate! Assuming 105 words in Dictionary and average sentence contains 10 words, how may possible worlds (entries in the JPD), would I need to specify Google language repository (22 Sept. 2006) contained “only”: 95,119,665,584 sentences ~ 10 11 Not enough to learn the probabilities from frequencies in this dataset (or corpus) Most sentences will not appear or appear only once

What can we do? P(The big red dog barks)= Make a strong simplifying assumption! Sentences are generated by a Markov Chain P(The big red dog barks)= P(The|<S>) * These probabilities can be assessed in practice!

How can we minimally extend Markov Chains? A useful situation to model is the one in which: the reasoning system does not have access to the states but can make observations that give some information about the current state 16

Hidden Markov Model A Hidden Markov Model (HMM) starts with a Markov chain, and adds a noisy observation about the state at each time step: |domain(S)| = k |domain(O)| = h P (S0) specifies initial conditions P (St+1|St) specifies the dynamics P (Ot |St) specifies the sensor model k probabilities k × k matrix of probabilities k × h matrix of probabilities Markov Assumption on Evidence) P(Ot |S0:t , O0:t-1) = P(Ot | St)

Simple Example (We’ll use this as a running example) Guard stuck in a high-security bunker Would like to know if it is raining outside Can only tell by looking at whether his boss comes into the bunker with an umbrella every day Transition model State variables Observation model Observable variables 18

Discussion Note that the first-order Markov assumption implies that the state variables contain all the information necessary to characterize the probability distribution over the next time slice Sometime this assumption is only an approximation of reality Whether it rains or not today may depend on the weather on more days than just the previous one Possible fixes Increase the order of the Markov Chain (e.g., add Raint-2 as a parent of Raint) Add state variables that can compensate for the missing temporal information Such as? 19

Rain Network Montht Montht+1 Montht-1 Raint+1 Raint-1 Raint We could add Month to each time slice to include season statistics Montht Montht+1 Montht-1 Raint+1 Raint-1 Raint Umbrellat+1 Umbrellat-1 Umbrellat 20

Rain Network Or we could add Temperature, Humidity and Pressure to include meteorological knowledge in the network Raint-1 Umbrellat-1 Raint Umbrellat Raint+1 Umbrellat+1 Temperaturet-1 Humidityt-1 Pressuret-1 Temperaturet Humidityt Pressuret Temperaturet+1 Humidityt+1 Pressuret+1 21

Rain Network However, adding more state variables may require modelling their temporal dynamics in the network Trick to get away with it Add sensors that can tell me the value of each new variable at each specific point in time The more reliable a sensor, the less important to include temporal dynamics to get accurate estimates of the corresponding variable Humidityt Humidityt-1 Pressuret-1 Pressuret Temperaturet-1 Temperaturet Raint-1 Raint Thermometert Thermometert-1 Barometert-1 Barometert Umbrellat-1 Umbrellat 22

Overview Modelling Evolving Worlds with DBNs Marcov Chains Hidden Markov Models Inference in Temporal Models

Inference Tasks in Temporal Models Filtering (or monitoring): P(Xt |e0:t ) Compute the posterior distribution over the current state given all evidence to date In the rain example, this would mean computing the probability that it rains today given all the observations on umbrella made so far Important if a rational agent needs to make a decision in the current situation Prediction: P(Xt+k | e0:t ) Compute the posterior distribution over a future state given all evidence to date In the rain example, this would mean computing the probability that it rains in two days given all the observations on umbrella made so far Useful for an agent to evaluate possible courses of actions 24

Inference Tasks in Temporal Models Smoothing: P(Xt-k | e0:t ) Compute the posterior distribution over a past state given all evidence to date In the rain example, this would mean computing the probability that it rained five days ago given all the observations on umbrella made so far Useful to better estimate what happened by incorporating additional evidence to the evidence available at that time Most Likely Explanation: argmaxX0:t P(X0:t | e0:t ) Given a sequence of observations, find the sequence of states that is most likely to have generated them Useful in many applications, e.g., speech recognition: find the most likely sequence of words given a sequence of sounds 25

Filtering Idea: recursive approach Compute filtering up to time t-1, and then include the evidence for time t (recursive estimation)‏ TRUE 0.5 FALSE 0.5 Rain2 Rain0 Rain1 Umbrella2 Umbrella1 26

Prediction of current state given evidence up to t-1 Filtering Idea: recursive approach Compute filtering up to time t-1, and then include the evidence for time t (recursive estimation)‏ P(St | o0:t) = P(St | o0:t-1,ot ) dividing up the evidence = α P(ot | St, o0:t-1 ) P(St | o0:t-1 ) WHY? = α P(ot | St) P(St | o0:t-1 ) WHY? Prediction of current state given evidence up to t-1 Inclusion of new evidence: this is available from.. So we only need to compute P(St | o0:t-1 ) 27

Filtering why? Compute P(St | o0:t-1 ) Product Rule P(A,B) = P(A|B)P(B) Compute P(St | o0:t-1 ) P(St | o0:t-1 ) = ∑St-1 P(St, St-1 |o0:t-1 ) = ∑St-1 P(St | St-1 , o0:t-1 ) P( St-1 | o0:t-1 ) = = ∑St-1 P(St | St-1 ) P( St-1 | o0:t-1 ) because of.. Filtering at time t-1 Transition model! Putting it all together, we have the desired recursive formulation P(St | o0:t) = α P(ot | St) ∑St-1 P(St | St-1 ) P( St-1 | o0:t-1 ) Filtering at time t-1 Inclusion of new evidence (sensor model)‏ Propagation to time t P( St-1 | o0:t-1 ) can be seen as a message f0:t-1 that is propagated forward along the sequence, modified by each transition and updated by each observation 28

Inclusion of new evidence Filtering Thus, the recursive definition of filtering at time t in terms of filtering at time t-1 can be expressed as a FORWARD procedure f0:t = α FORWARD (f0:t-1, ot) which implements the update described in P(St | o0:t) = α P(ot | St) ∑St-1 P(St | St-1 ) P( St-1 | o0:t-1 ) Filtering at time t-1 Inclusion of new evidence (sensor model) Propagation to time t 29

Analysis of Filtering Because of the recursive definition in terms for the forward message, when all variables are discrete the time for each update is constant (i.e. independent of t) The constant depends of course on the size of the state space and the type of temporal model 30

Rain Example Suppose our security guard came with a prior belief of 0.5 that it rained on day 0, just before the observation sequence started. Without loss of generality, this can be modeled with a fictitious state R0 with no associated observation and P(R0) = <0.5, 0.5> P(R1 | o0:t-1 ) = P(R1) = ∑r0 P(R1 | r0 ) P(r0 ) = <0.7, 0.3> * 0.5 + <0.3,0.7> * 0.5 = <0.5,0.5> Day 1: umbella appears (u1). Thus 0.5 TRUE 0.5 FALSE 0.5 Rain2 Rain0 Rain1 Rt-1 P(Rt) t f 0.7 0.3 Rt P(Ut) t f 0.9 0.2 Umbrella2 Umbrella1 31

Rain Example Updating P(R1) with evidence from for t =1 (umbrella appeared) gives P(R1| u1) = α P(u1 | R1) P(R1) = α<0.9, 0.2><0.5,0.5> = α<0.45, 0.1> ~ <0.818, 0.182> Day 2: umbella appears (u2). Thus P(R2 | o0:t-1 ) = P(R2 | u1 ) = ∑r1 P(R2 | r1 ) P(r1 | u1) = = <0.7, 0.3> * 0.818 + <0.3,0.7> * 0.182 ~ <0.627,0.373> Vector dot product 0.627 0.373 0.5 TRUE 0.5 FALSE 0.5 0.818 0.182 Rain2 Rain0 Rain1 Rt-1 P(Rt) t f 0.7 0.3 Rt P(Ut) t f 0.9 0.2 Umbrella2 Umbrella1 32

Rain Example Updating this with evidence from for t =2 (umbrella appeared) gives P(R2| u1 , u2) = α P(u2 | R2) P(R2| u1) = α<0.9, 0.2><0.627,0.373> = α<0.564, 0.075> ~ <0.883, 0.117> Intuitively, the probability of rain increases, because the umbrella appears twice in a row 0.627 0.373 0.5 TRUE 0.5 FALSE 0.5 0.883 0.117 0.818 0.182 Rain2 Rain0 Rain1 Umbrella2 Umbrella1 33

Prediction: P(St+k+1 | o0:t ) Can be seen as filtering without addition of new evidence In fact, filtering already contains a one-step prediction P(St | o0:t) = α P(ot | St) ∑st-1 P(St | st-1 ) P( st-1 | e0:t-1 ) Filtering at time t-1 Inclusion of new evidence (sensor model)‏ Propagation to time t We just need to show how to recursively predict the state at time t+k +1 from a prediction for state t + k P(St+k+1 | o0:t ) = ∑st+k P(St+k+1, st+k |o0:t ) = ∑st+k P(St+k+1 | st+k , o0:t ) P( st+k | o0:t ) = = ∑st+k P(St+k+1 | st+k ) P( st+k | o0:t ) Let‘s continue with the rain example and compute the probability of Rain on day four after having seen the umbrella in day one and two: P(R4| u1 , u2) Prediction for state t+ k Transition model 34

Rain Example Prediction from day 2 to day 3 P(S3 | o1:2 ) = ∑s2 P(S3 | s2 ) P( s2 | o1:2 ) = ∑r2 P(R3 | r2 ) P( r2 | u1 u2 ) = = <0.7,0.3>*0.883 + <0.3,0.7>*0.117 = <0.618,0.265> + <0.035, 0.082> = <0.653, 0.347> Prediction from day 3 to day 4 P(S4 | o1:2 ) = ∑s3 P(S4 | s3 ) P( s3 | o1:2 ) = ∑r3 P(R4 | r3 ) P( r3 | u1 u2 ) = = <0.7,0.3>*0.653 + <0.3,0.7>*0.347= <0.457,0.196> + <0.104, 0.243> = <0.561, 0.439> 0.627 0.373 0.5 0.561 0.439 0.653 0.347 0.5 0.883 0.117 0.818 0.182 Rain0 Rain1 Rain2 Rain3 Rain3 Rt P(Ut) t f 0.9 0.2 Rt-1 P(Rt) t f 0.7 0.3 Umbrella3 Umbrella1 Umbrella2 Umbrella3 35

Rain Example Intuitively, the probability that it will rain decreases for each successive day, as the influence of the observations from the first two days decays What happens if I try to predict further and further into the future? It can be shown that the predicted distribution converges to the stationary distribution of the Markov process defined by the transition model (<0.5,0.5> for the rain example)‏ When the convergence happens, I have basically lost all the information provided by the existing observations, and I can’t generate any meaningful prediction on states from this point on The time necessary to reach this point is called mixing time. The more uncertainty there is in the transition model, the shorter the mixing time will be Basically, the more uncertainty there is in what happens at t+1 given that I know what happens in t, the faster the information that I gain from evidence on the state at t dissipates with time 36

Another Example: Localization for “Pushed around” Robot Localization (where am I?) is a fundamental problem in robotics Suppose a robot is in a circular corridor with 16 locations There are four doors at positions: 2, 4, 7, 11 The Robot initially doesn’t know where it is The Robot is pushed around. After a push it can stay in the same location, move left or right. The Robot has a noisy sensor telling whether it is in front of a door

This scenario can be represented as… Example Stochastic Dynamics: when pushed, it stays in the same location p=0.2, moves left or right with equal probability 1 2 0.2 . 3 .. 16 15 P(Loct + 1 | Loc t) 0.4 0.4 0.4 0.4 P(Loc0) 1/16 for each location in the corridor

This scenario can be represented as… Example of Noisy sensor telling whether it is in front of a door. If it is in front of a door P(O t = T) = .8 If not in front of a door P(O t = T) = .1 P(O t | Loc t) Loc t P(O t = t) 1 2 3 4 …… 16

This scenario can be represented as… Example of Noisy sensor telling whether it is in front of a door. If it is in front of a door P(O t = T) = .8 If not in front of a door P(O t = T) = .1 P(O t | Loc t) Loc t P(O t = t) 0.1 0.9 1 2 0.8 0.2 3 4 …… 16

Useful inference in this problem Localization: Robot starts at an unknown location and it is pushed around t times. It wants to determine where it is P(Loct | o0, o1,…,o t) This is an instance of Filtering: compute the posterior distribution over the current state given all evidence to date P(St | o0:t ) 41

More complex Example: Robot Localization Suppose a robot wants to determine its location based on its actions and its sensor readings Three actions: goRight, goLeft, Stay This can be represented by an augmented HMM

More complex Example: Robot Localization Suppose a robot wants to determine its location based on its actions and its sensor readings Three actions: goRight, goLeft, Stay This can be represented by an augmented HMM

Robot Localization Sensor and Dynamics Model Sample Sensor Model (assume same as for pushed around) Sample Stochastic Dynamics: P(Loct + 1 | Actiont , Loc t) P(Loct + 1 = L | Action t = goRight , Loc t = L) = 0.1 P(Loct + 1 = L+1 | Action t = goRight , Loc t = L) = 0.8 P(Loct + 1 = L + 2 | Action t = goRight , Loc t = L) = 0.074 P(Loct + 1 = L’ | Action t = goRight , Loc t = L) = 0.002 for all other locations L’ All location arithmetic is modulo 16 The action goLeft works the same but to the left The action Stay is deterministic

Dynamics Model More Details Sample Stochastic Dynamics: P(Loct + 1 | Action, Loc t) P(Loct + 1 = L | Action t = goRight , Loc t = L) = 0.1 P(Loct + 1 = L+1 | Action t = goRight , Loc t = L) = 0.8 P(Loct + 1 = L + 2 | Action t = goRight , Loc t = L) = 0.074 P(Loct + 1 = L’ | Action t = goRight , Loc t = L) = 0.002 for all other locations L’ 1 2 . 3 .. 16 15 1 2 . 3 .. 16 15

Example Inference What is the probability distribution for the robot location at time 2, given the following sequence of observations and actions

Robot Localization additional sensor Additional Light Sensor: there is light coming through an opening at location 10. The light sensor detects if there is light or not at a given location. What do we need to specify?

Robot Localization additional sensor Additional Light Sensor: there is light coming through an opening at location 10 P (Lt | Loct) P (Lt =t) P (Lt =f) Info from the two sensors is combined :“Sensor Fusion” Do we need to do anything special to make this sensor fusion happen?

But inference actually works pretty well. check: The Robot starts at an unknown location and must determine where it is The model appears to be too ambiguous Sensors are too noisy Dynamics are too stochastic to infer anything But inference actually works pretty well. check: http://artint.info/demos/localization/localization.html It uses a generalized form of filtering (not just a sequence of observations, but a sequence of observation pairs (from the two sensors) + the actions

HMMs have many other applications…. Natural Language Processing: e.g., Speech Recognition States: phoneme \ word Observations: acoustic signal \ phoneme Bioinformatics: Gene Finding States: coding / non-coding region Observations: DNA Sequences For these problems the critical inference is: find the most likely sequence of states given a sequence of observations

Most Likely Sequence Suppose that in the rain example we have the following umbrella observation sequence [true, true, false, true, true] Is it the perfect reflection on the rain situation? [rain, rain, no-rain, rain, rain] Or perhaps it did rain on the third day but the boss forgot to bring the umbrella? If it did not rain on day 3, perhaps it also did not rain on day 4, but the boss brought the umbrella just in case 25 possible sequences of states 51

Most Likely Sequence (Explanation) Most Likely Sequence: argmaxx0:t P(X0:t | e0:t ) General idea: search in a graph whose nodes are the possible states at each time step 52

Most Likely Sequence (assuming a single state) Suppose we want to find the most likely path to state St+1. Because of the Markov assumption, this can be found by finding The most likely path to each state st at step t the state st, at step t that maximizes the path to St+1. Recursive relationship between most likely path to state St+1 and most likely path to state St, which we can express as max s1,...st P(s1,.... st ,St+1|o1:t+1) = P(ot+1 |St+1) max st[(P(St+1|st) max s1,...st-1 P(s1,.... st-1 ,st|o1:t)] See derivation in next slide if you need 53

Most Likely Sequence max s1,...st P(s1,.... st ,St+1|o1:t+1)= max s1,...st P(s1,.... st ,St+1|o1:t, ot+1)= = max s1,...st αP(ot+1|o1:t, s1,.... st ,St+1) P(s1,.... st ,St+1|o1:t)= = max s1,...st αP(ot+1|St+1) P(s1,.... st ,St+1|o1:t)= = max s1,...st αP(ot+1|St+1) P(St+1| s1,.... st , e1:t)P(s1,.... st , |o1:t)= = max s1,...st P(ot+1 |St+1) P(St+1|st) P(s1,.... st-1 ,st|o1:t) P(ot+1 |St+1) max st (P(St+1|st) max s1,...st-1 P(s1,.... st-1 ,st|o1:t) Bayes Rule Markov Assumption Product Rule Markov Assumption 54

Most Likely Sequence f0:t-1 = P(St-1|o0:t-1) is replaced by Identical to filtering, P(St | o0:t) = α P(ot | St) ∑st-1 P(St | st-1 ) P( st-1 | o0:t-1 )‏ max s1,...st P(s1,.... st ,St+1|o1:t+1) = P(ot+1 |St+1) max st (P(St+1|st) max s1,...st-1 P(s1,.... st-1 ,st|o1:t) f0:t-1 = P(St-1|o0:t-1) is replaced by m1:t = max s1,...st-1 P(s1,.... st-1 ,St|o1:t) (*)‏ the summation in the filtering equations is replaced by maximization in the maximization equations Recursive call 55

Viterbi Algorithm Computes the most likely sequence to St+1 by running forward along the sequence computing the m message at each time step, using (*) in the previous slide in the end it has the most likely sequence to each of the final states, and we can pick the most likely See derivation of the first 3 m 1:I in the next couple of slides 56

Rain Example max s1,...st P(s1,.... st ,St+1|o1:t+1) = P(ot+1 |St+1) max st [(P(St+1|st) m 1:t] m 1:t = maxs1,...st-1 P(s1,.... st-1 ,St|o1:t) 0.818 0.515 0.182 0.049 m 1:1 is just P(R1|u) = <0.818 , 0.182> m 1:2 = P(u2|R2) <max [P(r2|r1) * 0.818 , P(r2| ⌐ r1) 0.182], max [P(⌐ r2|r1) * 0.818 , P(⌐ r2| ⌐ r1) 0.182)= = <0.9 , 0.2><max(0.7*0.818 , 0.3*0.182), max(0.3*0.818 , 0.7*0.182)= =<0.9 , 0.2>*<0.573 , 0.245>= <0.515 , 0.049> 57

Rain Esample 0.818 0.515 0.0036 0.182 0.049 0.124 m 1:3 = P(⌐ u3|R3) <max [P(r3|r2) * 0.515, P(r3| ⌐ r2) *0.049], mas [P(⌐ r3|r2) * 0.515, P(⌐ r3| ⌐ r2) 0.049)= = <0.1,0.8><max(0.7* 0.515, 0.3* 0.049), max(0.3* 0.515, 0.7* 0.049)= =<0.1,0.8>*<0.036, 0.155>= <0.0036, 0.124> 58

Viterbi Algorithm Time complexity, O(t) The space is also linear in t, because the algorithm needs to save the pointers indicating the best sequence to each state 59

DBN and HMM Every HMM is a DBN with only one state variable Every DBN can be turned into a HMM Just create a mega-variable with values that represent all possible combinations of values for the variables in the DBN

DBN and HMM Student learning example in DBN format Knows-Add (boolean), Knows-Sub (boolean), Morale {low, high} Knows-Subt-1 Knows-Subt Knows-Addt-1 Knows-Addt Moralet-1 Moralet Equivalent HMM: only one variable StudentState with 8 possible states representing all possible combinations of knowing/not knowing addition and subtraction and having a high/low morale

DBN and HMM If they are equivalent, why bother making a distinction? Main difference is that decomposing the state into variables allows one to explicitly represent dependencies among them Sparse dependencies => Exponentially fewer parameters Suppose we have Bnet with 20 Boolean state variables Each with 3 parents in previous time slice Transition model has ……………. Equivalent HMM 1 state variable with ……. states ……………. entries in the transition matrix Really bad space and time complexity Really bad having to specify so many parameters

DBN and HMM If they are equivalent, why bother making a distinction? Main difference is that decomposing the state into variables allows one to explicitly represent dependencies among them Sparse dependencies => Exponentially fewer parameters Suppose we have Bnet with 20 Boolean state variables Each with 3 parents in previous time slice Transition model has 20 x 23 = 160 probabilities Equivalent HMM 1 state variable with 220 states 220 x 220 = 240 ~ 10 12 entries in the transition matrix Really bad space and time complexity Really bad having to specify so many parameters

DBN and HMM Even when the number of states in the equivalent HMM is not that large (e.g. student learning model)… …specifying the numbers in the transition matrix can be less intuitive than specifying the more modular conditional probabilities in the Bnets What is the probability that the student goes From knowing addition, not knowing subtraction, having low morale To knowing addition, not knowing subtraction and having high morale?

Exact Inference in DBN Since DBN are Bayesian networks, we can use any of the existing algorithms for exact inference Given a sequence of observations Construct the full Bayesian networks by creating as many time slices as they are needed to fit the whole sequence This technique is called unrolling But this is not such a great idea, as it would requite O(t) space to unroll the network Besides if we update the network anew for each new piece of evidence, time complexity would also be O(t)

Exact Inference in DBN Better to use the recursive definition of filtering that we have seen earlier P(Xt | e0:t) = α P(et | Xt) ∑xt-1 P(Xt | xt-1 ) P( xt-1 | e0:t-1 ) Xt is now a set of state variables, and e is a set of observations Filtering at time t-1 Propagation to time t Inclusion of new evidence (sensor model)‏ Filtering step: basically sums out the state variables in the previous time slice to get the distribution in the current time slice This is exactly variable elimination run with the variables in temporal order

Rollup Filtering This algorithm (rollup filtering) keeps only two time slices in memory at any given time Start with Slice 0 Add slice 1 and sum out slice 0 Add slice 2 and sum out slice 1 Add slice n and sum out slice n -1 0.627 0.373 0.5 TRUE 0.5 FALSE 0.5 0.883 0.117 0.818 0.182 Rain2 Rain0 Rain1 Umbrella2 Umbrella1

Rollup Filtering Rollup filtering requires “constant” (independent of t) space and time for each update Unfortunately the “constant” is, in most cases, exponential in the number of state variables As variable elimination proceeds, the factors grow to include all state variables that have parents in the previous time slice Thus, while in principle DBN allow to represent temporal processes with many sparsely connected variables, in practice we cannot reason efficiently and exactly about those processes Need to use sampling algorithms! The most promising is particle filtering

Why not Likelihood Weighting ? LW Avoids the inefficiency of rejection sampling by generating only events that are consistent with evidence e Fixes the values for the evidence variables E Samples only the remaining variables Z, i.e. the query X and hidden variables Y Still needs to account for the influence of the given evidence on the probability of the samples LW approximates the posterior distribution by weighing the samples by the probability they afford to the evidence

Example: P(Rain|sprinkler, wet-grass)‏ Random => 0.4 Sample=> cloudy 0.5 P(C)‏ Cloudy w1 = 1 w2 = w1* 0.1 = 0.1 w3 = w2* 0.99 = 0.099 0.1 0.5 T F P(S|C)‏ C Rain Sprinkler T F C 0.8 0.2 P(R|C)‏ Sprinkler is fixed No sampling, but adjust weight w2 = w1 * P(sprinkler|cloudy)‏ Wet Grass Random => 0.4 Sample=> rain 0.1 F 0.9 T 0.99 P(W|S,R)‏ R S Wet Grass is fixed No sampling, but adjust weight w3 = w2 * P(wet-grass|sprinkler, rain)‏

Why not Likelihood Weighting Two problems Standard algorithm runs each sample in turn, all the way through the network time and space requirements grow with t!

Why not Likelihood Weighting 2) Does not work well when evidence comes “late” in the network topology Samples are largely independent on the evidence True in HMM, where often state variables don’t have evidence as parents In the Rain network, I could observe umbrella every day, and still get samples that are sequences of sunny days Number of samples necessary to get good approximations increases exponentially with t!

Why not Likelihood Weighting

Particle Filtering Designed to fix both problems Run all N samples together through the network, one slice at a time Form of filtering, where the N samples are the forward message Essentially use the sample themselves as an approximation of the current state distribution No need to unroll the network, all that is needed is current and next slice => “constant” update per time step STEP 0: Generate a population on N initial-state samples by sampling from initial state P(Xt) Particle Filtering N = 10

Particle Filtering STEP 1: Propagate each sample for xt forward by sampling the next state value xt+1 based on P(Xt+1 |xt ) Rt-1 P(Rt) t f 0.7 0.3

Particle Filtering STEP 2: Weight each sample by the likelihood it assigns to the evidence E.g. assume we observe not umbrella at t+1 Rt P(Ut) t f 0.9 0.2

Now we need to take care of problem #2: generate samples that are a better simulation of reality IDEA: Focus the set of samples on the high probability regions of the state space Throw away samples with very low weight according to evidence Replicate those with high weight, to obtain a sample closer to reality STEP 3: Create a new sample from the population at Xt+1, i.e. resample the population so that the probability that each sample is selected is proportional to its weight Particle Filtering

STEP 3: Create a new sample from the population at Xt+1, i.e. resample the population so that the probability that each sample is selected is proportional to its weight Particle Filtering Start the Particle Filtering cycle again from the new sample

Particle Filtering

Particle Filtering N = 10

Particle Filtering

Particle Filtering

Particle Filtering

Is PF Efficient? In practice, approximation error of particle filtering remains bounded overtime It is also possible to prove that the approximation maintains bounded error with high probability (with specific assumptions)

Representational Dimensions Reasoning Technique Environment Stochastic Deterministic Problem Type Arc Consistency This concludes the module on answering queries in stochastic environments Constraint Satisfaction Vars + Constraints Search Static Belief Nets Logics Variable Elimination Query Search Approximate Add to logic: first order, propositional AKA bayesian networks Temporal Inference Sequential Decision Nets STRIPS Variable Elimination Planning Search Markov Processes Value Iteration

Learning Goals For Probability andTime Explain what is a dynamic Bayesian network and when it is a useful R&R tool. Indentify real world scenarions that can be modeled via DBN State the stationary process assumption, Markov assumption and Markov Assumption on Evidence. Explain why they are made. Define markov Chains, and explain when they are useful. Define Hidden Markov Models, and explain when they are useful. Explain/write in pseudocode/implement/trace/debug algorithms for temporal inference: filtering, prediction, Viterbi. Explain the relation between DN and HHM. Translate a problem represented as a HMM into a corresponding DN (and vice versa). Compare the pros and cons of the the two representations. Explain/implement roll-up filtering. Explain Particle Filtering for discrete networks