Presentation is loading. Please wait.

Presentation is loading. Please wait.

Search and Planning for Inference and Learning in Computer Vision

Similar presentations


Presentation on theme: "Search and Planning for Inference and Learning in Computer Vision"— Presentation transcript:

1 Search and Planning for Inference and Learning in Computer Vision
Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu

2 Markov Decision Processes & Reinforcement Learning
Sinisa Todorovic and Iasonas Kokkinos June 7, 2015

3 Multi-Armed Bandit Problem
A gambler faces K slot-machines ("armed bandits") Each machine provides a random reward from an unknown distribution specific to that machine Problem: In which order to play each machine to maximize the sum of rewards of a sequence of lever pulls s a1 a2 ak R(s,a1) R(s,a2) R(s,ak) Robbins 1952

4 Outline Stochastic Process Markov Property Markov Chain
Markov Decision Process Reinforcement Learning

5 Discrete Stochastic Process
Andrey Makrov A collection of indexed random variables with well-defined ordering Characterized by probabilities that the variables take given values, called states

6 Stochastic Process Example
Classic: Random Walk Start at state X0 at time t0 At time ti, move a step Zi where P(Zi = -1) = p and P(Zi = 1) = 1 - p At time ti, state Xi = X0 + Z1 +…+ Zi

7 Markov Property Also thought of as the “memoryless” property
If the probability that Xn+1 has any given value depends only on Xn

8 Markov Chain Discrete-time stochastic process with the Markov property
Example: Google’s PageRank Likelihood of random linking ending up on a page

9 Markov Decision Process (MDP)
Discrete time stochastic control process Extension of Markov chains Differences: Addition of actions (choice) Addition of rewards (motivation) If the actions are fixed, an MDP reduces to a Markov chain

10 Description of MDPs Tuple (S, A, P(.,.), R(.)))
S -> state space A -> action space Pa(s, s’) = Pr(st+1 = s’ | st = s, at = a) R(s) = immediate reward at state s Goal: maximize a cumulative function of the rewards = utility function

11 Example MDP state node action node

12 Solution to an MDP = Policy π
Given a state, selects the optimal action regardless of history Value function

13 Learning Policy Value Iteration Policy Iteration
Modified Policy Iteration Prioritized Sweeping Notable variants Value iteration In value iteration (Bellman 1957), the π array is not used; instead, the value of π(s) is calculated whenever it is needed. Substituting the calculation of π(s) into the calculation of V(s) gives the combined step: V(s): = R(s) + γmaxa∑Pa(s,s')V(s')s' Policy iteration In policy iteration (Howard 1960), step one is performed once, and then step two is repeated until it converges. Then step one is again performed once and so on. Instead of repeating step two to convergence, it may be formulated and solved as a set of linear equations. This variant has the advantage that there is a definite stopping condition: when the array π does not change in the course of applying step 1 to all states, the algorithm is completed. Modified policy iteration In modified policy iteration (Puterman and Shin 1978), step one is performed once, and then step two is repeated several times. Then step one is again performed once and so on. Prioritized sweeping In this variant, the steps are preferentially applied to states which are in some way important - whether based on the algorithm (there were large changes in V or π around those states recently) or based on use (those states are near the starting state, or otherwise of interest to the person or program using the algorithm). From:

14 Value Iteration k Vk(PU) Vk(PF) Vk(RU) Vk(RF) 1 2 3 4 5 6 10 4.5 14.5
For V2(RU): For action A: [(.5)(10) + (.5)(0)] = 5 For action S: [(.5)(0) + (.5)(0)] = 0 So, V2(RU) = 10 + (.9)(5) = 14.5 k Vk(PU) Vk(PF) Vk(RU) Vk(RF) 1 2 3 4 5 6 10 4.5 14.5 19 2.03 8.55 18.55 24.18 4.76 11.79 19.26 29.23 7.45 15.30 20.81 31.82 10.23 17.67 22.72 33.68

15 Why So Interesting? Straightforward if the transition probabilities are known, but... If the transition probabilities are unknown, then this problem is reinforcement learning.

16 A Typical Agent In reinforcement learning (RL), an agent observes a state and takes an action. Afterwards, the agent receives a reward.

17 Mission: Optimize Reward
Rewards are calculated in the environment Used to teach the agent how to reach a goal state Must signal what we ultimately want achieved, not necessarily subgoals May be discounted over time In general, seek to maximize the expected return

18 Monte Carlo Methods Instead of Compute:
Qπ(s, a): Expected reward when starting in state s, taking action a, and thereafter following policy π

19 Monte-Carlo Tree Search
Builds a tree rooted at current state by repeated Monte-Carlo simulation of a “rollout policy” Key Idea: Use statistics of previous trajectories to expand the tree in most promising direction No heuristic functions, unlike A*, and branch-and-bound methods Kocsis & Szepesvari, 2006 Browne et. al., 2012

20 Monte-Carlo Tree Search
select the best state so far take an action and move to a new state simulation backpropagation of the total reward of simulation Repeated until the maximum tree depth is reached

21 Monte-Carlo Tree Search
During construction each tree node s stores: state-visitation count n(s) action counts n(s,a) action values Q(s,a) Repeat until time is up Select action a Update statistics of each node s on trajectory Increment n(s) and n(s,a) for selected action a Update Q(s,a) by total reward of simulation

22 Monte-Carlo Tree Search
During construction each tree node s stores: state-visitation count n(s) action counts n(s,a) action values Q(s,a) Repeat until time is up Select action a Update statistics of each node s on trajectory Increment n(s) and n(s,a) for selected action a Update Q(s,a) by total reward of simulation

23 Monte-Carlo Tree Search
exploration exploitation Theoretically, guaranteed to converge to optimal solutions if run long enough. Practically, it often shows good anytime behavior. Kocsis & Szepesvari, 2006 Browne et. al., 2012

24 Acknowledgements NSF IIS DARPA MSEE FA


Download ppt "Search and Planning for Inference and Learning in Computer Vision"

Similar presentations


Ads by Google