MAKING COMPLEX DEClSlONS

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

This Segment: Computational game theory Lecture 1: Game representations, solution concepts and complexity Tuomas Sandholm Computer Science Department Carnegie.
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
An Introduction to Markov Decision Processes Sarah Hickmott
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
POMDPs: Partially Observable Markov Decision Processes Advanced AI
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Markov Decision Processes
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
Markov Decision Processes
Department of Computer Science Undergraduate Events More
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2003 material from Jean-Claude Latombe, and Daphne Koller.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Instructor: Vincent Conitzer
Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving.
Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.
Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CS 416 Artificial Intelligence Lecture 19 Making Complex Decisions Chapter 17 Lecture 19 Making Complex Decisions Chapter 17.
Department of Computer Science Undergraduate Events More
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
Markov Decision Process (MDP)
MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.
Comparison Value vs Policy iteration
Department of Computer Science Undergraduate Events More
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Markov Decision Process (MDP)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes II
Making complex decisions
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
Chapter 17 – Making Complex Decisions
CS 188: Artificial Intelligence Spring 2006
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Making Simple Decisions
Markov Decision Processes
Collaboration in Repeated Games
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

MAKING COMPLEX DEClSlONS

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL outline MDPs(Markov Decision Processes) Sequential decision problems Value iteration&Policy iteration POMDPs Partially observable MDPs Decision-theoretic Agents Game Theory Decisions with Multiple Agents: Game Theory Mechanism Design ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Sequential decision problems An example ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Sequential decision problems Game rules: 4 x 3 environment shown in Figure 17.l(a) Beginning in the start state choose an action at each time step End in the goal states, marked +1 or -1. Actions eaquls {Up, Down, Left, Right} the environment is fully observable Teminal states have reward +1 and -1,respectively All other states have a reward of -0.04 ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Sequential decision problems The particular model of stochastic motion that we adopt is illustrated in Figure 17.l(b). Each action achieves the intended effect with probability 0.8, but the rest of the time, the action moves the agent at right angles to the intended direction. Furthermore, if the agent bumps into a wall, it stays in the same square. ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Sequential decision problems Transition model ---A specification of the outcome probabilities for each action in each possible state environment history --- a sequence of states utility of an environment history ---the sum of the rewards(positive or negative) received ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Sequential decision problems Definition of MDP Markov Decision Process: The specification of a sequential decision problem for a fully observable environment with a Markovian transition model and additive rewards An MDP is defined by Initial State: S0 Transition Model: T ( s , a, s') Reward Function:' R(s) ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Sequential decision problems Policy(denoted by π) a solution which specify what the agent should do for any state that the agent might reach optimal policy(denote by π*) a policy that yields the highest expected utility ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Sequential decision problems An optimal policy for the world of Figure 17.1 ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Sequential decision problems The balance of risk and reward changes depends on the value of R(s) for the nonterminal states Figure 17.2(b) shows optimal policies for four different ranges of R(s) ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Sequential decision problems a finite horizon (i)A finite horizon means that there is a fixed time N after which nothing matters-the game is over (ii)the optimal policy for finite horizon is nonstationary(the optimal action in a given state could change over time) (iii)complex ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Sequential decision problems an infinite horizon (i)A finite horizon means that there is not a fixed time N after which nothing matters-the game is over (ii)the optimal policy for finite horizon is stationary (iii)simpler ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Sequential decision problems calculate the utility of state sequences (i)Additive rewards: The utility of a state sequence is (ii)Discounted rewards: The utility of a state sequences is where the discount factory γ is a number between 0 and 1 ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Sequential decision problems infinite horizons A policy that is guaranteed to reach a terminal state is called a proper policy(γ=1) Another possibility is to compare infinite sequences in terms of the average reward obtained per time step ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Sequential decision problems How to choose between policies the value of a policy is the expected sum of discounted rewards obtained, where the expectation is taken over all possible state sequences that could occur, given that the policy is executed. An optimal policy n* satisfies ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Value iteration The basic idea is to calculate the utility of each state and then use the state utilities to select an optimal action in each state. ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Value iteration Utilities of states let st be the state the agent is in after executing π for t steps (note that st is a random variable) ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Value iteration Figure 17.3 shows the utilities for the 4 x 3 world ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Value iteration choose: the action that maximizes the expected utility of the subsequent state the utility of a state is given by Bellman equation ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Value iteration Let us look at one of the Bellman equations for the 4 x 3 world. The equation for the state (1,l) is ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Value iteration The value iteration algorithm a Bellman update, looks like this VALUE-ITERATION algorithm as follows ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Value iteration ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Value iteration Starting with initial values of zero, the utilities evolve as shown in Figure 17.5(a) ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Value iteration two important properties of contractions: (i)A contraction has only one fixed point; if there were two fixed points they would not get closer together when the function was applied, so it would not be a contraction. (ii)When the function is applied to any argument, the value must get closer to the fixed point (because the fixed point does not move), so repeated application of a contraction always reaches the fixed point in the limit. ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Value iteration Let Ui denote the vector of utilities for all the states at the ith iteration. Then the Bellman update equation can be written as ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Value iteration use the max norm, which measures the length of a vector by the length of its biggest component: Let Ui and Ui' be any two utility vectors. Then we have ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Value iteration In particular, we can replace U,' in Equation (17.7) with the true utilities U, for which B U = U. Then we obtain the inequality Figure 17.5(b) shows how N varies with y, for different values of the ratio ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Value iteration ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Value iteration From the contraction property (Equation (17.7)), it can be shown that if the update is small (i.e., no state's utility changes by much), then the error, compared with the true utility function, allso is small. More precisely, ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Value iteration policy loss Uπi (s) is the utility obtained if πi is executed starting in s, policy loss is the most the agent can lose by executing πi instead of the optimal policy π* ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Value iteration The policy loss of πi is connected to the error in Ui by the following inequality: ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Policy iteration The policy iteration algorithm alternates the following two steps, beginning from some initial policy π0 : Policy evaluation: given a policy πi, calculate Ui = Uπi, the utility of each state if πi were to be executed. Policy improvement: Calculate a new MEU policy πi+1, using one-step look-ahead based on Ui (as in Equation (17.4)). ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Policy iteration ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Policy iteration For n states, we have n linear equations with n unknowns, which can be solved exactly in time O(n3) by standard linear algebra methods. For large state spaces, O(n3) time might be prohibitive modified policy iteration The simplified Belllman update for this process ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Policy iteration In fact, on each iteration, we can pick any subset of states and apply either kind of updating (policy improvement or simplified value iteration) to that subset. This very general algorithm is called asynchronous policy iteration. ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Partially observable MDPs When the environment is only partially observable, MDPs turns into Partially observable MDPs(or POMDPs pronounced "pom-dee-pees") ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Partially observable MDPs an example for POMDPs ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Partially observable MDPs POMDPs ’s elements: elements of MDP(transition model、reward function) Observation model ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Partially observable MDPs How to calculate belief state ? Decision cycle of a POMDP agent: Given the current belief state b, execute the actioin a = ∏* (b) . Receive observation o. Set the current belief state to FORWARD(b, a, o) and repeat. ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Decision –theoretic Agents Basic elements of approach to agent design Dynamic Bayesian network Dynamic decision network(DDN) A filtering algorithm Make decisions A dynamic decision network as follows: ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Decision –theoretic Agents ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

Decision –theoretic Agents Part of the look-ahead solution of the DDN ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Game Theory Where it can be used? Agent design Mechanism design Components of a game in game theory Players Actions A payoff matrix ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Game Theory Strategy of players Pure strategy(deterministic policy) Mixed strategy(randomized policy) Strategy profile an assignment of a strategy to each player Solution a strategy profile in which each player adopts a rational strategy. ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Game Theory Game theory describes rational behavior for agents in situations where multiple agents interact simultaneously. Solutions of games are Nash equilibria - strategy profiles in which no agent has an incentive to deviate from the specified strategy. ARTIFICIAL INTELLIGENCE-TJU 2008 FALL

ARTIFICIAL INTELLIGENCE-TJU 2008 FALL Mechanism Design Mechanism design can be used to set the rules by which agents will interact, in order to maximize some global utility through the operation of individually rational agents. Sometimes, mechanisms exist that achieve this goal without requiring each agent to consider the choices made by other agents. ARTIFICIAL INTELLIGENCE-TJU 2008 FALL