Making Simple Decisions

Making Simple Decisions

Outline Combining Beliefs & Desires Basis of Utility Theory
Utility Functions. Multi-attribute Utility Functions. Decision Networks Value of Information

Combining Beliefs & Desires(1)
Rational Decision Based on Beliefs & Desires Where uncertainty & conflicting goals exist Utility assigned single number by some function Fn. express the desirability of a state combined with outcome prob. expected utility for each action

Notation U(S) : utility of state S S : snapshot of the world A : action of the agent : outcome state by doing A E : available evidence Do(A) : executing A in current S

Expected Utility Maximum Expected Utility(MEU) Choose an action which maximizes agent’s expected utility

Basis of Utility Theory
Notation Lottery(L): a complex decision making scenario Different outcomes are determined by chance. : A is preferred to B : indifference btw. A & B : B is not preferred to A Multi—outcome lottery

Basis of Utility Theory(2)
Constraints Orderability Transitivity Continuity

Constraints (cont.) Substitutability Monotonicity Decomposability

Utility principle Maximum Expected Utility principle Utility Represents that the agent’s actions are trying to achieve. Can be constructed by observing agent’s preferences. Non-uniqueness

Utility Functions. (1) Utility mapping state to real numbers approach
Compare A to standard lottery u : best possible prize with prob. p u : worst possible catastrophe with prob. 1-p Adjust p until eg) $30 ~ L 0.9 continue death 0.1

Utility Functions. (2) Utility scales u = 1.0, u = 0.0
positive linear transform Normalized utility u = 1.0, u = 0.0 Micromort (“微死亡”) one-millionth chance of death Russian roulette, insurance QALY quality-adjusted life years

Utility Functions. (3) Utility of Money Example (Grayson, 1960)
TV game 1million vs millions by chance 0.5 Example (Grayson, 1960)

Utility Functions. (4) Given a lottery L
Money : does NOT behave as a utility fn. Given a lottery L risk-averse (不愿承担风险) risk-seeking (风险追求) In reality, true probability is not easy to estimate, we may estimate Utility function by learning U $

Outline Combining Belief & Desire Basis of Utility Theory
Utility Functions Multi-attribute Utility Functions Decision Networks Value of Information Expert Systems

Multi-attribute Utility Functions. (1)
Multi-Attribute Utility Theory (MAUT) Outcomes are characterized by 2 or more attributes. eg) Site a new airport disruption by construction, cost of land, noise, …. Approach Identify regularities in the preference behavior

Notation Attributes Attribute value vector Utility Fn.

Dominance Certain (strict dominance, Fig.1) eg) airport site S1 cost less, less noise, safer than S2 : strict dominance of S1 over S2 Uncertain(Fig. 2) Fig Fig.2

Dominance(cont.) Stochastic dominance In real world problem, very few dominance eg) S1 : $2.8 billion and $4.8 billion S2 : $3 billion and $5.2 billion S1 stochastically dominates S2

Dominance(cont.)

Preferences（偏好） without Uncertainty Preferences btw. concrete outcome values. Preference structure X1 & X2 preferentially independent of X3 iff Preference btw. Does not depend on eg) Airport site : <Noise,Cost,Safety> <20,000 suffer, $4.6billion, 0.06deaths/mpm> vs. <70,000 suffer, $4.2billion, 0.06deaths/mpm>

Preferences without Uncertainty(cont.) Mutual preferential independence(MPI) every pair of attributes is P.I of its complements. eg) Airport site : <Noise, Cost, Safety> Noise & Cost P.I Safety Noise & Safety P.I Cost Cost & Safety P.I Noise : <Noise,Cost,Safety> exhibits MPI Agent’s preference behavior

Preferences without Uncertainty(cont.) Mutual preferential independence(MPI) Airport site selection <Noise,Cost,Safety> exhibits MPI Agent’s preference behavior

Preferences with Uncertainty Preferences btw. Lotteries’ utility Utility Independence(U.I) X is utility-independent of Y iff preferences over lotteries’ attribute set X do not depend on particular values of a set of attribute Y. Mutual U.I. (MUI) Each subset of attributes is U.I of the remaining attributes. Agent’s behavior : multiplicative Utility Fn

Value of Information (1)
Idea Compute value gain of acquiring each of evidence Example : Buying oil drilling rights Three blocks A,B and C, exactly one has oil, worth k dollars Prior probabilities 1/3 each, mutually exclusive Current price of each block is k/3 Consultant offers accurate survey of A. What is the fair price?

Value of Information (2)
Solution: Compute expected value of Information = Expected value of best action given information -- Expected value of best action without information Survey say “oil in A” with pdf 1/3or ‘no oil in A”( 2/3) With pdf 1/3, A has oil profit: k-k/3=2k/3 With pdf 2/3, A has no oil Profit: k/2-k/3=k/6

General Formula Notation Value of perfect information (VPI)
Current evidence E, Current best action  Possible action outcomes Resulti(A)=Si Potential new evidence Ej Value of perfect information (VPI)

Properties of VPI Nonnegative Nonadditive Order-Independent

Three generic cases for VoI
a) Choice is obvious, information worth little b) Choice is nonobvious, information worth a lot c) Choice is nonobvious, information worth little

Summary Combining Belief & Desire Basis of Utility Theory
Utility Functions. Multi-attribute Utility Functions. Decision Networks Value of Information

MAKING COMPLEX DEClSlONS

Outline MDPs(Markov Decision Processes) POMDPs Game Theory
Sequential decision problems Value iteration & Policy iteration POMDPs Partially observable MDPs Decision-theoretic Agents Game Theory Decisions with Multiple Agents: Game Theory Mechanism Design

Sequential decision problems
An example

Game rules: 4 x 3 environment shown Beginning in the start state Choose an action at each time step End in the goal states, marked +1 or -1. Actions : {Up, Down, Left, Right} The environment is fully observable Terminal states have reward +1 and -1,respectively All other states have a reward of -0.04

Each action achieves the intended effect with probability 0.8, but the rest of the time, the action moves the agent at right angles to the intended direction. If the agent bumps into a wall, it stays in the same square.

Transition model A specification of the outcome probabilities for each action in each possible state Environment history a sequence of states Utility of an environment history the sum of the rewards (positive or negative) received

Definition of MDP Markov Decision Process: The specification of a sequential decision problem for a fully observable environment with a Markovian transition model and additive rewards An MDP is defined by Initial State: S0 Transition Model: T ( s , a, s') Reward Function: R(s)

Policy(策略)(denoted by ) a solution which specify what the agent should do for any state that the agent might reach is the action recommended by the policy for state s Optimal policy(denoted by ) a policy that yields the highest expected utility

An optimal policy for the world of Figure 17.1

A finite horizon There is a fixed time N after which nothing matters- the game is over The optimal policy for finite horizon is nonstationary (the optimal action in a given state could change over time) Complex An infinite horizon There is not a fixed time N The optimal policy for infinite horizon is stationary simpler

Calculate the utility of state sequences Additive rewards: The utility of a state sequence is Discounted rewards: The utility of a state sequences is where the discount factory γ is a number between 0 and 1

Infinite horizons Definition：A policy that is guaranteed to reach a terminal state is called a proper policy with proper policy, we may use γ =1 Another possibility is to compare infinite sequences in terms of the average reward obtained per time step

How to choose between policies The value of a policy is the expected sum of discounted rewards obtained, where the expectation is taken over all possible state sequences that could occur, given that the policy is executed. An optimal policy satisfies

Value iteration The basic idea is to calculate the utility of each state and then use the state utilities to select an optimal action in each state. Definition: Utilities of states (given a specific policy π ) let be the state the agent is in after executing π for t steps (note that is a random variable) Difference between short term reword long term reword

Value iteration The utilities for the 4 x 3 world

Value iteration Choose: the action that maximizes the expected utility of the subsequent state The utility of a state is given by Bellman equation

Value iteration Let us look at one of the Bellman equations for the 4 x 3 world. The equation for the state (1,1) is Up Left Down right

Value iteration The value iteration algorithm a Bellman update, looks like this VALUE-ITERATION algorithm as follows

Value iteration

Convergence of Value iteration
Starting with initial values of zero, the utilities evolve as shown in Figure 17.5(a)

Value iteration Two important properties of contractions:
A contraction function has only one fixed point.. When the function is applied to any argument, the value must get closer to the fixed point. Let denote the vector of utilities for all the states at the i-th iteration. Then the Bellman update equation can be written as

Value iteration Use the max norm, which measures the length of a vector by the length of its biggest component: Let Ui and Ui' be any two utility vectors. Then we have

Value iteration The number of value iterations k required to guarantee an error of at most for different values of c.

Policy iteration The policy iteration algorithm alternates the following two steps, beginning from some initial policy π0 : Policy evaluation: given a policy πi, calculate Ui = Uπi, the utility of each state if πi were to be executed. Policy improvement: Calculate a new MEU policy πi+1, using one-step look-ahead based on Ui (as in Equation (17.4)).

Policy iteration For n states, we have n linear equations with n unknowns, which can be solved exactly in time O(n3) by standard linear algebra methods. For large state spaces, O(n3) time might be prohibitive Modified policy iteration The simplified Belllman update for this process

Policy iteration Example：See the figure.
Suppose is the policy shown in the figure

Policy iteration

Policy iteration In fact, on each iteration, we can pick any subset of states and apply either kind of updating (policy improvement or simplified value iteration) to that subset. This very general algorithm is called asynchronous policy iteration.

Partially observable MDPs
When the environment is only partially observable, MDPs turns into Partially observable MDPs(or POMDPs pronounced "pom-dee-pees") POMDPs 's elements： elements of MDP（transition model、reward function） Observation model An POMDP is defined by Initial State: S0 ( also unknown) Transition Model: P ( s’| a, s) Reward Function: R(s) Sensor model P(e|s)

an example for POMDPs

How to calculate belief state ? where is a normalized term. Suppose the agent move LEFT and its sensor reports it adjacent wall; then it’s quite likely that the agent is now in (3,1) under the motion and the sensor are noisy

Decision cycle of a POMDP agent: Given the current belief state b, execute the action Receive observation e. Set the current belief state to FORWARD(b, a, e) and repeat.

The probability of e Given that a was performed starting in belief state b, the probability of e

The probability of e where if , =0, otherwise. Reward function:
Thus define an observable MDP on the space of belief state. Solving a POMDB on a physical state space can be reduced to solving an MDP on the corresponding belief-state space.

Value Iteration for POMDPs
Denote the utility of executing a fixed conditional plan p in physical state s. The expected utility of executing p in the belief state is The expected utility of b under optimal policy is the utility of that conditional plan The Utility iteration

Decision –theoretic Agents
Basic elements of approach to agent design Dynamic Decision network(DDN=DBN+Utility) A filtering algorithm & Make decisions A dynamic decision network as follows:

Dynamic Decision network(DDN=DBN+Utility) Transition model: Sensor model： Reward: Utility:

Part of the look-ahead solution of the DDN

Game Theory Components of a game in game theory Players: Alice, Bob
Actions: one , testify A payoff function The payoff matrix for two finger Morra: O: one O: two E: one E=+2; O= -2 E= -3; O= +3 E: two E=-3; O= +3 E=+4; O= -4

Game Theory Agent design: Mechanism design:
To analyze the agent’s decisions and compute the expected utility for each decision under assumption that other agents are acting optimally according to game theory. Mechanism design: When an environment is inhabited by many agents, it might be possible to define the rules of the environment so that the collective good of all agents is maximized when each agent adopts the game-theoretic solution that maximizes its own utility Example: Internet traffic routers

Game Theory Strategy of players Strategy profile Solution
Pure strategy(deterministic policy) Mixed strategy(randomized policy) Strategy profile an assignment of a strategy to each player Solution a strategy profile in which each player adopts a rational strategy.

Game Theory Game theory describes rational behavior for agents in situations where multiple agents interact simultaneously. Solutions of games are Nash equilibria - strategy profiles in which no agent has an incentive to deviate from the specified strategy.

Prisoner’s Dilemma Alice: testify Alice: refuse Bob:testify
A=-5; B= -5 A= -10; B= 0 Bob:refuse A=0; B= -10 A= -1; B= -1 Suppose Bob testifies Alice: testify 5 years refuse: 10 years Suppose Bob refuse Alice: testify: 0 year refuse: 1year

Dominant strategy Dominant strategy Pareto Optimal
For player p a strategy s strongly dominates strategy s’ if the outcome for s is better than for p than the outcome for s’( for every choice of strategies by the other players) Strategy s weakly dominates s’ if s is better than s’ at least one strategy profile and no worse on any others. Pareto Optimal If there is no other outcome that all players would prefer.

Equilibrium Alice’s reason:
Bob’s dominant strategy is “testfy”, and both get five years. Dominant strategy equilibrium John Nash proved that every game has at least one equilibrium. Known as Nash equilibrium Dominant strategy equilibrium is Nash equilibrium, but Nash equilibrium is not necessary to be Dominant

Mechanism Design Mechanism design can be used to set the rules by which agents will interact, in order to maximize some global utility through the operation of individually rational agents. Sometimes, mechanisms exist that achieve this goal without requiring each agent to consider the choices made by other agents.

The End of Talk

Making Simple Decisions

Similar presentations

Presentation on theme: "Making Simple Decisions"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Making Simple Decisions

Similar presentations

Presentation on theme: "Making Simple Decisions"— Presentation transcript:

Similar presentations

About project

Feedback