CS 416 Artificial Intelligence Lecture 23 Making Complex Decisions Chapter 17 Lecture 23 Making Complex Decisions Chapter 17.

CS 416 Artificial Intelligence Lecture 23 Making Complex Decisions Chapter 17 Lecture 23 Making Complex Decisions Chapter 17

Partially observable Markov Decision Processes (POMDPs) Relationship to MDPs Value and Policy Iteration assume you know a lot about the world:Value and Policy Iteration assume you know a lot about the world: –current state, action, next state, reward for state, … In real world, you don’t exactly know what state you’re inIn real world, you don’t exactly know what state you’re in –Is the car in front braking hard or braking lightly? –Can you successfully kick the ball to your teammate? Relationship to MDPs Value and Policy Iteration assume you know a lot about the world:Value and Policy Iteration assume you know a lot about the world: –current state, action, next state, reward for state, … In real world, you don’t exactly know what state you’re inIn real world, you don’t exactly know what state you’re in –Is the car in front braking hard or braking lightly? –Can you successfully kick the ball to your teammate?

Partially observable Consider not knowing what state you’re in… Go left, left, left, left, leftGo left, left, left, left, left Go up, up, up, up, upGo up, up, up, up, up –You’re probably in upper- left corner Go right, right, right, right, rightGo right, right, right, right, right Consider not knowing what state you’re in… Go left, left, left, left, leftGo left, left, left, left, left Go up, up, up, up, upGo up, up, up, up, up –You’re probably in upper- left corner Go right, right, right, right, rightGo right, right, right, right, right

Extending the MDP model MDPs have an explicit transition function T(s, a, s’) We add O (s, o)We add O (s, o) –The probability of observing o when in state s We add the belief state, bWe add the belief state, b –The probability distribution over all possible states –b(s) = belief that you are in state s MDPs have an explicit transition function T(s, a, s’) We add O (s, o)We add O (s, o) –The probability of observing o when in state s We add the belief state, bWe add the belief state, b –The probability distribution over all possible states –b(s) = belief that you are in state s

Two parts to the problem Figure out what state you’re in Use Filtering from Chapter 15Use Filtering from Chapter 15 Figure out what to do in that state Bellman’s equation is useful againBellman’s equation is useful again The optimal action depends only on the agent’s current belief state Figure out what state you’re in Use Filtering from Chapter 15Use Filtering from Chapter 15 Figure out what to do in that state Bellman’s equation is useful againBellman’s equation is useful again The optimal action depends only on the agent’s current belief state Update b(s) and  (s) after each iteration

Selecting an action  is normalizing constant that makes belief state sum to 1  is normalizing constant that makes belief state sum to 1 b’ = FORWARD (b, a, o)b’ = FORWARD (b, a, o) Optimal policy maps belief states to actionsOptimal policy maps belief states to actions –Note that the n-dimensional belief-state is continuous  Each belief value is a number between 0 and 1  is normalizing constant that makes belief state sum to 1  is normalizing constant that makes belief state sum to 1 b’ = FORWARD (b, a, o)b’ = FORWARD (b, a, o) Optimal policy maps belief states to actionsOptimal policy maps belief states to actions –Note that the n-dimensional belief-state is continuous  Each belief value is a number between 0 and 1

A slight hitch The previous slide required that you know the outcome o of action a in order to update the belief state If the policy is supposed to navigate through belief space, we want to know what belief state we’re moving into before executing action a The previous slide required that you know the outcome o of action a in order to update the belief state If the policy is supposed to navigate through belief space, we want to know what belief state we’re moving into before executing action a

Predicting future belief states Suppose you know action a was performed when in belief state b. What is the probability of receiving observation o? b provides a guess about initial stateb provides a guess about initial state a is knowna is known Any observation could be realized… any subsequent state could be realized… any new belief state could be realizedAny observation could be realized… any subsequent state could be realized… any new belief state could be realized Suppose you know action a was performed when in belief state b. What is the probability of receiving observation o? b provides a guess about initial stateb provides a guess about initial state a is knowna is known Any observation could be realized… any subsequent state could be realized… any new belief state could be realizedAny observation could be realized… any subsequent state could be realized… any new belief state could be realized

Predicting future belief states The probability of perceiving o, given action a and belief state b, is given by summing over all the actual states the agent might reach

Predicting future belief states We just computed the odds of receiving o We want new belief state Let  (b, a, b’) be the belief transition functionLet  (b, a, b’) be the belief transition function We just computed the odds of receiving o We want new belief state Let  (b, a, b’) be the belief transition functionLet  (b, a, b’) be the belief transition function Equal to 1 if b′ = FORWARD(b, a, o) Equal to 0 otherwise

Predicted future belief states Combining previous two slides This is a transition model through belief states Combining previous two slides This is a transition model through belief states

Relating POMDPs to MDPs We’ve found a model for transitions through belief states Note MDPs had transitions through states (the real things)Note MDPs had transitions through states (the real things) We need a model for rewards based on beliefs Note MDPs had a reward function based on stateNote MDPs had a reward function based on state We’ve found a model for transitions through belief states Note MDPs had transitions through states (the real things)Note MDPs had transitions through states (the real things) We need a model for rewards based on beliefs Note MDPs had a reward function based on stateNote MDPs had a reward function based on state

Bringing it all together We’ve constructed a representation of POMDPs that make them look like MDPs Value and Policy Iteration can be used for POMDPsValue and Policy Iteration can be used for POMDPs The optimal policy,  *(b) of the MDP belief-state representation is also optimal for the physical-state POMDP representationThe optimal policy,  *(b) of the MDP belief-state representation is also optimal for the physical-state POMDP representation We’ve constructed a representation of POMDPs that make them look like MDPs Value and Policy Iteration can be used for POMDPsValue and Policy Iteration can be used for POMDPs The optimal policy,  *(b) of the MDP belief-state representation is also optimal for the physical-state POMDP representationThe optimal policy,  *(b) of the MDP belief-state representation is also optimal for the physical-state POMDP representation

Continuous vs. discrete Our POMDP in MDP-form is continuous Cluster continuous space into regions and try to solve for approximations within these regionsCluster continuous space into regions and try to solve for approximations within these regions Our POMDP in MDP-form is continuous Cluster continuous space into regions and try to solve for approximations within these regionsCluster continuous space into regions and try to solve for approximations within these regions

Final answer to POMDP problem [l, u, u, r, u, u, r, u, u, r, …] It’s deterministic (it already takes into account the absence of observations)It’s deterministic (it already takes into account the absence of observations) It has an expected utility of 0.38 (compared with 0.08 of the simple l, l, l, u, u, u, r, r, r,…)It has an expected utility of 0.38 (compared with 0.08 of the simple l, l, l, u, u, u, r, r, r,…) It is successful 86.6%It is successful 86.6% In general, POMDPs with a few dozen states are nearly impossible to optimize [l, u, u, r, u, u, r, u, u, r, …] It’s deterministic (it already takes into account the absence of observations)It’s deterministic (it already takes into account the absence of observations) It has an expected utility of 0.38 (compared with 0.08 of the simple l, l, l, u, u, u, r, r, r,…)It has an expected utility of 0.38 (compared with 0.08 of the simple l, l, l, u, u, u, r, r, r,…) It is successful 86.6%It is successful 86.6% In general, POMDPs with a few dozen states are nearly impossible to optimize

Game Theory Multiagent games with simultaneous moves First, study games with one moveFirst, study games with one move –bankruptcy proceedings –auctions –economics –war gaming Multiagent games with simultaneous moves First, study games with one moveFirst, study games with one move –bankruptcy proceedings –auctions –economics –war gaming

Definition of a game The playersThe players The actionsThe actions The payoff matrixThe payoff matrix –provides the utility to each player for each combination of actions The playersThe players The actionsThe actions The payoff matrixThe payoff matrix –provides the utility to each player for each combination of actions Two-finger Morra

Game theory strategies Strategy == policy (as in policy iteration) What do you do?What do you do? –pure strategy  you do the same thing all the time –mixed strategy  you rely on some randomized policy to select an action Strategy ProfileStrategy Profile –The assignment of strategies to players Strategy == policy (as in policy iteration) What do you do?What do you do? –pure strategy  you do the same thing all the time –mixed strategy  you rely on some randomized policy to select an action Strategy ProfileStrategy Profile –The assignment of strategies to players

Game theoretic solutions What’s a solution to a game? All players select a “rational” strategyAll players select a “rational” strategy Note that we’re not analyzing one particular game, but the outcomes that accumulate over a series of played gamesNote that we’re not analyzing one particular game, but the outcomes that accumulate over a series of played games What’s a solution to a game? All players select a “rational” strategyAll players select a “rational” strategy Note that we’re not analyzing one particular game, but the outcomes that accumulate over a series of played gamesNote that we’re not analyzing one particular game, but the outcomes that accumulate over a series of played games

Prisoner’s Dilemma Alice and Bob are caught red handed at the scene of a crime both are interrogated separately by the policeboth are interrogated separately by the police the penalty if they both confess is 5 years for eachthe penalty if they both confess is 5 years for each the penalty if they both refuse to confess is 1 year for eachthe penalty if they both refuse to confess is 1 year for each if one confesses and the other doesn’tif one confesses and the other doesn’t –the honest one (who confesses) gets 10 years –the liar gets 0 years Alice and Bob are caught red handed at the scene of a crime both are interrogated separately by the policeboth are interrogated separately by the police the penalty if they both confess is 5 years for eachthe penalty if they both confess is 5 years for each the penalty if they both refuse to confess is 1 year for eachthe penalty if they both refuse to confess is 1 year for each if one confesses and the other doesn’tif one confesses and the other doesn’t –the honest one (who confesses) gets 10 years –the liar gets 0 years What do you do to act selfishly?

Prisoner’s dilemma payoff matrix

Prisoner’s dilemma strategy Alice’s Strategy If Bob testifiesIf Bob testifies –best option is to testify (-5) If Bob refusesIf Bob refuses –best options is to testify (0) Alice’s Strategy If Bob testifiesIf Bob testifies –best option is to testify (-5) If Bob refusesIf Bob refuses –best options is to testify (0) testifying is a dominant strategy

Prisoner’s dilemma strategy Bob’s Strategy If Alice testifiesIf Alice testifies –best option is to testify (-5) If Alice refusesIf Alice refuses –best options is to testify (0) Bob’s Strategy If Alice testifiesIf Alice testifies –best option is to testify (-5) If Alice refusesIf Alice refuses –best options is to testify (0) testifying is a dominant strategy

Rationality Both players seem to have clear strategies Both testifyBoth testify –game outcome would be (-5, -5) Both players seem to have clear strategies Both testifyBoth testify –game outcome would be (-5, -5)

Dominance of strategies Comparing strategies Strategy s can strongly dominate s’Strategy s can strongly dominate s’ –the outcome of s is always better than the outcome of s’ no matter what the other player does  testifying strongly dominates refusing for Bob and Alice Strategy s can weakly dominate s’Strategy s can weakly dominate s’ –the outcome of s is better than the outcome of s’ on at least one action of the opponent and no worse on others Comparing strategies Strategy s can strongly dominate s’Strategy s can strongly dominate s’ –the outcome of s is always better than the outcome of s’ no matter what the other player does  testifying strongly dominates refusing for Bob and Alice Strategy s can weakly dominate s’Strategy s can weakly dominate s’ –the outcome of s is better than the outcome of s’ on at least one action of the opponent and no worse on others

Pareto Optimal Pareto optimality comes from economics An outcome can be Pareto optimalAn outcome can be Pareto optimal –textbook: no alternative outcome that all players would prefer –I prefer: the best that could be accomplished without disadvantaging at least one group Pareto optimality comes from economics An outcome can be Pareto optimalAn outcome can be Pareto optimal –textbook: no alternative outcome that all players would prefer –I prefer: the best that could be accomplished without disadvantaging at least one group Is the testify outcome (-5, -5) Pareto Optimal?

Is (-5, -5) Pareto Optimal? Is there an outcome that improves outcome without disadvantaging any group? How about (-1, -1) from (testify, testify)?

Dominant strategy equilibrium (-5, -5) represents a dominant strategy equilibrium neither player has an incentive to divert from dominant strategyneither player has an incentive to divert from dominant strategy –If Alice assumes Bob executes same strategy as he is now, she will only lose more by switching  likewise for Bob Imagine this as a local optimum in outcome spaceImagine this as a local optimum in outcome space –each dimension of outcome space is dimension of a player’s choice –any movement from dominant strategy equilibrium in this space results in worse outcomes (-5, -5) represents a dominant strategy equilibrium neither player has an incentive to divert from dominant strategyneither player has an incentive to divert from dominant strategy –If Alice assumes Bob executes same strategy as he is now, she will only lose more by switching  likewise for Bob Imagine this as a local optimum in outcome spaceImagine this as a local optimum in outcome space –each dimension of outcome space is dimension of a player’s choice –any movement from dominant strategy equilibrium in this space results in worse outcomes

Thus the dilemma… Now we see the problem Outcome (-5, -5) is Pareto dominated by outcome (-1, -1)Outcome (-5, -5) is Pareto dominated by outcome (-1, -1) –To achieve Pareto optimal outcome requires divergence from local optimum at strategy equilibrium Tough situation… Pareto optimal would be nice, but it is unlikely because each player risks losing moreTough situation… Pareto optimal would be nice, but it is unlikely because each player risks losing more Now we see the problem Outcome (-5, -5) is Pareto dominated by outcome (-1, -1)Outcome (-5, -5) is Pareto dominated by outcome (-1, -1) –To achieve Pareto optimal outcome requires divergence from local optimum at strategy equilibrium Tough situation… Pareto optimal would be nice, but it is unlikely because each player risks losing moreTough situation… Pareto optimal would be nice, but it is unlikely because each player risks losing more

Nash Equilibrium John Nash studied game theory in 1950s Proved that every game has an equilibriumProved that every game has an equilibrium –If there is a set of strategies with the property that no player can benefit by changing her strategy while the other players keep their strategies unchanged, then that set of strategies and the corresponding payoffs constitute the Nash Equilibrium All dominant strategies are Nash equilibriaAll dominant strategies are Nash equilibria John Nash studied game theory in 1950s Proved that every game has an equilibriumProved that every game has an equilibrium –If there is a set of strategies with the property that no player can benefit by changing her strategy while the other players keep their strategies unchanged, then that set of strategies and the corresponding payoffs constitute the Nash Equilibrium All dominant strategies are Nash equilibriaAll dominant strategies are Nash equilibria

Another game Acme: Hardware manufacturer chooses between CD and DVD format for next game platformAcme: Hardware manufacturer chooses between CD and DVD format for next game platform Best: Software manufacturer chooses between CD and DVD format for next titleBest: Software manufacturer chooses between CD and DVD format for next title Acme: Hardware manufacturer chooses between CD and DVD format for next game platformAcme: Hardware manufacturer chooses between CD and DVD format for next game platform Best: Software manufacturer chooses between CD and DVD format for next titleBest: Software manufacturer chooses between CD and DVD format for next title

No dominant strategy Verify that there is no dominant strategyVerify that there is no dominant strategy

Yet two Nash equilibria exist Outcome 1: (DVD, DVD)… (9, 9) Outcome 2: (CD, CD)… (5, 5) If either player unilaterally changes strategy, that player will be worse off Outcome 1: (DVD, DVD)… (9, 9) Outcome 2: (CD, CD)… (5, 5) If either player unilaterally changes strategy, that player will be worse off

We still have a problem Two Nash equlibria, but which is selected? If players fail to select same strategy, both will loseIf players fail to select same strategy, both will lose –they could “agree” to select the Pareto optimal solution  that seems reasonable –they could coordinate Two Nash equlibria, but which is selected? If players fail to select same strategy, both will loseIf players fail to select same strategy, both will lose –they could “agree” to select the Pareto optimal solution  that seems reasonable –they could coordinate

Repeated games Imagine same game played multiple times payoffs accumulate for each playerpayoffs accumulate for each player optimal strategy is a function of game historyoptimal strategy is a function of game history –must select optimal action for each possible game history StrategiesStrategies –perpetual punishment  cross me once and I’ll take us both down forever –tit for tat  cross me once and I’ll cross you the subsequent move Imagine same game played multiple times payoffs accumulate for each playerpayoffs accumulate for each player optimal strategy is a function of game historyoptimal strategy is a function of game history –must select optimal action for each possible game history StrategiesStrategies –perpetual punishment  cross me once and I’ll take us both down forever –tit for tat  cross me once and I’ll cross you the subsequent move

The design of games Let’s invert the strategy selection process to design fair/effective games Tragedy of the commonsTragedy of the commons –individual farmers bring their livestock to the town commons to graze –commons is destroyed and all experience negative utility –all behaved rationally – refraining would not have saved the commons as someone else would eat it  Externalities are a way to place a value on changes in global utility  Power utilities pay for the utility they deprive neighboring communities (yet another Nobel prize in Econ for this – Coase (prof at UVa)) Let’s invert the strategy selection process to design fair/effective games Tragedy of the commonsTragedy of the commons –individual farmers bring their livestock to the town commons to graze –commons is destroyed and all experience negative utility –all behaved rationally – refraining would not have saved the commons as someone else would eat it  Externalities are a way to place a value on changes in global utility  Power utilities pay for the utility they deprive neighboring communities (yet another Nobel prize in Econ for this – Coase (prof at UVa))

Auctions English AuctionEnglish Auction –auctioneer incrementally raises bid price until one bidder remains  bidder gets the item at the highest price of another bidder plus the increment (perhaps the highest bidder would have spent more?)  strategy is simple… keep bidding until price is higher than utility  strategy of other bidders is irrelevant English AuctionEnglish Auction –auctioneer incrementally raises bid price until one bidder remains  bidder gets the item at the highest price of another bidder plus the increment (perhaps the highest bidder would have spent more?)  strategy is simple… keep bidding until price is higher than utility  strategy of other bidders is irrelevant

Auctions Sealed bid auctionSealed bid auction –place your bid in an envelope and highest bid is selected  say your highest bid is v  say you believe the highest competing bid is b  bid min (v, b +  )  player with highest value on good may not win the good and players must contemplate other player’s values Sealed bid auctionSealed bid auction –place your bid in an envelope and highest bid is selected  say your highest bid is v  say you believe the highest competing bid is b  bid min (v, b +  )  player with highest value on good may not win the good and players must contemplate other player’s values

Auctions Vickery Auction (A sealed bid auction)Vickery Auction (A sealed bid auction) –Winner pays the price of the second-highest bid –Dominant strategy is to bid what item is worth to you Vickery Auction (A sealed bid auction)Vickery Auction (A sealed bid auction) –Winner pays the price of the second-highest bid –Dominant strategy is to bid what item is worth to you

Auctions These auction algorithms can find their way into computer- controlled systemsThese auction algorithms can find their way into computer- controlled systems –Networking  Routers  Ethernet –Thermostat control in offices (Xerox PARC) These auction algorithms can find their way into computer- controlled systemsThese auction algorithms can find their way into computer- controlled systems –Networking  Routers  Ethernet –Thermostat control in offices (Xerox PARC)

CS 416 Artificial Intelligence Lecture 23 Making Complex Decisions Chapter 17 Lecture 23 Making Complex Decisions Chapter 17.

Similar presentations

Presentation on theme: "CS 416 Artificial Intelligence Lecture 23 Making Complex Decisions Chapter 17 Lecture 23 Making Complex Decisions Chapter 17."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 416 Artificial Intelligence Lecture 23 Making Complex Decisions Chapter 17 Lecture 23 Making Complex Decisions Chapter 17.

Similar presentations

Presentation on theme: "CS 416 Artificial Intelligence Lecture 23 Making Complex Decisions Chapter 17 Lecture 23 Making Complex Decisions Chapter 17."— Presentation transcript:

Similar presentations

About project

Feedback