Presentation is loading. Please wait.

Presentation is loading. Please wait.

Convergence, Targeted Optimality, and Safety in Multiagent Learning

Similar presentations


Presentation on theme: "Convergence, Targeted Optimality, and Safety in Multiagent Learning"— Presentation transcript:

1 Convergence, Targeted Optimality, and Safety in Multiagent Learning
Doran Chakraborty Peter Stone Learning Agent Research Group University of Texas, Austin TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA

2 Multiple Autonomous Agents
Non-stationarity in environment Overlap of spheres of influence Ignoring other agents and treating every thing else as the environment can be sub-optimal Each agent needs to learn the behavior of other agents in its sphere of influence MULTIAGENT LEARNING

3 Multiagent Learning from a Game theoretic perspective
Agents are involved in a repeated matrix game N-player N-action matrix game On each time step, each agent just sees the joint action and hence the payoffs for ever agent Is there any way for an agent to ensure certain payoffs (if not the best possible) against unknown opponents?

4 Contributions First Multiagent Learning Algorithm called Convergence with Model Learning and Safety (CMLeS) In a n-player n-action repeated game achieves Converges to Nash equilibrium with probability 1 in self play (Convergence) Against a set of memory bounded counterparts of memory size at most Kmax, converges to playing close to the best response with a very high probability (Targeted-optimality) Also holds for opponents which eventually become memory bounded Achieves it in the best reported time complexity Against every other unknown agent ensures the maximin payoff (Safety)

5 High level overview of CMLeS
Try to coordinate to a Nash equilibrium assuming all other agents are CMLeS agents if all agents are CMLeS agents when other agents are not CMLeS agents Try to model the opponents as memory bounded with max memory size Kmax (plays MLeS) Convergence achieved if other agents are arbitrary if other agents are memory bounded with memory size Kmax Targeted Optimality achieved Safety achieved

6 A motivating example : Battle of Sexes
Bob B S B 1,2 0,0 Alice S 0,0 2,1 Alice Bob 3 Nash equilibria 2 in pure strategies 1 in mixed (each player goes to its preferred event 2/3 times)

7 Assume Hypothesis H0 = Bob is a CMLeS agent.

8 p = 0 Assume the agents choose the mixed strategy Nash equilibrium
Assume Hypothesis H0 = Bob is a CMLeS agent Assume the agents choose the mixed strategy Nash equilibrium Alice plays B with prob 1/3 while Bob plays B with prob 2/3 Use a Nash equilibrium solver to compute a Nash strategy for Alice and Bob Np = 100 and εp = 0.1 Compute a schedule Np and εp p = 0

9 p = 0 Assume the agents choose the mixed strategy Nash equilibrium
Assume Hypothesis H0 = Bob is a CMLeS agent Assume the agents choose the mixed strategy Nash equilibrium Alice plays B with prob 1/3 while Bob plays B with prob 2/3 Use a Nash equilibrium solver to compute a Nash strategy for Alice and Bob Np = 100 and εp = 0.1 Compute a schedule Np and εp p = 0 Play your own part of the Nash strategy for Np episodes

10 Assume Hypothesis H0 = Bob is a CMLeS agent Assume the agents choose the mixed strategy Nash equilibrium Alice plays B with prob 1/3 while Bob plays B with prob 2/3 Use a Nash equilibrium solver to compute a Nash strategy for Alice and Bob Np = 100 and εp = 0.1 Compute a schedule Np and εp p = 0,1,2,…… Play your own part of the Nash strategy for Np episodes Alice played a1 31% times and Bob played a1 65 % times NO any agent deviated by εp from its Nash strategy?

11 Signal Play according to a fixed behavior p = 0,1,2,.. NO YES YES
Assume Hypothesis H0 = Bob is a CMLeS agent Assume the agents choose the mixed strategy Nash equilibrium Alice plays B with prob 1/3 while Bob plays B with prob 2/3 p = 0,1,2,.. Use a Nash equilibrium solver to compute a Nash strategy for Alice and Bob Play according to a fixed behavior Signal Compute a schedule Np and εp Play your own part of the Nash strategy for Np episodes NO YES any agent deviated by εp from its Nash strategy? YES Check for Consistency?

12 Signal When Bob is a memory bounded agent p = 0,1,2,…… NO YES YES
Assume Hypothesis H0 = Bob is a CMLeS agent When Bob is a memory bounded agent p = 0,1,2,…… Use a Nash equilibrium solver to compute a Nash strategy for Alice and Bob Signal Compute a schedule Np and εp Play a1 Kmax+1 times Play your own part of the Nash strategy for Np episodes C == 0 Play a1 Kmax times followed by another random action apart from a1 NO YES C++ C == 1 any agent deviated by εp from its Nash strategy? C > 1 Play a1 Kmax+1 times YES Check for Consistency? Reject H0 and play MLeS NO

13 Contributions of CMLeS
First MAL algorithm that in a n-player n-action repeated game Converges to Nash equilibrium with probability 1 in self play (Convergence) Against a set of memory bounded counterparts of memory size at most Kmax, converges to playing close to the best response with a very high probability (Targeted-optimality) Also holds for opponents which eventually become memory bounded Achieves it in the best reported time complexity Against every other unknown agent ensures safety eventually (Safety)

14 How to play against memory bounded opponents?
Play against memory bounded opponents can be modeled as a Markov Decision Process (MDP) Chakraborty and Stone (ECML’08) The adversary induces the MDP and hence known as Adversary Induced MDP (AIM) The state space of the AIM is all feasible joint histories of size K The transition and reward function of the AIM is determined by the opponent’s strategy Both K and opponent strategy unknown and hence needs to be figured out

15 Adversary Induced MDP (AIM)
(B,B)(S,S) Time = t Alice plays action S Assume Bob is a memory bounded opponent with K=2

16 Adversary Induced MDP (AIM)
(B,B)(S,S) Time = t Alice plays action S (S,S)(S,?) Assume Bob is a memory bounded opponent with K=2

17 Adversary Induced MDP (AIM)
(B,B )(S,S) Time = t Probability with which Bob plays S for a memory of (B,B)(S,S) = 0.3 Probability with which Bob plays B for a memory of (B,B)(S,S) = 0.7 Reward = 2 Reward = 0 (S,S)(S,S) (S,S)(S,B) Optimal policy for this AIM is the optimal way of playing against Bob How to achieve it? Use MLeS

18 Flowchart of MLeS YES NO Is k a Start of episode t
Compute the best estimate of K using FIND-K algorithm. Let that be k Run RMax assuming that the true underlying AIM is of size k Play the safety strategy YES NO Is k a valid value?

19 Flowchart of MLeS YES NO Is k a Start of episode t
Compute the best estimate of K using FIND-K algorithm. Let that be k Run RMax assuming that the true underlying AIM is of size k Play the safety strategy YES NO Is k a valid value?

20 Find-K algorithm 0 1 2 3 4 … K K+1 K+2 Kmax Kmax+1
Figuring out the opponent memory size … K K+1 K Kmax Kmax+1 ΔKt ΔK+1t ΔKmaxt Δ0t Δ1t amount of information lost by not modeling Bob as a k+1 memory sized opponent as opposed to a k memory sized opponent Δkt

21 Find-K algorithm 0 1 2 3 4 … K K+1 K+2 Kmax Kmax+1
Figuring out the opponent memory size … K K+1 K Kmax Kmax+1 ΔKt ΔK+1t ΔKmaxt Δ0t Δ1t 0.05 0.001 0.01 0.4 0.3

22 Find-K algorithm 0 1 2 3 4 … K K+1 K+2 Kmax Kmax+1
Figuring out the opponent memory size … K K+1 K Kmax Kmax+1 ΔKt ΔK+1t ΔKmaxt Δ0t Δ1t 0.05 0.001 0.01 0.4 0.3 σKmaxt σot σ1t σKt σK+1t 0.07 0.0001 0.0002 0.002 0.02

23 Picks K with prob at least 1- δ
Find-K algorithm Figuring out the opponent memory size … K K+1 K Kmax Kmax+1 ΔKt ΔK+1t ΔKmaxt Δ0t Δ1t 0.05 0.001 0.01 0.4 0.3 σKmaxt σot σ1t σKt σK+1t 0.07 0.0001 0.0002 0.002 0.02 δ/Kmax δ/Kmax δ/Kmax Picks K with prob at least 1- δ

24 Theoretical properties of MLeS
Find-K needs only polynomial number of visits to every feasible joint history of size K to find the true opponent memory size, or K, with probability at least 1 – δ Polynomial in 1/δ and Kmax Overall time complexity of computing a ε-best response against a memory bounded opponent is then polynomial in the size of feasible joint histories of size K, Kmax,1/δ and 1/ε For opponents which cannot be modeled as a Kmax memory bounded opponent, it converges to safety strategy with probability 1, in the limit

25

26 Conclusion and Future Work
A new Multiagent learning algorithm CMLeS Convergence Targeted optimality against memory bounded adversaries in the best reported time complexity Safety What if there is a mixed population of agents? How to incorporate no-regret or bounded regret? Agents in graphical games

27

28


Download ppt "Convergence, Targeted Optimality, and Safety in Multiagent Learning"

Similar presentations


Ads by Google