Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cooperation via Policy Search and Unconstrained Minimization Brendan and Yifang Feb 24 2015.

Similar presentations


Presentation on theme: "Cooperation via Policy Search and Unconstrained Minimization Brendan and Yifang Feb 24 2015."— Presentation transcript:

1 Cooperation via Policy Search and Unconstrained Minimization Brendan and Yifang Feb

2 Paper: Learning to Cooperate via Policy Search Peshkin, Leonid and Kim, Kee-Eung and Meuleau, Nicolas and Kaelbling, Leslie Pack

3 Introduction Previously we’ve been concerned with single agents, oblivious environments and one-to-one value functions Impractical for large/diverse environments Impractical for complicated/reactive systems More generalizable approach desirable RL Chapter 8 introduced function approximation to build more accurate value functions with small samples Peshkin et al. introduce multiple agents and world state information asymmetry Wide range of potential new applications Particularly relevant to game theory Compatible with Connectionism models

4 Game Theory “Game theory is the study of the ways in which interacting choices of economic agents produce outcomes with respect to the preferences (or utilities) of those agents, where the outcomes in question might have been intended by none of the agents.” - Stanford Encyclopedia of Philosophy Example – Volunteer’s Dilemma: N agents share a common reward if at least one agent volunteers, but volunteering has an associated cost (e.g. power goes out and at least one person has to make an expensive sat-phone call to electric company to get power restored) Cooperative Multi-agent Imperfect world state information Analytic solutions

5 Multi-Agent Environments

6 Paper Specifics Paper adopts Partially Observable Identical Payoff Stochastic Game model with some caveats Picked to be interesting Allows theoretic guarantees Requires non-trivial distributed algorithm Some game theory applications

7 Identical Payoff Stochastic Game (IPSG)

8 Reward Questions

9 Reward Questions Cont. Question: “ Its title is about cooperation. Does the reward function r reflect the idea of cooperation? ” – Jiyun Luo Paper’s definition - “Cooperative games are those in which both agents share the same payoff structure.” Identical payoff stochastic games meet this requirement by definition Cooperation effect seemingly evident in some example – e.g. soccer More generally – “a cooperative game is a game where groups of players ("coalitions") may enforce cooperative behavior, hence the game is a competition between coalitions of players, rather than between individual players.” Typically entails communication and enforcement not present in model Similarly, joint strategies are not explicit in this model

10 Partially Observable IPSG (POIPSG)

11 Asymmetric Information Question Question: “The paper investigates the problem in which "agents all receive the shared reward signal, but have incomplete, unreliable, and generally different perceptions of the world state." How important is this problem? What are the real world applications to this model? ” – Yuankai Economics: Equal shared holders in a company all benefit from its success, but may have different world state information Politics: Citizens all have a common goal, but differ in knowledge Shared rewards are common (ie volunteer’s dilemma), but identical rewards are somewhat unrealistic Model arguably lacks communication and enforcement necessary for more complex economic problems; potentially representable with observation function?

12 Model Assumptions State space is discrete Agents have finite memory Important for realism Paper outlines a finite state controller for each agent, but uses fully reactive policies in practice Factored controller (ie distributed opposed to centralized), each agent has its own sub-policy Not compatible with simple communication models Learning is simultaneous Actions happen simultaneously coordinated by the reward result Policy defines probability of action as a continuous differential function

13 Joint Controller vs. Factored Controller

14 Control Question “What is the major difference (mathematically) between central control vs. distributed control of factored actions?” – Dr. Yang In general, see previous In the case of gradient descent, none Action probabilities of each agent (and consequently the distributed partial derivatives) are independent. As such, weight updates can be distributed without cooperation

15 Algorithm Question

16 Simple Artificial Neuron

17 Algorithm - REINFORCE

18 Algorithm – Gradient Descent for Policy Search

19 Gradient Descent for Policy Search Cont.

20 Algorithm – Distributed Gradient Descent Now that we can calculate the gradient for each trial, using gradient descent is trivial Gradient descent implementations will be covered later To distribute the action determination to individual agents (consistent with our model), simply have each agent perform gradient descent locally Each agent given some set of initial weights Each agent adjusts their own weight given their local observation and the reward for each round

21 Guarantees For factored controllers, distributed gradient descent is equivalent to joint gradient descent Thus, Distributed Gradient Descent finds a local optimum Centralized gradient descent finds a local optimum simply because it’s a descent algorithm that stops at an optimum Distributed is equivalent to joint Every strict Nash equilibrium is a local optimum for gradient descent Equivalence isn’t two-way (discussed later)

22 Nash Equilibrium

23 NE Question “ If we are modeling two agents in a system, say a user and a machine. To reach Nash equilibrium, can we restrict what the user can or should do in the process? Is this reasonable? If not, that means that the user can do whatever he/she wants, and acts not optimal, any comments on the Nash equilibrium under such condition? How can the machine still optimize for both parties? ” A NE is two-sided by definition. If a user can improve their return (ie isn’t acting optimally) it isn’t a NE. Rational agents are typically a good assumption, though it often requires a very sophisticated model to reflect real life Partial observability may be a good way model mostly rational agents ie a user interacting with a search engine The machine may have a dominant strategy irrespective of the user actions, but it’s generally not interesting to model such problems with multiple agents

24 NE Equivalence Question “Can you explain when a local optima for gradient descent would not be a Nash equilibrium?” – Brad Gradient descent is only guaranteed to search locally One agent may have a better strategy (far removed) that improves theirs payoff regardless of other agents “We can construct a value function V (w1, w2) such that for some c, V (·, c) has two modes, one at V (a, c) and the other at V (b, c), such that V (b, c)> V (a, c). Further assume that V (a, ·) and V (b, ·) each have global maxima V (a, c) and V (b, c). Then V (a, c) is a local optimum that is not a Nash equilibrium”

25 Example - Toy Coordination problem s1 s2 s3 s4 s5 s ;

26 Experimental Results

27 Example - Soccer 6x5 grid with 2 randomly places agents and 1 defender Agents can {North, South, East, West, Stay, Pass} Agents observe who possesses ball and status of surrounding Defenders can be Random, Greedy (goes out to block ball), or Defensive(stays in goal to block)

28 Soccer optimal solution question “ The paper shows a distributed learning algorithm in cooperative multi-agent domain by a 2-agents soccer game example. It points out that, an algorithm which can achieve optimal payoff for all agents may not be possible in general. Why not? Can you explain this in the class? ” – Sicong Solving POIPSG completely is intractable Analytical solutions are sometimes possible

29 Q-learning

30 Experimental Results – Defensive Opponents

31 Experimental Results – Greedy Opponents

32 Experimental Results – Random Opponents

33 Experimental Results – Mixed Opponents

34 Tractability Question “The paper gives the example of tractable soccer game with two learning agents playing against one opponent with fixed strategy and then it mentions that it becomes difficult to even store the Q table. The question is, what can be the possible to solve complex environments that are useful for practical applications, if any? Discussion on this?” – Tavish Recall: Q-learning was mainly introduced as motivation Removing the Q table is favor of a compact function approximation based on sampling is much more practical for large examples and often not too bad.

35 Conclusions Reinforcement learning can easily generalized to include multiple agents and applied to game theory Distributed gradient descent has nice guarantees and performs well in POIPSG Distributed gradient descent can produce “cooperation” against consistent and coordinated opponents

36 Convex Optimization: Section 9.1 Stephen Boyd and Lieven Vandenberghe

37 Problem Our goal is to solve the unconstrained minimization problem: Where f(x) is convex, and twice continuously differentiable. In this case, a necessary and sufficient condition for x to be optimal is So what is convex?

38 Convex, strictly convex f(x) is called convex, if f(x) is called strictly convex, if

39 Strongly convex f(x) is twice continuously differentiable, and it is strongly convex, if it satisfy the following condition where is Hessian matrix

40 Strong convex, cont’d Thus, the m should be smaller than the smallest Eigen value of the Hessian matrix.

41 Lower bound of According to Taylor theorem, we will have Then we have It means when is small, the f(x) is near to its optimal.

42 Upper-bound of Similar to lower bound, has an upper bound.

43 Condition number of sublevel sets We define the width of a context set C in the direction q, q is a unit vector. Further, we have minimum width and maximum width The condition number of a context set C is

44 Question 1: What is Hessian matrix?(From Brad) Answer: It is second-order derivative matrix Question 2: Given convexity, does local optimal guarantee to be global optimal? Answer: Yes, local optimal guarantees global optimal, if the objective function is continuously differentiable.

45 Question 3: Convexity is a strong assumption. Does the real- world problem always satisfy this assumption? What if it does not stand? Answer: The real-world problems do not always satisfy this assumption. We adopt some other strategies: follow the other direction other than gradient-descent direction, with a small probability. Question 4: Can you explain the math symbol in Section 9.1?(From Jiyun)

46 Convex Optimization: Sections 9.2 and 9.3 Stephen Boyd and Lieven Vandenberghe

47 Descent Methods

48 General descent method

49 Descent Direction Question

50 Exact Line Search

51 Backtracking Line Search

52 Backtracking Stopping Condition Question

53 Backtracking Line Search Ex.

54 Backtracking Stopping Condition Question “ Could you show us graphically what it looks like when the stopping condition of the bactracking line search holds?” - Brad

55 Backtracking Stop Ex.

56 Backtracking Parameter Choice Question

57 Line Search Question

58 Gradient Descent

59 Line Search Questions Cont. “Since the setting is convex and twice continuously differentiable, can you compare the exact line search, backtracking line search, and the Newton’s Method?” – Sicong Newton’s method is a descent method, but it relies on the Hessian rather than the gradient so it’s using a different search vector Convergence guarantees rely on an additional self-concordance assumption “When would one use exact, vs backtracking line search?” - Brad Exact line search typically reduces the required number of steps (not guaranteed) However, computing the exact minimum often not feasible or efficient “What are the major differences between gradient method with exact line search and gradient method with backtracking line search? Please illustrate with an example how they differ in reaching the point of convergence.” - Tavish

60 Newton’s Method

61 Gradient Descent with Exact Line Search

62 Gradient Descent with Backtracking Line Search

63 Exact vs. Backtracking Speed

64 Conclusions Based on toy experimental results Gradient method offers roughly linear convergence Choice of backtracking parameters has a noticeable but not dramatic effect Gradient descent is advantageous due to its simplicity Gradient descent may converge slowly in certain cases


Download ppt "Cooperation via Policy Search and Unconstrained Minimization Brendan and Yifang Feb 24 2015."

Similar presentations


Ads by Google