Presentation is loading. Please wait.

Presentation is loading. Please wait.

Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science.

Similar presentations


Presentation on theme: "Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science."— Presentation transcript:

1 Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science Department Stanford University July 2008, ICML TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A AA A AA

2 Outline Reinforcement Learning and Following Trajectories Space-indexed Dynamical Systems and Space-indexed Dynamic Programming Experimental Results

3 Reinforcement Learning and Following Trajectories

4 Trajectory Following Consider task of following trajectory in a vehicle such as a car or helicopter State space too large to discretize, can’t apply tabular RL/dynamic programming

5 Trajectory Following Dynamic programming algorithms w/ non- stationary policies seem well-suited to task –Policy Search by Dynamic Programming (Bagnell, et. al), Differential Dynamic Programming (Jacobson and Mayne)

6 Dynamic Programming t=1 Divide control task into discrete time steps

7 Dynamic Programming t=1 Divide control task into discrete time steps t=2

8 Dynamic Programming t=1 Divide control task into discrete time steps t=2 t=3 t=4 t=5

9 Dynamic Programming t=1 t=2 t=3 t=4 t=5 Proceeding backwards in time, learn policies for t = T, T-1, …, 2, 1

10 Dynamic Programming t=1 t=2 t=3 t=4 t=5 Proceeding backwards in time, learn policies for t = T, T-1, …, 2, 1

11 Dynamic Programming t=1 t=2 t=3 t=4 t=5 Proceeding backwards in time, learn policies for t = T, T-1, …, 2, 1

12 Dynamic Programming t=1 t=2 t=3 t=4 t=5 Proceeding backwards in time, learn policies for t = T, T-1, …, 2, 1

13 Dynamic Programming t=1 t=2 t=3 t=4 t=5 Key Advantage: Policies are local (only need to perform well over small portion of state space)

14 Problems with Dynamic Programming Problem #1: Policies from traditional dynamic programming algorithms are time-indexed

15 Problems with Dynamic Programming Supposed we learned policy assuming this distribution over states

16 Problems with Dynamic Programming But, due to natural stochasticity of environment, car is actually here at t = 5

17 Problems with Dynamic Programming Resulting policy will perform very poorly

18 Problems with Dynamic Programming Partial Solution: Re-indexing Execute policy closest to current location, regardless of time

19 Problems with Dynamic Programming Problem #2: Uncertainty over future states makes it hard to learn any good policy

20 Problems with Dynamic Programming Due to stochasticity, large uncertainty over states in distant future Dist. over states at time t = 5

21 Problems with Dynamic Programming DP algorithms require learning policy that performs well over entire distribution Dist. over states at time t = 5

22 Space-Indexed Dynamic Programming Basic idea of Space-Indexed Dynamic Programming (SIDP): Perform DP with respect to space indices (planes tangent to trajectory)

23 Space-Indexed Dynamical Systems and Dynamic Programming

24 Difficulty with SIDP No guarantee that taking single action will move to next plane along trajectory Introduce notion of space-indexed dynamical system

25 Time-Indexed Dynamical System Creating time-indexed dynamical systems:

26 Time-Indexed Dynamical System Creating time-indexed dynamical systems: current state

27 Time-Indexed Dynamical System Creating time-indexed dynamical systems: control action current state

28 Time-Indexed Dynamical System Creating time-indexed dynamical systems: control action current state time derivative of state

29 Time-Indexed Dynamical System Creating time-indexed dynamical systems: Euler integration

30 Space-Indexed Dynamical Systems Creating space-indexed dynamical systems: Simulate forward until whenever vehicle hits next tangent plane space index d space index d+1

31 Space-Indexed Dynamical Systems Creating space-indexed dynamical systems: space index d space index d+1

32 Space-Indexed Dynamical Systems Creating space-indexed dynamical systems: space index d space index d+1 (Positive solution exists as long as controller makes some forward progress)

33 Space-Indexed Dynamical Systems Result is a dynamical system indexed by spatial-index variable d rather than time Space-indexed dynamic programming runs DP directly on this system

34 Space-Indexed Dynamic Programming Divide trajectory into discrete space planes d=1

35 Space-Indexed Dynamic Programming Divide trajectory into discrete space planes d=1 d=2

36 Space-Indexed Dynamic Programming Divide trajectory into discrete space planes d=1 d=2 d=3 d=4 d=5

37 Space-Indexed Dynamic Programming d=1 d=2 d=3 d=4 d=5 Proceeding backwards, learn policies for d = D, D-1, …, 2, 1

38 Space-Indexed Dynamic Programming d=1 d=2 d=3 d=4 d=5 Proceeding backwards, learn policies for d = D, D-1, …, 2, 1

39 Space-Indexed Dynamic Programming d=1 d=2 d=3 d=4 d=5 Proceeding backwards, learn policies for d = D, D-1, …, 2, 1

40 Space-Indexed Dynamic Programming d=1 d=2 d=3 d=4 d=5 Proceeding backwards, learn policies for d = D, D-1, …, 2, 1

41 Problems with Dynamic Programming Problem #1: Policies from traditional dynamic programming algorithms are time-indexed

42 Space-Indexed Dynamic Programming Time indexed DP: can execute policy learned for different location Space indexed DP: always executes policy based on current spatial index

43 Problems with Dynamic Programming Problem #2: Uncertainty over future states makes it hard to learn any good policy

44 Space-Indexed Dynamic Programming Time indexed DP: wide distribution over future states Space indexed DP: much tighter distribution over future states Dist. over states at time t = 5Dist. over states at index d = 5

45 Space-Indexed Dynamic Programming Time indexed DP: wide distribution over future states Space indexed DP: much tighter distribution over future states Dist. over states at time t = 5Dist. over states at index d = 5 t(5):

46 Experiments

47 Experimental Domain Task: following race track trajectory in RC car with randomly placed obstacles

48 Experimental Setup Implemented space-indexed version of PSDP algorithm –Policy chooses steering angle using SVM classifier (constant velocity) –Used simple textbook model simulator of car dynamics to learn policy Evaluated PSDP time-indexed, time- indexed with re-indexing and space-indexed

49 Time-Indexed PSDP

50 Time-Indexed PSDP w/ Re-indexing

51 Space-Indexed PSDP

52 Empirical Evaluation Time-indexed PSDPTime-indexed PSDP with Re-indexing Space-indexed PSDP Cost: 49.32 Cost: Infinite (no trajectory succeeds) Cost: 59.74

53 Additional Experiments In the paper: additional experiments on the Stanford Grand Challenge Car using space-indexed DDP, and on a simulated helicopter domain using space-indexed PSDP

54 Related Work Reinforcement learning / dynamic programming: Bagnell et al., 2004; Jacobson and Mayne, 1970; Lagoudakis and Parr, 2003; Langford and Zadrozny, 2005 Differential Dynamic Programming: Atkeson, 1994; Tassa et al., 2008 Gain Scheduling, Model Predictive Control: Leith and Leithead, 2000; Garica et al., 1989

55 Summary Trajectory following uses non-stationary policies, but traditional DP / RL algorithms suffer because they are time-indexed In this paper, we introduce the notions of a space-indexed dynamical system, and space-indexed dynamic programming Demonstrated usefulness of these methods on real-world control tasks.

56 Thank you! Videos available online at http://cs.stanford.edu/~kolter/icml08videos http://cs.stanford.edu/~kolter/icml08videos


Download ppt "Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science."

Similar presentations


Ads by Google