Download presentation

Presentation is loading. Please wait.

Published byBritton Townsend Modified about 1 year ago

1
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012

2
2 Objetivo desta Aula n Aprendizado por Reforço: –Traços de Elegibilidade. –Generalização e Aproximações de funções. n Aula de hoje: capítulos 7 e 8 do Sutton & Barto.

3
3 Generalization and Function Approximation Capítulo 8 do Sutton e Barto.

4
4 Objetivos n Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. n Overview of function approximation (FA) methods and how they can be adapted to RL n Não tão profundamente como no livro (comentário do Bianchi)

5
5 Value Prediction with Function Approximation As usual: Policy Evaluation (the prediction problem): for a given policy , compute the state-value function In earlier chapters, value functions were stored in lookup tables.

6
6 Adapt Supervised Learning Algorithms Supervised Learning System Inputs Outputs Training Info = desired (target) outputs Error = (target output – actual output) Training example = {input, target output}

7
7 Backups as Training Examples As a training example: inputtarget output

8
8 Any FA Method? n In principle, yes: –artificial neural networks –decision trees –multivariate regression methods –etc. n But RL has some special requirements: –usually want to learn while interacting –ability to handle nonstationarity –other?

9
9 Gradient Descent Methods transpose

10
10 Performance Measures n A common and simple one is the mean- squared error (MSE) over a distribution P : n Let us assume that P is always the distribution of states at which backups are done. n The on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution.

11
11 Gradient Descent Iteratively move down the gradient:

12
12 Gradient Descent Cont. For the MSE given above and using the chain rule:

13
13 Gradient Descent Cont. Use just the sample gradient instead: Since each sample gradient is an unbiased estimate of the true gradient, this converges to a local minimum of the MSE if decreases appropriately with t.

14
14 But We Don’t have these Targets

15
15 What about TD( ) Targets?

16
16 On-Line Gradient-Descent TD( )

17
17 Linear Methods

18
18 Nice Properties of Linear FA Methods n The gradient is very simple: n For MSE, the error surface is simple: quadratic surface with a single minumum. Linear gradient descent TD( ) converges: –Step size decreases appropriately –On-line sampling (states sampled from the on- policy distribution) –Converges to parameter vector with property:

19
19 Linear methods mais usados n Coarse Coding n Tile Coding (CMAC) n Radial Basis Functions n Kanerva Coding

20
20 Coarse Coding Generalization from state X to state Y depends on the number of their features whose receptive fields

21
21 Coarse Coding Generalization in linear function approximation methods is determined by the sizes and shapes of the features' receptive fields. All three of these cases have roughly the same number and density of features.

22
22 Coarse Coding

23
23 Learning and Coarse Coding Example of feature width's strong effect on initial generalization (first row) and weak effect on accuracy

24
24 Tile Coding n Binary feature for each tile n Number of features present at any one time is constant n Binary features means weighted sum easy to compute n Easy to compute indices of the freatures present

25
25 Tile Coding

26
26 Exemplo: Simulated Soccer n How does agent decide what to do with the ball? n Complexities –Continuous inputs –High dimensionality n Do Artigo: Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

27
27 Problems n State space explodes exponentially in terms of dimensionality n Current methods of managing state space explosion lack automation RL does not scale well to problems with complexities of simulated soccer… Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

28
28 Quantization n Divide State Space into regions of interest –Tile Coding (Sutton & Barto, 1998) n No automated method for regions –granularity –Heterogeneity –location n Prefer a learned abstraction of state space Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

29
29 Kohonen Networks n Clustering algorithm n Data driven Agent near opponent goal Teammate near opponent goal No nearby opponents Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

30
30 State Space Reduction n 90 continuous valued inputs describe state of a soccer game –Naïve discretization 2 90 states –Filter out unnecessary inputs still 2 18 states –Clustering algorithm only 5000 states Big Win!!! Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

31
31 Two Pass Algorithm n Pass 1: –Use Kohonen Network and large training set to learn state space n Pass 2: –Use Reinforcement Learning to learn utilities for states (SARSA) Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

32
32 Fragility of Learned Actions What happens to attacker’s utility if goalie crosses dotted line? Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

33
33 Results n Evaluate three systems –Control – Random action selection –SARSA –Forcing Function n Evaluation criteria –Goals scored –Time of possession Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

34
34 Cumulative Score Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

35
35 Team with Forcing Functions Reinforcement Learning in Simulated Soccer with Kohonen Networks, de Chris White and David Brogan (University of Virginia)

36
36 Can you beat the “curse of dimensionality”? n Can you keep the number of features from going up exponentially with the dimension? n “Lazy learning” schemes: –Remember all the data –To get new value, find nearest neighbors and interpolate –e.g., locally-weighted regression

37
37 Can you beat the “curse of dimensionality”? n Function complexity, not dimensionality, is the problem. n Kanerva coding: –Select a bunch of binary prototypes –Use hamming distance as distance measure –Dimensionality is no longer a problem, only complexity

38
38 Algorithms using Function Approximators n We now extend value prediction methods using function approximation to control methods, following the pattern of GPI. n First we extend the state-value prediction methods to action-value prediction methods, then we combine them with policy improvement and action selection techniques. n As usual, the problem of ensuring exploration is solved by pursuing either an on-policy or an off-policy approach.

39
39 Control with FA n Learning state-action values: n The general gradient-descent rule: Training examples of the form:

40
40 Control with FA Gradient-descent Sarsa( ) (backward view):

41
41 Linear Gradient Descent Sarsa(l)

42
42 Linear Gradient Descent Q( )

43
43 Mountain-Car Task

44
44 Mountain-Car Results The effect of alpha, lambda and the kind of traces on early performance on the mountain-car task. This study used five 9 x 9 tilings.

45
45 Summary n Generalization n Adapting supervised-learning function approximation methods n Gradient-descent methods n Linear gradient-descent methods –Radial basis functions –Tile coding –Kanerva coding

46
46 Summary n Nonlinear gradient-descent methods? Backpropation? n Subleties involving function approximation, bootstrapping and the on-policy/off-policy distinction

47
Conclusion 47

48
Conclusão n Vimos dois métodos importantes na aula de hoje: –Traços de elegibilidade, que faz uma generalização temporal do aprendizado. –Aproximadores de função, que generalizam a função valor aprendida. n Generalizam o aprendizado. 48

49
49 Fim.

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google