Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tutorial : Echo State Networks Dan Popovici University of Montreal (UdeM) MITACS 2005.

Similar presentations


Presentation on theme: "Tutorial : Echo State Networks Dan Popovici University of Montreal (UdeM) MITACS 2005."— Presentation transcript:

1 Tutorial : Echo State Networks Dan Popovici University of Montreal (UdeM) MITACS 2005

2 Overview 1. Recurrent neural networks: a 1-minute primer 2. Echo state networks 3. Examples, examples, examples 4. Open Issues

3 1 Recurrent neural networks

4 Feedforward- vs. recurrent NN Input Output connections only "from left to right", no connection cycle activation is fed forward from input to output through "hidden layers" no memory at least one connection cycle activation can "reverberate", persist even with no input system with memory

5 recurrent NNs, main properties input time series  output time series can approximate any dynamical system (universal approximation property) mathematical analysis difficult learning algorithms computationally expensive and difficult to master few application-oriented publications, little research

6 Supervised training of RNNs A. Training Teacher: Model: B. Exploitation Input: Correct (unknown) output: Model: in out in out

7 Backpropagation through time (BPTT) Most widely used general- purpose supervised training algorithm Idea: 1. stack network copies, 2. interpret as feedforward network, 3. use backprop algorithm.... original RNN stack of copies

8 What are ESNs? training method for recurrent neural networks black-box modelling of nonlinear dynamical systems supervised training, offline and online exploits linear methods for nonlinear modeling Previously ESN training

9 Introductory example: a tone generator Goal: train a network to work as a tuneable tone generator input: frequency setting output: sines of desired frequency

10 Tone generator, sampling For sampling period, drive fixed "reservoir" network with teacher input and output. Observation: internal states of dynamical reservoir reflect both input and output teacher signals

11 Tone generator: compute weights Determine reservoir-to-output weights such that training output is optimally reconstituted from internal "echo" signals.

12 Tone generator: exploitation With new output weights in place, drive trained network with input. Observation: network continues to function as in training. –internal states reflect input and output –output is reconstituted from internal states internal states and output create each other echo reconstitute

13 Tone generator: generalization The trained generator network also works with input different from training input A. step inputB. teacher and learned output C. some internal states

14 Dynamical reservoir large recurrent network (100 -  units) works as "dynamical reservoir", "echo chamber" units in DR respond differently to excitation output units combine different internal dynamics into desired dynamics

15 Rich excited dynamics excitation responses Unit impulse responses should vary greatly. Achieve this by, e.g., inhomogeneous connectivity random weights different time constants...

16 Notation and Update Rules

17 Learning: basic idea Every stationary deterministic dynamical system can be defined by an equation like where the system function h might be a monster. Combine h from the I/O echo functions by selecting suitable DR-to-output weights :

18 Offline training: task definition Let be the teacher output.. Compute weights such that mean square error is minimized. Recall

19 Offline training: how it works 1.Let network run with training signal teacher-forced. 2.During this run, collect network states, in matrix M 3.Compute weights, such that is minimized MSE minimizing weight computation (step 3) is a standard operation. Many efficient implementations available, offline/constructive and online/adaptive.

20 Practical Considerations Chosen randomly Spectral radius of W < 1 W should be sparse Input and feedback weights have to be scaled “appropriately” Adding noise in the update rule can increase generalization performance

21 Echo state network training, summary use large recurrent network as "excitable dynamical reservoir (DR)" DR is not modified through learning adapt only DR  output weights thereby combine desired system function from I/O history echo functions use any offline or online linear regression algorithm to minimize error

22 3 Examples, examples, examples

23 Short-term memories

24 Delay line: scheme

25 Delay line: example Network size 400 Delays: 1, 30, 60, 70, 80, 90, 100, 103, 106, 120 steps Training sequence length N = 2000 training signal: random walk with resting states

26 results correct delayed signals ( ) and network outputs ( ) -1 -30 -60 -90 -100 -103 -106 -120 traces of some DR internal units

27 Delay line: test with different input correct delayed signals ( ) and network outputs ( ) -1 -30 -60 -90 -100 -103 -106 -120 traces of some DR internal units

28 3.2 Indentification of nonlinear systems

29 Identifying higher-order nonlinear systems A tenth-order system... Training setup

30 Results: offline learning augmented ESN (800 Parameters) : NMSE test = 0.006 previous published state of the art 1) : NMSE train = 0.24 D. Prokhorov, pers. communication 2) : NMSE test = 0.004 1) Atiya & Parlos (2000), IEEE Trans. Neural Networks 11(3), 697-708 2) EKF-RNN, 30 units, 1000 Parameters.

31 The Mackey-Glass equation delay differential equation delay  > 16.8 : chaotic benchmark for time series prediction  = 17  = 30

32 Learning setup network size 1000 training sequence N = 3000 sampling rate 1

33 Results for  = 17 Error for 84-step prediction: NRMSE = 1E-4.2 (averaged over 100 training runs on independently created data) With refined training method: NRMSE = 1E-5.1 previous best: NRMSE = 1E-1.7 original learnt model

34 Prediction with model visible discrepancy after about 1500 steps...

35 Comparison: NRMSE for 84-step prediction *) data from survey in Gers / Eck /Schmidhuber 2000 log10(NRMSE)

36 3.3 Dynamic pattern recognition

37 Dynamic pattern detection 1) Training signal: output jumps to 1 after occurence of pattern instance in input 1) see GMD Report Nr 152 for detailed coverage

38 Single-instance patterns, training setup 1. A single-instance, 10- step pattern is randomly fixed 2. It is inserted into 500- step random signal at positions 200 (for training) 350, 400, 450, 500 (for testing) 3. 100-unit ESN trained on first 300 steps (single positive instance! "single shot learning), tested on remaining 200 steps test data: 200 steps with 4 occurances of pattern on random background, desired output: red impulses the pattern

39 Single-instance patterns, results 1. trained network response on test data 2. network response after training 800 more pattern- free steps ("negative examples") 3. like 2., but 5 positive examples in training data DR: 12.4DR: 12.1DR: 6.4 4. comparison: optimal linear filter DR: 3.5 discrimination ratio DR:

40 Event detection for robots (joint work with J.Hertzberg & F. Schönherr) Robot runs through office environment, experiences data streams (27 channels) like... 10 sec infrared distance sensor left motor speed activation of "goThruDoor" external teacher signal, marking event category

41 Learning setup 27 (raw) data channelsunlimited number of event detector channels 100 unit RNN simulated robot (rich simulation) training run spans 15 simulated minutes event categories like pass through door pass by 90° corner pass by smooth corner

42 Results easy to train event hypothesis signals "boolean" categories possible single-shot learning possible

43 Network setup in training _az_az 29 input channels code symbols... 29 output channels for next symbol hypotheses 400 units

44 Trained network in "text" generation decision mechanism, e.g. winner-take-all !! winning symbol is next input

45 Results Selection by random draw according to output yth_upsghteyshhfakeofw_io,l_yodoinglle_d_upeiuttytyr_hsymua_doey_sa mmusos_trll,t.krpuflvek_hwiblhooslolyoe,_wtheble_ft_a_gimllveteud_... Winner-take-all selection sdear_oh,_grandmamma,_who_will_go_and_the_wolf_said_the_wolf_said _the_wolf_said_the_wolf_said_the_wolf_said_the_wolf_said_the_wolf...

46 4 Open Issues

47 4.2 Multiple timescales 4.3 Additive mixtures of dynamics 4.4 "Switching" memory 4.5 High-dimensional dynamics

48 Multiple time scales This is hard to learn (Laser benchmark time series): Reason: 2 widely separated time scales Approach for future research: ESNs with different time constants in their units

49 Additive dynamics This proved impossible to learn: Reason: requires 2 independent oscillators; but in ESN all dynamics are mutually coupled. Approach for future research: modular ESNs and unsupervised multiple expert learning

50 "Switching" memory This FSA has long memory "switches": Generating such sequences not possible with monotonic, area- bounded forgetting curves! a a b c b aaa....aaa c aaa...aaa b aaa...aaa c aaa...aaa... bounded area unbounded width An ESN simply is not a model for long-term memory!

51 High-dimensional dynamics High-dimensional dynamics would require very large ESN. Example: 6-DOF nonstationary time series one-step prediction 200-unit ESN: RMS = 0.2; 400-unit network: RMS = 0.1; best other training technique 1) : RMS = 0.02 Approach for future research: task-specific optimization of ESN 1) Prokhorov et al, extended Kalman filtering BPPT. Network size 40, 1400 trained links, training time 3 weeks

52 Spreading trouble... Signals x i (n) of reservoir can be interpreted as vectors in (infinite-dimensional) signal space Correlation E[xy] yields inner product on this space Output signal y(n) is linear combination of these x i (n) The more orthogonal the x i (n), the smaller the output weights: y y x1x1 x2x2 x2x2 x1x1 y = 30 x 1  28 x 2 y = 0.5 x 1  0.7 x 2

53 Eigenvectors v k of correlation matrix R = (E[x i x j ] ) are orthogonal signals Eigenvalues  k indicate what "mass" of reservoir signals x i (all together) is aligned with v k Eigenvalue spread max / min indicates overall "non- orthogonality" of reservoir signals v max x1x1 x2x2 x2x2 x1x1 v min v max v min max / min  20 max / min  1

54 Large eigenvalue spread  large output weights... harmful for generalization, because slight changes in reservoir signals will induce large changes in output harmful for model accuracy, because estimation error contained in reservoir signals is magnified (applies not to deterministic systems) renders LMS online adaptive learning useless v max x1x1 x2x2 v min max / min  20

55 Summary Basic idea: dynamical reservoir of echo states + supervised teaching of output connections. Seemed difficult: in nonlinear coupled systems, every variable interacts with every other. BUT seen the other way round, every variable rules and echoes every other. Exploit this for local learning and local system analysis. Echo states shape the tool for the solution from the task.

56 Thank you.

57 References H. Jaeger (2002): Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the "echo state network" approach. GMD Report 159, German National Research Center for Information Technology, 2002 Slides used by Herbert Jaeger at IK2002


Download ppt "Tutorial : Echo State Networks Dan Popovici University of Montreal (UdeM) MITACS 2005."

Similar presentations


Ads by Google