Recurrent Neural Networks

Slides:

Advertisements

Similar presentations

Multi-Layer Perceptron (MLP)

Advertisements

Dougal Sutherland, 9/25/13.

Slides from: Doug Gray, David Poole

NEURAL NETWORKS Backpropagation Algorithm

Neural networks Introduction Fitting neural networks

1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)

Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.

Neural Networks  A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Tuomas Sandholm Carnegie Mellon University Computer Science Department

Backpropagation CS 478 – Backpropagation.

CS 678 –Relaxation and Hopfield Networks1 Relaxation and Hopfield Networks Totally connected recurrent relaxation networks Bidirectional weights (symmetric)

Machine Learning Neural Networks

Lecture 14 – Neural Networks

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Prediction Networks Prediction –Predict f(t) based on values of f(t – 1), f(t – 2),… –Two NN models: feedforward and recurrent A simple example (section.

Chapter 6: Multilayer Neural Networks

Artificial Neural Networks

Saturation, Flat-spotting Shift up Derivative Weight Decay No derivative on output nodes.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Artificial Neural Networks

Integrating Neural Network and Genetic Algorithm to Solve Function Approximation Combined with Optimization Problem Term presentation for CSC7333 Machine.

Neural Networks Chapter 6 Joost N. Kok Universiteit Leiden.

Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.

Chapter 9 Neural Network.

Neural Networks AI – Week 23 Sub-symbolic AI Multi-Layer Neural Networks Lee McCluskey, room 3/10

1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 23 Nov 2, 2005 Nanjing University of Science & Technology.

Machine Learning Chapter 4. Artificial Neural Networks

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.

Classification / Regression Neural Networks 2

CS 478 – Tools for Machine Learning and Data Mining Backpropagation.

CS Inductive Bias1 Inductive Bias: How to generalize on novel data.

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

An Artificial Neural Network Approach to Surface Waviness Prediction in Surface Finishing Process by Chi Ngo ECE/ME 539 Class Project.

Neural Networks - lecture 51 Multi-layer neural networks  Motivation  Choosing the architecture  Functioning. FORWARD algorithm  Neural networks as.

Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.

Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22

Chapter 18 Connectionist Models

Chapter 6 Neural Network.

Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.

Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.

Neural Networks Lecture 4 out of 4. Practical Considerations Input Architecture Output.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning Supervised Learning Classification and Regression

Neural networks.

Fall 2004 Backpropagation CS478 - Machine Learning.

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.

RNNs: An example applied to the prediction task

CS 388: Natural Language Processing: Neural Networks

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Natural Language and Text Processing Laboratory

Learning in Neural Networks

Backpropagation in fully recurrent and continuous networks

RNNs: Going Beyond the SRN in Language Prediction

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 8, 2018.

Neural Networks Geoff Hulten.

Machine Learning: Lecture 4

Machine Learning: UNIT-2 CHAPTER-1

RNNs: Going Beyond the SRN in Language Prediction

CSC321: Neural Networks Lecture 11: Learning in recurrent networks

Prediction Networks Prediction A simple example (section 3.7.3)

November 1, 2010 Dr. Itamar Arel College of Engineering

Presentation transcript:

Recurrent Neural Networks Recurrent Networks Some problems require previous history/context in order to be able to give proper output (speech recognition, stock forecasting, target tracking, etc. One way to do that is to just provide all the necessary context in one "snap-shot" and use standard learning How big should the snap-shot be? Varies for different instances of the problem. Recurrent Neural Networks

Recurrent Neural Networks Recurrent Networks Another option is to use a recurrent neural network which lets the network dynamically learn how much context it needs in order to solve the problem Speech Example – Vowels vs Consonants, etc. Acts like a state machine that will give different outputs for the current input depending on the current state Recurrent nets must learn and use this state/context information in order to get high accuracy on the task Recurrent Neural Networks

Recurrent Neural Networks Recurrent Networks Partially and Fully recurrent networks – feed forward vs. relaxation nets Elman Training – simple recurrent networks, can use standard BP training BPTT – Backpropagation through time – can learn further back, must pick depth Real-time recurrent learning, etc. Review BP equations, Saturation, Inductive Bias intuition Recurrent Neural Networks

Recurrent Neural Networks

Recurrent Network Variations This network can theoretically learn contexts arbitrarily far back Many structural variations Elman/Simple Net Jordan Net Mixed Context sub-blocks, etc. Multiple hidden/context layers, etc. Generalized row representation How do we learn the weights? Draw different variations on board. For general row rep show a SRN and other variations – 2 input 3 hidden 1 output Recurrent Neural Networks

Simple Recurrent Training – Elman Training Can think of net as just being a normal MLP structure where part of the input happens to be a copy of the last set of state/hidden node activations. The MLP itself does not even need to be aware that the context inputs are coming from the hidden layer Then can train with standard BP training While network can theoretically look back arbitrarily far in time, Elman learning gradient goes back only 1 step in time, thus limited in the context it can learn Would if current output depended on input 2 time steps back Can still be useful for applications with short term dependencies Recurrent Neural Networks

BPTT – Backprop through time BPTT allows us to look back further as we train However we have to pre-specify a value k, which is the maximum that learning will look back During training we unfold the network in time as if it were a standard feedfoward network with k layers But where the weights of each unfolded layer are the same We then train the unfolded k layer feedforward net with standard BP Execution still happens with the actual recurrent version Is not knowing k apriori that bad? How do you choose it? Cross Validation, just like finding best number of hidden nodes, etc., thus we can find a good k fairly reasonably for a given task K choosing, CV, just like picking number of hidden nodes, etc., so we can find it pretty reasonably, could automate Recurrent Neural Networks

Recurrent Neural Networks Note k=1 is just standard BP with no feedback k=1 means no feedback state – Note k=1 means one regular net (the top one above with input – and assuming 0 valued initial state activations it is just standard BP (even if non-zero constants they would just be more bias weights). k=2 is just elman training k is number of feedback blocks in the unfolded net Note I(k) goes into h(k) with possible output(k), hidden next to I(k) is h(k-1). I would normally start first input as I(1) with state next to it as H(0), last input is I(k) going to H(k) and O(k). Show this Recurrent Neural Networks

Recurrent Neural Networks Outputk BPTT - Unfolding in Time (k=3) with output connections Weights at each layer are maintained as exact copies Inputk Output2 Input2 one step time delay Outputk Output1 Inputk Input1 Recurrent Neural Networks

Recurrent Neural Networks Outputk BPTT - Unfolding in Time (k=3) with output connections Weights at each layer are maintained as exact copies Inputk Output2 Input2 one step time delay Outputk Output1 one step time delay Inputk Input1 Recurrent Neural Networks

Recurrent Neural Networks Synthetic Data Set Delayed Parity Task - Dparity This task has a single time series input of random bits. The output label is the parity (even) of n arbitrarily delayed (but consistent) previous inputs. For example, for Dparity(0,2,5) the label of each instance would be set to the parity of the current input, the input 2 steps back, and the input 5 steps back. Dparity(0,1) is the simplest version where the output is the XOR of the current input and the most recent input Dparity-to-ARFF app User enters # of instances wanted, random seed, and a vector of the n delays (and optionally a noise parameter?) The app returns an ARFF file of this task with a random input stream based on the seed, with proper labels Do this here so we can look at specific examples Recurrent Neural Networks

BPTT Learning/Execution Consider Dparity(0,1) and Dparity(0,2) For Dparity(0,1) what would k need to be? For learning and execution we need to start the input stream at least k steps back to get reasonable context How do you fill in initial activations of context nodes 0 vector common, .5 vector, typical/average vector For Dparity(0,2) what would k need to be? Note k=1 is just standard non-feedback BP And k=2 is simple Elman training looking back one step Let's do an example and walk through it – HW DP(0,1) needs k=2, DP(0,2) needs k=3 No extra burn-in needed beyond because inputs more the k back are not part of setting the output label, and thus anything could be possible before the k steps back Do an example 1 input, 2 hidden/context nodes, 1 output, k=2 Make a table for values, Draw unfolded net as we go up the first BP execution Recurrent Neural Networks

BPTT Training Example/Notes How to select instances from the training set Random start positions Input and process for k steps (could start a few further back to get more representative example of initial context node activations – Burn in) Use the kth label as the target Any advantage in starting next sequence at the last start + 1? Would already have good approximations for initial context activations Don't shuffle training set (targets of first k-1 instances are ignored) Unfold and propagate error for the k layers Backpropagate error just starting from the kth target – else hidden node weight updates would be dominated by earlier less attenuated target errors Accumulate the weight changes and make one update at the end – thus all unfolded weights are proper exact copies Draw training set with inputs in one column, outputs in the next (is a time series with separate outputs), will not start with any of the last k-1 instances No, because next start will go through updated weights so old calculations would not be reusable (though wouldn't be that far off, and might still work). Fill in initial state activations 0 or .5. Recurrent Neural Networks

Recurrent Neural Networks BPTT Issues/Notes Typically an exponential drop-off in effect of prior inputs – only so much that a few context nodes can be expected to remember Error attenuation issues of multi-layer BP learning as k gets larger Learning less stable and more difficult to get good results, local optima more common with recurrent nets BPTT – Common approach, finding proper depth k is important Recurrent Neural Networks

Recurrent Neural Networks BPTT Project Implement BPTT Experiment with the Delayed Parity Task First test with Dparity(0,1) to make sure that works. Then try other variations including ones which stress BPTT. Analyze the results of learning a Real World recurrent task of your choice Don't re-create the assignment version here. Go to the assignment page for full details. Dparity – requires lots of training, thousand of epochs (depends on TS size), learning is often flat for a long time and then learning picks up all of a sudden. Need some burn (1 more than k should do, but improved results when use a bit more) Use k one more than you would think? Can do well with k=furthest back and burn in of at least 1, but k+1 can be faster/better, k even bigger can slow things down Sensitive to learning parameters (# of instances (needs lots), LR, momentum, etc.) # nodes critical – more usually better (state vs hidden) but if too many learning slows way down (and note less epochs with more hidden nodes could still be longer wall clock) Real world problems not completely dependent on recurrence (unlike Dparity), thus also try k=0 as the baseline comparison Use a fast language! Lots of sim time. Recurrent Neural Networks

Recurrent Neural Networks BPTT Project Sequential/Time series with and without separate labels These series often do not have separate labels Recurrent nets can support both variations Possibilities in the Irvine Data Repository Detailed Example – Localization Data for Person Activity Data Set – Let's set this one up exactly – some subtleties Which features should we use as inputs Look at Irvine set together Note time-series/sequential – Time series consistent time vs a sequence which is less tied to time (except that one thing happens after another) Note classification, regression, clustering, etc. Look at Names and Text version (both windows up) for localization. Note the you should only use the Tag ID and the corresponding x, y, z coordinates as inputs. Why? Also note that the events are not completely regular. Recurrent Neural Networks

Recurrent Neural Networks Localization Example Time stamps are not that regular Thus just one sensor reading per time stamp Could try to separate out learning of one sensor at a time, but the combination of sensors is critical, and just keeping the examples in temporal order would be sufficient for learning What would the network structure and data representation look like? What value for k? Typical CV graph? Stopping criteria (e.g. validation set, etc.) Remember basic BP issues: normalization, nominal value encoding, don’t know values, etc. Localist/encoded/ for nominal don't know values? This set doesn't have, but normally Recurrent Neural Networks

Recurrent Neural Networks Localization Example Note that you might think that there would be a synchronized time stamp showing the x,y,z coordinates for each of the 4 sensors – in which case the feature vector would look like what? And could then do k ≈ 3 vs k ≈ 10 for current version (and k ≈ 10 will struggle due to error attenuation) Localist/encoded/ for nominal don't know values? This set doesn't have, but normally Recurrent Neural Networks

Recurrent Neural Networks BPTT Project Hints Dparity(0,1) Needs LOTS of weights updates (e.g. 20 epochs with 10,000 instances, 10 epochs with 20,000 instance, etc.) Learning can be negligible for a long time, and then suddenly rise to 100% accuracy k must be at least 2 Larger k should just slow things down and lead to potential overfit Need enough hidden nodes Struggles to learn with less than 4, 4 or more does well more hidden nodes can bring down epochs, but may still increase wall clock time (i.e. # of weight updates) Not all hidden nodes need to be state nodes Explore a bit Recurrent Neural Networks

Recurrent Neural Networks BPTT Project Hints Dparity(x,y,z) More weight updates needed k should be at least z+1, try different values Burn-in helpful? – inconclusive so far Need enough hidden nodes, at least 2*z might be a conservative heuristic, more can be helpful, but too many can slow things down LR between .1 and .5 seemed reasonable Could try momentum to speed things up Use a fast computer language/system! Recurrent Neural Networks

Recurrent Neural Networks BPTT Project Hints Real world task Unlike Dparity(), the recurrence requirement for different instances may vary Sometimes may need to look back 4-5 steps Other times may not need to look back at all Thus, first train with k=1 (standard BP) as the baseline, and then you can see how much improvement is obtained when using recurrence Then try k = 2, 3, 4, etc. Too big of k (e.g. > 10, will usually take too long to see any benefits since the error is too attenuated to gain much benefit) Recurrent Neural Networks

Other Recurrent Approaches LSTM – Long short term memory RTRL – Real Time Recurrent Learning Do not have to specify a k, will look arbitrarily far back But note, that with an expectation of looking arbitrarily far back, you create a very difficult problem expectation Looking back more requires increase in data, else overfit – Lots of irrelevant options which could lead to minor accuracy improvements Have reasonable expectations n4 and n3 versions – too expensive in practice Relaxation networks – Hopfield, Boltzmann, Multcons, etc. - Later Recurrent Neural Networks