Recurrent Neural Networks

Name: Recurrent Neural Networks
Uploaded: 2017-08-19T14:43:44+00:00
Duration: PTM16S50
Channel: Norman Morton
Description: Recurrent Neural Networks

Recurrent Neural Networks
Recurrent Networks Some problems require previous history/context in order to be able to give proper output (speech recognition, stock forecasting, target tracking, etc. One way to do that is to just provide all the necessary context in one "snap-shot" and use standard learning How big should the snap-shot be? Varies for different instances of the problem. Recurrent Neural Networks

Recurrent Networks Another option is to use a recurrent neural network which lets the network dynamically learn how much context it needs in order to solve the problem Speech Example – Vowels vs Consonants, etc. Acts like a state machine that will give different outputs for the current input depending on the current state Recurrent nets must learn and use this state/context information in order to get high accuracy on the task Recurrent Neural Networks

Recurrent Networks Partially and Fully recurrent networks – feed forward vs. relaxation nets Elman Training – simple recurrent networks, can use standard BP training BPTT – Backpropagation through time – can learn further back, must pick depth Real-time recurrent learning, etc. Review BP equations, Saturation, Inductive Bias intuition Recurrent Neural Networks

Recurrent Network Variations
This network can theoretically learn contexts arbitrarily far back Many structural variations Elman/Simple Net Jordan Net Mixed Context sub-blocks, etc. Multiple hidden/context layers, etc. Generalized row representation How do we learn the weights? Draw different variations on board. For general row rep show a SRN and other variations – 2 input 3 hidden 1 output Recurrent Neural Networks

Simple Recurrent Training – Elman Training
Can think of net as just being a normal MLP structure where part of the input happens to be a copy of the last set of state/hidden node activations. The MLP itself does not even need to be aware that the context inputs are coming from the hidden layer Then can train with standard BP training While network can theoretically look back arbitrarily far in time, Elman learning gradient goes back only 1 step in time, thus limited in the context it can learn Would if current output depended on input 2 time steps back Can still be useful for applications with short term dependencies Recurrent Neural Networks

BPTT – Backprop through time
BPTT allows us to look back further as we train However we have to pre-specify a value k, which is the maximum that learning will look back During training we unfold the network in time as if it were a standard feedfoward network with k layers But where the weights of each unfolded layer are the same We then train the unfolded k layer feedforward net with standard BP Execution still happens with the actual recurrent version Is not knowing k apriori that bad? How do you choose it? Cross Validation, just like finding best number of hidden nodes, etc., thus we can find a good k fairly reasonably for a given task K choosing, CV, just like picking number of hidden nodes, etc., so we can find it pretty reasonably, could automate Recurrent Neural Networks

Note k=1 is just standard BP with no feedback k=1 means no feedback state – Note k=1 means one regular net (the top one above with input – and assuming 0 valued initial state activations it is just standard BP (even if non-zero constants they would just be more bias weights). k=2 is just elman training k is number of feedback blocks in the unfolded net Note I(k) goes into h(k) with possible output(k), hidden next to I(k) is h(k-1). I would normally start first input as I(1) with state next to it as H(0), last input is I(k) going to H(k) and O(k). Show this Recurrent Neural Networks

Outputk BPTT - Unfolding in Time (k=3) with output connections Weights at each layer are maintained as exact copies Inputk Output2 Input2 one step time delay Outputk Output1 Inputk Input1 Recurrent Neural Networks

Outputk BPTT - Unfolding in Time (k=3) with output connections Weights at each layer are maintained as exact copies Inputk Output2 Input2 one step time delay Outputk Output1 one step time delay Inputk Input1 Recurrent Neural Networks

Synthetic Data Set Delayed Parity Task - Dparity This task has a single time series input of random bits. The output label is the parity (even) of n arbitrarily delayed (but consistent) previous inputs. For example, for Dparity(0,2,5) the label of each instance would be set to the parity of the current input, the input 2 steps back, and the input 5 steps back. Dparity(0,1) is the simplest version where the output is the XOR of the current input and the most recent input Dparity-to-ARFF app User enters # of instances wanted, random seed, and a vector of the n delays (and optionally a noise parameter?) The app returns an ARFF file of this task with a random input stream based on the seed, with proper labels Do this here so we can look at specific examples Recurrent Neural Networks

BPTT Learning/Execution
Consider Dparity(0,1) and Dparity(0,2) For Dparity(0,1) what would k need to be? For learning and execution we need to start the input stream at least k steps back to get reasonable context How do you fill in initial activations of context nodes 0 vector common, .5 vector, typical/average vector For Dparity(0,2) what would k need to be? Note k=1 is just standard non-feedback BP And k=2 is simple Elman training looking back one step Let's do an example and walk through it – HW DP(0,1) needs k=2, DP(0,2) needs k=3 No extra burn-in needed beyond because inputs more the k back are not part of setting the output label, and thus anything could be possible before the k steps back Do an example 1 input, 2 hidden/context nodes, 1 output, k=2 Make a table for values, Draw unfolded net as we go up the first BP execution Recurrent Neural Networks

BPTT Training Example/Notes
How to select instances from the training set Random start positions Input and process for k steps (could start a few further back to get more representative example of initial context node activations – Burn in) Use the kth label as the target Any advantage in starting next sequence at the last start + 1? Would already have good approximations for initial context activations Don't shuffle training set (targets of first k-1 instances are ignored) Unfold and propagate error for the k layers Backpropagate error just starting from the kth target – else hidden node weight updates would be dominated by earlier less attenuated target errors Accumulate the weight changes and make one update at the end – thus all unfolded weights are proper exact copies Draw training set with inputs in one column, outputs in the next (is a time series with separate outputs), will not start with any of the last k-1 instances No, because next start will go through updated weights so old calculations would not be reusable (though wouldn't be that far off, and might still work). Fill in initial state activations 0 or .5. Recurrent Neural Networks

BPTT Issues/Notes Typically an exponential drop-off in effect of prior inputs – only so much that a few context nodes can be expected to remember Error attenuation issues of multi-layer BP learning as k gets larger Learning less stable and more difficult to get good results, local optima more common with recurrent nets BPTT – Common approach, finding proper depth k is important Recurrent Neural Networks

BPTT Project Implement BPTT Experiment with the Delayed Parity Task First test with Dparity(0,1) to make sure that works. Then try other variations including ones which stress BPTT. Analyze the results of learning a Real World recurrent task of your choice Don't re-create the assignment version here. Go to the assignment page for full details. Dparity – requires lots of training, thousand of epochs (depends on TS size), learning is often flat for a long time and then learning picks up all of a sudden. Need some burn (1 more than k should do, but improved results when use a bit more) Use k one more than you would think? Can do well with k=furthest back and burn in of at least 1, but k+1 can be faster/better, k even bigger can slow things down Sensitive to learning parameters (# of instances (needs lots), LR, momentum, etc.) # nodes critical – more usually better (state vs hidden) but if too many learning slows way down (and note less epochs with more hidden nodes could still be longer wall clock) Real world problems not completely dependent on recurrence (unlike Dparity), thus also try k=0 as the baseline comparison Use a fast language! Lots of sim time. Recurrent Neural Networks

BPTT Project Sequential/Time series with and without separate labels These series often do not have separate labels Recurrent nets can support both variations Possibilities in the Irvine Data Repository Detailed Example – Localization Data for Person Activity Data Set – Let's set this one up exactly – some subtleties Which features should we use as inputs Look at Irvine set together Note time-series/sequential – Time series consistent time vs a sequence which is less tied to time (except that one thing happens after another) Note classification, regression, clustering, etc. Look at Names and Text version (both windows up) for localization. Note the you should only use the Tag ID and the corresponding x, y, z coordinates as inputs. Why? Also note that the events are not completely regular. Recurrent Neural Networks

Localization Example Time stamps are not that regular Thus just one sensor reading per time stamp Could try to separate out learning of one sensor at a time, but the combination of sensors is critical, and just keeping the examples in temporal order would be sufficient for learning What would the network structure and data representation look like? What value for k? Typical CV graph? Stopping criteria (e.g. validation set, etc.) Remember basic BP issues: normalization, nominal value encoding, don’t know values, etc. Localist/encoded/ for nominal don't know values? This set doesn't have, but normally Recurrent Neural Networks

Localization Example Note that you might think that there would be a synchronized time stamp showing the x,y,z coordinates for each of the 4 sensors – in which case the feature vector would look like what? And could then do k ≈ 3 vs k ≈ 10 for current version (and k ≈ 10 will struggle due to error attenuation) Localist/encoded/ for nominal don't know values? This set doesn't have, but normally Recurrent Neural Networks

BPTT Project Hints Dparity(0,1) Needs LOTS of weights updates (e.g. 20 epochs with 10,000 instances, 10 epochs with 20,000 instance, etc.) Learning can be negligible for a long time, and then suddenly rise to 100% accuracy k must be at least 2 Larger k should just slow things down and lead to potential overfit Need enough hidden nodes Struggles to learn with less than 4, 4 or more does well more hidden nodes can bring down epochs, but may still increase wall clock time (i.e. # of weight updates) Not all hidden nodes need to be state nodes Explore a bit Recurrent Neural Networks

BPTT Project Hints Dparity(x,y,z) More weight updates needed k should be at least z+1, try different values Burn-in helpful? – inconclusive so far Need enough hidden nodes, at least 2*z might be a conservative heuristic, more can be helpful, but too many can slow things down LR between .1 and .5 seemed reasonable Could try momentum to speed things up Use a fast computer language/system! Recurrent Neural Networks

BPTT Project Hints Real world task Unlike Dparity(), the recurrence requirement for different instances may vary Sometimes may need to look back 4-5 steps Other times may not need to look back at all Thus, first train with k=1 (standard BP) as the baseline, and then you can see how much improvement is obtained when using recurrence Then try k = 2, 3, 4, etc. Too big of k (e.g. > 10, will usually take too long to see any benefits since the error is too attenuated to gain much benefit) Recurrent Neural Networks

Other Recurrent Approaches
LSTM – Long short term memory RTRL – Real Time Recurrent Learning Do not have to specify a k, will look arbitrarily far back But note, that with an expectation of looking arbitrarily far back, you create a very difficult problem expectation Looking back more requires increase in data, else overfit – Lots of irrelevant options which could lead to minor accuracy improvements Have reasonable expectations n4 and n3 versions – too expensive in practice Relaxation networks – Hopfield, Boltzmann, Multcons, etc. - Later Recurrent Neural Networks

Recurrent Neural Networks

Similar presentations

Presentation on theme: "Recurrent Neural Networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Recurrent Neural Networks

Similar presentations

Presentation on theme: "Recurrent Neural Networks"— Presentation transcript:

Similar presentations

About project

Feedback