Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 16: Recurrent Neural Networks (RNNs)

Similar presentations


Presentation on theme: "Lecture 16: Recurrent Neural Networks (RNNs)"— Presentation transcript:

1 Lecture 16: Recurrent Neural Networks (RNNs)
CS480/680: Intro to ML Lecture 16: Recurrent Neural Networks (RNNs) some slides are adapted from Stanford cs231n course slides 11/27/18 Yaoliang Yu

2 Recap: MLPs and CNNs MLPs: Fixed length of input and output
No parameter sharing Units fully connected to the previous layer CNNs Parameter sharing Units only connected to a small region of the previous layer 11/27/18 Yaoliang Yu

3 RNN: examples (1) Variable length of input/output
e.g., image captioning 11/27/18 Yaoliang Yu

4 RNN: examples (2) Variable length of input/output
``There is nothing to like in this movie.’’  0 ``This movie is fantastic!’’  1 e.g., sentiment classification 11/27/18 Yaoliang Yu

5 RNN: examples (3) Variable length of input/output
你想和我一起唱歌吗?  Do you want to sing with me? e.g., machine translation 11/27/18 Yaoliang Yu

6 RNN: examples (4) Variable length of input/output
e.g., video classification on frame level 11/27/18 Yaoliang Yu

7 RNN: parameter sharing
Parameter sharing: unit is produced using the same update rule applied to previous ones h(t-1): old state at time t-1 x(t): new input at time t 𝜽: parameter same f() function used for every time step Through these connections the model can retain information about the past, enabling it to discover temporal correlations between events that are far away from each other in the data. 11/27/18 Yaoliang Yu

8 Unfolding computational graphs (1)
RNN with no output time delay 11/27/18 Yaoliang Yu

9 Unfolding computational graphs (2)
Example: RNN produces a (discrete) output at each time step and has recurrent connections between hidden units (many to many) same parameters for every step Vanilla RNN: one single hidden vector h 11/27/18 Yaoliang Yu

10 Language model: training
wrong wrong ht = tanh(Whh h_t-1 + Wxh x_t) 11/27/18 Yaoliang Yu

11 Language model: testing
11/27/18 Yaoliang Yu

12 Vanilla RNN gradient problems
Challenge of long-term backprop Assume W has eigen-decomposition 𝑾=𝑸𝚲 𝑸 𝑻 ; 𝑾 𝑡 =𝑸 𝚲 t 𝑸 𝑻 assume W real symmetric matrix; eigendecomposition W = QvQ^T, Q ortho backprop: continuously multiply by a matrix W and tanh We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem. Jack, who is …., is a great guy. Largest eigenvalue > 1: exploding gradient Largest eigenvalue < 1: vanishing gradient Impossible to learn correlation between temporally distant events 11/27/18 Yaoliang Yu

13 Exploding gradient Common solution: clipping gradient g: gradient
v: threshold If the norm of gradient exceeds some threshold, clip it! overshoot 11/27/18 Yaoliang Yu

14 Vanishing gradient Common solution: create paths along which the product of gradients is near 1, e.g., LSTM, GRU sigmoid: 0 or 1, consider switch on or off tanh: -1 or 1, consider useful information The flow of information into and out of the unit is guarded by learned input and output gates. element-wise product 11/27/18 Yaoliang Yu

15 Long short-term memory (LSTM)
gradient w.r.t. parameter w: at each time step, first take gradient w.r.t. to hidden state and cell state, then take local gradient (take gradient of hidden and cell state w.r.t. w). Finally, add all gradients together for all time steps. the hidden state h is a function of cell state c, so we an focus on cell state cell state, element-wise backprop, also f is different for each time step, so we do not multiply the same factor over and over again. f is from sigmoid function, between 0, and 1 won’t explode when you consider all these factors together, you will not have extreme cases of gradient vanishing problem, although it is also possible to have this problem. more smooth gradient flow memory cell: c_t, which is the thing that’s able to remember information over time. ct: cell state ht: hidden state 11/27/18 Yaoliang Yu

16 LSTM: cell state Update of cell state can simulate a counter
The ability to add to the previous value means these units Can simulate a counter; you are asked to show that if the forget gate is on and the input and output gates are off, it just passes the memory cell gradients through unmodified at each time step. Therefore, the LSTM architecture is resistant to exploding and vanishing gradients, although mathematically both phenomena are still possible. can simulate a counter 11/27/18 Yaoliang Yu

17 Gated recurrent unit (GRU)
GRU: 2 gates (r, z) LSTM: 3 gates (i, f, o) 11/27/18 Yaoliang Yu

18 Bidirectional RNNs Idea: combine an RNN that moves forward through time with another RNN that moves backward through time Motivation: need future information E.g., name entity recognition He said: ``Teddy bears are on sale!’’ (not a person’s name) He said: ``Teddy Roosevelt was a great President!’’ (yes) 11/27/18 Yaoliang Yu

19 Illustration h(t): propagate information forward
g(t): propagate information backward o(t): output uses both h(t) and g(t): 11/27/18 Yaoliang Yu

20 Deep RNNs Usually 2-3 hidden layers 𝑙: 𝑙-th hidden layer
𝑡: 𝑡-th time step 11/27/18 Yaoliang Yu

21 Questions? 11/27/18 Yaoliang Yu


Download ppt "Lecture 16: Recurrent Neural Networks (RNNs)"

Similar presentations


Ads by Google