Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kai Sheng-Tai, Richard Socher, Christopher D. Manning

Similar presentations


Presentation on theme: "Kai Sheng-Tai, Richard Socher, Christopher D. Manning"— Presentation transcript:

1 Kai Sheng-Tai, Richard Socher, Christopher D. Manning
Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks Kai Sheng-Tai, Richard Socher, Christopher D. Manning Presentation by: Reed Coke

2 Neural Nets in NLP As we know, neural networks are taking NLP by storm. Left: Visualization of Word Embeddings Right: Sentiment Analysis using RNTNs

3 Long Short-Term Memory (LSTM)
One particular type of network architecture has become the de facto way to model sentences: the long short-term memory network (Hochreiter and Schmidhuber 97). LSTMs are a type of recurrent neural network that are good at remembering information over long period of time within a sequence. Cathy Finegan-Dollak recently gave a presentation about LSTMs and kindly allowed me to borrow many of her examples and images.

4 Long Short-Term Memory (LSTM)
“My dog [eat/eats] rawhide” “My dog, who I rescued in 2009, [eat/eats] rawhide” Why might sentence 2 be harder to handle than sentence 1?

5 Long Short-Term Memory (LSTM)
“My dog [eat/eats] rawhide” “My dog, who I rescued in 2009, [eat/eats] rawhide” Why might sentence 2 be harder to handle than sentence 1? Long-term dependencies The network needs to remember that dog was singular even though there were five words in the way.

6 RNN: Deep Learning for Sequences
Figure credit: Cathy Finegan-Dollak RNN: Deep Learning for Sequences x1 x2 xt wx wx wx wh wh wh h0 σ h1 σ h2 ... ht-1 σ ht wy wy wy σ σ σ y1 y2 yt Compare our predicted sequence to correct sequence. ŷ1 ŷ2 ŷt

7 RNN: Deep Learning for Sequences
Figure credit: Cathy Finegan-Dollak RNN: Deep Learning for Sequences My dog 2009 wx wx wx wh wh wh h0 σ h1 σ h2 ... ht-1 σ ht wy wy wy σ σ σ dog who eats Compare our predicted sequence to correct sequence. dog eats eat

8 Modern LSTM Diagram c0 f1 My h0 σ c2 ... My h0 σ i1 My h0 σ o1 h1 + c1
wxf whf c2 ... My h0 σ wxi whi i1 My h0 σ wxo who o1 h1 The solution is to add gates. i: “input gate” f: “forget gate” o: “output gate” + c1 My h0 σ wxc whc ĉ1 Figure credit: Cathy Finegan-Dollak

9 Modern LSTM Diagram c0 f1 My h0 σ c2 ... My h0 σ i1 My h0 σ o1 h1 + c1
wxf whf c2 ... My h0 σ wxi whi i1 My h0 σ wxo who o1 h1 The Input Gate: The input gate takes values between 0 and 1 depending on the input. It then “discounts” the input sigmoid by that amount. + c1 My h0 σ wxc whc ĉ1 Figure credit: Cathy Finegan-Dollak

10 Modern LSTM Diagram c0 f1 My h0 σ c2 ... My h0 σ i1 My h0 σ o1 h1 + c1
wxf whf c2 ... My h0 σ wxi whi i1 My h0 σ wxo who o1 h1 The Forget Gate: The forget gate helps determine what amount of the output from the previous state, C0, should be kept. + c1 My h0 σ wxc whc ĉ1 Figure credit: Cathy Finegan-Dollak

11 Modern LSTM Diagram c0 f1 My h0 σ c2 ... My h0 σ i1 My h0 σ o1 h1 + c1
wxf whf c2 ... My h0 σ wxi whi i1 My h0 σ wxo who o1 h1 The Output Gate: The output gate modulates the fraction of the hidden state that gets sent to the output, based on the input. + c1 My h0 σ wxc whc ĉ1 Figure credit: Cathy Finegan-Dollak

12 LSTMs and Beyond LSTMs are experiencing wild success in language modeling (Filippova et al. 15, Sutskever et al. 14, Graves 13, Sundermeyer et al. 10, Tang et al. 2015) But, is language really just a flat sequence of words?

13 LSTMs and Beyond LSTMs are experiencing wild success in language modeling (Filippova et al. 15, Sutskever et al. 14, Graves 13, Sundermeyer et al. 10, Tang et al. 2015) But, is language really just a flat sequence of words? If so, I kind of regret studying linguistics in college.

14 Tree-LSTMs One of the things that makes language so interesting is its tree structure. To exploit this together with the performance of an LSTM, we generalize them to arrive at Tree-LSTMs. Instead of taking input from a single previous node, Tree-LSTMs take input from all the children of a particular node. Today’s paper discusses two variations - the Child-Sum Tree-LSTM and the N-ary Tree-LSTM.

15 Tree-LSTMs Ordinary Tree-LSTM Binary Tree-LSTM
All hail Cathy, who made these figures.

16 Binary Tree-LSTMs The Forget Gates:
A Binary Tree-LSTM can be used on a tree has a maximum branching factor of 2. f1 is the forget gate for all left children. f3 is the forget gate for all right children. This scheme works well if the children are ordered in a predictable way. All hail Cathy, who made these figures.

17 Child-Sum Tree-LSTMs The children of a Child-Sum Tree-LSTM are unordered. Instead, the ht-1 value that gets passed into ut along with xt and sent to the input gate becomes the sum of all the h values of the child nodes. Similarly, the hidden states of all the children are summed after they are sent through their forget gates into the hidden state for xt.

18 Summary N-ary Tree LSTMs Child-Sum Tree-LSTMs Children are discrete
Each of N child slots has its own forget gate (e.g. same gate for all left children) Children are ordered, forget gate is determined by ordering Also called Constituency Tree-LSTMs Restriction: branching factor may not exceed N. Child-Sum Tree-LSTMs Children are lumped together Children are unordered, because they are all summed anyway. Also called Dependency Tree-LSTMs

19 Summary Sentiment Classification Semantic Relatedness
Two types: binary and 1-5 Stanford Sentiment Treebank (Socher et al. 13) Dataset includes parse trees Semantic Relatedness Given two sentences, predict an integer [1, 5] 1= least related 5 = most related Sentences Involving Compositional Knowledge dataset (Marelli et al. 2014) Final label of a sentence pair is the average of 10 annotators

20 Sentiment Classification Results
Constituency Tree-LSTM is trained on more data than Dependency Tree-LSTM Continuing to train the GLOVE vectors yield a noticeable improvement No mention of why CNN-multichannel performs so well on the binary task.

21 Semantic Relatedness Results
Supervision only at tree node Maximum depth of dependency tree is smaller than binarized constituency tree

22 Example Code Various LSTM Examples:
LSTM for sentiment analysis of tweets - deeplearning.net Character-level LSTM for sequence generation - Andrej Karpathy RNNLIB - Alex Graves LSTM with peepholes - Felix Gers Tree LSTMs: N-ary and Child-Sum Tree LSTMs - Kai Sheng Tai Other: GLOVE vectors - Jeffrey Pennington


Download ppt "Kai Sheng-Tai, Richard Socher, Christopher D. Manning"

Similar presentations


Ads by Google