Recursive Neural Networks

Recursive Neural Networks
Hazem Nomer

Introduction Non-linear adaptive models
Learn deep structured information. It can operate on structured input (e.g binary trees, graphs, sequences, etc..) Simply the RNN hidden units has the same shape of the input (tree) A generalization of recurrent neural networks Applied to parsing, sentiment analysis, protein structure prediction.

Recurrent vs. Recursive
The operation of recursive neural network. Black, orange and red dots represent input, hidden and output layers, respectively. We begin by computing the representation of each word vector (leaf nodes) and then computing the internal nodes. Figure (C) shows a recurrent neural network folded through time.

Recurrent vs. Recursive
They are feed forward neural networks with recurrent edges that span time steps(the activation from one unit is stored to be used as an input in the next time step)

Recursive Neural Network
Given a positional directed acyclic graph, it visits the nodes in topological order, and recursively applies transformations to generate further representations from previously computed representations of children. Given a binary tree structure with leaves having the initial representations, e.g. a parse tree with word vector representations at the leaves, a recursive neural network computes the representations at each internal node as follows: where l(η) and r(η) are the left and right children of η, WL and WR are the weight matrices that connect the left and right children to the parent, and b is a bias vector.

Recursive Neural Network
The previous definition shows that initial representations and intermediate representations lie in the same space. Then, there is a task-specific output layer above the representation layer. As an example, for the task of sentiment classification yη is the predicted sentiment label of the phrase given by the sub tree rooted at η. Thus, during supervised learning initial external errors are incurred on y and back propagated from the root toward leaves.

Untying Leaves and internals
The previous definition treated the leaves and internal nodes the same. (the words and phrases lie in the same meaning space) We can use untied version that distinguish between leaf and internal node Benefits: Small but powerful models can be trained by using pretrained word vectors with a large dimensionality. Separating leaves and internal nodes allows the use of rectifiers in a more natural manner.

Deep Recursive Neural Network
Recursive neural networks are deep in structure but they lack hierarchical interpretation of the data. In stacked deep learners the depth means a hierarchy among hidden representations: every hidden layer lies in a different representation space and is a more abstract representation of the input than the previous layer. A deep recursive neural network is which is constructed by stacking multiple layers of individual recursive nets. Where i indexes the multiple stacked layers and V(i) is the weight matrix that connects the (i − 1)th hidden layer to the ith hidden layer. For prediction, we connect the output layer to only the final hidden layer. Learning in deep RNN is done by back propagation through structure. A node receives error terms from both its parent (through structure) and from its counterpart in the higher layer (through space).

Deep vs Shallow Recursive Neural Networks
In a shallow recursive neural network a single layer is responsible for learning a representation of composition that is both useful and sufficient for the final decision. In a deep recursive neural network a layer can learn some parts of the composition to apply pass this intermediate representation to the next layer for further processing for the remaining parts of the overall composition. Irsoy and Cardie (2014) showed that deep recursive neural networks outperform shallow recursive nets of the same size in the fine-grained sentiment prediction task on the Stanford Sentiment Treebank and outperform multiplicative recursive neural network variants, achieving new state-of-the-art performance on the task.

Deep Recursive Neural Networks for natural language compositionality
Stanford Sentiment Treebank (SST) includes labels for 215,154 phrases in the parse trees of 11,855 sentences, with an average sentence length of 19.1 tokens. Real-valued sentiment labels are converted to an integer ordinal label in {0, , 4} by simple thresholding. Therefore the supervised task is posed as a 5-class classification problem. For the output layer: standard softmax activation: For the hidden layers : rectifier linear activation: f(x) = max{0, x}. Experimentally, rectifier activation gives better performance, faster convergence, and sparse representations.

Deep Recursive Neural Networks for natural language compositionality
Regularization: using dropout technique with dropout rate (probability of dropped neurons) from {0,0.1,0.5}. Dropout prevents learned features from co-adapting. Note that dropped units are shared: for a single sentence and a layer drop the same units of the hidden layer at each node. Training: Stochastic gradient descent with learning rate (0.01) with AdaGrad. Recursive weights within a layer(Whh) are initialized as 0.5I +  where  is a small uniformly random noise. This means that initially, the representation of each node is approximately the mean of its two children. All other weights are initialized as .

Long Short-Term Memory Over Recursive Structures
Introduced by Hochreiter and Schmidhuber, the LSTM overcome the problem of vanishing gradients. Same as a standard recurrent neural network with a hidden layer, but each ordinary node in the hidden layer is replaced by a memory cell. Each memory cell contains a node with a self-connected recurrent edge of fixed weight one to ensure that the gradient can pass across many time steps without vanishing or exploding. The LSTM model introduces an intermediate type of storage via the memory cell.

On the left, long short term memory cell as introduced by Hochreiter and Schmidhuber. On the right, a long short term memory with forget gate as introduced by Gers et al. It is used to flush the contents of the internal state.

A recurrent neural network with a hidden layer consisting of two memory cells. The network is shown unfolded across two time steps.

Recursion is a fundamental process associated with many problems—a recursive process and the structure it forms are common in different modalities. Semantics of sentences in human languages is arguably to be a linear concatenation of words instead, sentences often have structures . Image understanding, as another example, may benefit from recursive modeling over structures. Zhu et al. extended LSTM to tree structures to learn memory blocks that can reflect the history memories of multiple child cells and also multiple descendant cells. They call the model S-LSTM.

An example of S-LSTM, a long-short term memory network on tree structures. A tree node can consider information from multiple descendants. Information of the other nodes in white are blocked. The small circle (”◦”) or short line (”−”) at each arrowhead indicates a pass or block of information, respectively, while in the real model the gating is a soft version of gating.

Each node in the network is composed of a S-LSTM memory block. Each memory block contains one input gate and one output gate. The number of forget gates depends on the structure (the number of children of a node). The hidden vectors of the two children, denoted as hLt−1 for the left child and hR t−1 for the right are inputs of the current block. The input gate it consider four resources of information: the hidden vectors cell vectors of its two children. The left and right forget gates can be controlled independently, allowing the pass-through of information from children’s cell vectors.

The output gate ot considers the hidden vectors from the children and the current cell vector. The hidden vector ht and the cell vector ct of the current block are passed to the parent and are used depending on if the current block is a left or right child of its parent. The memory cell through merging the gated cell vectors of the children, can reflect multiple direct or indirect descendant cells so, the long- distance interplays over the structures can be captured.

A S-LSTM memory block, consisting of an input gate, two forget gates, and an output gate. Hidden vectors h∗ t−1 and cell vectors c∗ t−1 from the left (red arrows) and right (blue arrows) children are deployed to compute ct and ht. ⊗ denotes a Hadamard product.

During training, the gradient of the objective function with respect to each parameter can be calculated efficiently via back-propagation over structures. They use LSTM-like back-propagation where unlike a regular LSTM the pass of error needs to discriminate between children.

Recursive Neural Networks

Similar presentations

Presentation on theme: "Recursive Neural Networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Recursive Neural Networks

Similar presentations

Presentation on theme: "Recursive Neural Networks"— Presentation transcript:

Similar presentations

About project

Feedback