Learning linguistic structure with simple and more complex recurrent neural networks Psychology 209 - 2017 February 2, 2017.

Learning linguistic structure with simple and more complex recurrent neural networks
Psychology February 2, 2017

Elman’s Simple Recurrent Network (Elman, 1990)
What is the best way to represent time Slots? Or time itself? What is the best way to represent language? Units and rules? Or connectionist learning? Is grammar learnable? If so, are there any necessary constraints?

The Simple Recurrent Network
Network is trained on a stream of elements with sequential structure At step n, target for output is next element. Pattern on hidden units is copied back to the context units. After learning it comes to retain information about preceding elements of the string, allowing expectations to be conditioned by an indefinite window of prior context.

Learning about words from streams of letters (200 sentences of 4-9 words)
Similarly, SRNs have also been used to model learning to segment words in speech (e.g., Christiansen, Allen and Seidenberg, 1998)

Learning about sentence structure from streams of words

Learned and imputed hidden-layer representations (average vectors over all contexts)
‘Zog’ representation derived by averaging vectors obtained by inserting novel item in place of each occurrence of ‘man’.

Within-item variation by context

Analyis of SRN’s using Simpler Sequential Structures (Servain-Schreiber, Cleeremans, & McClelland)
The Grammar The Network

Hidden unit representations with 3 hidden units
True Finite State Machine Graded State Machine

Training with Restricted Set of Strings
21 of the 43 valid strings of length 3-8

Progressive Deepening of the Network’s Sensitivity to Prior Context
Note: Prior Context is only maintained if it is prediction-relevant at intermediate points.

Elman (1991)

NV Agreement and Verb successor prediction
Histograms show summed activation for classes of words: W = who S = period V1/V2 / N1/N2/PN indicate singular, plural, or proper For V’s: N = No DO O = Optional DO R = Required DO

Prediction with an embedded clause

Rules or Connections? How is it that we can process sentences we’ve never seen before? Colorless green ideas sleep furiously Chomsky, Fodor, Pinker, … Abstract, symbolic rules S-> NP VP ; NP -> (Adj)* N ; VP-> V (Adv) The connectionist alternative Function approximation using distributed representations and knowledge in connection weights

Going Beyond the SRN Back-propagation through time
The vanishing gradient Problem Solving the vanishing gradient problem with LSTM’s The problem of generalization and overfitting Solutions to the overfitting problem Applying LSTM’s with dropout to a full-scale version of Elman’s prediction task

A RNN for character prediction
We can see this as several copies of an Elman net placed next to each other. Note that there are only three actual weight arrays, just as in the Elman network. But now we can do ‘back propagation through time’. Gradients are propagated through all arrows, and many different paths affect the same weights. We simply add all these gradient paths together before we change the weights. What happens when we want to process the next characters in our sequence? We keep the last hidden state, but don’t backpropagate through it. You can think of this are just an Elman net unrolled for several time steps! …

Parallelizing the computation
Create 20 copies of the whole thing – call each one a stream Process your text starting at 20 different points in the data. Add up the gradient across all the streams at the end of processing a batch Then take one gradient step! The forward and backward computations are farmed out to a GPU, so they actually occur in parallel using the weights from the last update …

Some problems and solutions
Classic RNN at right Note superscript l for layer The vanishing gradient problem Solution: the LSTM Has it’s own internal state ‘c’ Has weights to gate input and output and to allow it to forget Note dot product notation

Some problems and solutions
Classic RNN at right Note superscript l for layer The vanishing gradient problem Solution: the LSTM Has it’s own internal state ‘c’ Has weights to gate input and output and to allow it to forget Note dot product notation Overfitting Solution: Dropout Tensorflow RNN tutorial does prediction task using stacked LSTMs with Dropout (dotted lines)

Word Embeddings Use a learned word vector instead of one unit per word
Similar to the Rumelhart model from Tuesday and the first hidden layer of 10 units from Elman (199) We use backprop to (conceptually) change the word vectors, rather than the input to word vector weights, but it is essentially the same computation.

Zaremba et al (2016) Uses stacked LSTMs with dropout
Learns its own word vectors as it goes Shows performance gains compared with other network variants Was used in the Tensorflow RNN tutorial Could be used to study lots of interesting cognitive science questions Will be available on the lab server soon!

Learning linguistic structure with simple and more complex recurrent neural networks Psychology 209 - 2017 February 2, 2017.

Similar presentations

Presentation on theme: "Learning linguistic structure with simple and more complex recurrent neural networks Psychology 209 - 2017 February 2, 2017."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning linguistic structure with simple and more complex recurrent neural networks Psychology 209 - 2017 February 2, 2017.

Similar presentations

Presentation on theme: "Learning linguistic structure with simple and more complex recurrent neural networks Psychology 209 - 2017 February 2, 2017."— Presentation transcript:

Similar presentations

About project

Feedback