Learning linguistic structure with simple and more complex recurrent neural networks Psychology 209 - 2017 February 2, 2017.

Slides:

Advertisements

Similar presentations

Dougal Sutherland, 9/25/13.

Advertisements

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.

Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:

Learning linguistic structure with simple recurrent networks February 20, 2013.

Learning in Recurrent Networks Psychology 209 February 25, 2013.

PDP: Motivation, basic approach. Cognitive psychology or “How the Mind Works”

Lecture 14 – Neural Networks

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Recurrent Neural Networks

9.012 Brain and Cognitive Sciences II Part VIII: Intro to Language & Psycholinguistics - Dr. Ted Gibson.

Sentence Processing using a Simple Recurrent Network EE 645 Final Project Spring 2003 Dong-Wan Kang 5/14/2003.

Artificial Neural Networks

CHAPTER 11 Back-Propagation Ming-Feng Yeh.

November 21, 2012Introduction to Artificial Intelligence Lecture 16: Neural Network Paradigms III 1 Learning in the BPN Gradients of two-dimensional functions:

Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.

James L. McClelland Stanford University

Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.

Machine Learning Chapter 4. Artificial Neural Networks

Connectionist Models of Language Development: Grammar and the Lexicon Steve R. Howell McMaster University, 1999.

Methodology of Simulations n CS/PY 399 Lecture Presentation # 19 n February 21, 2001 n Mount Union College.

Introduction to Neural Networks and Example Applications in HCI Nick Gentile.

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

Pattern Associators, Generalization, Processing Psych /719 Feb 6, 2001.

Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.

Convolutional Sequence to Sequence Learning

RNNs: An example applied to the prediction task

End-To-End Memory Networks

Neural Networks.

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Simple recurrent networks.

Deep Learning Amin Sobhani.

Recursive Neural Networks

Recurrent Neural Networks for Natural Language Processing

Matt Gormley Lecture 16 October 24, 2016

James L. McClelland SS 100, May 31, 2011

Deep Learning: Model Summary

Intro to NLP and Deep Learning

ICS 491 Big Data Analytics Fall 2017 Deep Learning

Backpropagation in fully recurrent and continuous networks

Intelligent Information System Lab

CSE 190 Modeling sequences: A brief overview

CSE P573 Applications of Artificial Intelligence Neural Networks

RNNs: Going Beyond the SRN in Language Prediction

Image Captions With Deep Learning Yulia Kogan & Ron Shiff

A First Look at Music Composition using LSTM Recurrent Neural Networks

Recurrent Neural Networks

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 8, 2018.

Artificial Intelligence Chapter 3 Neural Networks

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

CSE 573 Introduction to Artificial Intelligence Neural Networks

Backpropagation.

Word Embedding Word2Vec.

Neural Networks Geoff Hulten.

Other Classification Models: Recurrent Neural Network (RNN)

Artificial Intelligence Chapter 3 Neural Networks

Artificial Intelligence Chapter 3 Neural Networks

Learning linguistic structure with simple recurrent neural networks

RNNs: Going Beyond the SRN in Language Prediction

Back Propagation and Representation in PDP Networks

Artificial Intelligence Chapter 3 Neural Networks

Word embeddings (continued)

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

CSC321: Neural Networks Lecture 11: Learning in recurrent networks

Attention for translation

A unified extension of lstm to deep network

Recurrent Neural Networks

Sequence-to-Sequence Models

Neural Machine Translation by Jointly Learning to Align and Translate

Presentation transcript:

Learning linguistic structure with simple and more complex recurrent neural networks Psychology 209 - 2017 February 2, 2017

Elman’s Simple Recurrent Network (Elman, 1990) What is the best way to represent time Slots? Or time itself? What is the best way to represent language? Units and rules? Or connectionist learning? Is grammar learnable? If so, are there any necessary constraints?

The Simple Recurrent Network Network is trained on a stream of elements with sequential structure At step n, target for output is next element. Pattern on hidden units is copied back to the context units. After learning it comes to retain information about preceding elements of the string, allowing expectations to be conditioned by an indefinite window of prior context.

Learning about words from streams of letters (200 sentences of 4-9 words) Similarly, SRNs have also been used to model learning to segment words in speech (e.g., Christiansen, Allen and Seidenberg, 1998)

Learning about sentence structure from streams of words

Learned and imputed hidden-layer representations (average vectors over all contexts) ‘Zog’ representation derived by averaging vectors obtained by inserting novel item in place of each occurrence of ‘man’.

Within-item variation by context

Analyis of SRN’s using Simpler Sequential Structures (Servain-Schreiber, Cleeremans, & McClelland) The Grammar The Network

Hidden unit representations with 3 hidden units True Finite State Machine Graded State Machine

Training with Restricted Set of Strings 21 of the 43 valid strings of length 3-8

Progressive Deepening of the Network’s Sensitivity to Prior Context Note: Prior Context is only maintained if it is prediction-relevant at intermediate points.

Elman (1991)

NV Agreement and Verb successor prediction Histograms show summed activation for classes of words: W = who S = period V1/V2 / N1/N2/PN indicate singular, plural, or proper For V’s: N = No DO O = Optional DO R = Required DO

Prediction with an embedded clause

Rules or Connections? How is it that we can process sentences we’ve never seen before? Colorless green ideas sleep furiously Chomsky, Fodor, Pinker, … Abstract, symbolic rules S-> NP VP ; NP -> (Adj)* N ; VP-> V (Adv) The connectionist alternative Function approximation using distributed representations and knowledge in connection weights

Going Beyond the SRN Back-propagation through time The vanishing gradient Problem Solving the vanishing gradient problem with LSTM’s The problem of generalization and overfitting Solutions to the overfitting problem Applying LSTM’s with dropout to a full-scale version of Elman’s prediction task

A RNN for character prediction We can see this as several copies of an Elman net placed next to each other. Note that there are only three actual weight arrays, just as in the Elman network. But now we can do ‘back propagation through time’. Gradients are propagated through all arrows, and many different paths affect the same weights. We simply add all these gradient paths together before we change the weights. What happens when we want to process the next characters in our sequence? We keep the last hidden state, but don’t backpropagate through it. You can think of this are just an Elman net unrolled for several time steps! …

Parallelizing the computation Create 20 copies of the whole thing – call each one a stream Process your text starting at 20 different points in the data. Add up the gradient across all the streams at the end of processing a batch Then take one gradient step! The forward and backward computations are farmed out to a GPU, so they actually occur in parallel using the weights from the last update …

Some problems and solutions Classic RNN at right Note superscript l for layer The vanishing gradient problem Solution: the LSTM Has it’s own internal state ‘c’ Has weights to gate input and output and to allow it to forget Note dot product notation

Some problems and solutions Classic RNN at right Note superscript l for layer The vanishing gradient problem Solution: the LSTM Has it’s own internal state ‘c’ Has weights to gate input and output and to allow it to forget Note dot product notation Overfitting Solution: Dropout Tensorflow RNN tutorial does prediction task using stacked LSTMs with Dropout (dotted lines)

Word Embeddings Use a learned word vector instead of one unit per word Similar to the Rumelhart model from Tuesday and the first hidden layer of 10 units from Elman (199) We use backprop to (conceptually) change the word vectors, rather than the input to word vector weights, but it is essentially the same computation.

Zaremba et al (2016) Uses stacked LSTMs with dropout Learns its own word vectors as it goes Shows performance gains compared with other network variants Was used in the Tensorflow RNN tutorial Could be used to study lots of interesting cognitive science questions Will be available on the lab server soon!