CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 8: Modeling text using a recurrent neural network trained with a really fancy.

Slides:

Advertisements

Similar presentations

Multi-Layer Perceptron (MLP)

Advertisements

CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.

Neural networks Introduction Fitting neural networks

CS590M 2008 Fall: Paper Presentation

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.

Least squares CS1114

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

CS Perceptrons1. 2 Basic Neuron CS Perceptrons3 Expanded Neuron.

Classification Neural Networks 1

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Easy Optimization Problems, Relaxation, Local Processing for a single variable.

Jonathan Richard Shewchuk Reading Group Presention By David Cline

x – independent variable (input)

November 2, 2010Neural Networks Lecture 14: Radial Basis Functions 1 Cascade Correlation Weights to each new hidden node are trained to maximize the covariance.

September 23, 2010Neural Networks Lecture 6: Perceptron Learning 1 Refresher: Perceptron Training Algorithm Algorithm Perceptron; Start with a randomly.

Artificial Neural Networks

CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.

December 7, 2010Neural Networks Lecture 21: Hopfield Network Convergence 1 The Hopfield Network The nodes of a Hopfield network can be updated synchronously.

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

Collaborative Filtering Matrix Factorization Approach

Neural Networks Lecture 8: Two simple learning algorithms

Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences

Artificial Neural Networks

Artificial Neural Networks

Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.

Integrating Neural Network and Genetic Algorithm to Solve Function Approximation Combined with Optimization Problem Term presentation for CSC7333 Machine.

CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.

GENERATING TEXT WITH RECURRENT NEURAL NETWORKS Ilya Sutskever, James Martens, and Geoffrey Hinton, ICML Institute of Electronics, NCTU 指導教授.

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.

Classification / Regression Neural Networks 2

Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.

CS 478 – Tools for Machine Learning and Data Mining Backpropagation.

CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton.

CSC321: Neural Networks Lecture 2: Learning with linear neurons Geoffrey Hinton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

CSC321: Introduction to Neural Networks and machine Learning Lecture 16: Hopfield nets and simulated annealing Geoffrey Hinton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.

CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.

Neural Nets: Something you can use and something to think about Cris Koutsougeras What are Neural Nets What are they good for Pointers to some models and.

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:

Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.

CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.

CSC321: Lecture 7:Ways to prevent overfitting

METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.

Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.

CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

Neural Networks for Machine Learning Lecture 3a Learning the weights of a linear neuron Geoffrey Hinton with Nitish Srivastava Kevin Swersky.

 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.

Neural Networks - Berrin Yanıkoğlu1 MLP & Backpropagation Issues.

CSC321: Neural Networks Lecture 9: Speeding up the Learning

Today’s Lecture Neural networks Training

Deep Learning Methods For Automated Discourse CIS 700-7

Neural Networks for Machine Learning Lecture 8 (for us, First Lecture of week 9) A brief overview of “Hessian-Free” optimization followed by its application.

Neural Networks CS 446 Machine Learning.

CSC 578 Neural Networks and Deep Learning

Collaborative Filtering Matrix Factorization Approach

Perceptron as one Type of Linear Discriminants

Artificial Neural Networks

Capabilities of Threshold Neurons

Support Vector Machines

Neural networks (1) Traditional multi-layer perceptrons

Computer Vision Lecture 19: Object Recognition III

Presentation transcript:

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 8: Modeling text using a recurrent neural network trained with a really fancy optimizer Geoffrey Hinton

The error surface for a linear neuron The error surface lies in a space with a horizontal axis for each weight and one vertical axis for the error. –It is a quadratic bowl. i.e. the height can be expressed as a function of the weights without using powers higher than 2. Quadratics have constant curvature (because the second derivative must be a constant) –Vertical cross-sections are parabolas. –Horizontal cross-sections are ellipses. E w1 w2 w

Convergence speed The direction of steepest descent does not point at the minimum unless the ellipse is a circle. –The gradient is big in the direction in which we only want to travel a small distance. –The gradient is small in the direction in which we want to travel a large distance. This equation is sick. The RHS needs to be multiplied by a term of dimension w^2 to make the dimensions balance.

How the learning goes wrong If the learning rate is big, it sloshes to and fro across the ravine. If the rate is too big, this oscillation diverges. How can we move quickly in directions with small gradients without getting divergent oscillations in directions with big gradients? E w

Five ways to speed up learning Use an adaptive global learning rate –Increase the rate slowly if its not diverging –Decrease the rate quickly if it starts diverging Use separate adaptive learning rate on each connection –Adjust using consistency of gradient on that weight axis Use momentum –Instead of using the gradient to change the position of the weight “particle”, use it to change the velocity. Use a stochastic estimate of the gradient from a few cases –This works very well on large, redundant datasets. Don’t go in the direction of steepest descent. –The gradient does not point at the minimum. Can we preprocess the data or do something to the gradient so that we move directly towards the minimum?

Newton’s method The basic problem is that the gradient is not the direction we want to go in. –If the error surface had circular cross- sections, the gradient would be fine. –So maybe we should apply a linear transformation that turns ellipses into circles. Instead of turning the error surface into a circular bowl, we could achieve the same thing by transforming the gradient vector. –Transform it in exactly the way it would be transformed by turning the error surface into a circular bowl.

How to transform the gradient vector We would like to multiply the vector of gradients by the inverse of the curvature matrix. –This produces a vector that takes us straight to the minimum in one step for a quadratic surface. Unfortunately, the inverse curvature matrix has too many terms to be of use in a big neural network (the number of weights squared!) This equation is dimensionally correct.

Conjugate gradient There is an alternative to going to the minimum in one step by multiplying by the inverse of the curvature matrix. Use a sequence of steps each of which finds the minimum along one direction. Make sure that each new direction is “conjugate” to the previous directions. –This means that as you go in the new direction, you do not change the gradients in the previous directions.

A picture of conjugate gradient The gradient in the direction of the first step is zero at all points on the green line. So if we move along the green line we don’t mess up the minimization we already did in the first direction.

What does conjugate gradient achieve? After N steps, conjugate gradient is guaranteed to find the minimum of an N-dimensional quadratic surface. –After many less than N steps it has typically got the error to close to the minimum value. Conjugate gradient can be applied to a non- quadratic error surface and it usually works quite well, but the HF optimizer is much better.

Training recurrent neural nets with a powerful optimizer Here is a good problem to demonstrate the power of RNNs: understand all the text on the web. The only way to predict the next word really well is to learn a model of what the source believes. –Stephen Harper suspended parliament because he wanted to avoid the danger of …. We are still a long way from learning this level of understanding! –But machine learning will get there eventually and programming rules by hand never will.

Advantages of working with characters The web is composed of character strings. Any learning method powerful enough to understand the world by reading the web ought to find it trivial to learn which strings make words. Pre-processing text to get words is a big hassle –What about prefixes, suffixes etc. –What about New York? –What about subtle effects like “sn” words?

The obvious recurrent neural net 1500 hidden units character: 1-of hidden units c predicted distribution for next character. It is a lot easier to predict 86 characters than 100,000 words.

Another view of the recurrent net There are exponentially many nodes! Different nodes can share structure because they use distributed representations. The next hidden representation needs to depend on the conjunction of the current character and the current hidden representation. fix fixi fixin i e n Each node is a hidden state vector. The next character must transform this to a new node.

Multiplicative connections Instead of using the inputs to the recurrent net to provide additive extra input to the hidden units, we could use the current input character to choose the whole hidden-to-hidden weight matrix. –But this requires 86x1500x1500 parameters –This would make the net overfit. Can we achieve the same kind of multiplicative interaction using fewer parameters? –Is there some way to allow the 86 character- specific weight matrices to share parameters?

1500 hidden units character: 1-of-86 Using a few thousand 3-way factors to allow a character to create a whole transition matrix predicted distribution for next character hidden units Each factor, f, defines a rank one matrix, Each character, c, determines a gain for each of these rank one matrices c

Training the character model Ilya Sutskever used about a million strings of 250 characters taken from wikipedia. For each string he starts predicting at the 51 st character. It takes 5 days on 8 GPU boards each with 240 processors to get a really good model. It needs very big mini-batches (its lucky he didn’t start in 1980) Ilya’s best model beats the state of the art for character prediction, but works in a very different way from the best other models. –It can balance quotes and brackets over long distances. The rival models cannot do this.

In 1974 Northern Denver had been overshadowed by CNL, and several Irish intelligence agencies in the Mediterranean region. However, on the Victoria, Kings Hebrew stated that Charles decided to escape during an alliance. The mansion house was completed in 1882, the second in its bridge are omitted, while closing is the proton reticulum composed below it aims, such that it is the blurring of appearing on any well-paid type of box printer. Some text generated by the model