CS249: Neural Language Model

Slides:

Advertisements

Similar presentations

Bayesian Belief Propagation

Advertisements

Supervised Learning Recap

A Neural Probabilistic Language Model Keren Ye.

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

x – independent variable (input)

Learning From Data Chichang Jou Tamkang University.

Artificial Neural Networks

Online Learning Algorithms

Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.

Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.

Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function.

1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.

Chapter1: Introduction Chapter2: Overview of Supervised Learning

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.

Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Today’s Lecture Neural networks Training

The role of optimization in machine learning

CS 9633 Machine Learning Support Vector Machines

Visualizing High-Dimensional Data

Chapter 7. Classification and Prediction

DEEP LEARNING BOOK CHAPTER to CHAPTER 6

Deep Feedforward Networks

Deep Learning Amin Sobhani.

LECTURE 11: Advanced Discriminant Analysis

Statistical Models for Automatic Speech Recognition

Machine Learning I & II.

Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.

Classification with Perceptrons Reading:

Intelligent Information System Lab

Machine Learning Basics

Data Mining Lecture 11.

Efficient Estimation of Word Representation in Vector Space

Word2Vec CS246 Junghoo “John” Cho.

Neural Language Model CS246 Junghoo “John” Cho.

Hidden Markov Models Part 2: Algorithms

Image Captions With Deep Learning Yulia Kogan & Ron Shiff

Collaborative Filtering Matrix Factorization Approach

CSCI B609: “Foundations of Data Science”

Logistic Regression & Parallel SGD

Statistical Models for Automatic Speech Recognition

10701 / Machine Learning Today: - Cross validation,

Deep Learning for Non-Linear Control

ML – Lecture 3B Deep NN.

The loss function, the normal equation,

CS246: Information Retrieval

CS246: Latent Dirichlet Analysis

CPSC 503 Computational Linguistics

Mathematical Foundations of BME Reza Shadmehr

Topic Models in Text Processing

Machine learning overview

Logistic Regression Chapter 7.

Word embeddings (continued)

Lecture 14 Learning Inductive inference

Image recognition.

Professor Junghoo “John” Cho UCLA

Presentation transcript:

CS249: Neural Language Model Professor Junghoo “John” Cho

Today’s Topics Yoshua Bengio, et al.: A Neural Probabilistic Language Model High-level overview of machine learning

Language Model Key question: When we hear English, how likely will we hear a particular “sentence”? John could not sleep yesterday Poop grew went therefore Q: Where is it useful? Why do we care? A: Many different applications! Spell/grammar correction: “John went there” vs “John went their” Speech recognition: again, “John went there” Sentence generation And many others… Q: How can we make computers to answer the key question? How can we formalize our goal?

Language Model: Formalization Assume a “language machine” When asked, it randomly generates a syntactically correct and semantically meaningful sentence Given a sequence of words, 𝑤 1 𝑤 2 … 𝑤 𝑛 , what is the probability that the next sentence generated by the language machine is 𝑤 1 𝑤 2 … 𝑤 𝑛 ? Example P(“UCLA is best”) ~ 0.001 P(“Poop grew would”) ~ 0.000000001 Q: How can we estimate this probability? Any idea?

Estimating Language Model Q: How do we compute P( 𝑤 1 𝑤 2 … 𝑤 𝑛 )? A: In principle, look at an infinitely large language corpus and see how many times 𝑤 1 𝑤 2 … 𝑤 𝑛 appears Example: Corpus with 1,000,000,000 word sequence Q: “UCLA” appears 100,000 times. What should be P(UCLA)? Q: “UCLA is the best” appears 10,000 times. What should be P(“UCLA is the best”)? Q: Is the problem solved?

Curse of Dimensionality Q: Assume 10,000 words in English. How many possible combinations of 4- word sequence (= 4-gram)? Q: If we have a corpus of 1,000,000,000 words, are we likely to see most 4- grams? Even for a small n, we are unlikely to see most n-grams Q: How to estimate P(sentence) when the sentence was never seen? “UCLA is located in a very expensive and safe neighborhood that everyone loves to visit” Assign P(sentence)=0? We need ways to “estimate” P(sentence) for unseen sentence. Q: How?

Estimating Language Model Simple: 1-gram. How to measure 𝑃 𝑤 ? Next: 2-gram. How to measure 𝑃 𝑤 1 𝑤 2 ? Difficult: n-gram. How to measure 𝑃 𝑤 1 𝑤 2 … 𝑤 𝑛 ? 𝑃 𝑤 1 𝑤 2 … 𝑤 𝑛 =𝑃 𝑤 1 𝑃 𝑤 2 …𝑃( 𝑤 𝑛 ) “Independence assumption” Simplest language model and easier to analyze Less likely to be accurate, but better than no language model Q: Any better way?

Chain Rule 𝑃 𝑤 1 𝑤 2 … 𝑤 𝑛 =𝑃 𝑤 1 𝑃 𝑤 2 | 𝑤 1 𝑃( 𝑤 3 | 𝑤 1 𝑤 2 )…𝑃( 𝑤 𝑛 | 𝑤 1 … 𝑤 𝑛−1 ) If we can estimate 𝑃( 𝑤 𝑖 | 𝑤 1 … 𝑤 𝑖−1 ) correctly for every 𝑖, we can estimate 𝑃( 𝑤 1 … 𝑤 𝑛 ) exactly! Q: How do we estimate 𝑃( 𝑤 𝑖 | 𝑤 1 … 𝑤 𝑖−1 )? A: Locality. Two words are unlikely to be correlated if they are far apart! 𝑃 𝑤 𝑖 𝑤 1 … 𝑤 𝑖−1 ≅𝑃( 𝑤 𝑖 | 𝑤 𝑖−𝑛+1 𝑤 𝑖−𝑛+2 … 𝑤 𝑖−1 ) for a reasonably small 𝑛. Q: But even for small 𝑛, say 4, we are unlikely to see all possible combination! How do we still estimate 𝑃( 𝑤 𝑖 | 𝑤 𝑖−𝑛+1 𝑤 𝑖−𝑛+2 … 𝑤 𝑖−1 )? Many different techniques exist [Bengio 2003] Use neural network to estimate the conditional probability! We will learn more techniques from other papers two weeks later

[Bengio 2003] Intuition If we see “A cat is walking in the bedroom”, we know “A dog is running in a room” is also likely. Q: Why? Paradigmatic relationship “cat”, “dog”, …: words that often appear in similar context 𝑃 𝑤𝑎𝑙𝑘𝑠 𝑐𝑎𝑡 ≈𝑃(𝑤𝑎𝑙𝑘𝑠|𝑑𝑜𝑔) Q: How can we ensure that 𝑃 𝑤𝑎𝑙𝑘𝑠 𝑐𝑎𝑡 ≈𝑃(𝑤𝑎𝑙𝑘𝑠|𝑑𝑜𝑔)?

[Bengio 2003] Key Formulation Map each word 𝑤 𝑖 to a vector 𝑣 𝑖 , so that the vectors 𝑣 𝑖 and 𝑣 𝑗 are close to each other when the words 𝑤 𝑖 and 𝑤 𝑗 are similar Represent 𝑃 𝑤 𝑖 𝑤 1 … 𝑤 𝑖−1 = 𝑓 𝑖 ( 𝑤 1 , …, 𝑤 𝑖−1 ) as a function of the input word vectors, 𝑓 𝑖 𝑣 1 , …, 𝑣 𝑖−1 Note that we have one function 𝑓 𝑖 per every word Equivalently, 𝑓=( 𝑓 1 ,…, 𝑓 𝑉 ) is a function that outputs a 𝑉-dimensional vector Intuition: When similar words 𝑤 𝑎 ≈ 𝑤 𝑏 are mapped to similar vectors 𝑣 𝑎 ≈ 𝑣 𝑏 , then 𝑓 𝑖 ( 𝑣 𝑎 , …, 𝑣 𝑖−1 )≈ 𝑓 𝑖 ( 𝑣 𝑏 , …, 𝑣 𝑖−1 ) As long as 𝑓 𝑖 is a smooth function

Probability Function 𝑓( 𝑣 1 ,…, 𝑣 𝑛 ) Function from 𝑛 𝑚-dimensional vectors to a 𝑉-dimensional probability vector Each dimension in the output vector represents the probability of each word 𝑤 𝑖 𝑓 𝑣 1 𝑣 2 𝑣 𝑛 ⋮ 𝑤 1 :0.03 𝑤 2 :0.01 ⋮ 𝑤 𝑉 :0.05

Example 10 words, 2d vector representation 𝑤 1 = 0.1, 0.3 𝑤 2 = 0.3, 0.2 𝑤 3 = 0.7, 0.1 𝑤 4 = 0.2, 0.6 𝑤 5 = 0.7, 0.7 𝑤 6 = 0.5, 0.2 𝑤 7 = 0.4, 0.1 𝑤 8 = 0.3, 0.5 𝑤 9 = 0.9, 0.1 𝑤 10 = 0.4, 0.3 𝑃 𝑤 𝑖 𝑤 3 𝑤 7 𝑤 1 : 𝑓 0.7, 0.1 , 0.4, 0.1 , 0.1, 0.3 =( 𝑝 1 , 𝑝 2 ,…, 𝑝 10 )

Remaining Questions Q: How can we map words to vectors, so that similar words map to similar vectors? Q: How do we obtain the function 𝑓()? A: Use neural network to find them together!

Machine Learning as Function Approximation Claim: Most (if not all) machine learning problem is a function approximation problem! Q: ??????? What does it exactly mean? Claim: Given input 𝑥 , we want to find a function 𝑦 =𝑓( 𝑥 ) that predicts the output variable 𝑦 Example: Face recognition Input: image pixels -- a matrix of numbers Output: 1/0 – Is John in the picture? Example: Weather prediction Input: sensor readings Output: [0,1] – What is the chance of rain tomorrow?

Machine Learning as Function Approximation Q: How can the computer “learn” the function 𝑓( 𝑥 ) from data 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , …, ( 𝑥 𝑛 , 𝑦 𝑛 ) automatically? Approach Pick a class of functions 𝑓 𝜃 ( 𝑥 )(with parameter 𝜃 ) from which we will find the true 𝑓( 𝑥 ) Linear function: 𝑓 𝜃 ( 𝑥 ) = 𝜃 ∙ 𝑥 = 𝜃 1 𝑥 1 +…+ 𝜃 𝑛 𝑥 𝑛 Log linear function: 𝑓 𝜃 𝑥 =log⁡( 𝜃 ∙ 𝑥 )= log⁡(𝜃 1 𝑥 1 +…+ 𝜃 𝑛 𝑥 𝑛 ) … Find 𝜃 that minimizes the “difference” between 𝑓 𝜃 ( 𝑥 ) and true 𝑓( 𝑥 ) Q: But we don’t know true 𝑓( 𝑥 ). How do we compute the difference?

Loss Function 𝐿( 𝑦 , 𝑦 ′) The “error” between our estimated function 𝑓 𝜃 ( 𝑥 𝑖 ) and true function 𝑓 𝑥 𝑖 = 𝑦 𝑖 Many popular loss functions are used L1 norm: 𝑗 | 𝑓 𝜃 𝑥 𝑗 −𝑓 𝑥 𝑗 | L2 norm: 𝑗 ( 𝑓 𝜃 𝑥 𝑗 −𝑓 𝑥 𝑗 ) 2 KL-divergence: 𝑗 𝑓 𝑥 𝑗 log( 𝑓 𝜃 𝑥 𝑗 𝑓 𝑥 𝑗 ) … Sum up the “loss” on every “training data” 𝑖 𝐿( 𝑓 𝜃 𝑥 𝑖 , 𝑦 𝑖 ) Find the parameter 𝜃 that minimizes the loss on the training data

Finding 𝜃 Q: How do we find the parameter 𝜃 that minimizes the loss on the training data? Machine learning as an optimization problem Given 𝐿 𝜃 , say, 𝑖 𝑓 𝜃 𝑥 𝑖 − 𝑦 𝑖 2 , find 𝜃 that minimizes 𝐿 𝜃 on the training data 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , …, ( 𝑥 𝑛 , 𝑦 𝑛 ) Many different optimization techniques exists for function minimization Linear programming Convex optimization Gradient descent …

Summary: Machine Learning Machine learning requires Training data: 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , …, ( 𝑥 𝑛 , 𝑦 𝑛 ) Choice of parameterized function (hypothesis space): 𝑓 𝜃 ( 𝑥 ) Choice of loss function Optimization technique

Questions on Machine Learning Q: Where do we get training data? How much do we need? A: Collecting training data is often very hard and critical. The more, the better Q: What class of function 𝑓 𝜃 ( 𝑥 ) should we assume? How do we know whether it includes the true function 𝑓( 𝑥 )? A: Mainly trial and error. Before early 2010, mostly linear models and decision trees were used “Neural network” became hugely popular in the last decade Q: What loss function should we use? Depends on “the goal of the application” Different loss functions lead to different functions Q: What optimization technique should we use? A: Depends on the choice of 𝑓 𝜃 ( 𝑥 ) and loss function Linear programming for a linear function Convex optimization techniques if 𝑓 𝜃 ( 𝑥 ) is convex Stochastic gradient descent (SGD) for neural network

“Understanding” an ML Paper What is the input and output of our problem? How is the problem mapped into a function approximation problem? What hypothesis space was assumed? What loss function was used? What technique is used to solve the loss optimization problem? What data was obtained? How was the result evaluated? Your lecture should include the answers to the above together with higher-level motivation on why it is important, why it is difficult, etc.

Announcements Second paper summary is due before Monday lecture Tomas Mikolov, et al.: Distributed Representations of Words and Phrases and their Compositionality Please sign up for Piazza http://piazza.com/ucla/spring2019/cs249 Please form a group and submit your group information Please sign up for paper presentation