A Tutorial on ML Basics and Embedding Chong Ruan

A Tutorial on ML Basics and Embedding Chong Ruan ruanchong_ruby@163.com

Machine Learning Basics What is machine learning? Traditionally: Input + Model => Output What ML does: Input (+ Output) => Model

Machine Learning Basics Example: How to write a classifier which can distinguish between a watermelon and an apple? The traditional way: Use hand-crafted features and manually code how they are related to the expected output Feature: express a sample as some values (typically a vector) E.g.: The color, shape, weight of a fruit If (weight > 2kg) print “Watermelon”; else print “Apple”. Difficulty Sometimes the relation between inputs and outputs are too complicated to be formulated Say, how to write a program to recognize a cat?

Machine Learning Basics Example: How to write a classifier which can distinguish between a watermelon and an apple? The ML way: Collect some data with labels, say we have: (7.5, watermelon), (8.1, watermelon), (0.5, apple), (0.6, apple) Propose a hypothesis (model space): If (weight > T) print “Class A”; else print “Class B” (T: unknown parameter) You may want to use a more complicated hypothesis for some more difficult tasks Training/Learning the model: Typically achieved by optimize an objective function Find the best threshold T, such that most samples are classified correctly.

Machine Learning Basics The difficulty of the ML approach: How to collect (labelled) data What kind of hypothesis to use … The first (and the most important) step is: How to choose features

Machine Learning Basics Look back on the fruit example aforementioned: How to represent a fruit: Its weight, color, flavor, size, etc. Feature selection Which subset of features is useful for our purpose (classification)? In this example, weight (or/and size) is a great choice (due to our prior knowledge)

Machine Learning Basics More on feature selection: Sometimes we do not need to select (or construct) proper features Just feed all available features to a model Let the model learn automatically! But too many feature may: slow down the training and predicting procedure Cause model to overfit or misfit the dataset So we want to reduce the dimension of the collected data…

Machine Learning Basics Examples of huge feature set: In NLP: Use TF-IDF feature to express a document In music recommendation: Use user-music matrix to predict users’ preference Typically a really huge matrix …

Dimension Reduction PCA: Principle Component Analysis A well-known dimension reduction algorithm An example:

Dimension Reduction PCA: Principle Component Analysis Algorithm:

Dimension Reduction PCA: Principle Component Analysis For the aforementioned example: Matrix X: Mean value of each dimension is already 0 Covariance matrix C:

Dimension Reduction PCA: Principle Component Analysis For the aforementioned example: Eigenvalue decomposition: Choose top k rows from the matrix P and left multiply X: In this case k=1

Dimension Reduction

Embedding

Graph embedding Suppose you have a graph with m vertices Each vertex represents a data point (say, a person, a word, etc.) And a similarity matrix W Wij measures to what extent vertex i and j are similar Say, # of mutual friends, mutual information, etc. Use the row it self to represent each sample is too cumbersome Say, a social network may have millions of users a corpus may have tens of thousands of words The purpose: Assign a low dimensional vector to each vertex of the graph that preserves similarities between the vertex pairs

Embedding Word embedding (Word vector) Suppose you have a corpus The pattern words occur in the corpus reflect their meanings Similar/Related words are likely to appear together The purpose: Assign a low dimensional vector to each word which preserves the information of the word Say, its polarity, grammatical function, semantic property, etc. If two words are similar, there word embeddings are similar

Embedding The advantages of embedding: Compact Word embedding vs. one-hot expression Handy for numerical operation Say, calculate similarity between samples Easy to visualize Use PCA/LDA/… to transform data points to a 2/3 dimensional space and plot them

Embedding Obtain embeddings with neural networks: Wishful thinking: Consider: how do you write a recursive function? Suppose we already have an embedding (randomly initialization or use some heuristics) Set a proper objective function and optimize it The embeddings will converge to a reasonable positions An example: PageRank When considering a (probabilistic) model, be sure to distinguish between: Representation (Modelling) Learning (Training) Inference (Predicting)

Neural Networks (Modelling) What is a neural network: A special kind of function Inspired by neurons and their connection To recap, what is ML: Collect data Propose a hypothesis Fit the model to the data Optimize a objective function Say, minimize the misclassification rate (for a classification problem), minimize intra-cluster distance while maximize inter-cluster distance (for a clustering problem), minimize reconstruction error (representation learning), etc.

Neural Networks (Modelling)

Illustration (Photo credit: Andrew Ng):

Neural Networks (Modelling)

Validation: What if the relation between inputs and outputs are not of this form? No need to worry: A 3-layer neural network is a universal approximator, can approximate any continuous function within any accuracy Analogy: If you have some 2-dimensional data and you want to fit them You can always use a polynomial Any accuracy can be achieved given the degree of the polynomial is high enough So we can always use a neural network (of more than 3 layers) to fit any data Say, the function that predict the next word given its previous words, if exists

Neural Networks (Training) Set an objective function (cost function) Say, mean square error, cross entropy, etc. Find a group of parameters which can best interpret the data Numerical optimization methods, typically gradient descent Need to compute cost function and its gradient

Neural Networks (Training) Gradient descent (illustration)

Neural Networks (Training) Calculate the cost function Forward Propagation z: value before activation a: value after activation

Neural Networks (Training)

Calculate the gradient of the objective function Back Propagation: the chain rule

Neural Networks (Training)

Remark on Back Propagation Intuition: Propagate error over the graph A more formal and neat way to understand Back Propagation: Then use the chain rule

Neural Networks (Training) Take-home message: Modelling: Propose a hypothesis that the relation between the inputs and outputs can be approximated by a special kind of function with parameters Training: Find a group of parameters which can interpret the observed data best Achieved by optimization algorithms, typically gradient descent Gradient can be computed efficiently by Back Propagation (The Chain Rule) Use random initialization (not all zeros!)

Neural Networks (Inference) Given a new example, feed it to the trained neural network Use forward propagation to compute the output Take the output as your prediction

Neural Networks (Application) Multiple classification

Neural Networks (Application) What is flexible: The network architecture

Neural Networks (Application)

Neural Networks in NLP How to express a word as a vector: A unique integer as word ID: Can be used for indexing Not a semantic embedding, because improper order is introduced One-hot expression Use a vector of length |V| to represent a word Only one component is 1, the rest are 0 Compare: 酸，甜，苦，辣不辣，微辣，中辣，特辣

Neural Networks in NLP Distributed expression Express a word as an n-dimensional real vector (more on this later) Pros: Compact: a dimension of several hundred works well Can code semantic property of words: similar words have similar vectors Cons Hard to interpret the “meaning” of its components

Neural Networks in NLP How to obtain word vector? Hypothesis: Words in a sentence are not chosen randomly There is a formula that: Given first n words, this formula gives the distribution of next word A.k.a. Language Model (which models the probability of a given sentence) And the formula can be approximated by a neural network A.k.a. neural network language model (NNLM)

Neural Networks in NLP More on language model: Suppose we have: 我爱 … What should be the next word? P( 你 ) = 0.003 P( 北京 ) = 0.00065 P( 睡觉 ) = 0.00087 P( 我 ) = 3.8e-9 P( 在 ) = 1.2e-10 … Useful for machine translation, input method, spell checking, etc.

Neural Networks in NLP

Neural network language model: Use a neural network to define the distribution of the next word The network structure is flexible (up to you)

An example (2003, Bengio): Hidden layer: concatenate the input word embeddings

An example (2003, Bengio): Objective function: Maximize the likelihood of the corpus

Neural Networks in NLP Don’t be too afraid of it Keep in mind: A neural network is a special kind of function We want to find a group of parameters to optimize the objective function Now we have two kinds of parameters: network weight and word embeddings The learning algorithm: Randomly initialization Use gradient descent to update parameters Note: include both network weight and word embeddings Use back propagation to calculate gradients Pros: Automatically smoothing

Neural Networks in NLP Visualize word embeddings using t-SNE: http://metaoptimize.s3.amazonaws.com/cw-embeddings- ACL2010/embeddings-mostcommon.EMBEDDING_SIZE=50.png http://metaoptimize.s3.amazonaws.com/cw-embeddings- ACL2010/embeddings-mostcommon.EMBEDDING_SIZE=50.png

Neural Networks in NLP Visualize word embeddings using t-SNE: Bilingual embedding Socher et al. 2013

Neural Networks in NLP A more efficient network Word2vec (Mikolov et al., 2013) Continuous bag of words (CBOW) + hierarchical softmax (HS)

Neural Networks in NLP CBOW Given context, predict current word

HS An example: For leaf “ 足球 ”: Then:

Neural Networks in NLP HS Use the Huffman tree: Intuitively, frequent words should have larger probability So make their paths shorter

Neural Networks in NLP The whole model

Neural Networks in NLP Learning procedure: SGD (details are omitted) Quite efficient Computation cost from the hidden layer to the output layer are reduced greatly Open source: For Linux: http://word2vec.googlecode.com/svn/trunk/http://word2vec.googlecode.com/svn/trunk/ Works on Mac, with minor modification For windows: https://github.com/jdeng/word2vechttps://github.com/jdeng/word2vec Require C++11 support See http://blog.csdn.net/heyongluoyao8/article/details/43488765 for referencehttp://blog.csdn.net/heyongluoyao8/article/details/43488765

Neural Networks in NLP Something more appealing: Analogy property Man – Woman = King – Queen = …

Something more crazy Embedding words and images into the same space Socher et al. 2013

Something more crazy Embedding sentences/paragraphs/documents… Cho et al. 2014

The End Thanks for your listening!

A Tutorial on ML Basics and Embedding Chong Ruan

Similar presentations

Presentation on theme: "A Tutorial on ML Basics and Embedding Chong Ruan"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Tutorial on ML Basics and Embedding Chong Ruan

Similar presentations

Presentation on theme: "A Tutorial on ML Basics and Embedding Chong Ruan"— Presentation transcript:

Similar presentations

About project

Feedback