Word Embedding Word2Vec.

Slides:



Advertisements
Similar presentations
Naïve-Bayes Classifiers Business Intelligence for Managers.
Advertisements

Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.
Distributed Representations of Sentences and Documents
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
Data mining and statistical learning - lecture 11 Neural networks - a model class providing a joint framework for prediction and classification  Relationship.
Image Compression Using Neural Networks Vishal Agrawal (Y6541) Nandan Dubey (Y6279)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Radial Basis Function Networks
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Soft Computing Lecture 18 Foundations of genetic algorithms (GA). Using of GA.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Computer Go : A Go player Rohit Gurjar CS365 Project Presentation, IIT Kanpur Guided By – Prof. Amitabha Mukerjee.
Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
Yue Xu Shu Zhang.  A person has already rated some movies, which movies he/she may be interested, too?  If we have huge data of user and movies, this.
Neural Networks Lecture 4 out of 4. Practical Considerations Input Architecture Output.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Vector Semantics Dense Vectors.
Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
DeepWalk: Online Learning of Social Representations
Distributed Representations for Natural Language Processing
Big data classification using neural network
CS 9633 Machine Learning Support Vector Machines
Dimensionality Reduction and Principle Components Analysis
RNNs: An example applied to the prediction task
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
End-To-End Memory Networks
Deep Feedforward Networks
Deep learning David Kauchak CS158 – Fall 2016.
Vector Semantics Introduction.
Intro to NLP and Deep Learning
Intro to NLP and Deep Learning
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Distributed Representations of Words and Phrases and their Compositionality Presenter: Haotian Xu.
Neural Networks A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Supervised Training of Deep Networks
Vector-Space (Distributional) Lexical Semantics
Efficient Estimation of Word Representation in Vector Space
Word2Vec CS246 Junghoo “John” Cho.
Neural Language Model CS246 Junghoo “John” Cho.
RNNs: Going Beyond the SRN in Language Prediction
Jun Xu Harbin Institute of Technology China
Neural Networks Advantages Criticism
CS 4501: Introduction to Computer Vision Training Neural Networks II
Principal Component Analysis
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
N-Gram Model Formulas Word sequences Chain rule of probability
General Aspects of Learning
Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.
ML – Lecture 3B Deep NN.
Chapter 8: Generalization and Function Approximation
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Vector Representation of Text
Word2Vec.
Neural networks (3) Regularization Autoencoder
Autoencoders Supervised learning uses explicit labels/correct output in order to train a network. E.g., classification of images. Unsupervised learning.
COSC 4335: Part2: Other Classification Techniques
Word embeddings (continued)
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Attention for translation
Introduction to Sentiment Analysis
Human-object interaction
Word representations David Kauchak CS158 – Fall 2016.
Vector Representation of Text
Presentation transcript:

Word Embedding Word2Vec

How to represent a word Simple representation One hot representation : a vector with one 1 and a lot of zeroes ex) motel = [0 0 0 0 0 1 0 0 0 0 0 0 0 0]

Problem of One-Hot representation High dimensionality E.g.) For Google News, 13M words Sparse Only 1 non-zero value Shallow representation E.g.) motel = [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0] AND hotel = [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0] = 0

Word Embedding Low dimensional vector representation of every word E.g.) motel = [1.3, -1.4] and hotel = [1.2, -1.5]

How to learn such Embedding ? Use context information !!

A Naive Approach Build a Co-occurrence matrix for words and apply SVD Corpus = He is not lazy. He is intelligent. He is smart   He is not lazy intelligent smart 4 2 1 SVD = Singular Value Decomposition Words co-occurrence statistics describes how words occur together that in turn captures the relationships between words. Words co-occurrence statistics is computed simply by counting how two or more words occur together in a given corpus A co-occurrence matrix of size V X V. Now, for even a decent corpus V gets very large and difficult to handle. So generally, this architecture is never preferred in practice. But, remember this co-occurrence matrix is not the word vector representation that is generally used. Instead, this Co-occurrence matrix is decomposed using techniques like PCA, SVD etc. into factors and combination of these factors forms the word vector representation.

Problems of Co-occurrence matrix approach For a given corpus V, size of the matrix becomes V x V very large and difficult to handle Co-occurrence matrix is not the word vector representation. Instead, this Co-occurrence matrix is decomposed using techniques like PCA, SVD etc. into factors and combination of these factors forms the word vector representation computationally expensive task

Word2Vec using Skip-gram Architecture Skip-gram Neural Network Model NN with single hidden layer Often used for auto-encoder to compress input vector in hidden layer

Main idea of Word2Vec Consider a local window of a target word Predict neighbors of a target word using skip-gram model

Collection of Training samples with a window of size 2

Skip-gram Architecture

Word Embedding Build a vocabulary of words from training documents E.g.) a vocabulary of 10,000 unique words Represent an input word like “ants” as a one-hot vector This vector will have 10,000 components (one for every word in our vocabulary) This vector will have “1” in the position corresponding to the word, say “ants”, and 0s in all of the other positions.

Word Embedding No activation function for hidden layer neurons, but output neurons use softmax When training the network with word pairs, the input is a one-hot vector representing the input word and output is also a one-hot vector representing the output word. when evaluating the trained network on an input word, the output vector will actually be a probability distribution (i.e., a bunch of floating point values, not a one-hot vector).

Word Embedding Hidden layer is represented by a weight matrix with 10,000 rows (one for every word in our vocabulary) and 300 columns (one for every hidden neuron) 300 features for Google News Dataset  rows of this weight matrix are actually what will be our word vectors!

Word Embedding If we multiply a 1 x 10,000 one-hot vector by a 10,000 x 300 matrix, it will effectively just select the matrix row corresponding to the “1” Hidden layer is really just operating as a lookup table !! The output of the hidden layer is just the “word vector” for the input word

Word Embedding The output layer is a softmax regression classifier It produces an output between 0 and 1, and the sum of all these output values will add up to 1 each output neuron has a weight vector which it multiplies against the word vector from the hidden layer, then it applies the function exp(x) to the result. Finally, in order to get the outputs to sum up to 1, we divide this result by the sum of the results from all 10,000 output nodes

Word Embedding

Word Embedding

Negative sampling for Skip-gram Softmax function requires too much computation for its denominator (say, summation of 10000 terms) When training, negative sampling considers only a small number of negative words (let’s say 5) including the positive word for this summation “negative” word is one for which network outputs “0” and “positive” word is one for which network outputs “1”

Negative sampling for Skip-gram

A Potential Application Relation detection