Efficient Estimation of Word Representation in Vector Space

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.
Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Distributed Representations of Sentences and Documents
NL Question-Answering using Naïve Bayes and LSA By Kaushik Krishnasamy.
CS365 Course Project Billion Word Imputation Guide: Prof. Amitabha Mukherjee Group 20: Aayush Mudgal [12008] Shruti Bhargava [13671]
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
1 Sentence-extractive automatic speech summarization and evaluation techniques Makoto Hirohata, Yosuke Shinnaka, Koji Iwano, Sadaoki Furui Presented by.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.
Efficient Estimation of Word Representations in Vector Space
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
2D-LDA: A statistical linear discriminant analysis for image matrix
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.
Vector Semantics Dense Vectors.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.
Distributed Representations for Natural Language Processing
Topic Modeling for Short Texts with Auxiliary Word Embeddings
CS 9633 Machine Learning Support Vector Machines
Dimensionality Reduction and Principle Components Analysis
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Deep Feedforward Networks
Exploiting the distance and the occurrence of words for language modeling Chong Tze Yuang.
Deep learning David Kauchak CS158 – Fall 2016.
Deep Learning Amin Sobhani.
Compact Bilinear Pooling
Information Retrieval: Models and Methods
Vector Semantics Introduction.
Intro to NLP and Deep Learning
Multimodal Learning with Deep Boltzmann Machines
Intro to NLP and Deep Learning
Vector-Space (Distributional) Lexical Semantics
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Word2Vec CS246 Junghoo “John” Cho.
Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules
Neural Language Model CS246 Junghoo “John” Cho.
Distributed Representation of Words, Sentences and Paragraphs
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
Word Embedding Word2Vec.
Michal Rosen-Zvi University of California, Irvine
Topic Models in Text Processing
Vector Representation of Text
Word2Vec.
Presentation By: Eryk Helenowski PURE Mentor: Vincent Bindschaedler
Word embeddings (continued)
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Restructuring Sparse High Dimensional Data for Effective Retrieval
Introduction to Sentiment Analysis
Word representations David Kauchak CS158 – Fall 2016.
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Deep Neural Network Language Models
Vector Representation of Text
Latent Semantic Analysis
CS249: Neural Language Model
Professor Junghoo “John” Cho UCLA
Presentation transcript:

Efficient Estimation of Word Representation in Vector Space Paper presentation by: Aradhya Chouhan

Authors: Tomas Mikolov(Google Inc.) Kai Chen(Google Inc.) Greg Corrado(Google Inc.) Jeffrey Dean(Google Inc.)

Introduction: Many NLP systems and techniques treat words as atomic units - there is no notion of similarity between words, as these are represented as indices in a vocabulary. Simple models trained on huge amounts of data outperform complex systems trained on less data. However, simple techniques are at their limits in many tasks.The amount of relevant in-domain data for automatic speech recognition is limited - the performance is usually dominated by the size of high quality transcribed speech data (often just millions of words).

With progress of machine learning techniques in recent years, it has become possible to train more complex models on much larger data set, and they typically outperform the simple models. Probably the most successful concept is to use distributed representations of words. For example, neural network based language models significantly outperform N-gram models.

Objectives: The main goal of this paper is to introduce techniques that can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary. Maximize accuracy of these vector operations by developing new model architectures that preserve the linear regularities among words.

Language Models: A language model computes a probability for a sequence of words: P(w1,w2…,wT) Useful for machine translation word ordering P(the cat is small) > P(small the is cat) Useful in machine translation word choice P(walking home after school) > P(walking house after school)

One-hot encoding:

One-hot encoding:

One-hot encoding:

Dimensionality reduction: (Idea-1) Singular value decomposition: X = UΣVt Columns of U contain the eigenvectors of XXt Columns of V contain the eigenvectors of XtX Σ is a diagonal matrix. Diagonal values are eigenvalues Of XtX or XXt. Computational cost in SVD is quadratic for dxn matrix O(dn^2). Impractical for large datasets.

Dimensionality reduction: (Idea-2) Predict surrounding word of every word(word2vec). Capture co-occurrence counts directly(GloVe). Fast and can incorporate a new sentence or add a word to vocabulary.

Distribution Hypothesis: The distribution hypothesis in linguistics is derived from the semantic theory of language usage. “Words that are used and occur in the same context tend to purport similar meaning.” → Harris.Z (1954) Distributional Structures “A word is characterized by the company it keeps.” → J.R. Firth (1957) A Synopsis of linguistics theory

Co-occurrence matrix: All the models we will see use the idea of co-occurrence implicitly or explicitly. Co- occurrence can be seen as an indicator of semantic proximity of words. “Silence is the language of Gods, all else is poor translation.” → Rumi(1207-1273) Vocabulary → silence, is, the, language, of, gods, all, else, poor, translation

silence is the language of gods all else poor translation 1

Neural Network Language Model Architecture:

Feedforward NNLM:

NNLM Architectures: Feedforward Neural Net Language Model (NNLM): Complexity:Q = N × D + N × D × H + H × V V is size of the vocabulary N is previous words are encoded using 1-of-V coding H is hidden layer size D is dimensionality of projection layer. Many different types of models were proposed for estimating continuous representations of words, including the well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). In this paper, we focus on distributed representations of words learned by neural networks, as it was previously shown that they perform significantly better than LSA for preserving linear regularities among words ; LDA moreover becomes computationally very expensive on large data sets. The NNLM architecture becomes complex for computation between the projection and the hidden layer, as values in the projection layer are dense.

Recurrent NNLM:

NNLM Architectures: Recurrent Neural Net Language Model (RNNLM): Complexity: Q = H × H + H × V H is dimensionality of the hidden layer V is the size of vocabulary

New Log-linear Models: Continuous bag-of-words The first proposed architecture is similar to the feedforward NNLM, where the non-linear hidden layer is removed and the projection layer is shared for all words (not just the projection matrix); thus, all words get projected into the same position (their vectors are averaged).

New Log-linear Models: Continuous bag-of-words: Training complexity Q = N × D + D × log(V)

New Log-linear Models: Continuous skip-gram The second architecture is similar to CBOW, but instead of predicting the current word based on the context, it tries to maximize classification of a word based on another word in the same sentence. More precisely, we use each current word as an input to a log-linear classifier with continuous projection layer, and predict words within a certain range before and after the current word.

New Log-linear Models: Continuous Skip-gram Model: Training complexity Q = C × (D + D × log(V))

Task Description: Defined a comprehensive test set that contains five types of semantic questions, and nine types of syntactic questions. Overall, there are 8869 semantic and 10675 syntactic questions Question is assumed to be correctly answered only if the closest word to the vector computed using the above method is exactly the same as the correct word in the question; synonyms are thus counted as mistakes.

Task Description:

Results:

Results:

Results:

Results:

Conclusion: In this paper we studied the quality of vector representations of words derived by various models on a collection of syntactic and semantic language tasks. We observed that it is possible to train high quality word vectors using very simple model architectures, compared to the popular neural network models (both feedforward and recurrent). Because of the much lower computational complexity, it is possible to compute very accurate high dimensional word vectors from a much larger data set.