RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from.

Slides:

Advertisements

Similar presentations

Fast Algorithms For Hierarchical Range Histogram Constructions

Advertisements

Ziming Zhang*, Ze-Nian Li, Mark Drew School of Computing Science Simon Fraser University Vancouver, Canada {zza27, li, AdaMKL: A Novel.

A Neural Probabilistic Language Model Keren Ye.

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.

Linguistic Regularities in Sparse and Explicit Word Representations Omer LevyYoav Goldberg Bar-Ilan University Israel.

Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Linguistic Regularities in Sparse and Explicit Word Representations

Word sense induction using continuous vector space models

Yang-de Chen Tutorial: word2vec Yang-de Chen

Vector Semantics Introduction. Dan Jurafsky Why vector models of meaning? computing the similarity between words “fast” is similar to “rapid” “tall” is.

Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait Daniel Barbará

CS365 Course Project Billion Word Imputation Guide: Prof. Amitabha Mukherjee Group 20: Aayush Mudgal [12008] Shruti Bhargava [13671]

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

2014 EMNLP Xinxiong Chen, Zhiyuan Liu, Maosong Sun State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information.

Constructing Knowledge Graph from Unstructured Text Image Source: Kundan Kumar Siddhant Manocha.

Modelling Human Thematic Fit Judgments IGK Colloquium 3/2/2005 Ulrike Padó.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from.

Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel

Cold Start Problem in Movie Recommendation JIANG CAIGAO, WANG WEIYAN Group 20.

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

Learning video saliency from human gaze using candidate selection CVPR2013 Poster.

Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Dependency-Based Word Embeddings Omer LevyYoav Goldberg Bar-Ilan University Israel.

Vector Semantics.

Linguistic Regularities in Sparse and Explicit Word Representations Omer LevyYoav Goldberg Bar-Ilan University Israel.

Ganesh J1, Manish Gupta1,2 and Vasudeva Varma1

Vector Semantics Dense Vectors.

Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

From Frequency to Meaning: Vector Space Models of Semantics

Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.

Intrinsic Subspace Evaluation of Word Embedding Representations Yadollah Yaghoobzadeh and Hinrich Schu ̈ tze Center for Information and Language Processing.

Unsupervised Sparse Vector Densification for Short Text Similarity

Distributed Representations for Natural Language Processing

Sivan Biham & Adam Yaari

Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :

IEEE BIBM 2016 Xu Min, Wanwen Zeng, Ning Chen, Ting Chen*, Rui Jiang*

End-To-End Memory Networks

Comparison with other Models Exploring Predictive Architectures

A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis

Vector Semantics Introduction.

Intro to NLP and Deep Learning

Smoothing 10/27/2017.

Distributed Representations of Words and Phrases and their Compositionality Presenter: Haotian Xu.

Machine Learning Basics

Vector-Space (Distributional) Lexical Semantics

Efficient Estimation of Word Representation in Vector Space

Word2Vec CS246 Junghoo “John” Cho.

Distributed Representation of Words, Sentences and Paragraphs

Jun Xu Harbin Institute of Technology China

Institute for Big Data Analytics

Word Embeddings with Limited Memory

Word Embedding Word2Vec.

Word embeddings based mapping

Word embeddings based mapping

Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.

Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B

Socialized Word Embeddings

Natural Language to SQL(nl2sql)

Vector Representation of Text

Using Multilingual Neural Re-ranking Models for Low Resource Target Languages in Cross-lingual Document Detection Using Multilingual Neural Re-ranking.

Word embeddings (continued)

Introduction to Sentiment Analysis

Word representations David Kauchak CS158 – Fall 2016.

Natural Language Processing Is So Difficult

Vector Representation of Text

Presentation transcript:

RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from Omer Levy, Yoav Goldberg and Ido Dagan

Word Similarity & Relatedness How similar is pizza to pasta? How related is pizza to Italy? Representing words as vectors allows easy computation of similarity 2

Approaches for Representing Words Distributional Semantics (Count) Used since the 90’s Sparse word-context PMI/PPMI matrix Decomposed with SVD Word Embeddings (Predict) Inspired by deep learning word2vec (Mikolov et al., 2013) GloVe (Pennington et al., 2014) 3 Underlying Theory: The Distributional Hypothesis (Harris, ’54; Firth, ‘57) “Similar words occur in similar contexts”

Approaches for Representing Words Both approaches: Rely on the same linguistic theory Use the same data Are mathematically related “Neural Word Embedding as Implicit Matrix Factorization” (NIPS 2014) How come word embeddings are so much better? “Don’t Count, Predict!” (Baroni et al., ACL 2014) More than meets the eye… 4

What’s really improving performance? The Contributions of Word Embeddings Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … 5

What’s really improving performance? The Contributions of Word Embeddings Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … 6

What’s really improving performance? The Contributions of Word Embeddings Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … 7

What’s really improving performance? The Contributions of Word Embeddings Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … 8

But embeddings are still better, right? Plenty of evidence that embeddings outperform traditional methods “Don’t Count, Predict!” (Baroni et al., ACL 2014) GloVe (Pennington et al., EMNLP 2014) How does this fit with our story? 9

The Big Impact of “Small” Hyperparameters 10

The Big Impact of “Small” Hyperparameters word2vec & GloVe are more than just algorithms… Introduce new hyperparameters May seem minor, but make a big difference in practice 11

Identifying New Hyperparameters 12

New Hyperparameters Preprocessing(word2vec) Dynamic Context Windows Subsampling Deleting Rare Words Postprocessing(GloVe) Adding Context Vectors Association Metric(SGNS) Shifted PMI Context Distribution Smoothing 13

New Hyperparameters Preprocessing(word2vec) Dynamic Context Windows Subsampling Deleting Rare Words Postprocessing(GloVe) Adding Context Vectors Association Metric(SGNS) Shifted PMI Context Distribution Smoothing 14

New Hyperparameters Preprocessing(word2vec) Dynamic Context Windows Subsampling Deleting Rare Words Postprocessing(GloVe) Adding Context Vectors Association Metric(SGNS) Shifted PMI Context Distribution Smoothing 15

New Hyperparameters Preprocessing(word2vec) Dynamic Context Windows Subsampling Deleting Rare Words Postprocessing(GloVe) Adding Context Vectors Association Metric(SGNS) Shifted PMI Context Distribution Smoothing 16

Dynamic Context Windows Marco saw a furry little wampimuk hiding in the tree. 17

Dynamic Context Windows Marco saw a furry little wampimuk hiding in the tree. 18

Dynamic Context Windows 19

Adding Context Vectors 20

Adding Context Vectors 21

Adapting Hyperparameters across Algorithms 22

Context Distribution Smoothing 23

Context Distribution Smoothing 24

Context Distribution Smoothing 25

Comparing Algorithms 26

Controlled Experiments Prior art was unaware of these hyperparameters Essentially, comparing “apples to oranges” We allow every algorithm to use every hyperparameter 27

Controlled Experiments Prior art was unaware of these hyperparameters Essentially, comparing “apples to oranges” We allow every algorithm to use every hyperparameter* * If transferable 28

Systematic Experiments 9 Hyperparameters 6 New 4 Word Representation Algorithms PPMI (Sparse & Explicit) SVD(PPMI) SGNS GloVe 8 Benchmarks 6 Word Similarity Tasks 2 Analogy Tasks 5,632 experiments 29

Systematic Experiments 9 Hyperparameters 6 New 4 Word Representation Algorithms PPMI (Sparse & Explicit) SVD(PPMI) SGNS GloVe 8 Benchmarks 6 Word Similarity Tasks 2 Analogy Tasks 5,632 experiments 30

Hyperparameter Settings Classic Vanilla Setting (commonly used for distributional baselines) Preprocessing Postprocessing Association Metric Vanilla PMI/PPMI 31

Hyperparameter Settings Classic Vanilla Setting (commonly used for distributional baselines) Preprocessing Postprocessing Association Metric Vanilla PMI/PPMI Recommended word2vec Setting (tuned for SGNS) Preprocessing Dynamic Context Window Subsampling Postprocessing Association Metric Shifted PMI/PPMI Context Distribution Smoothing 32

Experiments 33

Experiments: Prior Art 34 Experiments: “Apples to Apples” Experiments: “Oranges to Oranges”

Experiments: Hyperparameter Tuning 35 [different settings]

Overall Results Hyperparameters often have stronger effects than algorithms Hyperparameters often have stronger effects than more data Prior superiority claims were not accurate 36

Re-evaluating Prior Claims 37

Don’t Count, Predict! (Baroni et al., 2014) “word2vec is better than count-based methods” Hyperparameter settings account for most of the reported gaps Embeddings do not really outperform count-based methods 38

Don’t Count, Predict! (Baroni et al., 2014) “word2vec is better than count-based methods” Hyperparameter settings account for most of the reported gaps Embeddings do not really outperform count-based methods* * Except for one task… 39

GloVe (Pennington et al., 2014) “GloVe is better than word2vec” Hyperparameter settings account for most of the reported gaps Adding context vectors applied only to GloVe Different preprocessing We observed the opposite SGNS outperformed GloVe on every task Our largest corpus: 10 billion tokens Perhaps larger corpora behave differently? 40

GloVe (Pennington et al., 2014) “GloVe is better than word2vec” Hyperparameter settings account for most of the reported gaps Adding context vectors applied only to GloVe Different preprocessing We observed the opposite SGNS outperformed GloVe on every task Our largest corpus: 10 billion tokens Perhaps larger corpora behave differently? 41

Linguistic Regularities in Sparse and Explicit Word Representations (Levy and Goldberg, 2014) “PPMI vectors perform on par with SGNS on analogy tasks” Holds for semantic analogies Does not hold for syntactic analogies (MSR dataset) Hyperparameter settings account for most of the reported gaps Different context type for PPMI vectors Syntactic Analogies: there is a real gap in favor of SGNS 42

Conclusions 43

Conclusions: Distributional Similarity The Contributions of Word Embeddings: Novel Algorithms New Hyperparameters What’s really improving performance? Hyperparameters (mostly) The algorithms are an improvement SGNS is robust 44

Conclusions: Distributional Similarity The Contributions of Word Embeddings: Novel Algorithms New Hyperparameters What’s really improving performance? Hyperparameters (mostly) The algorithms are an improvement SGNS is robust & efficient 45

Conclusions: Methodology Look for hyperparameters Adapt hyperparameters across different algorithms For good results: tune hyperparameters For good science: tune baselines’ hyperparameters Thank you :) 46