DISTRIBUTIONAL SEMANTICS

Slides:

Advertisements

Similar presentations

Request Dispatching for Cheap Energy Prices in Cloud Data Centers

Advertisements

SpringerLink Training Kit

Luminosity measurements at Hadron Colliders

From Word Embeddings To Document Distances

Choosing a Dental Plan Student Name

Virtual Environments and Computer Graphics

Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI

THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –

D. Phát triển thương hiệu

NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN

Điều trị chống huyết khối trong tai biến mạch máu não

BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.

Nasal Cannula X particulate mask

Evolving Architecture for Beyond the Standard Model

HF NOISE FILTERS PERFORMANCE

Electronics for Pedestrians – Passive Components –

Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel

L-Systems and Affine Transformations

CMSC423: Bioinformatic Algorithms, Databases and Tools

Some aspect concerning the LMDZ dynamical core and its use

Bayesian Confidence Limits and Intervals

实习总结（Internship Summary)

Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,

Front End Electronics for SOI Monolithic Pixel Sensor

Face Recognition Monday, February 1, 2016.

Solving Rubik's Cube By: Etai Nativ.

CS284 Paper Presentation Arpad Kovacs

انتقال حرارت 2 خانم خسرویار.

Summer Student Program First results

Theoretical Results on Neutrinos

HERMESでのHard Exclusive生成過程による核子内クォーク全角運動量についての研究

Wavelet Coherence & Cross-Wavelet Transform

yaSpMV: Yet Another SpMV Framework on GPUs

Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.

MOCLA02 Design of a Compact L-band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Fuel cell development program for electric vehicle

Overview of TST-2 Experiment

Optomechanics with atoms

داده کاوی سئوالات نمونه

Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium

ლექცია 4 - ფული და ინფლაცია

10. predavanje Novac i financijski sustav

Wissenschaftliche Aussprache zur Dissertation

FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,

Particle acceleration during the gamma-ray flares of the Crab Nebular

Interpretations of the Derivative Gottfried Wilhelm Leibniz

Advisor: Chiuyuan Chen Student: Shao-Chun Lin

Widow Rockfish Assessment

SiW-ECAL Beam Test 2015 Kick-Off meeting

On Robust Neighbor Discovery in Mobile Wireless Networks

Chapter 6 并发：死锁和饥饿 Operating Systems: Internals and Design Principles

You NEED your book!!! Frequency Distribution

Y V =0 a V =V0 x b b V =0 z

Fairness-oriented Scheduling Support for Multicore Systems

Climate-Energy-Policy Interaction

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Ch48 Statistics by Chtan FYHSKulai

The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.

Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs

Online Learning: An Introduction

Factor Based Index of Systemic Stress (FISS)

What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.

THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*

Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.

The Toroidal Sporadic Source: Understanding Temporal Variations

FW 3.4: More Circle Practice

ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف

Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM

Limits on Anomalous WWγ and WWZ Couplings from DØ

Presentation transcript:

DISTRIBUTIONAL SEMANTICS Heng Ji jih@rpi.edu September 29, 2016

Word Similarity & Relatedness How similar is pizza to pasta? How related is pizza to Italy? Representing words as vectors allows easy computation of similarity If you’ve come to this talk, you’re probably all interested in measuring word similarity, or relatedness. For example, how similar is pizza to pasta? Or how related is pizza to Italy? Representing words as vectors has become a very convenient way to compute these similarities, For example, using their cosine.

Approaches for Representing Words Distributional Semantics (Count) Used since the 90’s Sparse word-context PMI/PPMI matrix Decomposed with SVD Word Embeddings (Predict) Inspired by deep learning word2vec (Mikolov et al., 2013) GloVe (Pennington et al., 2014) There are two approaches for creating these vectors: First, we’ve got the traditional count-based approach, where the most common variant is the word-context PMI matrix. The other, cutting-edge, approach, is word embeddings, Which has become extremely popular with word2vec. Now, the interesting thing… …is that both approaches rely on the same linguistic theory: the distributional hypothesis. <pause> Now, in previous work, that I’ll talk about in a minute, we showed that the two approaches are even more related. Underlying Theory: The Distributional Hypothesis (Harris, ’54; Firth, ‘57) “Similar words occur in similar contexts”

Approaches for Representing Words Both approaches: Rely on the same linguistic theory Use the same data Are mathematically related “Neural Word Embedding as Implicit Matrix Factorization” (NIPS 2014) How come word embeddings are so much better? “Don’t Count, Predict!” (Baroni et al., ACL 2014) More than meets the eye… Not only are they based on the same linguistic theory, They also use the same data, And even have a strong mathematical connection. If that’s the case, then how come word embeddings are so much better? There’s even an ACL paper that shows it! So today I’m going to try and convince you, that there’s more to this story, than meets the eye.

The Contributions of Word Embeddings What’s really improving performance? Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … When you take a deeper look at the latest word embedding methods, you’ll notice that they actually have 2 main contributions: First, you’ve got the algorithms. This is what seems to be the only contribution, or at least what gets papers into ACL :) But a closer look reveals that these papers also introduce… …new hyperparameters. Now, some of these may seem like they’re part of the algorithms themselves, But what I’m going to show you today, is that they actually can be separated from their original methods, and applied across many other algorithms. So what’s really improving performance?

The Contributions of Word Embeddings What’s really improving performance? Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … Is it the algorithms?

The Contributions of Word Embeddings What’s really improving performance? Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … The hyperparameters?

The Contributions of Word Embeddings What’s really improving performance? Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … Or maybe a bit of both? <pause>

What is word2vec? So what is word2vec?...

What is word2vec? How is it related to PMI? …And how is it related to PMI?

What is word2vec? word2vec is not a single algorithm It is a software package for representing words as vectors, containing: Two distinct models CBoW Skip-Gram Various training methods Negative Sampling Hierarchical Softmax A rich preprocessing pipeline Dynamic Context Windows Subsampling Deleting Rare Words Word2vec is *not* an algorithm. It’s actually a software package With two distinct models Various ways to train these models And a rich preprocessing pipeline

What is word2vec? word2vec is not a single algorithm It is a software package for representing words as vectors, containing: Two distinct models CBoW Skip-Gram (SG) Various training methods Negative Sampling (NS) Hierarchical Softmax A rich preprocessing pipeline Dynamic Context Windows Subsampling Deleting Rare Words Now, we’re going to focus on Skip-Grams with Negative Sampling, Which is considered the state-of-the-art.

Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. So, how does SGNS work? Let’s say we have this sentence, (This example was shamelessly taken from Marco Baroni) “word2vec Explained…” Goldberg & Levy, arXiv 2014

Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. And we want to understand what a wampimuk is. “word2vec Explained…” Goldberg & Levy, arXiv 2014

Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. words contexts wampimuk furry wampimuk little wampimuk hiding wampimuk in … … 𝐷 (data) What SGNS does is take the context words around “wampimuk”, And in this way, Constructs a dataset of word-context pairs, D. “word2vec Explained…” Goldberg & Levy, arXiv 2014

Skip-Grams with Negative Sampling (SGNS) SGNS finds a vector 𝑤 for each word 𝑤 in our vocabulary 𝑉 𝑊 Each such vector has 𝑑 latent dimensions (e.g. 𝑑=100) Effectively, it learns a matrix 𝑊 whose rows represent 𝑉 𝑊 Key point: it also learns a similar auxiliary matrix 𝐶 of context vectors In fact, each word has two embeddings 𝑊 𝑑 𝑉 𝑊 𝐶 𝑉 𝐶 𝑑 Now, using D, SGNS is going to learn a vector for each word in the vocabulary A vector of say… 100 latent dimensions. Effectively, it’s learning a matrix W, Where each row is a vector that represents a specific word. Now, a key point in understanding SGNS, Is that it also learns an auxiliary matrix C of context vectors So in fact, each word has two different embeddings Which are not necessarily similar. 𝑤:wampimuk =(−3.1, 4.15, 9.2, −6.5,…) ≠ 𝑐:wampimuk =(−5.6, 2.95, 1.4, −1.3,…) “word2vec Explained…” Goldberg & Levy, arXiv 2014

Skip-Grams with Negative Sampling (SGNS) So to train these vectors… “word2vec Explained…” Goldberg & Levy, arXiv 2014

Skip-Grams with Negative Sampling (SGNS) Maximize: 𝜎 𝑤 ⋅ 𝑐 𝑐 was observed with 𝑤 words contexts wampimuk furry wampimuk little wampimuk hiding wampimuk in …SGNS maximizes the similarity between word and context vectors that were observed together. “word2vec Explained…” Goldberg & Levy, arXiv 2014

Skip-Grams with Negative Sampling (SGNS) Maximize: 𝜎 𝑤 ⋅ 𝑐 𝑐 was observed with 𝑤 words contexts wampimuk furry wampimuk little wampimuk hiding wampimuk in Minimize: 𝜎 𝑤 ⋅ 𝑐 ′ 𝑐′ was hallucinated with 𝑤 words contexts wampimuk Australia wampimuk cyber wampimuk the wampimuk 1985 …and minimizes the similarity between word and context vectors hallucinated together. “word2vec Explained…” Goldberg & Levy, arXiv 2014

Skip-Grams with Negative Sampling (SGNS) SGNS samples 𝑘 contexts 𝑐 ′ at random as negative examples “Random” = unigram distribution 𝑃 𝑐 = #𝑐 𝐷 Spoiler: Changing this distribution has a significant effect Now, the way SGNS hallucinates is really important. It’s actually where it gets its name from. For each observed word-context pair, SGNS samples k contexts at random, As negative examples Now, when we say random, we mean the unigram distribution And Spoiler Alert: changing this distribution has quite a significant effect.

Demo http://rare-technologies.com/word2vec-tutorial/#app So, how does SGNS work? Let’s say we have this sentence, (This example was shamelessly taken from Marco Baroni)