Presentation is loading. Please wait.

Presentation is loading. Please wait.

RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from.

Similar presentations


Presentation on theme: "RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from."— Presentation transcript:

1 RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji jih@rpi.edu Oct13, 2015 Acknowledgement: distributional semantics slides from Omer Levy, Yoav Goldberg and Ido Dagan

2 Word Similarity & Relatedness How similar is pizza to pasta? How related is pizza to Italy? Representing words as vectors allows easy computation of similarity 2

3 Approaches for Representing Words Distributional Semantics (Count) Used since the 90’s Sparse word-context PMI/PPMI matrix Decomposed with SVD Word Embeddings (Predict) Inspired by deep learning word2vec (Mikolov et al., 2013) GloVe (Pennington et al., 2014) 3 Underlying Theory: The Distributional Hypothesis (Harris, ’54; Firth, ‘57) “Similar words occur in similar contexts”

4 Approaches for Representing Words Both approaches: Rely on the same linguistic theory Use the same data Are mathematically related “Neural Word Embedding as Implicit Matrix Factorization” (NIPS 2014) How come word embeddings are so much better? “Don’t Count, Predict!” (Baroni et al., ACL 2014) More than meets the eye… 4

5 What’s really improving performance? The Contributions of Word Embeddings Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … 5

6 What’s really improving performance? The Contributions of Word Embeddings Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … 6

7 What’s really improving performance? The Contributions of Word Embeddings Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … 7

8 What’s really improving performance? The Contributions of Word Embeddings Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … 8

9 But embeddings are still better, right? Plenty of evidence that embeddings outperform traditional methods “Don’t Count, Predict!” (Baroni et al., ACL 2014) GloVe (Pennington et al., EMNLP 2014) How does this fit with our story? 9

10 The Big Impact of “Small” Hyperparameters 10

11 The Big Impact of “Small” Hyperparameters word2vec & GloVe are more than just algorithms… Introduce new hyperparameters May seem minor, but make a big difference in practice 11

12 Identifying New Hyperparameters 12

13 New Hyperparameters Preprocessing(word2vec) Dynamic Context Windows Subsampling Deleting Rare Words Postprocessing(GloVe) Adding Context Vectors Association Metric(SGNS) Shifted PMI Context Distribution Smoothing 13

14 New Hyperparameters Preprocessing(word2vec) Dynamic Context Windows Subsampling Deleting Rare Words Postprocessing(GloVe) Adding Context Vectors Association Metric(SGNS) Shifted PMI Context Distribution Smoothing 14

15 New Hyperparameters Preprocessing(word2vec) Dynamic Context Windows Subsampling Deleting Rare Words Postprocessing(GloVe) Adding Context Vectors Association Metric(SGNS) Shifted PMI Context Distribution Smoothing 15

16 New Hyperparameters Preprocessing(word2vec) Dynamic Context Windows Subsampling Deleting Rare Words Postprocessing(GloVe) Adding Context Vectors Association Metric(SGNS) Shifted PMI Context Distribution Smoothing 16

17 Dynamic Context Windows Marco saw a furry little wampimuk hiding in the tree. 17

18 Dynamic Context Windows Marco saw a furry little wampimuk hiding in the tree. 18

19 Dynamic Context Windows 19

20 Adding Context Vectors 20

21 Adding Context Vectors 21

22 Adapting Hyperparameters across Algorithms 22

23 Context Distribution Smoothing 23

24 Context Distribution Smoothing 24

25 Context Distribution Smoothing 25

26 Comparing Algorithms 26

27 Controlled Experiments Prior art was unaware of these hyperparameters Essentially, comparing “apples to oranges” We allow every algorithm to use every hyperparameter 27

28 Controlled Experiments Prior art was unaware of these hyperparameters Essentially, comparing “apples to oranges” We allow every algorithm to use every hyperparameter* * If transferable 28

29 Systematic Experiments 9 Hyperparameters 6 New 4 Word Representation Algorithms PPMI (Sparse & Explicit) SVD(PPMI) SGNS GloVe 8 Benchmarks 6 Word Similarity Tasks 2 Analogy Tasks 5,632 experiments 29

30 Systematic Experiments 9 Hyperparameters 6 New 4 Word Representation Algorithms PPMI (Sparse & Explicit) SVD(PPMI) SGNS GloVe 8 Benchmarks 6 Word Similarity Tasks 2 Analogy Tasks 5,632 experiments 30

31 Hyperparameter Settings Classic Vanilla Setting (commonly used for distributional baselines) Preprocessing Postprocessing Association Metric Vanilla PMI/PPMI 31

32 Hyperparameter Settings Classic Vanilla Setting (commonly used for distributional baselines) Preprocessing Postprocessing Association Metric Vanilla PMI/PPMI Recommended word2vec Setting (tuned for SGNS) Preprocessing Dynamic Context Window Subsampling Postprocessing Association Metric Shifted PMI/PPMI Context Distribution Smoothing 32

33 Experiments 33

34 Experiments: Prior Art 34 Experiments: “Apples to Apples” Experiments: “Oranges to Oranges”

35 Experiments: Hyperparameter Tuning 35 [different settings]

36 Overall Results Hyperparameters often have stronger effects than algorithms Hyperparameters often have stronger effects than more data Prior superiority claims were not accurate 36

37 Re-evaluating Prior Claims 37

38 Don’t Count, Predict! (Baroni et al., 2014) “word2vec is better than count-based methods” Hyperparameter settings account for most of the reported gaps Embeddings do not really outperform count-based methods 38

39 Don’t Count, Predict! (Baroni et al., 2014) “word2vec is better than count-based methods” Hyperparameter settings account for most of the reported gaps Embeddings do not really outperform count-based methods* * Except for one task… 39

40 GloVe (Pennington et al., 2014) “GloVe is better than word2vec” Hyperparameter settings account for most of the reported gaps Adding context vectors applied only to GloVe Different preprocessing We observed the opposite SGNS outperformed GloVe on every task Our largest corpus: 10 billion tokens Perhaps larger corpora behave differently? 40

41 GloVe (Pennington et al., 2014) “GloVe is better than word2vec” Hyperparameter settings account for most of the reported gaps Adding context vectors applied only to GloVe Different preprocessing We observed the opposite SGNS outperformed GloVe on every task Our largest corpus: 10 billion tokens Perhaps larger corpora behave differently? 41

42 Linguistic Regularities in Sparse and Explicit Word Representations (Levy and Goldberg, 2014) “PPMI vectors perform on par with SGNS on analogy tasks” Holds for semantic analogies Does not hold for syntactic analogies (MSR dataset) Hyperparameter settings account for most of the reported gaps Different context type for PPMI vectors Syntactic Analogies: there is a real gap in favor of SGNS 42

43 Conclusions 43

44 Conclusions: Distributional Similarity The Contributions of Word Embeddings: Novel Algorithms New Hyperparameters What’s really improving performance? Hyperparameters (mostly) The algorithms are an improvement SGNS is robust 44

45 Conclusions: Distributional Similarity The Contributions of Word Embeddings: Novel Algorithms New Hyperparameters What’s really improving performance? Hyperparameters (mostly) The algorithms are an improvement SGNS is robust & efficient 45

46 Conclusions: Methodology Look for hyperparameters Adapt hyperparameters across different algorithms For good results: tune hyperparameters For good science: tune baselines’ hyperparameters Thank you :) 46


Download ppt "RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from."

Similar presentations


Ads by Google