Presentation is loading. Please wait.

Presentation is loading. Please wait.

RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from.

Similar presentations


Presentation on theme: "RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from."— Presentation transcript:

1 RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji jih@rpi.edu Oct13, 2015 Acknowledgement: distributional semantics slides from Omer Levy, Yoav Goldberg and Ido Dagan

2 2 Task Definition Symbolic Semantics Basic Features World Knowledge Learning Models Distributional Semantics Outline

3 relation: a semantic relationship between two entities ACE relation typeexample Agent-Artifact Rubin Military Design, the makers of the Kursk Discourse each of whom Employment/ MembershipMr. Smith, a senior programmer at Microsoft Place-Affiliation Salzburg Red Cross officials Person-Social relatives of the dead Physical a town some 50 miles south of Salzburg Other-Affiliation Republican senators Relation Extraction: Task

4 Test Sample Train Sample K=3 A Simple Baseline with K-Nearest-Neighbor (KNN)

5 Test Sample Train Sample: Employment Train Sample: Physical Train Sample: Employment Train Sample: Physical 1. If the heads of the mentions don’t match: +8 2. If the entity types of the heads of the mentions don’t match: +20 3. If the intervening words don’t match: +10 the president of the United States the previous president of the United States the secretary of NIST US forces in Bahrain Connecticut ’s governor his ranch in Texas 4626 46 36 0 Relation Extraction with KNN

6 Lexical  Heads of the mentions and their context words, POS tags Entity  Entity and mention type of the heads of the mentions  Entity Positional Structure  Entity Context Syntactic  Chunking  Premodifier, Possessive, Preposition, Formulaic  The sequence of the heads of the constituents, chunks between the two mentions  The syntactic relation path between the two mentions  Dependent words of the mentions Semantic Gazetteers  Synonyms in WordNet  Name Gazetteers  Personal Relative Trigger Word List Wikipedia  If the head extent of a mention is found (via simple string matching) in the predicted Wikipedia article of another mention References: Kambhatla, 2004; Zhou et al., 2005; Jiang and Zhai, 2007; Chan and Roth, 2010,2011 Typical Relation Extraction Features

7 7 Using Background Knowledge (Chan and Roth, 2010) Features employed are usually restricted to being defined on the various representations of the target sentences Humans rely on background knowledge to recognize relations Overall aim of this work Propose methods of using knowledge or resources that exists beyond the sentence Wikipedia, word clusters, hierarchy of relations, entity type constraints, coreference As additional features, or under the Constraint Conditional Model (CCM) framework with Integer Linear Programming (ILP) 7

8 8 8 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge

9 9 9 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge

10 10 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge

11 11 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge

12 12 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge David Brian Cone (born January 2, 1963) is a former Major League Baseball pitcher. He compiled an 8–3 postseason record over 21 postseason starts and was a part of five World Series championship teams (1992 with the Toronto Blue Jays and 1996, 1998, 1999 & 2000 with the New York Yankees). He had a career postseason ERA of 3.80. He is the subject of the book A Pitcher's Story: Innings With David Cone by Roger Angell. Fans of David are known as "Cone-Heads." Major League BaseballpitcherWorld Series1992Toronto Blue Jays1996199819992000New York YankeesRoger AngellCone-Heads Cone lives in Stamford, Connecticut, and is formerly a color commentator for the Yankees on the YES Network. [1]Stamford, Connecticut color commentatorYES Network [1] Contents [hide]hide 1 Early years 2 Kansas City Royals 3 New York Mets Partly because of the resulting lack of leadership, after the 1994 season the Royals decided to reduce payroll by trading pitcher David Cone and outfielder Brian McRae, then continued their salary dump in the 1995 season. In fact, the team payroll, which was always among the league's highest, was sliced in half from $40.5 million in 1994 (fourth-highest in the major leagues) to $18.5 million in 1996 (second-lowest in the major leagues)David ConeBrian McRae1995 season1996

13 13 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge fine-grained Employment:Staff0.20 Employment:Executive0.15 Personal:Family0.10 Personal:Business0.10 Affiliation:Citizen0.20 Affiliation:Based-in0.25

14 14 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge fine-grainedcoarse-grained Employment:Staff0.20 0.35Employment Employment:Executive0.15 Personal:Family0.10 0.40Personal Personal:Business0.10 Affiliation:Citizen0.20 0.25Affiliation Affiliation:Based-in0.25

15 15 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge fine-grainedcoarse-grained Employment:Staff0.20 0.35Employment Employment:Executive0.15 Personal:Family0.10 0.40Personal Personal:Business0.10 Affiliation:Citizen0.20 0.25Affiliation Affiliation:Based-in0.25

16 16 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Using Background Knowledge fine-grainedcoarse-grained Employment:Staff0.20 0.35Employment Employment:Executive0.15 Personal:Family0.10 0.40Personal Personal:Business0.10 Affiliation:Citizen0.20 0.25Affiliation Affiliation:Based-in0.25 0.55

17 17 Knowledge 1 : Wikipedia 1 (as additional feature) We use a Wikifier system (Ratinov et al., 2010) which performs context-sensitive mapping of mentions to Wikipedia pages Introduce a new feature based on: introduce a new feature by combining the above with the coarse- grained entity types of m i, m j 17 mimi mjmj r ?

18 18 Knowledge 1 : Wikipedia 2 (as additional feature) Given m i, m j, we use a Parent-Child system (Do and Roth, 2010) to predict whether they have a parent-child relation Introduce a new feature based on: combine the above with the coarse-grained entity types of m i, m j 18 mimi mjmj parent-child?

19 19 Knowledge 2 : Word Class Information (as additional feature) Supervised systems face an issue of data sparseness (of lexical features) Use class information of words to support generalization better: instantiated as word clusters in our work Automatically generated from unlabeled texts using algorithm of (Brown et al., 1992) apple pear Apple IBM 0 10 1 0 1 boughtrun of in 0 1 0 1 0 1 0 1 19

20 20 Knowledge 2 : Word Class Information Supervised systems face an issue of data sparseness (of lexical features) Use class information of words to support generalization better: instantiated as word clusters in our work Automatically generated from unlabeled texts using algorithm of (Brown et al., 1992) apple pear Apple 0 10 1 0 1 boughtrun of in 0 1 0 1 0 1 0 1 20 IBM

21 21 Knowledge 2 : Word Class Information Supervised systems face an issue of data sparseness (of lexical features) Use class information of words to support generalization better: instantiated as word clusters in our work Automatically generated from unlabeled texts using algorithm of (Brown et al., 1992) apple pear Apple 0 10 1 0 1 boughtrun of in 0 1 0 1 0 1 0 1 21 IBM011

22 22 Knowledge 2 : Word Class Information All lexical features consisting of single words will be duplicated with its corresponding bit-string representation apple pear Apple IBM 0 10 1 0 1 boughtrun of in 0 1 0 1 0 1 0 1 22 00 0110 11

23 23 weight vector for “local” models collection of classifiers Constraint Conditional Models (CCMs) (Roth and Yih, 2007; Chang et al., 2008)

24 24 Constraint Conditional Models (CCMs) (Roth and Yih, 2007; Chang et al., 2008) 24 weight vector for “local” models collection of classifiers penalty for violating the constraint how far y is from a “legal” assignment

25 25 Constraint Conditional Models (CCMs) (Roth and Yih, 2007; Chang et al., 2008) 25 Wikipedia word clusters hierarchy of relations entity type constraints coreference

26 26 David Cone, a Kansas City native, was originally signed by the Royals and broke into the majors with the team Constraint Conditional Models (CCMs) fine-grainedcoarse-grained Employment:Staff0.20 0.35Employment Employment:Executive0.15 Personal:Family0.10 0.40Personal Personal:Business0.10 Affiliation:Citizen0.20 0.25Affiliation Affiliation:Based-in0.25

27 27 Key steps Write down a linear objective function Write down constraints as linear inequalities Solve using integer linear programming (ILP) packages 27 Constraint Conditional Models (CCMs) (Roth and Yih, 2007; Chang et al., 2008)

28 28 Knowledge 3 : Relations between our target relations... personal... employment family bizexecutive staff 28

29 29 Knowledge 3 : Hierarchy of Relations... personal... employment family bizexecutive staff 29 coarse-grained classifier fine-grained classifier

30 30 Knowledge 3 : Hierarchy of Relations... personal... employment family bizexecutive staff 30 mimi mjmj coarse-grained? fine-grained?

31 31 Knowledge 3 : Hierarchy of Relations... personal... employment family bizexecutive staff 31

32 32 Knowledge 3 : Hierarchy of Relations... personal... employment family bizexecutive staff 32

33 33 Knowledge 3 : Hierarchy of Relations... personal... employment family bizexecutive staff 33

34 34 Knowledge 3 : Hierarchy of Relations... personal... employment family bizexecutive staff 34

35 35 Knowledge 3 : Hierarchy of Relations... personal... employment family bizexecutive staff 35

36 36 Knowledge 3 : Hierarchy of Relations Write down a linear objective function 36 coarse-grained prediction probabilities fine-grained prediction probabilities

37 37 Knowledge 3 : Hierarchy of Relations Write down a linear objective function 37 coarse-grained prediction probabilities fine-grained prediction probabilities coarse-grained indicator variable fine-grained indicator variable indicator variable == relation assignment

38 38 Knowledge 3 : Hierarchy of Relations Write down constraints If a relation R is assigned a coarse-grained label rc, then we must also assign to R a fine-grained relation rf which is a child of rc. (Capturing the inverse relationship) If we assign rf to R, then we must also assign to R the parent of rf, which is a corresponding coarse-grained label 38

39 39 Knowledge 4 : Entity Type Constraints ( Roth and Yih, 2004, 2007) Entity types are useful for constraining the possible labels that a relation R can assume 39 mimi mjmj Employment:Staff Employment:Executive Personal:Family Personal:Business Affiliation:Citizen Affiliation:Based-in

40 40 Entity types are useful for constraining the possible labels that a relation R can assume 40 Employment:Staff Employment:Executive Personal:Family Personal:Business Affiliation:Citizen Affiliation:Based-in per org per org per org gpe per mimi mjmj Knowledge 4 : Entity Type Constraints ( Roth and Yih, 2004, 2007)

41 41 We gather information on entity type constraints from ACE-2004 documentation and impose them on the coarse-grained relations By improving the coarse-grained predictions and combining with the hierarchical constraints defined earlier, the improvements would propagate to the fine-grained predications 41 Employment:Staff Employment:Executive Personal:Family Personal:Business Affiliation:Citizen Affiliation:Based-in per org per org per org gpe per mimi mjmj Knowledge 4 : Entity Type Constraints ( Roth and Yih, 2004, 2007)

42 42 Knowledge 5 : Coreference 42 mimi mjmj Employment:Staff Employment:Executive Personal:Family Personal:Business Affiliation:Citizen Affiliation:Based-in

43 43 Knowledge 5 : Coreference In this work, we assume that we are given the coreference information, which is available from the ACE annotation. 43 mimi mjmj Employment:Staff Employment:Executive Personal:Family Personal:Business Affiliation:Citizen Affiliation:Based-in null

44 44 Experiment Results 44 F1% improvement from using each knowledge source All nwire10% of nwire BasicRE50.5%31.0%

45 Consider different levels of syntactic information Deep processing of text produces structural but less reliable results Simple surface information is less structural, but more reliable Generalization of feature-based solutions A kernel (kernel function) defines a similarity metric Ψ(x, y) on objects No need for enumeration of features Efficient extension of normal features into high-order spaces Possible to solve linearly non-separable problem in a higher order space Nice combination properties Closed under linear combination Closed under polynomial extension Closed under direct sum/product on different domains References: Zelenko et al., 2002, 2003; Aron Culotta and Sorensen, 2004; Bunescu and Mooney, 2005; Zhao and Grishman, 2005; Che et al., 2005, Zhang et al., 2006; Qian et al., 2007; Zhou et al., 2007; Khayyamian et al., 2009; Reichartz et al., 2009 Most Successful Learning Methods: Kernel-based

46 , where 1) Argument 2) Local dependency, where Kernel Examples for Relation Extraction K T is a token kernel defined as: (Zhao and Grishman, 2005) 3) Path, where Composite Kernels:

47 Occurrences of seed tuples: Computer servers at Microsoft’s headquarters in Redmond… In mid-afternoon trading, share of Redmond-based Microsoft fell… The Armonk-based IBM introduced a new line… The combined company will operate from Boeing’s headquarters in Seattle. Intel, Santa Clara, cut prices of its Pentium processor. Initial Seed TuplesOccurrences of Seed Tuples Generate Extraction Patterns Generate New Seed Tuples Augment Table Bootstrapping for Relation Extraction

48 ’s headquarters in -based, Initial Seed TuplesOccurrences of Seed Tuples Generate Extraction Patterns Generate New Seed Tuples Augment Table Learned Patterns: Bootstrapping for Relation Extraction (Cont’)

49 Initial Seed TuplesOccurrences of Seed Tuples Generate Extraction Patterns Generate New Seed Tuples Augment Table Generate new seed tuples; start new iteration Bootstrapping for Relation Extraction (Cont’)

50 50 Task Definition Symbolic Semantics Basic Features World Knowledge Learning Models Distributional Semantics Outline

51 Word Similarity & Relatedness How similar is pizza to pasta? How related is pizza to Italy? Representing words as vectors allows easy computation of similarity 51

52 Approaches for Representing Words Distributional Semantics (Count) Used since the 90’s Sparse word-context PMI/PPMI matrix Decomposed with SVD Word Embeddings (Predict) Inspired by deep learning word2vec (Mikolov et al., 2013) GloVe (Pennington et al., 2014) 52 Underlying Theory: The Distributional Hypothesis (Harris, ’54; Firth, ‘57) “Similar words occur in similar contexts”

53 Approaches for Representing Words Both approaches: Rely on the same linguistic theory Use the same data Are mathematically related “Neural Word Embedding as Implicit Matrix Factorization” (NIPS 2014) How come word embeddings are so much better? “Don’t Count, Predict!” (Baroni et al., ACL 2014) More than meets the eye… 53

54 What’s really improving performance? The Contributions of Word Embeddings Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … 54

55 What’s really improving performance? The Contributions of Word Embeddings Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … 55

56 What’s really improving performance? The Contributions of Word Embeddings Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … 56

57 What’s really improving performance? The Contributions of Word Embeddings Novel Algorithms (objective + training method) Skip Grams + Negative Sampling CBOW + Hierarchical Softmax Noise Contrastive Estimation GloVe … New Hyperparameters (preprocessing, smoothing, etc.) Subsampling Dynamic Context Windows Context Distribution Smoothing Adding Context Vectors … 57

58 Our Contributions 1)Identifying the existence of new hyperparameters Not always mentioned in papers 2)Adapting the hyperparameters across algorithms Must understand the mathematical relation between algorithms 58

59 Our Contributions 1)Identifying the existence of new hyperparameters Not always mentioned in papers 2)Adapting the hyperparameters across algorithms Must understand the mathematical relation between algorithms 3)Comparing algorithms across all hyperparameter settings Over 5,000 experiments 59

60 Background 60

61 What is word2vec ? 61

62 What is word2vec ? How is it related to PMI? 62

63 What is word2vec ? word2vec is not a single algorithm It is a software package for representing words as vectors, containing: Two distinct models CBoW Skip-Gram Various training methods Negative Sampling Hierarchical Softmax A rich preprocessing pipeline Dynamic Context Windows Subsampling Deleting Rare Words 63

64 What is word2vec ? word2vec is not a single algorithm It is a software package for representing words as vectors, containing: Two distinct models CBoW Skip-Gram(SG) Various training methods Negative Sampling(NS) Hierarchical Softmax A rich preprocessing pipeline Dynamic Context Windows Subsampling Deleting Rare Words 64

65 Demo http://rare-technologies.com/word2vec-tutorial/#app 65

66 Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. “word2vec Explained…” Goldberg & Levy, arXiv 2014 66

67 Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. “word2vec Explained…” Goldberg & Levy, arXiv 2014 67

68 Skip-Grams with Negative Sampling (SGNS) Marco saw a furry little wampimuk hiding in the tree. wordscontexts wampimukfurry wampimuklittle wampimukhiding wampimukin… “word2vec Explained…” Goldberg & Levy, arXiv 2014 68

69 Skip-Grams with Negative Sampling (SGNS) “word2vec Explained…” Goldberg & Levy, arXiv 2014 69

70 Skip-Grams with Negative Sampling (SGNS) “word2vec Explained…” Goldberg & Levy, arXiv 2014 70

71 Skip-Grams with Negative Sampling (SGNS) “word2vec Explained…” Goldberg & Levy, arXiv 2014 71

72 Skip-Grams with Negative Sampling (SGNS) “word2vec Explained…” Goldberg & Levy, arXiv 2014 72

73 Skip-Grams with Negative Sampling (SGNS) 73

74 What is SGNS learning? 74

75 What is SGNS learning? “Neural Word Embeddings as Implicit Matrix Factorization” Levy & Goldberg, NIPS 2014 75

76 What is SGNS learning? “Neural Word Embeddings as Implicit Matrix Factorization” Levy & Goldberg, NIPS 2014 76

77 What is SGNS learning? “Neural Word Embeddings as Implicit Matrix Factorization” Levy & Goldberg, NIPS 2014 77

78 What is SGNS learning? “Neural Word Embeddings as Implicit Matrix Factorization” Levy & Goldberg, NIPS 2014 78

79 What is SGNS learning? “Neural Word Embeddings as Implicit Matrix Factorization” Levy & Goldberg, NIPS 2014 79

80 What is SGNS learning? “Neural Word Embeddings as Implicit Matrix Factorization” Levy & Goldberg, NIPS 2014 80

81 What is SGNS learning? SGNS is doing something very similar to the older approaches SGNS is factorizing the traditional word-context PMI matrix So does SVD! GloVe factorizes a similar word-context matrix 81

82 But embeddings are still better, right? Plenty of evidence that embeddings outperform traditional methods “Don’t Count, Predict!” (Baroni et al., ACL 2014) GloVe (Pennington et al., EMNLP 2014) How does this fit with our story? 82

83 The Big Impact of “Small” Hyperparameters 83

84 The Big Impact of “Small” Hyperparameters word2vec & GloVe are more than just algorithms… Introduce new hyperparameters May seem minor, but make a big difference in practice 84

85 Identifying New Hyperparameters 85

86 New Hyperparameters Preprocessing(word2vec) Dynamic Context Windows Subsampling Deleting Rare Words Postprocessing(GloVe) Adding Context Vectors Association Metric(SGNS) Shifted PMI Context Distribution Smoothing 86

87 New Hyperparameters Preprocessing(word2vec) Dynamic Context Windows Subsampling Deleting Rare Words Postprocessing(GloVe) Adding Context Vectors Association Metric(SGNS) Shifted PMI Context Distribution Smoothing 87

88 New Hyperparameters Preprocessing(word2vec) Dynamic Context Windows Subsampling Deleting Rare Words Postprocessing(GloVe) Adding Context Vectors Association Metric(SGNS) Shifted PMI Context Distribution Smoothing 88

89 New Hyperparameters Preprocessing(word2vec) Dynamic Context Windows Subsampling Deleting Rare Words Postprocessing(GloVe) Adding Context Vectors Association Metric(SGNS) Shifted PMI Context Distribution Smoothing 89

90 Dynamic Context Windows Marco saw a furry little wampimuk hiding in the tree. 90

91 Dynamic Context Windows Marco saw a furry little wampimuk hiding in the tree. 91

92 Dynamic Context Windows 92

93 Adding Context Vectors 93

94 Adding Context Vectors 94

95 Adapting Hyperparameters across Algorithms 95

96 Context Distribution Smoothing 96

97 Context Distribution Smoothing 97

98 Context Distribution Smoothing 98

99 Comparing Algorithms 99

100 Controlled Experiments Prior art was unaware of these hyperparameters Essentially, comparing “apples to oranges” We allow every algorithm to use every hyperparameter 100

101 Controlled Experiments Prior art was unaware of these hyperparameters Essentially, comparing “apples to oranges” We allow every algorithm to use every hyperparameter* * If transferable 101

102 Systematic Experiments 9 Hyperparameters 6 New 4 Word Representation Algorithms PPMI (Sparse & Explicit) SVD(PPMI) SGNS GloVe 8 Benchmarks 6 Word Similarity Tasks 2 Analogy Tasks 5,632 experiments 102

103 Systematic Experiments 9 Hyperparameters 6 New 4 Word Representation Algorithms PPMI (Sparse & Explicit) SVD(PPMI) SGNS GloVe 8 Benchmarks 6 Word Similarity Tasks 2 Analogy Tasks 5,632 experiments 103

104 Hyperparameter Settings Classic Vanilla Setting (commonly used for distributional baselines) Preprocessing Postprocessing Association Metric Vanilla PMI/PPMI 104

105 Hyperparameter Settings Classic Vanilla Setting (commonly used for distributional baselines) Preprocessing Postprocessing Association Metric Vanilla PMI/PPMI Recommended word2vec Setting (tuned for SGNS) Preprocessing Dynamic Context Window Subsampling Postprocessing Association Metric Shifted PMI/PPMI Context Distribution Smoothing 105

106 Experiments 106

107 Experiments: Prior Art 107 Experiments: “Apples to Apples” Experiments: “Oranges to Oranges”

108 Experiments: Hyperparameter Tuning 108 [different settings]

109 Overall Results Hyperparameters often have stronger effects than algorithms Hyperparameters often have stronger effects than more data Prior superiority claims were not accurate 109

110 Re-evaluating Prior Claims 110

111 Don’t Count, Predict! (Baroni et al., 2014) “word2vec is better than count-based methods” Hyperparameter settings account for most of the reported gaps Embeddings do not really outperform count-based methods 111

112 Don’t Count, Predict! (Baroni et al., 2014) “word2vec is better than count-based methods” Hyperparameter settings account for most of the reported gaps Embeddings do not really outperform count-based methods* * Except for one task… 112

113 GloVe (Pennington et al., 2014) “GloVe is better than word2vec” Hyperparameter settings account for most of the reported gaps Adding context vectors applied only to GloVe Different preprocessing We observed the opposite SGNS outperformed GloVe on every task Our largest corpus: 10 billion tokens Perhaps larger corpora behave differently? 113

114 GloVe (Pennington et al., 2014) “GloVe is better than word2vec” Hyperparameter settings account for most of the reported gaps Adding context vectors applied only to GloVe Different preprocessing We observed the opposite SGNS outperformed GloVe on every task Our largest corpus: 10 billion tokens Perhaps larger corpora behave differently? 114

115 Linguistic Regularities in Sparse and Explicit Word Representations (Levy and Goldberg, 2014) “PPMI vectors perform on par with SGNS on analogy tasks” Holds for semantic analogies Does not hold for syntactic analogies (MSR dataset) Hyperparameter settings account for most of the reported gaps Different context type for PPMI vectors Syntactic Analogies: there is a real gap in favor of SGNS 115

116 Conclusions 116

117 Conclusions: Distributional Similarity The Contributions of Word Embeddings: Novel Algorithms New Hyperparameters What’s really improving performance? Hyperparameters (mostly) The algorithms are an improvement SGNS is robust 117

118 Conclusions: Distributional Similarity The Contributions of Word Embeddings: Novel Algorithms New Hyperparameters What’s really improving performance? Hyperparameters (mostly) The algorithms are an improvement SGNS is robust & efficient 118

119 Conclusions: Methodology Look for hyperparameters Adapt hyperparameters across different algorithms For good results: tune hyperparameters For good science: tune baselines’ hyperparameters Thank you :) 119


Download ppt "RELATION EXTRACTION, SYMBOLIC SEMANTICS, DISTRIBUTIONAL SEMANTICS Heng Ji Oct13, 2015 Acknowledgement: distributional semantics slides from."

Similar presentations


Ads by Google