Are Distributional Dimensions Semantic Features?

Are Distributional Dimensions Semantic Features?
Katrin Erk University of Texas at Austin Meaning in Context Symposium München September 2015 Joint work with Gemma Boleda

Semantic features by example: Katz & Fodor
Different meanings of a word characterized by lists of semantic features

Semantic features In linguistics: Katz&Fodor, Wierzbicka, Jackendoff, Bierwisch, Pustejovsky, Asher, … In computational linguistics/AI: Schank, Wilks, Masterman, Sowa… Schank, Conceptual Dependencies “drink” in preference semantics (Wilks): ((*ANI SUBJ) (((FLOW STUFF) OBJE) (MOVE CAUSE))

Semantic features: Characteristics
Primitive (not themselves defined), unanalyzable Small set Lexicalized in all languages Combined, they characterize semantics of all lexical expressions in all languages Precise, fixed meaning, which is not part of language. Wilks: not so Individually enable inferences Feature lists or complex graphs Compiled from: Wierzbicka, Geeraerts, Schank

Uses of semantic features
Event structure in the lexical semantics of verbs (Levin): change-of-state verbs: [ [ x ACT] CAUSE [BECOME [y <result-state>]] Handle polysemy (Pustejovsky, Asher) Characterize selectional constraints (e.g. in VerbNet) Characterize synonyms, also cross-linguistically (application: translation) Enable inferences: John is a bachelor John is unmarried, John is a man

Are distributional dimensions semantic features?
believe-v american-a kill-v consider-v seem-v turn-v side-n serve-v involve-v report-v little-a big-a water-n attack-n much-a …. Alligator: Computed from UKWaC+Wikipedia + BNC + Gigaword, 2 word window, PPMI transform

[The] differences between vector space encoding and more familiar accounts of meaning is easy to exaggerate. For example, a vector space encoding is entirely compatible with the traditional doctrine that concepts are ‘bundles’ of semantic features. Indeed, the latter is a special case of the former, the difference being that […] semantic dimensions are allowed to be continuous. (Fodor and Lepore 1999: All at Sea in Semantic Space) (About connectionism and particularly Churchland, not distributional models)

If so, they either address or inherit methodological problems: Coverage of a realistic vocabulary Empirically determining semantic features Meaning creep: Predicates used in CyC did not stay stable in their meaning over the years (Wilks 2008)

If so, they inherit theoretical problems Lewis 1970: “Markerese” Fodor et al 1980, Against Definitions; Fodor and Lepore 1999, All at Sea in Semantic Space Asymmetry between words and primitives: What makes the primitives more basic? Also, how can people communicate if their semantic spaces differ?

Outline Differences between distributional dimensions and semantic features Redefining the dichotomy No dichotomy after all Integrated inference

Semantic features: Characteristics
Primitive (not themselves defined), unanalyzable Small set Lexicalized in all languages Combined, they characterize semantics of all lexical expressions in all languages Precise, fixed meaning, not part of language. Individually enable inferences Feature lists or complex graphs

Neither primitive nor with a fixed meaning
Not unanalyzable: Any distributional feature can in principle be a distributional target Compare: Target and dimensions as a graph (with similarity determined on the basis of random walks): d1 dd1 target d2 d3

Neither primitive nor with a fixed meaning
But are they treated as unanalyzed in practice? Features in vector usually not analyzed further SVD, topic modeling, prediction-based models: induce latent features exploiting distributional properties of features Are latent features unanalyzable? No, linked to original dimensions No fixed meaning, distributional features can be ambiguous

Then is it“Markerese”? Inference = deriving something non-distributional from distributional representations Inference from relation to other words “X cause Y”, “Y trigger X” occur with similar X, Y, hence they are probably close in meaning “alligator” appears in a subset of the contexts of “animal”, hence they are probably animals Inference from co-occurrence with extralinguistic information Distributional vectors linked to images for the same target Alligators are similar to crocodiles, crocodiles are listed in the ontology as animals, hence alligators are probably animals

No individual inferences
Distributional representation as a whole, in the aggregate, allows for inferences using aggregate techniques: Distributional similarity Distributional inclusion Whole-vector mappings to visual vectors

No individual inferences
Feature-based inference possible with “John Doe” features: Take text representation Take apart into features that are individually almost meaningless Aggregate of such features allows for inferences

Redefining the dichotomy
Not semantic features versus distributional dimensions: Individual features versus aggregate features Individual features: Individually allow for inferences May be relevant to grammar Are introspectively salient Not necessarily primitive Also hypernyms and synonyms Aggregate features May be individually almost meaningless Allow for aggregate inference Two modes of inference: individual and aggregate

Individual features in distributional representations
Some distributional dimensions can be cognitively relevant features Thill et al 2014: Because distributional models focus on how words are frequently used, they point to how humans experience concepts Freedom: (features from Baroni&Lenci 2010) positive events: guarantee, secure, grant, defend, respect negative events: undermine, deny, infringe on, violate

Approaches that find cognitively plausible features distributionally: Almuhareb & Poesio 2004 Cimiano & Wenderoth 2007 Schulte im Walde et al 2008: German association norms Baroni et al 2010: STRUDEL Baroni & Lenci 2010: Distributional memory Devereux et al 2010: dependency paths extracted from Wikipedia

Difficult: only small fraction of human-elicited features can be retrieved Baroni et al 2010: Distributional features tend to be different from human-elicited features preference for “‘actional’ and ‘situated’ descriptions” motorcycle: elicited: wheels, dangerous, engine, fast distributional: ride, sidecar, park, road

Not a competition Use both kinds of features!
Computational perspective: Distributional features are great learned automatically enable many inferences Human-defined semantic features are great less noisy enable inferences with more certainty enable inferences that distributional models do not provide How can we integrate the two?

Speculation: Learning both individual and aggregate features
Learner makes use of features from textual environment Some features almost meaningless, others more meaningful Some of them relevant to grammar (CAUSE, BECOME) Both meaningful and near-meaningless features enter aggregate inference Only certain features allow individual inference (Unclear: This should not be feature lists, there is structure! But where does that fit in this picture?)

Inferring individual features from aggregates
Johns and Jones 2012: Compute weight of feature bird for nightingale as summed similarity of nightingale to known birds Fagarasan/Vecchi/Clark 2015: Learn a mapping from distributional vectors to vectors of individual features Herbelot/Vecchi 2015: Learn a mapping from distributional space to “set-theoretic space”, vectors of quantified individual features (ALL apes are muscular, SOME apes live on coasts)

Inferring individual features from aggregates
Gupta et al 2015: Regression to learn properties of unknown cities/countries from those of known cities/countries Snow/Jurafsky/Ng 2006: Infer location of a word in the WordNet hierarchy using a distributional co-hyponymy classifier

Individual features influencing aggregate representations
Andrews/Vigliocco/Vinson 2009, Roller/Schulte im Walde 2013: Topic modeling, including known individual features of words in the text Faruqui et al 2015: Update vector representation to better match known synonymy, hypernymy, hyponymy information

Individual features influencing aggregate representations
Boyd-Graber/Blei/Zhu 2006: WordNet hierarchy as part of a topic model. Generate a word: choose topic, then walk down WN hierarchy based on the topic aim: best WN sense for each word in context Riedel et al 2013, Rocktäschel et al 2015: Universal Schema Relation characterized by vector of Named Entity pairs (entity pairs that fill the relation) Both human-defined and corpus-extracted relations Matric factorization over union of human-defined and corpus- extracted relations Predict whether a relation holds of an entity pair

Conclusion Distributional features are not semantic features:
Not primitive Inference from relations between word representations, co-occurrence with extra-linguistic information Not (necessarily) individually meaningful Inference from the aggregate of features Two modes of inference: individual and aggregate Use both individual and aggregate features How to integrate the two, and infer one from the other?

References Almuhareb, A., & Poesio, M. (2004). Attribute-based and value-based clustering: an evaluation (pp. 1–8). Presented at the EMNLP. Andrews, M., Vigliocco, G., & Vinson, D. (2009). Integrating experiential and distributional data to learn semantic representations. Psychological Review, 116(3), 463–498. Asher, N. (2011) Lexical meaning in context: a web of words. Cambridge University Press. Baroni, M., Murphy, B., Barbu, E., & Poesio, M. (2010). Strudel: A Corpus-Based Semantic Model Based on Properties and Types. Cognitive Science, 34(2), 222–254 Baroni, M., & Lenci, A. (2010). Distributional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4), 673–721. Bierwisch, M. (1969) On certain problems of semantic representation. Foundations of Language 5: 153–84. Boyd-Graber, J., Blei, D. M., & Zhu, X. (2007). A Topic Model for Word Sense Disambiguation. Presented at the EMNLP.

References Cimiano, Philipp and Johanna Wenderoth Automatic acquisition of ranked qualia structures from the Web. In Proceedings of ACL, pages 888–895, Prague. Devereux, B., Pilkington, N., Poibeau, T., & Korhonen, A. (2010). Towards Unrestricted, Large- Scale Acquisition of Feature-Based Conceptual Representations from Corpus Data. Research on Language and Computation, 7(2-4), 137–170. Fagarasan, L., E. Vecchi, S. Clark (2015). From distributional semantics to feature norms: grounding semantic models in human perceptual data. Proceedings of IWCS. Faruqui, M., Dodge, J., Jauhar, S., Dyer, C., Hovy, E., & Smith, N. (2015). Retrofitting Word Vectors to Semantic Lexicons. Presented at the NAACL. Fodor, J., Garrett, M. F., Walker, E. C. T., & Parkes, C. H. (1980). Against definitions. Cognition, 8(3), 263–367. Fodor, J., & Lepore, E. (1999). All at sea in semantic space: Churchland on meaning similarity. The Journal of Philosophy, 96(8), 381–403. Geeraerts, D. (2009) Theories of Lexical Semantics. Oxford University Press.

References Gupta, A., Boleda, G., Baroni, M., & Pado, S. (2015). Distributional vectors encode referential attributes. Proceedings of EMNLP. Herbelot, A., & Vecchi, E. M. (2015). Building a shared world:Mapping distributional to model-theoretic semantic spaces. Proceedings of EMNLP. Jackendoff, R. (1990) Semantic Structures. MIT Press. Johns, B. T., & Jones, M. N. (2012). Perceptual Inference Through Global Lexical Similarity. Topics in Cognitive Science, 4(1), 103–120 Katz, J. J., & Fodor, J. A. (1963). The structure of a semantic theory. Language, 39(2), 170. Lewis, D. (1970). General semantics. Synthese, 22(1):18– 67. Pustejovsky, J. (1991) The Generative Lexicon. Computational Linguistics 17(4).

References Rapaport Hovav, M., and B. Levin (2001). An event structure account of English resultatives. Language 77(4). Riedel, S., Yao, L., McCallum, A., & Marlin, B. (2013). Relation Extraction with Matrix Factorization and Universal Schemas. Presented at the NAACL. Rocktäschel, T., Singh, S., & Riedel, S. (2015). Injecting Logical Background Knowledge into Embeddings for Relation Extraction. Presented at the NAACL. Roller, S., & Schulte im Walde, S. (2013). A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities. Presented at the EMNLP. Schank, R. (1969). A conceptual dependency parser for natural language. Proceedings of COLING 1969 Schulte im Walde, S., A. Melinger, M. Roth, A. Weber (2008). An Empirical Characterisation of Response Types in German Association Norms. Research on Language and Computation 6(2): , 2008.

References Snow, R., Jurafsky, D., & Ng, A. Y. (2006). Semantic taxonomy induction from heterogenous evidence (pp. 801–808). Presented at the ACL-COLING. Sowa, J. (1992). Logical Structures in the Lexicon. In J. Pustejovsky & S. Bergler (Eds.), Lexical semantics and knowledge representation (LNCS, Vol. 627, pp. 39–60). Thill, S., Pado, S., & Ziemke, T. (2014). On the Importance of a Rich Embodiment in the Grounding of Concepts: Perspectives From Embodied Cognitive Science and Computational Linguistics. Topics in Cognitive Science, 6(3), 545–558. Wierzbicka, A. (1996) Semantics. Primes and Universals. Oxford University Press. Wilks, Y. (2008). What would a Wittgensteinian computational linguistics be like? Presented at the AISB workshop on computers and philosophy, Aberdeen.

Are Distributional Dimensions Semantic Features?

Similar presentations

Presentation on theme: "Are Distributional Dimensions Semantic Features?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Are Distributional Dimensions Semantic Features?

Similar presentations

Presentation on theme: "Are Distributional Dimensions Semantic Features?"— Presentation transcript:

Similar presentations

About project

Feedback