Reinhard Blutner 1 Linear Algebra and Geometric Approaches to Meaning 3b. Distributed Semantics Reinhard Blutner Universiteit van Amsterdam ESSLLI Summer.

Slides:



Advertisements
Similar presentations
10.4 Complex Vector Spaces.
Advertisements

Eigen Decomposition and Singular Value Decomposition
3D Geometry for Computer Graphics
Latent Semantic Analysis
Information retrieval – LSI, pLSI and LDA
Dimensionality Reduction PCA -- SVD
Hinrich Schütze and Christina Lioma
6 6.1 © 2012 Pearson Education, Inc. Orthogonality and Least Squares INNER PRODUCT, LENGTH, AND ORTHOGONALITY.
Symmetric Matrices and Quadratic Forms
Chapter 5 Orthogonality
Principal Component Analysis
3D Geometry for Computer Graphics
Unsupervised Learning - PCA The neural approach->PCA; SVD; kernel PCA Hertz chapter 8 Presentation based on Touretzky + various additions.
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Face Recognition Using Eigenfaces
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
3D Geometry for Computer Graphics
Lecture 20 SVD and Its Applications Shang-Hua Teng.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Tutorial 10 Iterative Methods and Matrix Norms. 2 In an iterative process, the k+1 step is defined via: Iterative processes Eigenvector decomposition.
Orthogonality and Least Squares
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
6 6.1 © 2012 Pearson Education, Inc. Orthogonality and Least Squares INNER PRODUCT, LENGTH, AND ORTHOGONALITY.
Dirac Notation and Spectral decomposition
Stats & Linear Models.
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka Virginia de Sa (UCSD) Cogsci 108F Linear.
Summarized by Soo-Jin Kim
Chapter 2 Dimensionality Reduction. Linear Methods
Presented By Wanchen Lu 2/25/2013
Linear Algebra Review 1 CS479/679 Pattern Recognition Dr. George Bebis.
1 February 24 Matrices 3.2 Matrices; Row reduction Standard form of a set of linear equations: Chapter 3 Linear Algebra Matrix of coefficients: Augmented.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Next. A Big Thanks Again Prof. Jason Bohland Quantitative Neuroscience Laboratory Boston University.
Digital Image Processing, 3rd ed. © 1992–2008 R. C. Gonzalez & R. E. Woods Gonzalez & Woods Matrices and Vectors Objective.
AN ORTHOGONAL PROJECTION
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
SINGULAR VALUE DECOMPOSITION (SVD)
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
Mathematical Tools of Quantum Mechanics
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
1 Chapter 8 – Symmetric Matrices and Quadratic Forms Outline 8.1 Symmetric Matrices 8.2Quardratic Forms 8.3Singular ValuesSymmetric MatricesQuardratic.
Chapter 13 Discrete Image Transforms
Unsupervised Learning II Feature Extraction
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Unsupervised Learning II Feature Extraction
From Frequency to Meaning: Vector Space Models of Semantics
Lecture 8:Eigenfaces and Shared Features
Matrices and Vectors Review Objective
Principal Component Analysis (PCA)
Lecture on Linear Algebra
Singular Value Decomposition SVD
Chapter 3 Linear Algebra
Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors.
Symmetric Matrices and Quadratic Forms
Maths for Signals and Systems Linear Algebra in Engineering Lectures 13 – 14, Tuesday 8th November 2016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR)
Maths for Signals and Systems Linear Algebra in Engineering Lecture 18, Friday 18th November 2016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR) IN SIGNAL.
Lecture 20 SVD and Its Applications
Symmetric Matrices and Quadratic Forms
Presentation transcript:

Reinhard Blutner 1 Linear Algebra and Geometric Approaches to Meaning 3b. Distributed Semantics Reinhard Blutner Universiteit van Amsterdam ESSLLI Summer School 2011, Ljubljana August 1 – August 7, 2011

Acknowledgement We thank Stefan Evert for allowing us to use some of his slides presented at the “Tandem Workshop on Optimality in Language and Geometric Approaches to Cognition” (Berlin, December 11-13, 2010) for parts of this course. Links – – Reinhard Blutner 2

3 1 1.Meaning and Distribution 2.Distributional semantic models 3.Word vectors and search engines 4.Latent semantic analysis

Meaning & distribution “Die Bedeutung eines Wortes ist sein Gebrauch in der Sprache.” — Ludwig Wittgenstein) “You shall know a word by the company it keeps!” — J. R. Firth (1957) Distributional hypothesis (Zellig Harris 1954) Stefan Evert

What is the meaning of “bardiwac”? He handed her her glass of bardiwac. Beef dishes are made to complement the bardiwacs. Nigel staggered to his feet, face flushed from too much bardiwac. Malbec, one of the lesser-known bardiwac grapes, responds well to Australia’s sunshine. I dined off bread and cheese and this excellent bardiwac. The drinks were delicious: blood-red bardiwac as well as light, sweet Rhenish.  bardiwac is a heavy red alcoholic beverage made from grapes Stefan Evert

The Distributional Hypothesis DH (Lenci 2008) –At least certain aspects of the meaning of lexical expressions depend on their distributional properties in the linguistic contexts –The degree of semantic similarity between two linguistic expressions  and  is a function of the similarity of the linguistic contexts in which  and  can appear Weak and strong DH –Weak view as a quantitative method for semantic analysis and lexical resource induction –Strong view as a cognitive hypothesis about the form and origin of semantic representations; assuming that word distributions in context play a specific causal role in forming meaning representations. Reinhard Blutner 6

Geometric interpretation row vector x dog describes usage of word dog in the corpus can be seen as coordinates of point in n-dimensional Euclidean space R n 7 Stefan Evert 2010

The family of Minkowski p-norms Adapted from Stefan Evert visualisation of norms in R 2 by plotting unit circle for each norm, i.e. points u with |u | = 1 here: p -norms |·| p for differ- ent values of p p = 1: Manhattan distance p = 2: Euclidian distance p   : maximum distance |u | p :=(|u 1 | p + … + |u n | p ) 1/p

Distance and similarity illustrated for two dimensions: get and use: x dog = (115, 10) similarity = spatial proximity (Euclidean distance) location depends on frequency of noun (f dog  2.7 · f cat ) 9 Stefan Evert 2010

Angle and similarity direction more important than location normalise “length” ||x dog || of vector or use angle  as distance measure 10 Stefan Evert 2010 

Reinhard Blutner Meaning and Distribution 2.Distributional semantic models 3.Word vectors and search engines 4.Latent semantic analysis

A very brief history Introduced to computational linguistics in early 1990s following the probabilistic revolution (Schütze 1992, 1998) Other early work in psychology (Landauer and Dumais 1997; Lund and Burgess 1996) -influenced by Latent Semantic Indexing (Dumais et al. 1988) and efficient software implementations (Berry 1992) Renewed interest in recent years –see Adapted from Stefan Evert

Some applications in computational linguistics Unsupervised part-of-speech induction (Schütze 1995) Word sense disambiguation (Schütze 1998) Synonym tasks & other language tests (Landauer and Dumais 1997; Turney et al. 2003) Ontology & wordnet expansion (Pantel et al. 2009) Probabilistic language models (Bengio et al. 2003) Subsymbolic input representation for neural networks Many other tasks in computational semantics: entailment detection, noun compound interpretation,… Adapted from Stefan Evert

Example: Word Space (Schütze) Corpus: 60 million words of news messages (New York Times News Service) Word-word co-occurrence matrix (20,000 target words & 2,000 context words as features) Row vector records how often each context word occurs close to the target word (co-occurrence) Co-occurrence window: left/right 50 words (Schütze 1998) or 1000 characters (Schütze 1992) Normalization -- determine “meaning” of a context Reduced to 100 Singular Value dimensions (mainly for efficiency) Stefan Evert

Clustering 15 Adapted from Stefan Evert 2010

Semantic maps 16 Adapted from Stefan Evert 2010

Reinhard Blutner Meaning and Distribution 2.Distributional semantic models 3.Word vectors and search engines 4.Latent semantic analysis

Basic references Dominic Widdows, Geometry of Meaning, CSLI, 2004 Keith van Rijsbergen, The Geometry of Information Retrieval, Cambridge University Press, 2004 D. Widdows & S. Peters, Word vectors and quantum logic, in MoL8, Reinhard Blutner 18

Term-document matrix Consider the frequencies of words in certain documents. From this information we can construct a vector for each word reflecting the corresponding frequencies: Document 1Document 2Document 3 bank001 bass cream100 guitar100 fisherman010 money Document 1 is about music instruments, document 2 about fishermen, and document 3 about financial institutions. Reinhard Blutner 19 Document 1Document 2Document 3 bank004 bass240 cream200 guitar100 fisherman030 money012

Similarity Matrix (scalar product/cos  ) bankbasscreamguitarfishermanmoney bank bass cream guitar fisherman money Reinhard Blutner 20

Words such as bass are ambiguous (i. music instrument, ii. fish). If a user is only interested in one of these meanings, how are we to enable users to search for only the documents containing this meaning of the word? a NOT b = a – (a  b) b a NOT b is a vector that is orthogonal to b, i.e. (a NOT b )  b = 0 bass NOT fisherman = ( ) NOT (0 1 0) = (1 0 0) Vector Negation Reinhard Blutner 21

Reducing dimensions using SVD Based on Singular Value decomposition Considering only the highest singular values reduces redundencies of the original matrix The approach can be thought as a version of decomposing words into semantic primitives Reinhard Blutner 22

Reinhard Blutner Meaning and Distribution 2.Distributional semantic models 3.Word vectors and search engines 4.Latent semantic analysis

Principal component analysis We want to project the data points to a lower- dimensional subspace, but preserve their mutual distances as well as possible variance = average squared distance If we reduced the data set to just a single dimension, which dimension would preserve the most variance? Mathematically, we project the points onto a line through the origin and calculate one-dimensional variance on this line 24 Adapted from Stefan Evert 2010

Example 25 Adapted from Stefan Evert 2010

The covariance matrix Assume the distributional analysis gives a n  m matrix M. With the help of the covariance matrix is it possible to calculate the variance  v 2 of projections on the unit vector v by the following formula:  v 2 = v T C v (without proof) Orthogonal dimensions v 1, v 2,... partition variance: Use the eigenvectors of the covariance matrix C 26 Adapted from Stefan Evert 2010

Principal components The eigenvectors v i of the covariance matrix C are called the principal components of the data set The amount of variance preserved (or “explained”) by the i -th principal component is given by the eigenvalue corresponding to the eigenvector v i :  vi 2 = v i T C v i = i Since 1  2  …  n, the first principal component accounts for the largest amount of variance etc. For the purpose of “noise reduction”, only the first k << n principal components (with highest variance) are retained, and the other dimensions are dropped 27 Adapted from Stefan Evert 2010

Singular Value Decomposition (SVD) The SVD can be seen as a generalization of the spectral theorem, which says that normal matrices can be unitarily diagonalized using a basis of eigenvectors, to arbitrary, not necessarily square, matrices. SVD: Any m  n matrix A can be factorized as the product U  V* of three matrixes where U is an m  m unitary matrix over K (i.e., the columns of U are orthonormal), the matrix Σ is m  n with nonnegative numbers on the diagonal (called the singular values) and zeros off the diagonal, and V* denotes the conjugate transpose of V, an n  n unitary matrix over K. Such a factorization is called a singular-value decomposition of A. Reinhard Blutner 28

General scheme SVD (This illustration assumes m > n, i.e. A has more rows than columns. For m < n, is a horizontal rectangle with diagonal elements 1,..., m.) 29 Adapted from Stefan Evert 2010

SVD: Some links ompositionhttp://en.wikipedia.org/wiki/Singular_value_dec omposition m (for per­forming online computations) m ecomposition.htmlhttp://mathworld.wolfram.com/SingularValueD ecomposition.html Distributed semantic model tutorial & other materials available from Reinhard Blutner 30

Reinhard Blutner 31 Conclusions At least certain aspects of the meaning of lexical expressions depend on their distributional properties in the linguistic contexts Weak distributional hypothesis as a quantitative method for semantic analysis and lexical resource induction PCA and SVD as methods to reduce the dimension of the primary semantic space. SVD as an empirical method for calculating relevant meaning components. When does it work and when not?

Appendix: Linear algebraic proof of SVD Let M be a rectangular matrix with complex entries. M*M is positive semidefinite, therefore Hermitian. By the spectral theorem, there exist an unitary U such that where Σ is diagonal and positive definite. Partition U appropriately so we can write Therefore U*1M*MU*1 = Σ, and MU2 = 0. Define Reinhard Blutner 32

Then We see that this is almost the desired result, except that W 1 and U 1 are not unitary in general. W 1 is a partial isometry (W 1 W* 1 = I ) while U 1 is an isometry (U* 1 U 1 = I ). To finish the argument, one simply has to "fill out" these matrices to obtain unitaries. U 2 already does this for U 1. Similarly, one can choose W 2 such that is unitary. Direct calculation shows which is the desired result. Notice the argument could begin with diagonalizing MM* rather than M*M (This shows directly that MM* and M*M have the same non-zero eigenvalues). Reinhard Blutner 33