2An Example d1 : Romeo and Juliet. d2 : Juliet: O happy dagger! d3 : Romeo died by dagger.d4 : “Live free or die”, that’s the New-Hampshire’s motto.d5 : Did you know, New-Hampshire is in New-England.q: dies, daggerWhich document should be returned and how the ranking should be?
3Eigenvectors and Eigenvalues Let A be an n × n matrix.If x is an n-dimensional vector, then the matrix-vector productAxis well-defined, and the result is again an n-dimensional vector.In general, multiplication by a matrix changes the direction of a non-zero vector x, unless the vector is special and we have thatAx = xfor some scalar .
4Matrix DecompositionLet S be the matrix with eigenvectors of A as columns.Let be the diagonal matrix with the eigenvalues of A on the diagonal.ThenA = SS-1If A is symmetric then we have S-1=STA = SST
5Singular Value Decomposition Let A be an m × n matrix with entries being real numbers and m > n.Consider the n × n square matrix B = ATA.B is symmetricit has been shown that the eigenvalues of such (ATA) matrices are non-negative.Since they are non-negative we can write them in decreasing order as squares of non-negative real numbers: 12 > > n2For some index r (possibly n) the first r numbers are positive whereas the rest are zero.S1 = [x1, , xr]y1=(1/1)Ax yr=(1/r)AxrS2 = [y1, ..., yr]We can show thatA = S2 S1T is diagonal and the values along the diagonal are 1, , n which are called singular values.If we denote S2 by S and S1 by U we have A = S UT
6Example d1 : Romeo and Juliet. d2 : Juliet: O happy dagger! d3 : Romeo died by dagger.d4 : “Live free or die”, that’s the New-Hampshire’s motto.d5 : Did you know, New-Hampshire is in New-England.q: dies, dagger
8Latent ConceptsLatent Semantic Indexing (LSI) is a method for discovering hidden concepts in document data.Each document and term (word) is then expressed as a vector with elements corresponding to these concepts.Each element in a vector gives the degree of participation of the document or term in the corresponding concept.Goal is not to describe the concepts verbally, but to be able to represent the documents and terms in a unified way for exposingdocument-document,document-term, andterm-term similaritieswhich are otherwise hidden…
9Matrix Matrix A can be written: A = SUT Let's "neglect" the last three singular values of as being too "small"...Also, just keeptwo columns from S obtaining S2 andtwo rows from UT obtaining U2TMatrix A is approximated as: A2 = S2U2TIn general: Ak = SkUkT where a good value for k is determined empirically.