Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Presentation on theme: "Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval."— Presentation transcript:

Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM 1999. Reading Assignment: Chapter 4.

Outline Matrix Decompositions  QR Factorization  Singular Value Decomposition  Updating Techniques

Matrix Decomposition To produce a reduced-rank approximation of the m  n term by document matrix A, one must identify the dependence between columns or rows of the matrix A. For a rank-k matrix, the k basis vectors of its column space serve in place of its n column vectors to represent its column space.

QR Factorization The QR factorization of matrix A is defined as where Q is an m  m orthogonal matrix  A square matrix is orthogonal if its columns are orthonormal. i.e., if q j denotes a column of the orthogonal matrix Q, then q j has unit Euclidean norm (|| q j || 2 = 1) for j = 1,2, …, m and it is orthogonal to all other columns of Q ((q j T q i )1/2 = 0 for all i ≠ j).  The rows of Q are also orthonormal, i.e. Q T Q = QQ T = I.  Such factorization exists for any matrix A.  There are many ways to do the factorization.

QR Factorization Given A = QR, the columns of the matrix A are all linear combinations of the columns of Q.  Thus, a subset of k of the columns of Q form a basis for the column space of A, where k = rank(A)

QR Factorization: Example

QR Factorization of the previous example can be represented as  Note that the first 7 columns of Q, Q 1, are orthonormal And hence constitute a basis for the column space of A. The bottom zero submatrix of R is not always guaranteed to be generated automatically from the QR factorization, and hence may need to apply column pivoting in order to guarantee the zero submatrix. Q 2 does not contribute to producing any nonzero value in A

QR Factorization One motivation for using QR factorization is that the basis vectors can be used to describe the semantic content of the corresponding text collection. The cosines of the angles  j between a query vector q and document vectors a j Note that for the query “Child Proofing” it gives exactly the same cosines. Why?

Frobenius Matrix Norm Definition: The Frobenius matrix norm of an m  n matrix B = [b ij ], ||.|| F is defined by

Low Rank Approximation for QR Factorization Initially, the rank of A is not known. However, after performing the QR factorization, its rank is obviously the rank of _______ With column pivoting, we know that there exists a permutation matrix P such that AP = QR where the larger entries of R are moved to the upper left corner. Such arrangement, if possible, partitions R where the smallest entries are isolated in the bottom submatrix.

Low Rank Approximation for QR Factorization

Computing Redefining R 22 to be the 4  2 zero matrix, the modified upper triangular matrix R has rank 5 rather than 7.  Hence, the matrix has rank ____ Show that ||E|| F = ||R 22 || F. Show that ||E|| F / ||A|| F = || R 22 || F / ||R|| F = 0.3237 Therefore, the relative change in R, 32.37%, yields the same relative change in A.  With r=4, the relative change is 76%.

Low Rank Approximation for QR Factorization: Example

Comparing Cosine Similarities for the Query: “Child Proofing” DocAr=5r=4 20.408 3 0.50.309 4000.184 50.5 6

Comparing Cosine Similarities for the Query: “Child Home Safety” DocAr=5r=4 20.667 310.8160.756 40.25800.1 5000 6000 7000.356

Singular Value Decomposition While QR factorization provides a reduced rank basis for the column space, no information is provided about the row space of A. SVD can provide  reduced rank approximation for both spaces  rank-k approximation to A of minimal change for any value of k.

Singular Value Decomposition A = U  V T where U: m  m orthogonal matrix whose columns define the left singular vectors of A V: n  n orthogonal matrix whose columns define the right singular vectors of A  : m  n diagonal matrix containing singular values  1  2  …   min{m,n} Such factorization exists for any matrix A.

Component Matrices of the SVD

SVD vs. QR What is the relationship between the rank of A and the ranks of the matrices in both factorizations? In QR, the first r A columns of Q form a basis for the column space, so do the first r A columns of U. The first r A rows of V T form a basis for the row space of A. The low rank-k approximation in SVD can be done by setting all but the k largest singular values in  to zero.

SVD Theorem: The low rank-k approximation of SVD is the closest rank-k approximation to A  Proven by Eckart and Young  It showed that the error in approximating A by A k is given by where A k = U k  k V k T  Hence, the error in approximating the original matrix is determined by  singular values (  k+1,  k+2,…,  rank(A) )

SVD: Example

||A – A 6 || F = …… Hence, the relative change in the matrix A is … Therefore, rank-5 approximation may be appropriate in our case. Determining the best rank approximation for any database depends on empirical testing  For very large databases, the number could be between 100 and 300.  Computational feasibility, rather than accuracy, determines the rank reduction k-rank approximation% Change Rank-67.4% Rank-522.67% Rank-432.49% Rank-356.45%

Low Rank Approximations Visual comparison of rank-reduced approximations to A can be misleading  Check rank-4 QR approximation vs. the more accurate rank-4 SVD approximation. Rank-4 SVD approximation shows associations made with terms, not originally in the document title  e.g. Term 4 (Health) and Term 8 (Safety) in Document 1 (Infant & Toddler First Aid).

Query Matching Given the query vector q, to be compared with the columns of the reduced-rank matrix A k.  Let e j denotes the j th canonical vector in I n. Then, A k e j represents _______________  It is easy to show that where

Query Matching An alternate formula for the cosine computation is  Note that which means that the number of retrieved documents using this query matching technique is larger.

Download ppt "Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval."

Similar presentations