Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM 1999. Reading Assignment: Chapter 3.

Traditional Information Retrieval vs. Vector Space Models Traditional lexical (or Boolean) IR techniques, e.g. keyword matching, often return too much and irrelevant information to user due to Synonymy (multiple ways to express a concept), and Polysemy (multiple meanings of a word/words). –If the terms used in the query differ from the terms used in the document, valuable information can never be found. Vector space models can be used to encode/represent terms and documents in a text collection –Each component of a document vector can be used to represent a particular term/key word/phrase/concept –Value assigned reflects the semantic importance of the term/key word/phrase/concept

Vector Space Construction A document collection of n documents indexed using m terms is represented as an m  n term-by-document matrix A. –a ij is the weighted frequency at which term i occurs in document j. –Columns are document vectors and rows are term vectors

Vector Space Properties The document vectors span the content, yet it is not the case that every vector represented in the column space of A has a specific interpretation. –For example, a linear combination of any two document vectors does not necessarily represent a viable document of the collection However, the vector space model can exploit geometric relationships between document (and term) vectors in order to explain both similarities and differences in concepts.

Term-by-Document Matrices What is the relationship between the number of terms, m, and number of documents, n, in –Heterogeneous text collections such as newspapers and encyclopedias? –WWW? 300,000  300,000,000 (as of late 90’s)

Term-by-Document Matrices: Example

Although the frequency of terms is 1, it can be more –Matrix entries can be scaled so that the Euclidean norm of each document vector is equal to 1: Determining which words to index and which words to discard defines both the art and the science of automated indexing. Terms are usually identified by their word stems. –Stemming reduces the number of rows in the term by document matrix, which may lead to storage savings

Simple Query Matching Queries are represented as m  1 vectors –So, in the previous example, a query of “Child Proofing” is represented as …? Query matching in the vector space model can be viewed as a search in the column space of Matrix A (i.e. the subspace spanned by the document vectors) for the documents most similar to the query.

Simple Query Matching A most commonly used similarity measure is to –find the cosine of the angle between the query vector and all document vectors (assume a j is the j th document vector) –Given a threshold value, T, documents that satisfy the condition |cos  j | ≥ T are judged as relevant to the user’s query and returned.

Simple Query Matching Properties –Since A is sparse, dot product computation is inexpensive. –Document vector norms can be pre-computed and stored before any cosine computation. No need to do that if the matrix A is normalized –If both the query and document vectors are normalized, the cosine computation constitutes a single inner product.

Simple Query Matching: Example For our previous query “Child Proofing”, and assuming that T = 0.5 –The nonzero cosines are cos  2 = cos  3 = 0.4082, and cos  5 = cos  6 = 0.5 –Therefore, the only “relevant” documents are Baby Proofing Basics Your Guide to Easy Rust Proofing –Documents 1 to 4 have been incorrectly ignored, whereas Document 7 have been correctly ignored. 

Simple Query Matching: Example What about a query on Child Home Safety? –What is the query? –What are the nonzero cosines? –What is retrieved with T = 0.5?

Simple Query Matching Hence, the current vector space model representation and query technique do not accurately represent and/or retrieve semantic content of the book titles. The following approaches have been developed to address errors in this model –Term Weighting –Low-rank approximations to the original term by document matrix A

Term Weighting The main objective of term weighting is to improve retrieval performance Term-document matrix A entries a ij are redefined to be as follows: a ij = l ij g i d j l ij is the local weight for term i occurring in document j g i is the global weight for term I in the collection d j is a document normalization factor that specifies whether or not the columns of A are normalized. –Define f ij as the frequency that term i appears in document j, and let

Term Weighting

A simple notation for specifying a term weighting approach is to use the three-letter string associated with the particular local, global, and normalization factor symbols. –For example, the lfc weighting scheme defines

Term Weighting Defining an appropriate weighting scheme depends on certain characteristics of the document collection –The choice for the local weight (l ij ) may depend on the vocabulary or word usage patterns for the collection For technical/scientific vocabularies (technical reports, journal articles), schemes of the form nxx are recommended For more general/varied vocabularies (e.g. popular magazines, encyclopedias), simple term frequencies (t**) may be sufficient. Binary term frequencies (b**) are useful when the term list is relatively short (e.g. controlled frequencies)

Term Weighting Defining an appropriate weighting scheme depends on certain characteristics of the document collection –The choice for the global weight (g i ) should take into account how often the collection is likely to change, called the state of the document collection For dynamic collections, one may disregard the global factor altogether (*x*) For static collections, the inverse document frequency (IDF) global weight (*f*) is a common choice among automatic indexing schemes. –The probability that a document being judged relevant by a user significantly increases with the document length, i.e. the longer the document is, the more likely all keywords will be found. Traditional cosine normalization (**c) has not been effective for large full text documents (e.g. TREC-4). Instead, a pivoted-cosine normalization scheme has been proposed for indexing the TREC collections.

Term Weighting: Pivoted [Cosine] Normalization The normalization factor for documents for which P retrieval > P relevance is increased whereas the normalization factor for documents for which P retrieval < P relevance is decreased Pivoted normalization = (1 – slope)  pivot + slope  old normalization If deviation of the retrieval pattern from the relevance pattern is systematic across collections for a normalization function, the pivot and the slope values learned from one collection can be used effectively on another collection

Sparse Matrix Storage Although A is sparse, it does not generally exhibit a structure or pattern, such as banded matrices. –This implies that it is difficult to identify clusters of documents sharing similar terms. Some progress in reordering hypertext-based matrices has been reported

Sparse Matrix Storage Two formats suitable for the term-by- document matrices are Compressed Row Storage (CRS) and Compressed Column Storage (CCS) –No assumptions on the existence of a pattern or structure of the sparse matrix –Require 3 arrays of storage

Compressed Row Storage One floating-point array (val) for storing the nonzero values, i.e. [un]weighted term frequencies, of A. –Row-wise Two integer arrays for indices (col_index, row_ptr) –col_index: corresponding column indices of the elements in the val array (What is the size of this array)? Val (k) = aij  col_index = k –row_ptr: locations in the val array that begin a row

Compressed Row Storage: Example val: col_index: row_ptr:

Compressed Column Storage Known as Harwell-Boeing sparse matrix format Almost identical to CRS –Columns are stored in contiguous array locations

Compressed Column Storage: Example val: row_index: col_ptr:

Low-Rank Approximations The uncertainties associated with term-by- document matrices can be attributed differences in language (word usage) culture. –For example, the author may use different words than the reader/searcher. Will we have ever a perfect term-by-document matrix representing all possible term-document associations? –Errors in measurement can accumulate and lead to those uncertainties.

Low-Rank Approximations Hence, the term-by-document matrix may be represented by the matrix sum A + E, where E reflects the error or uncertainty in assigning (or generating) the elements of matrix A. Current approaches to information retrieval without requiring literal word matches have focused on the use of rank-k approximations to term-by-document matrices. –Latent Semantic Indexing (k << min(m, n)) Coordinates produced by low-rank approximations do not explicitly reflect term frequencies within documents, instead they model global usage patterns of terms so that related documents are represented by nearby vectors in the k-dimensional space. –Semi-discrete Decomposition

Low-Rank Approximations

Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Similar presentations

Presentation on theme: "Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Similar presentations

Presentation on theme: "Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval."— Presentation transcript:

Similar presentations

About project

Feedback