Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Object Specific Compressed Sensing by minimizing a weighted L2-norm A. Mahalanobis.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Modeling Modern Information Retrieval
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Separate multivariate observations
Advanced Multimedia Text Classification Tamara Berg.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Chapter 10 Review: Matrix Algebra
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Chapter 6: Information Retrieval and Web Search
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
Vector Space Models.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 Information Retrieval LECTURE 1 : Introduction.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
1 CS 430: Information Discovery Lecture 5 Ranking.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Automated Information Retrieval
Plan for Today’s Lecture(s)
Best pTree organization? level-1 gives te, tf (term level)
Multimedia Information Retrieval
Representation of documents and queries
From frequency to meaning: vector space models of semantics
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Latent Semantic Analysis
Presentation transcript:

Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM Reading Assignment: Chapter 3.

Traditional Information Retrieval vs. Vector Space Models Traditional lexical (or Boolean) IR techniques, e.g. keyword matching, often return too much and irrelevant information to user due to Synonymy (multiple ways to express a concept), and Polysemy (multiple meanings of a word/words). –If the terms used in the query differ from the terms used in the document, valuable information can never be found. Vector space models can be used to encode/represent terms and documents in a text collection –Each component of a document vector can be used to represent a particular term/key word/phrase/concept –Value assigned reflects the semantic importance of the term/key word/phrase/concept

Vector Space Construction A document collection of n documents indexed using m terms is represented as an m  n term-by-document matrix A. –a ij is the weighted frequency at which term i occurs in document j. –Columns are document vectors and rows are term vectors

Vector Space Properties The document vectors span the content, yet it is not the case that every vector represented in the column space of A has a specific interpretation. –For example, a linear combination of any two document vectors does not necessarily represent a viable document of the collection However, the vector space model can exploit geometric relationships between document (and term) vectors in order to explain both similarities and differences in concepts.

Term-by-Document Matrices What is the relationship between the number of terms, m, and number of documents, n, in –Heterogeneous text collections such as newspapers and encyclopedias? –WWW? 300,000  300,000,000 (as of late 90’s)

Term-by-Document Matrices: Example

Although the frequency of terms is 1, it can be more –Matrix entries can be scaled so that the Euclidean norm of each document vector is equal to 1: Determining which words to index and which words to discard defines both the art and the science of automated indexing. Terms are usually identified by their word stems. –Stemming reduces the number of rows in the term by document matrix, which may lead to storage savings

Simple Query Matching Queries are represented as m  1 vectors –So, in the previous example, a query of “Child Proofing” is represented as …? Query matching in the vector space model can be viewed as a search in the column space of Matrix A (i.e. the subspace spanned by the document vectors) for the documents most similar to the query.

Simple Query Matching A most commonly used similarity measure is to –find the cosine of the angle between the query vector and all document vectors (assume a j is the j th document vector) –Given a threshold value, T, documents that satisfy the condition |cos  j | ≥ T are judged as relevant to the user’s query and returned.

Simple Query Matching Properties –Since A is sparse, dot product computation is inexpensive. –Document vector norms can be pre-computed and stored before any cosine computation. No need to do that if the matrix A is normalized –If both the query and document vectors are normalized, the cosine computation constitutes a single inner product.

Simple Query Matching: Example For our previous query “Child Proofing”, and assuming that T = 0.5 –The nonzero cosines are cos  2 = cos  3 = , and cos  5 = cos  6 = 0.5 –Therefore, the only “relevant” documents are Baby Proofing Basics Your Guide to Easy Rust Proofing –Documents 1 to 4 have been incorrectly ignored, whereas Document 7 have been correctly ignored. 

Simple Query Matching: Example What about a query on Child Home Safety? –What is the query? –What are the nonzero cosines? –What is retrieved with T = 0.5?

Simple Query Matching Hence, the current vector space model representation and query technique do not accurately represent and/or retrieve semantic content of the book titles. The following approaches have been developed to address errors in this model –Term Weighting –Low-rank approximations to the original term by document matrix A

Term Weighting The main objective of term weighting is to improve retrieval performance Term-document matrix A entries a ij are redefined to be as follows: a ij = l ij g i d j l ij is the local weight for term i occurring in document j g i is the global weight for term I in the collection d j is a document normalization factor that specifies whether or not the columns of A are normalized. –Define f ij as the frequency that term i appears in document j, and let

Term Weighting

A simple notation for specifying a term weighting approach is to use the three-letter string associated with the particular local, global, and normalization factor symbols. –For example, the lfc weighting scheme defines

Term Weighting Defining an appropriate weighting scheme depends on certain characteristics of the document collection –The choice for the local weight (l ij ) may depend on the vocabulary or word usage patterns for the collection For technical/scientific vocabularies (technical reports, journal articles), schemes of the form nxx are recommended For more general/varied vocabularies (e.g. popular magazines, encyclopedias), simple term frequencies (t**) may be sufficient. Binary term frequencies (b**) are useful when the term list is relatively short (e.g. controlled frequencies)

Term Weighting Defining an appropriate weighting scheme depends on certain characteristics of the document collection –The choice for the global weight (g i ) should take into account how often the collection is likely to change, called the state of the document collection For dynamic collections, one may disregard the global factor altogether (*x*) For static collections, the inverse document frequency (IDF) global weight (*f*) is a common choice among automatic indexing schemes. –The probability that a document being judged relevant by a user significantly increases with the document length, i.e. the longer the document is, the more likely all keywords will be found. Traditional cosine normalization (**c) has not been effective for large full text documents (e.g. TREC-4). Instead, a pivoted-cosine normalization scheme has been proposed for indexing the TREC collections.

Term Weighting: Pivoted [Cosine] Normalization The normalization factor for documents for which P retrieval > P relevance is increased whereas the normalization factor for documents for which P retrieval < P relevance is decreased Pivoted normalization = (1 – slope)  pivot + slope  old normalization If deviation of the retrieval pattern from the relevance pattern is systematic across collections for a normalization function, the pivot and the slope values learned from one collection can be used effectively on another collection

Sparse Matrix Storage Although A is sparse, it does not generally exhibit a structure or pattern, such as banded matrices. –This implies that it is difficult to identify clusters of documents sharing similar terms. Some progress in reordering hypertext-based matrices has been reported

Sparse Matrix Storage Two formats suitable for the term-by- document matrices are Compressed Row Storage (CRS) and Compressed Column Storage (CCS) –No assumptions on the existence of a pattern or structure of the sparse matrix –Require 3 arrays of storage

Compressed Row Storage One floating-point array (val) for storing the nonzero values, i.e. [un]weighted term frequencies, of A. –Row-wise Two integer arrays for indices (col_index, row_ptr) –col_index: corresponding column indices of the elements in the val array (What is the size of this array)? Val (k) = aij  col_index = k –row_ptr: locations in the val array that begin a row

Compressed Row Storage: Example val: col_index: row_ptr:

Compressed Column Storage Known as Harwell-Boeing sparse matrix format Almost identical to CRS –Columns are stored in contiguous array locations

Compressed Column Storage: Example val: row_index: col_ptr:

Low-Rank Approximations The uncertainties associated with term-by- document matrices can be attributed differences in language (word usage) culture. –For example, the author may use different words than the reader/searcher. Will we have ever a perfect term-by-document matrix representing all possible term-document associations? –Errors in measurement can accumulate and lead to those uncertainties.

Low-Rank Approximations Hence, the term-by-document matrix may be represented by the matrix sum A + E, where E reflects the error or uncertainty in assigning (or generating) the elements of matrix A. Current approaches to information retrieval without requiring literal word matches have focused on the use of rank-k approximations to term-by-document matrices. –Latent Semantic Indexing (k << min(m, n)) Coordinates produced by low-rank approximations do not explicitly reflect term frequencies within documents, instead they model global usage patterns of terms so that related documents are represented by nearby vectors in the k-dimensional space. –Semi-discrete Decomposition

Low-Rank Approximations