# Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.

## Presentation on theme: "Lecture 11 Search, Corpora Characteristics, & Lucene Introduction."— Presentation transcript:

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction

Relevance & metrics of success Ad hoc query Document collection (aka corpus) Document collection (aka corpus) Ranked Results

Relevance Relevant docs Retrieved docs relevant & retrieved

Relevance Relevant docs Retrieved docs relevant & retrieved precision relevant retrieved retrieved = recall relevant retrieved relevant =

Vector Space Model The Query & each Document represented as vector in the term space term space – all of the unique terms in the document set (corpus) Look at similarity between the query vector and the vector representing each Document

For our purposes, a vector is a series of numbers Every document has a vector associated with it Each query has a vector associated with it The idea is that a document and a query are similar to each other if their vectors point in the same general direction Vector Space Model

For every unique term in the document collection (corpus), there is a corresponding component of the vector – If there are 731 unique terms in the document collection, then there are 731 components of the vector Vector Space Model

query q 0, q 1, q 2, …, q n document 0 d 1,0, d 1,1, d 1,2, … d 1,n document 1 d 2,0, d 2,1, d 2,2, … d 2,n d 3,0, d 3,1, d 3,2, … d 3,n document 2 document 3 d 0,0, d 0,1, d 0,2, … d 0,n “ gnarly ”“ rocket ”“ pencil ”

We can populate the vector w/binary data – Eg if the term, “gnarly”, appears in document and “rocket” and “pencil” do not … -- … we can assign a “1” to the first component and “0” to the next two Vector Space Model d 2,0, d 2,1, d 2,2, … d 2,n document 2 “ gnarly ” 1, 0, 0, … d 2,n

This binary model does not take into account how often a term appeared in a document The model can be modified to account for often the term appears in the document. For example, if “gnarly” appears 10 times, “rocket” appears 4 times, and “pencil” appears twice, the vector could be updated as Vector Space Model 10, 4, 2, … d 2,n document 2 “ gnarly ” “ rocket ”“ pencil ”

Vector Space Model Model could also be modified by weighting the different components The user could manually increase the weight (boost) Weights can be automatically assigned – usually based on the frequency of a term as it occurs across the entire collection (corpus) – The idea is that an infrequently occurring term should be weighted higher than a less frequently recurring term 5 x (d 2,0 ), 2 x (d 2,1 ), 1 x d 2,2, … d 2,n document 2

Automatic weighting based on collection (corpus) frequency Some definitions – t = # of distinct terms in the corpus tf ij = # of occurrences of term t j in document D i -- called term frequency df j = # of documents that contain t j -- called document frequency idf j = log( d / df j ) where d is total # of documents -- called inverse document frequency

Weighting factor is a combination of term frequency & inverse document frequency So the value of the jth entry in the vector associated with document is: d ij = tf ij x idf j Automatic weighting based on collection (corpus) frequency (Information Retrieval; Grossman, Frieder)

Inverted Index Vector space models often use inverted indices that are constructed prior to query time (NIST) (wiki 022709)

Tokenization Breaking the text up into chunks for indexing & analysis Life is a box of chocolates Mr. O’Neill said that there aren’t any. O’Neill ONeill Neill O’ Neill O Neill aren’t arent are n’t aren t

Stop Words Dropping the words from the text to be indexed that don’t add much value to assisting with search – the, and, a, of, at, as, he, she, it, … [etc]

Stemming Don’t want to be so exact in our words search such that we miss relevant words, eg – democracy, democratization, democrat, democratic Porter stemming (1980) – algorithm involved, but here is the idea -- (http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)