Presentation on theme: "Basic IR: Modeling Basic IR Task: Slightly more complex:"— Presentation transcript:
1 Basic IR: Modeling Basic IR Task: Slightly more complex: Match a subset of documents to the user’s querySlightly more complex:and rank the resulting documents by predicted relevanceThe derivation of relevance leads to different IR models.
2 Concepts: Term-Document Incidence Imagine matrix of terms X documents with 1 when the term appears in the document and 0 otherwise.Queries satisfied how?Problems?searchsegmentselectsemantic…MIR1AI
3 Concepts: Term Frequency To support document ranking, need more than just term incidence.Term frequency records number of times a given term appears in each document.Intuition: More times a term appears in a document the more central it is to the topic of the document.
4 Concept: Term WeightWeights represent the importance of a given term for characterizing a document.wij is a weight for term i in document j.
5 Mapping Task and Document Type to Model Index TermsFull TextFull Text + StructureSearching (Retrieval)ClassicStructuredSurfing (Browsing)FlatHypertextStructure Guided
6 IR Models from MIR text s e Adhoc r Filtering T a k Browsing Non-Overlapping ListsProximal NodesStructured ModelsRetrieval:AdhocFilteringBrowsingUserTakClassic ModelsbooleanvectorprobabilisticSet TheoreticFuzzyExtended BooleanProbabilisticInference NetworkBelief NetworkAlgebraicGeneralized VectorLat. Semantic IndexNeural NetworksFlatStructure GuidedHypertextfrom MIR text
7 Classic Models: Basic Concepts Ki is an index termdj is a documentt is the total number of docsK = (k1, k2, …, kt) is the set of all index termswij >= 0 is a weight associated with (ki,dj)wij = 0 indicates that term does not belong to docvec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the document djgi(vec(dj)) = wij is a function which returns the weight associated with pair (ki,dj)
8 Classic: Boolean Model Based on set theory: map queries with Boolean operations to set operationsSelect documents from term-document incidence matrixPros:Cons:
9 Exact Matching Ignores… term frequency in documentterm scarcity in corpussize of documentranking
10 Vector Model Vector of term weights based on term frequency Compute similarity between query and document where both are vectorsvec(dj) = (w1j, w2j, ..., wtj) vec(q) = (w1q, w2q, ..., wtq)Similarity is the cosine of the angle between the vectors.
11 Cosine MeasurejdjqSince wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1from MIR notes
12 How to Set Wij Weights? TF-IDF Within document: Term-Frequencytf measures term density within a documentAcross document: Inverse Document Frequencyidf measures informativeness or rarity of term across corpus.
13 TF * IDF ComputationWhat happens as number of occurrences in a document increases?What happens as term becomes more rare?
14 TF * IDF TF may be normalized. IDF is computed tf(i,d) = freq(i,d) / max(freq(l,d))IDF is computednormalized to size of corpusas log to make TF and IDF values comparableIDF requires a static corpus.
15 How to Set Wi,q Weights? Create Vector directly from query Use modified tf-idf
16 The Vector Model: Example k1k2k3Which document seems to best match the query? What would we expect the ranking to be?from MIR notes
17 The Vector Model: Example (cont.) k1k2k3Compute Tf-IDF Vector for each documentFor first document:K1: ((2/2)*(log (7/5)) = .33K2: (0*(log (7/4))) = 0K3: ((1/2)*(log (7/3))) = .42for rest:[ ], [ ], [ ], [ ],[ ], [ ]TF-IDF for first document… k1 is 2* log(7/5)=.67, k2 is 0 * log(7/4)=0, k3 is 1 * log(7/3)=.84 [ ]normalized it is k1= (2/2)*log(7/5)=.33, k2=0, k3=(1/4)*log(7/3)=.21To match query,from MIR notes
18 The Vector Model: Example (cont.) k1k2k32. Compute the Tf-IDF for the query [1 2 3]:K1: (.5 + ((.5 * 1)/3))*(log (7/5)))K2: (.5 + ((.5 * 2)/3))*(log (7/4)))K3: (.5 + ((.5 * 3)/3))*(log (7/3)))which is: [ ]
20 Vector Model Implementation Issues Sparse TermXDocument matrixStore term count, term weight, or weighted by idfi ?What if the corpus is not fixed (e.g., the Web)? What happens to IDF?How to efficiently compute Cosine for large index?
21 Heuristics for Computing Cosine for Large Index Select from only non-zero cosinesFocus on non-zero cosines for rare (high idf) wordsPre-compute document adjacencyfor each term, pre-compute k nearest docsfor a t term query, compute cosines from query to union of t pre-computed lists, choose top k
22 The TFIDF Vector Model: Pros/Cons term-weighting improves qualitycosine ranking formula sorts documents according to degree of similarity to the queryCons:assumes independence of index terms