Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell

Topics Covered The Vector Space Model for IR (VSM) Evaluation Metrics for IR Query Expansion (the Rocchio Method) Inverted Indexing for Efficiency A Glimpse into Harder Problems

The Vector Space Model Definitions of document and query vectors, where w j = j th word, and c(w j,d i ) = count the occurrences of w i in document d j

Computing the Similarity Dot-product similarity: Cosine similarity:

Computing Norms and Products Dot product: Eucledian vector norm (aka “2-norm”):

Similarity in Retrieval Similarity ranking: If sim(q,d i ) > sim(q,d j ), d i ranks higher Retrieving top k documents:

Refinements to VSM (1) Word normalization Words in morphological root form countries => country interesting => interest Stemming as a fast approximation countries, country => countr moped => mop Reduces vocabulary (always good) Generalizes matching (usually good) More useful for non-English IR (Arabic has > 100 variants per verb)

Refinements to VSM (2) Stop-Word Elimination Discard articles, auxiliaries, prepositions,... typically 100-300 most frequent small words Reduce document “length” by 30-40% Retrieval accuracy improves slightly (5- 10%)

Refinements to VSM (3) Proximity Phrases E.g.: "air force" => airforce Found by high-mutual information p(w 1 w 2 ) >> p(w 1 )p(w 2 ) p(w 1 & w 2 in k-window) >> p(w 1 in k-window) p(w 2 in same k-window) Retrieval accuracy improves slightly (5-10%) Too many phrases => inefficiency

Refinements to VSM (4) Words => Terms term = word | stemmed word | phrase Use exactly the same VSM method on terms (vs words)

Evaluating Information Retrieval (1) Contingency table: relevantnot-relevant retrievedab not retrievedcd Recall = a/(a+c) = fraction of relevant retrieved Precision = a/(a+b) = fraction of retrieved that is relevant

Evaluating Information Retrieval (2) P = a/(a+b)R = a/(a+c) Accuracy = (a+d)/(a+b+c+d) F1 = 2PR/(P+R) Miss = c/(a+c) = 1 - R (false negatives) F/A = b/(a+b+c+d) (false positives)

Evaluating Information Retrieval (3) 11-point precision curves IR system generates total ranking Plot precision at 10%, 20%, 30%... recall,

Query Expansion (1) Observations: Longer queries often yield better results User’s vocabulary may differ from document vocabulary Q: how to avoid heart disease D: "Factors in minimizing stroke and cardiac arrest: Recommended dietary and exercise regimens" Maybe longer queries have more chances to help recall.

Query Expansion (2) Bridging the Gap Human query expansion (user or expert) Thesaurus-based expansion Seldom works in practice (unfocused) Relevance feedback –Widen a thin bridge over vocabulary gap –Adds words from document space to query Pseudo-Relevance feedback Local Context analysis

Relevance Feedback: Rocchio’s Method Idea: update the query via user feedback Exact method: (vector sums)

Relevance Feedback (2) For example, if: Q = (heart attack medicine) W(heart,Q) = W(attack,Q) = W(medicine,Q) = 1 D rel = (cardiac arrest prevention medicine nitroglycerine heart disease...) W(nitroglycerine,D) = 2, W(medicine,D) = 1 D irr = (terrorist attack explosive semtex attack nitroglycerine proximity fuse...) W(attack,D) = 1, W(nitroglycerine = 2), W(explosive,D) = 1 AND α =1, β =2, γ =.5

Relevance Feedback (3) Then: W(attack,Q’) = 1*1 - 0.5*1 = 0.5 W(nitroglycerine, Q’) = W(medicine, Q’) = w(explosive, Q’) =

Term Weighting Methods (1) Salton’s Tf*IDf Tf = term frequency in a document Df = document frequency of term = # documents in collection with this term IDf = Df -1

Term Weighting Methods (2) Salton’s Tf*IDf TfIDf = f 1 (Tf)*f 2 (IDf) E.g. f 1 (Tf) = Tf*ave(|D j |)/|D| E.g. f 2 (IDf) = log 2 (IDF) f 1 and f 2 can differ for Q and D

Efficient Implementations of VSM (1) Exploit sparseness Only compute non-zero multiplies in dot- products Do not even look at zero elements (how?) => Use non-stop terms to index documents

Efficient Implementations of VSM (2) Inverted Indexing Find all unique [stemmed] terms in document collection Remove stopwords from word list If collection is large (over 100,000 documents), [Optionally] remove singletons Usually spelling errors or obscure names Alphabetize or use hash table to store list For each term create data structure like:

Efficient Implementations of VSM (3) [term IDF term i, <doc i, freq(term, doc i ) doc j, freq(term, doc j )...>] or: [term IDF term i, <doc i, freq(term, doc i ), [pos 1,i, pos 2,i,...] doc j, freq(term, doc j ), [pos 1,j, pos 2,j,...]...>] pos l,1 indicates the first position of term in document j and so on.

Open Research Problems in IR (1) Beyond VSM Vectors in different Spaces: Generalized VSM, Latent Semantic Indexing... Probabilistic IR (Language Modeling): P(D|Q) = P(Q|D)P(D)/P(Q)

Open Research Problems in IR (2) Beyond Relevance Appropriateness of doc to user comprehension level, etc. Novelty of information in doc to user anti- redundancy as approx to novelty

Open Research Problems in IR (3) Beyond one Language Translingual IR Transmedia IR

Open Research Problems in IR (4) Beyond Content Queries "What’s new today?" "What sort of things to you know about" "Build me a Yahoo-style index for X" "Track the event in this news-story"

Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Similar presentations

Presentation on theme: "Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Similar presentations

Presentation on theme: "Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell."— Presentation transcript:

Similar presentations

About project

Feedback