Presentation is loading. Please wait.

Presentation is loading. Please wait.

3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.

Similar presentations

Presentation on theme: "3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc."— Presentation transcript:

1 3: Search & retrieval: Structures

2 The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc matrix Meta-data d1d1 dndn dog11 stop10 attack11 cat10 live10 USA10

3 Term-document matrix 1 if play contains word, 0 otherwise Brutus AND Caesar but NOT Calpurnia

4 Inverted index construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted index. friend roman countryman 24 2 13 16 1 Documents to be indexed. Friends, Romans, countrymen.

5 TF-IDF (ranking) binary match (Boolean) vs. probabilistic ranking (similarity) term frequency: Occurrences of term in doc tf t,d = frequency of t in doc d document frequency: docs with the term df t = documents with term t inverse document frequency (n=total docs): idf t = log(n/df t ) tf.idf weights for term i in document d is: (1) highest when lots of i in few documents (2) few times or many documents (2) frequent in many documents

6 Documents as vectors Each doc d can now be viewed as a vector of wf  idf values, one component for each term So we have a vector space – terms are axes – docs live in this space – even with stemming, may have 50,000+ dimensions (axes).

7 High-dimensional vector space Postulate: Documents that are “close together” in the vector space talk about the same things. t1t1 d2d2 d1d1 d3d3 d4d4 d5d5 t3t3 t2t2 θ φ dog cat attack dog cat dog attack dog attack dog attack attack cat attack cat attack cat

8 Classic IR: match query to indexed docs Re-articulating need as query Faceted search: “chunking” and “aliasing”

9 Precision = relevant/return Recall = return/relevant Concept1Term1Concept2Polysemy Concept3 Term1 Term2Concept1Synonymy Term3

10 “Text” processing 200 factors Document similarity – like tf.idf Web page – update, au, anchor Link structure – PageRank Google – commercial – ad populum fallacy GoogleScholar – indexing – 10-50% accessible

Download ppt "3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc."

Similar presentations

Ads by Google