Download presentation

Presentation is loading. Please wait.

Published byFaith Horn Modified over 3 years ago

1
**Basic IR: Modeling Basic IR Task: Slightly more complex:**

Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted relevance The derivation of relevance leads to different IR models.

2
**Concepts: Term-Document Incidence**

Imagine matrix of terms X documents with 1 when the term appears in the document and 0 otherwise. Queries satisfied how? Problems? search segment select semantic … MIR 1 AI

3
**Concepts: Term Frequency**

To support document ranking, need more than just term incidence. Term frequency records number of times a given term appears in each document. Intuition: More times a term appears in a document the more central it is to the topic of the document.

4
Concept: Term Weight Weights represent the importance of a given term for characterizing a document. wij is a weight for term i in document j.

5
**Mapping Task and Document Type to Model**

Index Terms Full Text Full Text + Structure Searching (Retrieval) Classic Structured Surfing (Browsing) Flat Hypertext Structure Guided

6
**IR Models from MIR text s e Adhoc r Filtering T a k Browsing**

Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a k Classic Models boolean vector probabilistic Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Algebraic Generalized Vector Lat. Semantic Index Neural Networks Flat Structure Guided Hypertext from MIR text

7
**Classic Models: Basic Concepts**

Ki is an index term dj is a document t is the total number of docs K = (k1, k2, …, kt) is the set of all index terms wij >= 0 is a weight associated with (ki,dj) wij = 0 indicates that term does not belong to doc vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the document dj gi(vec(dj)) = wij is a function which returns the weight associated with pair (ki,dj)

8
**Classic: Boolean Model**

Based on set theory: map queries with Boolean operations to set operations Select documents from term-document incidence matrix Pros: Cons:

9
**Exact Matching Ignores…**

term frequency in document term scarcity in corpus size of document ranking

10
**Vector Model Vector of term weights based on term frequency**

Compute similarity between query and document where both are vectors vec(dj) = (w1j, w2j, ..., wtj) vec(q) = (w1q, w2q, ..., wtq) Similarity is the cosine of the angle between the vectors.

11
Cosine Measure j dj q Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1 from MIR notes

12
**How to Set Wij Weights? TF-IDF**

Within document: Term-Frequency tf measures term density within a document Across document: Inverse Document Frequency idf measures informativeness or rarity of term across corpus.

13
TF * IDF Computation What happens as number of occurrences in a document increases? What happens as term becomes more rare?

14
**TF * IDF TF may be normalized. IDF is computed**

tf(i,d) = freq(i,d) / max(freq(l,d)) IDF is computed normalized to size of corpus as log to make TF and IDF values comparable IDF requires a static corpus.

15
**How to Set Wi,q Weights? Create Vector directly from query**

Use modified tf-idf

16
**The Vector Model: Example**

k1 k2 k3 Which document seems to best match the query? What would we expect the ranking to be? from MIR notes

17
**The Vector Model: Example (cont.)**

k1 k2 k3 Compute Tf-IDF Vector for each document For first document: K1: ((2/2)*(log (7/5)) = .33 K2: (0*(log (7/4))) = 0 K3: ((1/2)*(log (7/3))) = .42 for rest: [ ], [ ], [ ], [ ], [ ], [ ] TF-IDF for first document… k1 is 2* log(7/5)=.67, k2 is 0 * log(7/4)=0, k3 is 1 * log(7/3)=.84 [ ] normalized it is k1= (2/2)*log(7/5)=.33, k2=0, k3=(1/4)*log(7/3)=.21 To match query, from MIR notes

18
**The Vector Model: Example (cont.)**

k1 k2 k3 2. Compute the Tf-IDF for the query [1 2 3]: K1: (.5 + ((.5 * 1)/3))*(log (7/5))) K2: (.5 + ((.5 * 2)/3))*(log (7/4))) K3: (.5 + ((.5 * 3)/3))*(log (7/3))) which is: [ ]

19
**The Vector Model: Example (cont.)**

k1 k2 k3 3. Compute the Sim for each document: D1: D1*q = (.33 * .22) + (0 * .47) + (.42 * .85) = .43 |D1| = sqrt((.33^2) + (.42^2)) = .53 |q| = sqrt((.22^2) + (.47^2) + (.85^2)) = 1.0 sim = .43 / (.53 * 1.0) = .81 D2: D3: D4: .23 D5: D6: D7: .47

20
**Vector Model Implementation Issues**

Sparse TermXDocument matrix Store term count, term weight, or weighted by idfi ? What if the corpus is not fixed (e.g., the Web)? What happens to IDF? How to efficiently compute Cosine for large index?

21
**Heuristics for Computing Cosine for Large Index**

Select from only non-zero cosines Focus on non-zero cosines for rare (high idf) words Pre-compute document adjacency for each term, pre-compute k nearest docs for a t term query, compute cosines from query to union of t pre-computed lists, choose top k

22
**The TFIDF Vector Model: Pros/Cons**

term-weighting improves quality cosine ranking formula sorts documents according to degree of similarity to the query Cons: assumes independence of index terms

Similar presentations

OK

Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.

Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on power diode types Ppt on standing orders Ppt on pi in maths cheating Download ppt on indus valley civilization for kids Ppt on leverages business Ppt on ufo and aliens history Ppt on critical thinking Ppt on fair and lovely Ppt on personality development for school students Ppt on adivasis of india