Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w 12... w 1j... w 1m 1/|d 1 | d 2 w 21 w 22... w 2j... w 2m 1/|d.

Similar presentations


Presentation on theme: "Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w 12... w 1j... w 1m 1/|d 1 | d 2 w 21 w 22... w 2j... w 2m 1/|d."— Presentation transcript:

1 Indexing

2 Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w 12... w 1j... w 1m 1/|d 1 | d 2 w 21 w 22... w 2j... w 2m 1/|d 2 |.............. d i w i1 w i2... w ij... w im 1/|d i |.............. d n w n1 w n2... w nj... w nm 1/|d n | w ij is the weight of the term t j in document d i Most of the w ij ’s are zero.

3 Naïve Retrieval Given the query q = (q 1, q 2, …, q j, …, q n ), nf = 1/|q|. How to evaluate q (i.e., compute the similarity between q and each document)? Method 1: Compare q with each document: Data structure for documents: d i : ((t 1, w i1 ), (t 2, w i2 ),..., (t j, w ij ),..., (t m, w im ), 1/|d i |) –Only terms with positive weights are considered. –Terms are in alphabetic order. Data structure for the query: q : ((t 1, q 1 ), (t 2, q 2 ),..., (t j, q j ),..., (t m, q m ), 1/|q|)

4 Naïve Retrieval Method 1: Compare q with all the documents (cont.) Algorithm: Initialize all sim(q, d i ) = 0; for each document di (i = 1, …, n) { for each term t j (j = 1, …, m) if t j appears in q and d i sim(q, d i ) += q j  w ij ; sim(q, d i ) = sim(q, d i )  (1/|q|)  (1/|d i |); } rank the documents in descending order and show the k better ones to the users.

5 Indexing Method 1 is not efficient: all zero entries in the documents x terms matrix are accessed. Method 2: Use of a file of inverted indexes; Several data structures: For each term t j, an inverted list with all the documents that contain t is created j. I(t j ) = { (d 1, w 1j ), (d 2, w 2j ), …, (d i, w ij ), …, (d n, w nj ) } d i is the identification of the i-th document; Only non zero entries are considered.

6 Indexing Method 2: Use of inverted index file (cont.) Several data structures: Normalization factors of the docs. are pre- computed and stored in a vector: nf[i] stores 1/|d i |. A hash table for all the terms in the collection is created:...... t j points to I(t j )...... Inverted lists are typically stored in disks; Typically the number of different terms is very high.

7 Inverted file creation Dictionary Pointers

8 Retrieval with inverted lists Algorithm: Initialize all sim(q, d i ) = 0; for each term t j in q { find I(t) using the hash table; for each (d i, w ij ) in I(t) sim(q, d i ) += q j  w ij ; } for each document di sim(q, d i ) = sim(q, d i )  (1/|q|)  nf[i]; rank the documents in descending order and show the better k to the user.

9 Retrieval with inverted lists Observations about method 2: If a document d does not contain any term of the query q, so d is not involved in the evaluation of q; Only non-zero entries of the matrix documents x terms are employed for query evaluation; The computation of similarities of several documents are made simultaneously.

10 Example (Method 2): q = { (t1, 1), (t3, 1) }, 1/|q| = 0.7071 d1 = { (t1, 2), (t2, 1), (t3, 1) }, nf[1] = 0.4082 d2 = { (t2, 2), (t3, 1), (t4, 1) }, nf[2] = 0.4082 d3 = { (t1, 1), (t3, 1), (t4, 1) }, nf[3] = 0.5774 d4 = { (t1, 2), (t2, 1), (t3, 2), (t4, 2) }, nf[4] = 0.2774 d5 = { (t2, 2), (t4, 1), (t5, 2) }, nf[5] = 0.3333 I(t1) = { (d1, 2), (d3, 1), (d4, 2) } I(t2) = { (d1, 1), (d2, 2), (d4, 1), (d5, 2) } I(t3) = { (d1, 1), (d2, 1), (d3, 1), (d4, 2) } I(t4) = { (d2, 1), (d3, 1), (d4, 1), (d5, 1) } I(t5) = { (d5, 2) } Retrieval with inverted lists

11 After t1 preprocessing: sim(q, d1) = 2, sim(q, d2) = 0, sim(q, d3) = 1 sim(q, d4) = 2, sim(q, d5) = 0 After t3 preprocessing: sim(q, d1) = 3, sim(q, d2) = 1, sim(q, d3) = 2 sim(q, d4) = 4, sim(q, d5) = 0 After normalization: sim(q, d1) =.87, sim(q, d2) =.29, sim(q, d3) =.82, sim(q, d4) =.78, sim(q, d5) = 0 Retrieval with inverted lists

12 Efficiency x flexibility The storage of the weights is good for efficiency but bad for flexibility; –A re-computing is necessary if the formulas tf and idf change Flexibility can be improved storing tf and df information, but efficiency is worst; A compromise exists: –Storing the weights tf; –Use of the weights idf with the weights of the query terms tf instead of the weights of the document terms tf.

13 Inverted indexes Is the main structure for the indexes; Main idea: –To invert the documents in a big index file; Basic steps: –To create a “dictionary” with all the tokens in the collection; –For each token, to list all the documents in which the token occur; –To treat the structure to avoid redundancy.

14 Inverted indexes An inverted file is composed by vectors in a way that each row corresponds to a list of documents; it corresponds to the transpose of the document matrix.

15 Inverted index files creation Documents are analysed for token extraction; these ones are save with doc-ID. Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2

16 After the analysis of all the documents, the index file is sorted alphabetically. Inverted index files creation

17 Multiple entries of terms for a single document can de merged. Information about term frequencies are compiled. Inverted index files creation

18 Then the file can be separated in : a “Dictionary” file; and a “Pointer” file. Inverted index files creation

19 Allow a faster access to individual terms; For each term, a list is obtained with: –The identity of the document: doc-ID; –The term frequency in the document; –The position of the term in the document. These lists can be used to answer Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2 They can be also employed in ranking algorithms. Inverted index files

20 Use of inverted files Dictionary Pointers Query: “time” AND “dark” 2 docs with “time” in the dictionary -> IDs 1 and 2 in the pointer file 1 doc with “dark” in the dictionary -> ID 2 in the pointer file. So, only doc 2. Satisfy the query.


Download ppt "Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w 12... w 1j... w 1m 1/|d 1 | d 2 w 21 w 22... w 2j... w 2m 1/|d."

Similar presentations


Ads by Google