Download presentation

Presentation is loading. Please wait.

Published byKarlie Meader Modified over 3 years ago

1
Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki

2
10.7.2007 Space-Efficient Document Retrieval2 Introduction Information Retrieval Document Retrieval Inverted Index Combinatorial Pattern Matching Text IndexingSuffix tree Field Problem Solution [PST06] [Mut02] practice: space limits theory: time limits [Sad07 & this paper]

3
10.7.2007 Space-Efficient Document Retrieval3 Text Indexing Let T = t 1 t 2... t n be a text string from an ordered alphabet Σ. Text Indexing problem is to build an index structure for T that supports the following operations on a given pattern P=p 1 p 2... p m : Count(P): How many times P occurs in T? List(P): list the occurrence positions of P in T.

4
10.7.2007 Space-Efficient Document Retrieval4 Document Retrieval Let D={T 1,T 2,...T k } be a set of text documents of total length n. Document Retrieval problem is to build an index for D that supports the following operation on a given pattern P=p 1 p 2... p m : - Find(P): List the documents that contain P (in the order of relevance,...)

5
10.7.2007 Space-Efficient Document Retrieval5 Inverted Index & Document Retrieval... be: (d1,4) (d1,18)... (d2,74) (d2,139)...... to: (d1,1) (d1,15)...(d2,136)...... Find("to be")= Remove duplicates((Find("to")+3)∩Find("be")) = d1 (Hamlet), d2 (Merchant of Venice),... To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; PORTIA: If to do were as easy as to know what were good to do, chapels had been churches and poor men's cottages princes' palaces. It is a good divine that follows his own instructions: I can easier teach twenty what were good to be done, than be one of the twenty to follow mine own teaching. Creating inverted file over Shakespeare's plays...............................

6
10.7.2007 Space-Efficient Document Retrieval6 Suffix Array & Document Retrieval (1/2) Build generalized suffix array of D: 1 2.... 6853491 6853492 6853493 6853494... To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; PORTIA: If to do were as easy as to know what were good to do, chapels had been churches and poor men's cottages princes' palaces. It is a good divine that follows his own instructions: I can easier teach twenty what were good to be done, than be one of the twenty to follow mine own teaching.

7
10.7.2007 Space-Efficient Document Retrieval7 Suffix Array & Document Retrieval Build generalized suffix array of D: Locate the interval containing all occurrences of pattern P: Remove duplicates: 1 2.... 6853491 6853492 6853493 6853494... "to be" d1 (Hamlet), d2 (Merchant of Venice),...

8
10.7.2007 Space-Efficient Document Retrieval8 Muthukrishnan's improvement 1 2.... 6853491 6853492 6853493 6853494... 6 4.... 2 1 1 3 doc "to be" prev -1 -1...6853434 6853372 6853492 6853420... min min>6853490...

9
10.7.2007 Space-Efficient Document Retrieval9 Time-Optimal Document Retrieval Theorem [Mut02]: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size O(n log n) bits, where ndoc is the number of documents matching the query. Observation: The solution is not space- optimal, as the document collection can be represented in n log |Σ| bits.

10
10.7.2007 Space-Efficient Document Retrieval10 Space-Optimal Document Retrieval Theorem [Sad02]: Document retrieval problem can be solved in O(f(m,n)+ndoc·g(n)) time using an index structure of size |CSA|+4n+o(n)+O(k log (n/k)) bits, where |CSA| ≤ n log |Σ| (1+o(1)) is the size of the compressed suffix array used; f(m,n)=O(m log n) is the pattern search time; and Ω(log ε n)=g(n) is the time to decode a suffix array value.

11
10.7.2007 Space-Efficient Document Retrieval11 Our Result: Space- and Time- Efficient Document Retrieval Theorem: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size |CSA|+2n+o(n)+n log k(1+o(1)) bits, when |Σ|,k polylog(n); for unbounded |Σ|,k the time bound components become O(m log |Σ|) and O(ndoc log k), respectively.

12
10.7.2007 Space-Efficient Document Retrieval12 Details of Our Result (1/3) We use the alphabet-friendly FM-index [FMMN07] to find the suffix array interval containing the pattern occurrences. We use the generalized wavelet tree [GGV03,FMMN07] to store document numbers according to the suffix array order.

13
10.7.2007 Space-Efficient Document Retrieval13 Details of Our Result (2/3) Observation: prev[i]=select doc[i] (doc,rank doc[i] (doc,i)-1), where rank k' (A,i) gives the number of times value k' appears in A[1,i]; and select k' (A,j) gives the position of the j-th occurrence of value k' in A.

14
10.7.2007 Space-Efficient Document Retrieval14 Details of Our Result (3/3) The generalized wavelet tree representation of doc-array provides constant time rank and select when k polylog (n). Constant time Range Minimum Queries (RMQ) on implicit prev-array can be supported using 2n+o(n) bits [FH07].

15
10.7.2007 Space-Efficient Document Retrieval15 A simpler way to obtain the O(ndoc log k) result... 1 2 3 4 5 6 7 8 9 doc 2 3 4 2 1 2 3 1 4 2 2 1 2 13 4 1 2 2 2 3 4 |CSA|+2n+o(n)+n log k(1+o(1)) bits

16
10.7.2007 Space-Efficient Document Retrieval16 Extensions The approach can easily be extended to report the documents in relevance order under standard scoring schemes like TF*IDF; and show context around the first/several/all occurrences in selected documents.

17
10.7.2007 Space-Efficient Document Retrieval17 Small experiment 50MB English text k=200 inverted index 98 MB17.46 s4.29 s our index 169 MB3.7 s2.7 s size query time m=3 query time m=4

Similar presentations

OK

1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.

1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google