# Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki.

## Presentation on theme: "Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki."— Presentation transcript:

Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki

10.7.2007 Space-Efficient Document Retrieval2 Introduction Information Retrieval Document Retrieval Inverted Index Combinatorial Pattern Matching Text IndexingSuffix tree Field Problem Solution [PST06] [Mut02] practice: space limits theory: time limits [Sad07 & this paper]

10.7.2007 Space-Efficient Document Retrieval3 Text Indexing  Let T = t 1 t 2... t n be a text string from an ordered alphabet Σ.  Text Indexing problem is to build an index structure for T that supports the following operations on a given pattern P=p 1 p 2... p m : Count(P): How many times P occurs in T? List(P): list the occurrence positions of P in T.

10.7.2007 Space-Efficient Document Retrieval4 Document Retrieval  Let D={T 1,T 2,...T k } be a set of text documents of total length n.  Document Retrieval problem is to build an index for D that supports the following operation on a given pattern P=p 1 p 2... p m : - Find(P): List the documents that contain P (in the order of relevance,...)

10.7.2007 Space-Efficient Document Retrieval5 Inverted Index & Document Retrieval... be: (d1,4) (d1,18)... (d2,74) (d2,139)...... to: (d1,1) (d1,15)...(d2,136)...... Find("to be")= Remove duplicates((Find("to")+3)∩Find("be")) = d1 (Hamlet), d2 (Merchant of Venice),... To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; PORTIA: If to do were as easy as to know what were good to do, chapels had been churches and poor men's cottages princes' palaces. It is a good divine that follows his own instructions: I can easier teach twenty what were good to be done, than be one of the twenty to follow mine own teaching. Creating inverted file over Shakespeare's plays...............................

10.7.2007 Space-Efficient Document Retrieval6 Suffix Array & Document Retrieval (1/2)  Build generalized suffix array of D: 1 2.... 6853491 6853492 6853493 6853494... To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; PORTIA: If to do were as easy as to know what were good to do, chapels had been churches and poor men's cottages princes' palaces. It is a good divine that follows his own instructions: I can easier teach twenty what were good to be done, than be one of the twenty to follow mine own teaching.

10.7.2007 Space-Efficient Document Retrieval7 Suffix Array & Document Retrieval  Build generalized suffix array of D:  Locate the interval containing all occurrences of pattern P:  Remove duplicates: 1 2.... 6853491 6853492 6853493 6853494... "to be" d1 (Hamlet), d2 (Merchant of Venice),...

10.7.2007 Space-Efficient Document Retrieval8 Muthukrishnan's improvement 1 2.... 6853491 6853492 6853493 6853494... 6 4.... 2 1 1 3 doc "to be" prev -1 -1...6853434 6853372 6853492 6853420... min min>6853490...

10.7.2007 Space-Efficient Document Retrieval9 Time-Optimal Document Retrieval  Theorem [Mut02]: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size O(n log n) bits, where ndoc is the number of documents matching the query.  Observation: The solution is not space- optimal, as the document collection can be represented in n log |Σ| bits.

10.7.2007 Space-Efficient Document Retrieval10 Space-Optimal Document Retrieval  Theorem [Sad02]: Document retrieval problem can be solved in O(f(m,n)+ndoc·g(n)) time using an index structure of size |CSA|+4n+o(n)+O(k log (n/k)) bits, where |CSA| ≤ n log |Σ| (1+o(1)) is the size of the compressed suffix array used; f(m,n)=O(m log n) is the pattern search time; and Ω(log ε n)=g(n) is the time to decode a suffix array value.

10.7.2007 Space-Efficient Document Retrieval11 Our Result: Space- and Time- Efficient Document Retrieval  Theorem: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size |CSA|+2n+o(n)+n log k(1+o(1)) bits, when |Σ|,k  polylog(n); for unbounded |Σ|,k the time bound components become O(m log |Σ|) and O(ndoc log k), respectively.

10.7.2007 Space-Efficient Document Retrieval12 Details of Our Result (1/3)  We use the alphabet-friendly FM-index [FMMN07] to find the suffix array interval containing the pattern occurrences.  We use the generalized wavelet tree [GGV03,FMMN07] to store document numbers according to the suffix array order.

10.7.2007 Space-Efficient Document Retrieval13 Details of Our Result (2/3)  Observation: prev[i]=select doc[i] (doc,rank doc[i] (doc,i)-1), where rank k' (A,i) gives the number of times value k' appears in A[1,i]; and select k' (A,j) gives the position of the j-th occurrence of value k' in A.

10.7.2007 Space-Efficient Document Retrieval14 Details of Our Result (3/3)  The generalized wavelet tree representation of doc-array provides constant time rank and select when k  polylog (n).  Constant time Range Minimum Queries (RMQ) on implicit prev-array can be supported using 2n+o(n) bits [FH07].

10.7.2007 Space-Efficient Document Retrieval15 A simpler way to obtain the O(ndoc log k) result... 1 2 3 4 5 6 7 8 9 doc 2 3 4 2 1 2 3 1 4 2 2 1 2 13 4 1 2 2 2 3 4 |CSA|+2n+o(n)+n log k(1+o(1)) bits

10.7.2007 Space-Efficient Document Retrieval16 Extensions  The approach can easily be extended to report the documents in relevance order under standard scoring schemes like TF*IDF; and show context around the first/several/all occurrences in selected documents.

10.7.2007 Space-Efficient Document Retrieval17 Small experiment  50MB English text  k=200 inverted index 98 MB17.46 s4.29 s our index 169 MB3.7 s2.7 s size query time m=3 query time m=4

Similar presentations