Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

To Be or Not to Be: Suicide in Shakespeare By Ronjon Siler Tragedy.wikidot.com.
Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Space-for-Time Tradeoffs
William Shakespeare. The Globe Theatre Shakespeare became so successful that he, with the Burbages, founded his own theatre ares-
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Tries Standard Tries Compressed Tries Suffix Tries.
Modern Information Retrieval Chapter 8 Indexing and Searching.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Packing bag-of-features ICCV 2009 Herv´e J´egou Matthijs Douze Cordelia Schmid INRIA.
Modern Information Retrieval
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Document Retrieval Problems S. Muthukrishnan. Storyline Zvi Galil gave a talk on the 13 th on 13 open problems he posed 13 years ago in string matching.
Indexing and Searching
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Chapter 5: Information Retrieval and Web Search
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
“William Shakespeare – The Immortal Poet of Nature” 1Алексеева Н.А.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
“To Be, or Not To Be” From Hamlet By: William Shakespeare 10/30  “To be, or not to be: that is the question:  Whether ‘tis nobler in the mind to suffer.
CSC 110 – Intro to Computing Lecture 4: Arithmetic in other bases & Encoding Data.
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
Chapter 6: Information Retrieval and Web Search
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
The Hurdle of Analyzing Shakespeare’s Language Translation vs. Interpretation.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Hamlet test: 150 points I. 10 multiple choice: introductory notes on the historical era, theater, etc. II. 10 matching characters with their descriptions.
“To be or not to be, that is the question. Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune, or take arms against.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
Evidence from Content INST 734 Module 2 Doug Oard.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
CPSC 171 Introduction to Computer Science Binary.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections Jouni Sirén 1, Niko Välimäki 1, Veli Mäkinen 1, and Gonzalo Navarro.
William Shakespeare-450 th Birthday celebration worldwide in 2014 W.Shakespeare often called the English national poet is widely considered the greatest.
Why indexing? For efficient searching of a document
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Tries 07/28/16 11:04 Text Compression
Text Indexing and Search
Tries 5/27/2018 3:08 AM Tries Tries.
Indexing & querying text
Collections Michael Ernst UW CSE 140.
Ruth Anderson UW CSE 160 Spring 2015
Chapter 5: Information Retrieval and Web Search
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
OUR SCHOOL.
Tries 2/27/2019 5:37 PM Tries Tries.
Efficient Retrieval Document-term matrix t1 t tj tm nf
Presentation transcript:

Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki

Space-Efficient Document Retrieval2 Introduction Information Retrieval Document Retrieval Inverted Index Combinatorial Pattern Matching Text IndexingSuffix tree Field Problem Solution [PST06] [Mut02] practice: space limits theory: time limits [Sad07 & this paper]

Space-Efficient Document Retrieval3 Text Indexing  Let T = t 1 t 2... t n be a text string from an ordered alphabet Σ.  Text Indexing problem is to build an index structure for T that supports the following operations on a given pattern P=p 1 p 2... p m : Count(P): How many times P occurs in T? List(P): list the occurrence positions of P in T.

Space-Efficient Document Retrieval4 Document Retrieval  Let D={T 1,T 2,...T k } be a set of text documents of total length n.  Document Retrieval problem is to build an index for D that supports the following operation on a given pattern P=p 1 p 2... p m : - Find(P): List the documents that contain P (in the order of relevance,...)

Space-Efficient Document Retrieval5 Inverted Index & Document Retrieval... be: (d1,4) (d1,18)... (d2,74) (d2,139) to: (d1,1) (d1,15)...(d2,136) Find("to be")= Remove duplicates((Find("to")+3)∩Find("be")) = d1 (Hamlet), d2 (Merchant of Venice),... To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; PORTIA: If to do were as easy as to know what were good to do, chapels had been churches and poor men's cottages princes' palaces. It is a good divine that follows his own instructions: I can easier teach twenty what were good to be done, than be one of the twenty to follow mine own teaching. Creating inverted file over Shakespeare's plays

Space-Efficient Document Retrieval6 Suffix Array & Document Retrieval (1/2)  Build generalized suffix array of D: To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; PORTIA: If to do were as easy as to know what were good to do, chapels had been churches and poor men's cottages princes' palaces. It is a good divine that follows his own instructions: I can easier teach twenty what were good to be done, than be one of the twenty to follow mine own teaching.

Space-Efficient Document Retrieval7 Suffix Array & Document Retrieval  Build generalized suffix array of D:  Locate the interval containing all occurrences of pattern P:  Remove duplicates: "to be" d1 (Hamlet), d2 (Merchant of Venice),...

Space-Efficient Document Retrieval8 Muthukrishnan's improvement doc "to be" prev min min>

Space-Efficient Document Retrieval9 Time-Optimal Document Retrieval  Theorem [Mut02]: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size O(n log n) bits, where ndoc is the number of documents matching the query.  Observation: The solution is not space- optimal, as the document collection can be represented in n log |Σ| bits.

Space-Efficient Document Retrieval10 Space-Optimal Document Retrieval  Theorem [Sad02]: Document retrieval problem can be solved in O(f(m,n)+ndoc·g(n)) time using an index structure of size |CSA|+4n+o(n)+O(k log (n/k)) bits, where |CSA| ≤ n log |Σ| (1+o(1)) is the size of the compressed suffix array used; f(m,n)=O(m log n) is the pattern search time; and Ω(log ε n)=g(n) is the time to decode a suffix array value.

Space-Efficient Document Retrieval11 Our Result: Space- and Time- Efficient Document Retrieval  Theorem: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size |CSA|+2n+o(n)+n log k(1+o(1)) bits, when |Σ|,k  polylog(n); for unbounded |Σ|,k the time bound components become O(m log |Σ|) and O(ndoc log k), respectively.

Space-Efficient Document Retrieval12 Details of Our Result (1/3)  We use the alphabet-friendly FM-index [FMMN07] to find the suffix array interval containing the pattern occurrences.  We use the generalized wavelet tree [GGV03,FMMN07] to store document numbers according to the suffix array order.

Space-Efficient Document Retrieval13 Details of Our Result (2/3)  Observation: prev[i]=select doc[i] (doc,rank doc[i] (doc,i)-1), where rank k' (A,i) gives the number of times value k' appears in A[1,i]; and select k' (A,j) gives the position of the j-th occurrence of value k' in A.

Space-Efficient Document Retrieval14 Details of Our Result (3/3)  The generalized wavelet tree representation of doc-array provides constant time rank and select when k  polylog (n).  Constant time Range Minimum Queries (RMQ) on implicit prev-array can be supported using 2n+o(n) bits [FH07].

Space-Efficient Document Retrieval15 A simpler way to obtain the O(ndoc log k) result doc |CSA|+2n+o(n)+n log k(1+o(1)) bits

Space-Efficient Document Retrieval16 Extensions  The approach can easily be extended to report the documents in relevance order under standard scoring schemes like TF*IDF; and show context around the first/several/all occurrences in selected documents.

Space-Efficient Document Retrieval17 Small experiment  50MB English text  k=200 inverted index 98 MB17.46 s4.29 s our index 169 MB3.7 s2.7 s size query time m=3 query time m=4