Presentation is loading. Please wait.

Presentation is loading. Please wait.

Document Indexing: SPIMI

Similar presentations


Presentation on theme: "Document Indexing: SPIMI"— Presentation transcript:

1 Document Indexing: SPIMI
Contents: 1. Single-pass in-memory indexing (SPIMI) 2. Distributed Indexing 3. Some simple examples

2 Problems With Earlier Approaches

3 SPIMI: Single-pass in-memory indexing
Sec. 4.3 SPIMI: Single-pass in-memory indexing

4 Merging of blocks is analogous to BSBI.
Sec. 4.3 SPIMI-Invert Merging of blocks is analogous to BSBI.

5 Compression makes SPIMI even more efficient.
Sec. 4.3 SPIMI: Compression Compression makes SPIMI even more efficient. Compression of terms Compression of postings

6 For web-scale indexing : Individual machines are fault-prone
Sec. 4.4 Distributed indexing For web-scale indexing : must use a distributed computing cluster Individual machines are fault-prone Can unpredictably slow down or fail How do we exploit such a pool of machines?

7 Sec. 4.4 Distributed indexing Uses a Large number of inexpensive servers instead of a single expensive machine. Maintain a master machine directing the indexing job prepare clusters of machine and Considers each node of cluster as safe. Breaks the indexing into sets of (parallel) tasks and passes it to different machines (nodes). Master machine assigns each task to an idle machine from a pool. MapReduce is a distributed programming tool designed for indexing and analysis tasks

8 Ref: Information Retrieval in Practice, Addison Wesley, 2008
Example “Collection” Ref: Information Retrieval in Practice, Addison Wesley, 2008

9 Ref: Information Retrieval in Practice, Addison Wesley, 2008
Simple Inverted Index Ref: Information Retrieval in Practice, Addison Wesley, 2008

10 Inverted Index with counts supports better ranking algorithms
Ref: Information Retrieval in Practice, Addison Wesley, 2008

11 Ref: Information Retrieval in Practice, Addison Wesley, 2008
Inverted Index with positions supports proximity matches Ref: Information Retrieval in Practice, Addison Wesley, 2008

12 Data flow Master assign assign Postings Parser a-f g-p q-z Inverter
Sec. 4.4 Data flow Master assign assign Postings Parser a-f g-p q-z Inverter a-f Parser a-f g-p q-z Inverter g-p splits Inverter q-z Parser a-f g-p q-z Map phase Reduce phase Segment files Fig: A simple Map-Reduce system, ref: Information Retrieval, Cambridge

13 Reference Information Retrieval, Cambridge-2009.
Information Retrieval in Practice, Addison Wesley, 2008. Original publication on SPIMI: Heinz and Zobel (2003) Original publication on MapReduce: Dean and Ghemawat (2004)


Download ppt "Document Indexing: SPIMI"

Similar presentations


Ads by Google