Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indexing and Searching

Similar presentations


Presentation on theme: "Indexing and Searching"— Presentation transcript:

1 Indexing and Searching
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Chapter 8

2 Outline Inverted Files Other Indices for Text Sequential Searching
Pattern Matching Compression

3 Inverted Files And inverted file (or inverted index) is a word-oriented mechanism for indexing a text collection in order to speed up the searching task. Structure:vocabulary and occurrences Block addressing The text is divided in blocks, and the occurrences point to the blocks Full inverted indices:exact occurrences

4

5

6 Inverted Files The search algorithm on an inverted index
Vocabulary search Retrieval of occurrences Manipulation of occurrences Construction (split the index into two files) Posting file:the lists of occurrences are stored contiguously The vocabulary is stored in lexicographical order and points to its list.

7

8 Inverted Files For Large texts Partial index
Merging two indices consists of merging the sorted vocabularies.

9

10 Other Indices for Text Suffix Trees Suffix Arrays Signature Files

11 Suffix Trees and Suffix Arrays
Each position in the text is considered as a text suffix Index points are selected form the text, which point to the beginning of the text positions which will be retrievable

12

13 Suffix arrays The main drawbacks of Suffix Array are its costly construction process. Allow binary searches done by comparing the contents of each pointer. Supra-indices (for large suffix array)

14

15

16 Construction of Suffix Arrays for Large Texts

17 Signature Files Word-oriented index structures base on hashing
Maps words to bit masks of B bits Divides the text in blocks of b words each The mask is obtained by bitwise ORing the signatures of all the words in the text block. Hash the query to a bit mask W If W & Bi = W, the text block may contain the word

18

19 Sequential Searching Brute Force Knuth-Morris-Pratt Boyer-Moore Family
Shift-Or Suffix Automaton Backward DAWG matching (BDM) BNDM

20 Knuth-Morris-Pratt

21 Boyer-Moore Family

22 Shift-Or

23 Suffix Automaton

24

25 Pattern Matching Searching allowing errors Dynamic Programming
Automaton Regular Expressions and Extended patterns Pattern Matching Using Indices Inverted files Suffix Trees and Suffix Arrays

26 Dynamic Programming

27 Automaton

28 Regular Expressions

29 Pattern Matching Using Indices
Inverted Files The types of queries such as suffix or substring queries, searching allowing errors and regular expressions, are solved by a sequential search The restriction is to find approximate matches or regular expressions that span many word.

30 Pattern Matching Using Indices
Suffix Trees Suffix trees are able to perform complex searches Word, prefix, suffix, substring, and Range queries Regular expressions Unrestricted approximate string matching Useful in specific areas Find the longest substring Find the most common substring of a fixed size

31 Pattern Matching Using Indices
Suffix Arrays Some patterns can be searched directly in the suffix array without simulation the suffix tree Word, prefix, suffix, subword search and range search

32 Compression Compressed text--Huffman coding Taking words as symbols
Use an alphabet of bytes instead of bits Compressed indices Inverted Files Suffix Trees and Suffix Arrays Signature Files


Download ppt "Indexing and Searching"

Similar presentations


Ads by Google