Presentation is loading. Please wait.

Presentation is loading. Please wait.

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh

Similar presentations

Presentation on theme: "Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh"— Presentation transcript:

1 Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh

2 2 Previous Chapter: Conclusions Main measures: Precision & Recall. oFor sets oRankings are evaluated through initial subsets There are measures that combine them into one oInvolve user-defined preferences Many (other) characteristics oAn algorithm can be good at some and bad at others oAverages are used, but not always are meaningful Reference collection exists with known answers to evaluate new algorithms

3 3 Previous Chapter: Research topics Different types of interfaces Interactive systems: oWhat measures to use? oSuch as infromativeness

4 4 Types of searching Indexed oSemi-static oSpace overhead Sequential oSmall texts oVolatile, or space limited Combined oIndex into large portions, then sequential inside portion oBest combination of speed / overhead

5 5 Inverted files Vocabulary: sqrt (n). Heaps law. 1GB 5M Occurrences: n * 40% (stopwords) opositions (word, char), files, sections...

6 6 Compression: Block addressing Block addressing: 5% overhead o256, 64K,..., blocks (1, 2,..., bytes) oEqual size (faster search) or logical sections (retrieval units)

7 7 Searching in inverted files Vocabulary search oSeparate file oMany searching techniques oLexicographic: log V (voc. size) = ½ log n (Heaps) oHashing is not good for prefix search Retrieval of occurrences Manipulation with occurrences: ~sqrt (n) ( Heaps, Zipf ) oBoolean operations. Context search Merging occurrences For AND: One list is usually shorter (Zipf law) sublinear! Only inverted files allow sublinear both space & time oSuffix trees and signature files dont

8 8 Building inverted file: 1 Infinite memory? Use trie to store vocabulary. O(n) oappend positions Finite memory? Build in chunks, merge. Almost O(n) Insertion: index + merge. Deleting: O(n). Very fast.

9 9 Suffix trees Text as one long string. No words. oGenetic databases oComplex queries oCompacted trie structure oProblem: space For text retrieval, inverted files are better

10 10

11 11 Info for tree comes from the text itself

12 12 Suffix array All suffixes (by position) in lexicographic order Allows binary search Much less space: 40% n Supra-index: sampling, for better disk access

13 13 Suffix tree and suffix array: Searching. Construction Searching Patterns, prefixes, phrases. Not only words Suffix tree: O(m), but: space (m = query size) Suffix array: O(log n) (n = database size) Construction of arrays: sorting oLarge text: n 2 log (M)/M, more than for inverted files oSkip details Addition: n n' log (M)/M. (n' is the size of new portion) Deletion: n

14 14 Signature files Usually worse than inverted files Words are mapped to bit patterns Blocks are mapped to ORs of their word patterns If a block contains a word, all bits of its pattern are set Sequential search for blocks False drops! oDesign of the hash function oHave to traverse the block Good to search ANDs or proximity queries obit patterns are ORed

15 15 False drop: letters in 2 nd block

16 16 Boolean operations Merging file (occurrences) lists oAND: to find repetitions According to query syntax tree Complexity linear in intermediate results oCan be slow if they are huge There are optimization techniques oE.g.: merge small list with a big one by searching oThis is a usual case (Zipf)

17 17 Sequential search Necessary part of many algorithms (e.g., block addr) Brute force: O(nm) worst-case, O(n) on average MANY faster algorithms, but more complicated oSee the book

18 18 Approximate string matching Match with k errors, select the one with min k Levenshtein distance between strings s 1 and s 2 oThe minimum number of editing operations to make one from another oSymmetric for standard sets of operations oOperations: deletion, addition, change oSometimes weighted Solution: dynamic programming. O(mn), O(kn) om, n are lengths of the two strings

19 19 Regular expressions oAutomation: O (m 2 m ) + O (n) – bad for long patterns oThere are better methods, see book Using indices to search for words with errors oInverted files: search in vocabulary oSuffix trees and Suffix arrays: the same algorithms as for search without errors! Just allow deviations from the path

20 20 Search over compression Improves both space AND time (less disk operations) Compress query and search oHuffman compression, words as symbols, bytes (frequencies: most frequent shorter) oSearch each word in the vocabulary its code oMore sophisticated algorithms Compressed inverted files: less disk less time Text and index compression can be combined

21 21...compression Suffix trees can be compressed almost to size of suffix arrays Suffix arrays cant be compressed (almost random), but can be constructed over compressed text oinstead of Huffman, use a code that respects alphabetic order oalmost the same compression Signature files are sparse, so can be compressed oratios up to 70%

22 22

23 23 Research topics Perhaps, new details in integration of compression and search Linguistic indexing: allowing linguistic variations oSearch in plural or only singular oSearch with or without synonyms

24 24 Conclusions Inverted files seem to be the best option Other structures are good for specific cases oGenetic databases Sequential searching is an integral part of many indexing-based search techniques oMany methods to improve sequential searching Compression can be integrated with search

25 25 Thank you! Till April 26, 6 pm

Download ppt "Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh"

Similar presentations

Ads by Google