Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8): Indexing and Searching Alexander Gelbukh www.Gelbukh.com
Previous Chapter: Conclusions Main measures: Precision & Recall. For sets Rankings are evaluated through initial subsets There are measures that combine them into one Involve user-defined preferences Many (other) characteristics An algorithm can be good at some and bad at others Averages are used, but not always are meaningful Reference collection exists with known answers to evaluate new algorithms
Previous Chapter: Research topics Different types of interfaces Interactive systems: What measures to use? Such as infromativeness
Types of searching Indexed Sequential Combined Semi-static Space overhead Sequential Small texts Volatile, or space limited Combined Index into large portions, then sequential inside portion Best combination of speed / overhead
Inverted files Vocabulary: sqrt (n). Heaps’ law. 1GB 5M Occurrences: n * 40% (stopwords) positions (word, char), files, sections...
Compression: Block addressing Block addressing: 5% overhead 256, 64K, ..., blocks (1, 2, ..., bytes) Equal size (faster search) or logical sections (retrieval units)
Searching in inverted files Vocabulary search Separate file Many searching techniques Lexicographic: log V (voc. size) = ½ log n (Heaps) Hashing is not good for prefix search Retrieval of occurrences Manipulation with occurrences: ~sqrt (n) (Heaps, Zipf) Boolean operations. Context search Merging occurrences For AND: One list is usually shorter (Zipf law) sublinear! Only inverted files allow sublinear both space & time Suffix trees and signature files don’t
Building inverted file: 1 Infinite memory? Use trie to store vocabulary. O(n) append positions Finite memory? Build in chunks, merge. Almost O(n) Insertion: index + merge. Deleting: O(n). Very fast.
Suffix trees Text as one long string. No words. Genetic databases Complex queries Compacted trie structure Problem: space For text retrieval, inverted files are better
Info for tree comes from the text itself
Suffix array All suffixes (by position) in lexicographic order Allows binary search Much less space: 40% n Supra-index: sampling, for better disk access
Suffix tree and suffix array: Searching. Construction Patterns, prefixes, phrases. Not only words Suffix tree: O(m), but: space (m = query size) Suffix array: O(log n) (n = database size) Construction of arrays: sorting Large text: n2 log (M)/M, more than for inverted files Skip details Addition: n n' log (M)/M. (n' is the size of new portion) Deletion: n
Signature files Usually worse than inverted files Words are mapped to bit patterns Blocks are mapped to ORs of their word patterns If a block contains a word, all bits of its pattern are set Sequential search for blocks False drops! Design of the hash function Have to traverse the block Good to search ANDs or proximity queries bit patterns are ORed
False drop: letters in 2nd block
Boolean operations Merging file (occurrences) lists AND: to find repetitions According to query syntax tree Complexity linear in intermediate results Can be slow if they are huge There are optimization techniques E.g.: merge small list with a big one by searching This is a usual case (Zipf)
Sequential search Necessary part of many algorithms (e.g., block addr) Brute force: O(nm) worst-case, O(n) on average MANY faster algorithms, but more complicated See the book
Approximate string matching Match with k errors, select the one with min k Levenshtein distance between strings s1 and s2 The minimum number of editing operations to make one from another Symmetric for standard sets of operations Operations: deletion, addition, change Sometimes weighted Solution: dynamic programming. O(mn), O(kn) m, n are lengths of the two strings
Regular expressions Regular expressions Automation: O (m 2m) + O (n) – bad for long patterns There are better methods, see book Using indices to search for words with errors Inverted files: search in vocabulary Suffix trees and Suffix arrays: the same algorithms as for search without errors! Just allow deviations from the path
Search over compression Improves both space AND time (less disk operations) Compress query and search Huffman compression, words as symbols, bytes (frequencies: most frequent shorter) Search each word in the vocabulary its code More sophisticated algorithms Compressed inverted files: less disk less time Text and index compression can be combined
...compression Suffix trees can be compressed almost to size of suffix arrays Suffix arrays can’t be compressed (almost random), but can be constructed over compressed text instead of Huffman, use a code that respects alphabetic order almost the same compression Signature files are sparse, so can be compressed ratios up to 70%
Research topics Perhaps, new details in integration of compression and search “Linguistic” indexing: allowing linguistic variations Search in plural or only singular Search with or without synonyms
Conclusions Inverted files seem to be the best option Other structures are good for specific cases Genetic databases Sequential searching is an integral part of many indexing-based search techniques Many methods to improve sequential searching Compression can be integrated with search
Thank you! Till April 26, 6 pm