Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8): Indexing.

Slides:



Advertisements
Similar presentations
Alexander Gelbukh Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh
Advertisements

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Chapter 13: Query Processing
Special Topics in Computer Science The Art of Information Retrieval Chapter 10: User Interfaces and Visualization Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh
Alexander Gelbukh Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh
Alexander Gelbukh Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 5 (book chapter 11): Multimedia.
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 1: Introduction Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 1: Introduction Alexander Gelbukh
1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Query optimisation.
1 Term 2, 2004, Lecture 5, Physical DesignMarian Ursu, Department of Computing, Goldsmiths College Physical Design 3.
Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.
Database Performance Tuning and Query Optimization
Data Structures: A Pseudocode Approach with C
Data Structures ADT List
Data Structures Using C++
Advance Database Systems and Applications COMP 6521
ABC Technology Project
Hash Tables.
1 Symbol Tables Chapter Sedgewick. 2 Symbol Tables Searching Searching is a fundamental element of many computational tasks looking up a name.
Review Pseudo Code Basic elements of Pseudo code
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
Traditional IR models Jian-Yun Nie.
Processes Management.
Chapter 5 Test Review Sections 5-1 through 5-4.
25 seconds left…...
We will resume in: 25 Minutes.
February 12, 2007 WALCOM '2007 1/22 DiskTrie: An Efficient Data Structure Using Flash Memory for Mobile Devices N. M. Mosharaf Kabir Chowdhury Md. Mostofa.
Chapter 8 Improving the User Interface
all-pairs shortest paths in undirected graphs
12-Apr-15 Analysis of Algorithms. 2 Time and space To analyze an algorithm means: developing a formula for predicting how fast an algorithm is, based.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Inverted Index Hongning Wang
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Modern information retrieval Chapter 8 – Indexing and Searching.
Tries Standard Tries Compressed Tries Suffix Tries.
Modern Information Retrieval Chapter 8 Indexing and Searching.
Modern Information Retrieval
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
Modern Information Retrieval
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Indexing and Searching
Modern Information Retrieval Chapter 4 Query Languages.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Indexing and Complexity. Agenda Inverted indexes Computational complexity.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
CS 430: Information Discovery
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Evidence from Content INST 734 Module 2 Doug Oard.
Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
Why indexing? For efficient searching of a document
Course Developer/Writer: A. J. Ikuomola
Tries 07/28/16 11:04 Text Compression
Text Indexing and Search
New Indices for Text : Pat Trees and PAT Arrays
CS 430: Information Discovery
Indexing and Searching (File Structures)
Information Retrieval B
Indexing and Searching
Presentation transcript:

Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8): Indexing and Searching Alexander Gelbukh www.Gelbukh.com

Previous Chapter: Conclusions Main measures: Precision & Recall. For sets Rankings are evaluated through initial subsets There are measures that combine them into one Involve user-defined preferences Many (other) characteristics An algorithm can be good at some and bad at others Averages are used, but not always are meaningful Reference collection exists with known answers to evaluate new algorithms

Previous Chapter: Research topics Different types of interfaces Interactive systems: What measures to use? Such as infromativeness

Types of searching Indexed Sequential Combined Semi-static Space overhead Sequential Small texts Volatile, or space limited Combined Index into large portions, then sequential inside portion Best combination of speed / overhead

Inverted files Vocabulary: sqrt (n). Heaps’ law. 1GB  5M Occurrences: n * 40% (stopwords) positions (word, char), files, sections...

Compression: Block addressing Block addressing: 5% overhead 256, 64K, ..., blocks (1, 2, ..., bytes) Equal size (faster search) or logical sections (retrieval units)

Searching in inverted files Vocabulary search Separate file Many searching techniques Lexicographic: log V (voc. size) = ½ log n (Heaps) Hashing is not good for prefix search Retrieval of occurrences Manipulation with occurrences: ~sqrt (n) (Heaps, Zipf) Boolean operations. Context search Merging occurrences For AND: One list is usually shorter (Zipf law)  sublinear! Only inverted files allow sublinear both space & time Suffix trees and signature files don’t

Building inverted file: 1 Infinite memory? Use trie to store vocabulary. O(n) append positions Finite memory? Build in chunks, merge. Almost O(n) Insertion: index + merge. Deleting: O(n). Very fast.

Suffix trees Text as one long string. No words. Genetic databases Complex queries Compacted trie structure Problem: space For text retrieval, inverted files are better

Info for tree comes from the text itself

Suffix array All suffixes (by position) in lexicographic order Allows binary search Much less space: 40% n Supra-index: sampling, for better disk access

Suffix tree and suffix array: Searching. Construction Patterns, prefixes, phrases. Not only words Suffix tree: O(m), but: space (m = query size) Suffix array: O(log n) (n = database size) Construction of arrays: sorting Large text: n2 log (M)/M, more than for inverted files Skip details Addition: n n' log (M)/M. (n' is the size of new portion) Deletion: n

Signature files Usually worse than inverted files Words are mapped to bit patterns Blocks are mapped to ORs of their word patterns If a block contains a word, all bits of its pattern are set Sequential search for blocks False drops! Design of the hash function Have to traverse the block Good to search ANDs or proximity queries bit patterns are ORed

False drop: letters in 2nd block

Boolean operations Merging file (occurrences) lists AND: to find repetitions According to query syntax tree Complexity linear in intermediate results Can be slow if they are huge There are optimization techniques E.g.: merge small list with a big one by searching This is a usual case (Zipf)

Sequential search Necessary part of many algorithms (e.g., block addr) Brute force: O(nm) worst-case, O(n) on average MANY faster algorithms, but more complicated See the book

Approximate string matching Match with k errors, select the one with min k Levenshtein distance between strings s1 and s2 The minimum number of editing operations to make one from another Symmetric for standard sets of operations Operations: deletion, addition, change Sometimes weighted Solution: dynamic programming. O(mn), O(kn) m, n are lengths of the two strings

Regular expressions Regular expressions Automation: O (m 2m) + O (n) – bad for long patterns There are better methods, see book Using indices to search for words with errors Inverted files: search in vocabulary Suffix trees and Suffix arrays: the same algorithms as for search without errors! Just allow deviations from the path

Search over compression Improves both space AND time (less disk operations) Compress query and search Huffman compression, words as symbols, bytes (frequencies: most frequent shorter) Search each word in the vocabulary  its code More sophisticated algorithms Compressed inverted files: less disk  less time Text and index compression can be combined

...compression Suffix trees can be compressed almost to size of suffix arrays Suffix arrays can’t be compressed (almost random), but can be constructed over compressed text instead of Huffman, use a code that respects alphabetic order almost the same compression Signature files are sparse, so can be compressed ratios up to 70%

Research topics Perhaps, new details in integration of compression and search “Linguistic” indexing: allowing linguistic variations Search in plural or only singular Search with or without synonyms

Conclusions Inverted files seem to be the best option Other structures are good for specific cases Genetic databases Sequential searching is an integral part of many indexing-based search techniques Many methods to improve sequential searching Compression can be integrated with search

Thank you! Till April 26, 6 pm