Information Retrieval in Practice

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley, 2008.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Parallel and Distributed IR
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
FLANN Fast Library for Approximate Nearest Neighbors
Overview of Search Engines
Chapter 17 Domain Name System
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Querying Structured Text in an XML Database By Xuemei Luo.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal.
Efficient Peer to Peer Keyword Searching Nathan Gray.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Web- and Multimedia-based Information Systems Lecture 2.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
CS4432: Database Systems II Query Processing- Part 2.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
Modern Information Retrieval
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
1 VLDB, Background What is important for the user.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Domain Name System: DNS To identify an entity, TCP/IP protocols use the IP address, which uniquely identifies the Connection of a host to the Internet.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Gleb Skobeltsyn Flavio Junqueira Vassilis Plachouras
Information Retrieval in Practice
An Efficient Algorithm for Incremental Update of Concept space
Search Engine Architecture
Indexing UCSB 293S, 2017 Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley,
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Indexing & querying text
Database Management System
Text Based Information Retrieval
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Implementation Issues & IR Systems
Database Performance Tuning and Query Optimization
Internet Networking recitation #12
Chapter 15 QUERY EXECUTION.
Query Caching in Agent-based Distributed Information Retrieval
Data Mining Chapter 6 Search Engines
6. Implementation of Vector-Space Retrieval
8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Panagiotis G. Ipeirotis Luis Gravano
Chapter 11 Database Performance Tuning and Query Optimization
Relevance and Reinforcement in Interactive Browsing
Information Retrieval and Web Design
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INF 141: Information Retrieval
Presentation transcript:

Information Retrieval in Practice Query Processing Information Retrieval in Practice

Query Processing Document-at-a-time Term-at-a-time Calculates complete scores for documents by processing all term lists, one document at a time Term-at-a-time Accumulates scores for documents by processing term lists one at a time Both approaches have optimization techniques that significantly reduce time required to generate scores

Document-At-A-Time

Document-At-A-Time

Pseudocode Function Descriptions getCurrentDocument() Returns the document number of the current posting of the inverted list. skipForwardToDocument(d) Moves forward in the inverted list until getCurrentDocument() <= d. This function may read to the end of the list. movePastDocument(d) Moves forward in the inverted list until getCurrentDocument() < d. moveToNextDocument() Moves to the next document in the list. Equivalent to movePastDocument(getCurrentDocument()). getNextAccumulator(d) returns the first document number d' >= d that has already has an accumulator. removeAccumulatorsBetween(a, b) Removes all accumulators for documents numbers between a and b. Ad will be removed iff a < d < b.

Term-At-A-Time

Term-At-A-Time

Optimization Techniques Term-at-a-time uses more memory for accumulators, but accesses disk more efficiently Two classes of optimization Read less data from inverted lists e.g., skip lists better for simple feature functions Calculate scores for fewer documents e.g., conjunctive processing better for complex feature functions

Conjunctive Term-at-a-Time

Conjunctive Document-at-a-Time

Threshold Methods Threshold methods use number of top-ranked documents needed (k) to optimize query processing for most applications, k is small For any query, there is a minimum score that each document needs to reach before it can be shown to the user score of the kth-highest scoring document gives threshold τ optimization methods estimate τ′ to ignore documents

Threshold Methods For document-at-a-time processing, use score of lowest-ranked document so far for τ′ for term-at-a-time, have to use kth-largest score in the accumulator table MaxScore method compares the maximum score that remaining documents could have to τ′ safe optimization in that ranking will be the same without optimization

MaxScore Example Indexer computes μtree maximum score for any document containing just “tree” Assume k =3, τ′ is lowest score after first three docs Likely that τ ′ > μtree τ ′ is the score of a document that contains both query terms Can safely skip over all gray postings

Other Approaches Early termination of query processing List ordering ignore high-frequency word lists in term-at-a-time ignore documents at end of lists in doc-at-a-time unsafe optimization List ordering order inverted lists by quality metric (e.g., PageRank) or by partial score makes unsafe (and fast) optimizations more likely to produce good documents

Distributed Evaluation Basic process All queries sent to a director machine Director then sends messages to many index servers Each index server does some portion of the query processing Director organizes the results and returns them to the user Two main approaches Document distribution by far the most popular Term distribution

Distributed Evaluation Document distribution each index server acts as a search engine for a small fraction of the total collection director sends a copy of the query to each of the index servers, each of which returns the top-k results results are merged into a single ranked list by the director Collection statistics should be shared for effective ranking

Distributed Evaluation Term distribution Single index is built for the whole cluster of machines Each inverted list in that index is then assigned to one index server in most cases the data to process a query is not stored on a single machine One of the index servers is chosen to process the query usually the one holding the longest inverted list Other index servers send information to that server Final results sent to director

Caching Query distributions similar to Zipf About ½ each day are unique, but some are very popular Caching can significantly improve effectiveness Cache popular query results Cache common inverted lists Inverted list caching can help with unique queries Cache must be refreshed to prevent stale data