For more information please send to or EFFICIENT QUERY SUBSCRIPTION PROCESSING.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

© 2004, M. Fontoura VLDB, Toronto, September 2004 High Performance Index Build Algorithms for Intranet Search Engines Marcus Fontoura, Eugene Shekita,
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Qinqing Gan Torsten Suel Improved Techniques for Result Caching in Web Search Engines Presenter: Arghyadip ● Konark.
Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Search Engines and Information Retrieval
Bloom Filters Kira Radinsky Slides based on material from:
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY.
1 of 2 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel Review by Newton Alex
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Web Exploration and Search Technology Lab Department of Computer and Information Science Polytechnic University Brooklyn, NY Faculty: Torsten Suel.
Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Google and Scalable Query Services
Overview of Search Engines
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
Design and Implementation of a Geographic Search Engine Alexander Markowetz Yen-Yu Chen Torsten Suel Xiaohui Long Bernhard Seeger.
Databases & Data Warehouses Chapter 3 Database Processing.
Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department Overview: introduction.
Achieving fast (approximate) event matching in large-scale content- based publish/subscribe networks Yaxiong Zhao and Jie Wu The speaker will be graduating.
New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.
Planned Giving Design Center. What is the Planned Giving Design Center? National network of websites dedicated to advancing philanthropy.
Search Engines and Information Retrieval Chapter 1.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Searching: Binary Trees and Hash Tables CHAPTER 12 6/4/15 Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education,
Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Module 10 Administering and Configuring SharePoint Search.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Introduction to Data Structures Vamshi Ambati
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
XML - RSS Cathy Hsu. What’s RSS? RSS is considered a name variously used to refer to three different standards –Really Simple Syndication (RSS0.9) –Rich.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Sigir’99 Inside Internet Search Engines: Spidering and Indexing Jan Pedersen and William Chang.
Subscribing to the RSS Feed 10/28/2013. What is an RSS Feed? What is RSS? –RSS stands for "Really Simple Syndication". It is a way to easily distribute.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Data mining in web applications
Search Engine Optimization
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Efficient Multi-User Indexing for Secure Keyword Search
Search Engines and Search techniques
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Information Retrieval in Practice
Map Reduce.
“Real Simple Syndication” (RSS)
RSS What can it do for you? Rachel Hyland Systems Librarian
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Heavy Hitters in Streams and Sliding Windows
Information Retrieval and Web Design
Really Simple Syndication
Presentation transcript:

For more information please send to or EFFICIENT QUERY SUBSCRIPTION PROCESSING FOR PROSPECTIVE SEARCH ENGINES Utku Irmak, Svilen Mihaylov, Torsten Suel, Samrat Ganguly*, and Rauf Izmailov* Polytechnic University, Brooklyn *NEC Laboratories America, Inc., New Jersey Comparison to Traditional Search Retrospective Search: On a previously crawled file collection Searching the past Collection of files is static Queries are dynamic Prospective Search: On newly added or updated files Searching the future Files are dynamic Collection of queries is static What is RSS? Rich Site Summary (version 0.91) RDF Site Summary (versions 0.9 and 1.0) Really Simple Syndication (version 2.0) Provides: Web content (or summaries) Meta-data (TITLE, URL and DESCRIPTION) Goals: Web Syndication Allow readers to keep track of updates Query (Subscription) Types AND only: All terms have to appear k-out-of-n: At least k (out of all n) terms have to appear Boolean: Boolean expression with AND, OR and NOT Internal Representations for Efficient Matching Use of Inverted Index: Queries are indexed by their terms Reduces the number of queries examined Queries, Terms and Documents are represented by unique identifiers (QIDs, TIDs, DIDs) Q1: New York Yankees Q2: Yankees Red Sox Q3: Boston Red Sox New: Q1 York: Q1 Yankees: Q1 Q2 Red: Q2 Q3 Sox: Q2 Q3 Boston: Q3 A Motivating Application Notify the subscriber if an interesting document appears on the web Problem Definition Given large number of subscriptions (in the order of millions) how can we efficiently match large number of incoming documents (thousands per second) against all subscriptions? Challenges Scalability and load balancing Support for enhanced subscription capabilities Automatic resource (RSS) discovery and efficient crawling Improved service (a longer history of matches, ranking)

For more information please send to or EFFICIENT QUERY SUBSCRIPTION PROCESSING FOR PROSPECTIVE SEARCH ENGINES (continued) Utku Irmak, Svilen Mihaylov, Torsten Suel, Samrat Ganguly*, and Rauf Izmailov* Polytechnic University, Brooklyn *NEC Laboratories America, Inc., New Jersey Datasets and Experimental Evaluations Subscriptions: Query logs from excite.com Documents: Crawled & parsed web pages Evaluation: Throughput with various numbers of subscriptions A Primitive Matching Algorithm (AND only) For each TID in the document - Find queries that contain TID (using inverted index) - Maintain a counter (for each query returned) There is a match if (counter == query size) A Clustering Approach Queries usually have common terms and some are contained by others If a query is already evaluated on a document, contained queries can be answered very efficiently Opt 1: Exploiting Term Frequencies and Position Information - Assign TIDs based on frequencies - Sort terms in the queries by TIDs - Sort terms in incoming document by TIDs - For each TID in the document - If (TID pos==0) counter=1 - Else if (TID pos==counter) counter++ Advantage: Fewer counters maintained in the accumulators Smaller hash table Opt 3: Partitioning the Queries Create multiple smaller inverted indexes Repeat the matching algorithm Advantage: Better locality (in the processor cache) Opt 2: Use of Bloom Filters Bloom Filter: A probabilistic, space- efficient method for membership queries For each new item, set the corresponding bit to 1 False negatives are guaranteed not to occur Advantage: Reduced cost of maintaining the accumulators 11 1 A Greedy Clustering Algorithm - Create (artificial) super queries - Create inverted index only for super queries - Now maintain bit vectors instead of counters in the accumulators - Evaluate the corresponding cluster of contained queries for any matched super query