Presentation on theme: "For more information please send to or EFFICIENT QUERY SUBSCRIPTION PROCESSING."— Presentation transcript:
For more information please send email to email@example.com or firstname.lastname@example.org@email@example.com EFFICIENT QUERY SUBSCRIPTION PROCESSING FOR PROSPECTIVE SEARCH ENGINES Utku Irmak, Svilen Mihaylov, Torsten Suel, Samrat Ganguly*, and Rauf Izmailov* Polytechnic University, Brooklyn *NEC Laboratories America, Inc., New Jersey Comparison to Traditional Search Retrospective Search: On a previously crawled file collection Searching the past Collection of files is static Queries are dynamic Prospective Search: On newly added or updated files Searching the future Files are dynamic Collection of queries is static What is RSS? Rich Site Summary (version 0.91) RDF Site Summary (versions 0.9 and 1.0) Really Simple Syndication (version 2.0) Provides: Web content (or summaries) Meta-data (TITLE, URL and DESCRIPTION) Goals: Web Syndication Allow readers to keep track of updates Query (Subscription) Types AND only: All terms have to appear k-out-of-n: At least k (out of all n) terms have to appear Boolean: Boolean expression with AND, OR and NOT Internal Representations for Efficient Matching Use of Inverted Index: Queries are indexed by their terms Reduces the number of queries examined Queries, Terms and Documents are represented by unique identifiers (QIDs, TIDs, DIDs) Q1: New York Yankees Q2: Yankees Red Sox Q3: Boston Red Sox New: Q1 York: Q1 Yankees: Q1 Q2 Red: Q2 Q3 Sox: Q2 Q3 Boston: Q3 A Motivating Application Notify the subscriber if an interesting document appears on the web Problem Definition Given large number of subscriptions (in the order of millions) how can we efficiently match large number of incoming documents (thousands per second) against all subscriptions? Challenges Scalability and load balancing Support for enhanced subscription capabilities Automatic resource (RSS) discovery and efficient crawling Improved service (a longer history of matches, ranking)
For more information please send email to firstname.lastname@example.org or email@example.com@firstname.lastname@example.org EFFICIENT QUERY SUBSCRIPTION PROCESSING FOR PROSPECTIVE SEARCH ENGINES (continued) Utku Irmak, Svilen Mihaylov, Torsten Suel, Samrat Ganguly*, and Rauf Izmailov* Polytechnic University, Brooklyn *NEC Laboratories America, Inc., New Jersey Datasets and Experimental Evaluations Subscriptions: Query logs from excite.com Documents: Crawled & parsed web pages Evaluation: Throughput with various numbers of subscriptions A Primitive Matching Algorithm (AND only) For each TID in the document - Find queries that contain TID (using inverted index) - Maintain a counter (for each query returned) There is a match if (counter == query size) A Clustering Approach Queries usually have common terms and some are contained by others If a query is already evaluated on a document, contained queries can be answered very efficiently Opt 1: Exploiting Term Frequencies and Position Information - Assign TIDs based on frequencies - Sort terms in the queries by TIDs - Sort terms in incoming document by TIDs - For each TID in the document - If (TID pos==0) counter=1 - Else if (TID pos==counter) counter++ Advantage: Fewer counters maintained in the accumulators Smaller hash table Opt 3: Partitioning the Queries Create multiple smaller inverted indexes Repeat the matching algorithm Advantage: Better locality (in the processor cache) Opt 2: Use of Bloom Filters Bloom Filter: A probabilistic, space- efficient method for membership queries For each new item, set the corresponding bit to 1 False negatives are guaranteed not to occur Advantage: Reduced cost of maintaining the accumulators 11 1 A Greedy Clustering Algorithm - Create (artificial) super queries - Create inverted index only for super queries - Now maintain bit vectors instead of counters in the accumulators - Evaluate the corresponding cluster of contained queries for any matched super query
Your consent to our cookies if you continue to use this website.