EFFICIENT COMPUTATION OF DIVERSE QUERY RESULTS Presenting: Karina Koifman Course : DB Seminar.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Fast Algorithms For Hierarchical Range Histogram Constructions
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Branch & Bound Algorithms
Polynomial Time Approximation Schemes Presented By: Leonid Barenboim Roee Weisbert.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.
Efficient Query Evaluation on Probabilistic Databases
Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.
Rank Aggregation. Rank Aggregation: Settings Multiple items – Web-pages, cars, apartments,…. Multiple scores for each item – By different reviewers, users,
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.
Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.
CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Querying Structured Text in an XML Database By Xuemei Luo.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Order Statistics. Order statistics Given an input of n values and an integer i, we wish to find the i’th largest value. There are i-1 elements smaller.
Privacy Preservation of Aggregates in Hidden Databases: Why and How? Arjun Dasgupta, Nan Zhang, Gautam Das, Surajit Chaudhuri Presented by PENG Yu.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
K-Hit Query: Top-k Query Processing with Probabilistic Utility Function SIGMOD2015 Peng Peng, Raymond C.-W. Wong CSE, HKUST 1.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Efficient Processing of Top-k Spatial Preference Queries
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Marina Drosou, Evaggelia Pitoura Computer Science Department
FlexTable: Using a Dynamic Relation Model to Store RDF Data IDS Lab. Seungseok Kang.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Graph Indexing From managing and mining graph data.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 VLDB, Background What is important for the user.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
An Efficient Algorithm for Incremental Update of Concept space
Information Retrieval in Practice
Rank Aggregation.
8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Similarity Search: A Matching Based Approach
Computational Advertising and
Information Retrieval and Web Design
Efficient Processing of Top-k Spatial Preference Queries
Discussion Class 9 Google.
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

EFFICIENT COMPUTATION OF DIVERSE QUERY RESULTS Presenting: Karina Koifman Course : DB Seminar

Example

Yahoo! Autos

Maybe a better retrieval

Introduction  The article talks about the problem of efficiently computing diverse query results in online shopping applications.

The Goal  The goal of diverse query answering is to return a representative set of top-k answers from all the tuples that satisfy the user selection condition

 Users issues query for a product  Only most relevant answers are shown.  Many Duplications The Problem

 Existing Solutions  Definition of diversity  Impossibility results of diversity.  Query processing technique. Agenda

Existing Solutions Existing solutions are inefficient or do not work in all situations. Example:  Obtain all the query results and then pick a diverse subset from these results  doesn’t scale for large data sets.

Existing Solutions  Web search engines: first retrieve c × k and then pick a diverse subset from these.  It is more efficient than the previous method.  many duplicates  product sale. (inefficient and doesn’t guarantee diversity)

Existing Solutions  issuing multiple queries to obtain diverse results:

Pro’s\Con’s  The good:  Diversity  The Bad:  Hurts performance  Empty results *There are no Honda Accord convertibles

 Existing Solutions  Definition of diversity  Impossibility results of diversity.  Query processing technique. Agenda

 A diversity ordering of a relation R with attributes A, denoted by, is a total ordering of the attributes in A.  Example: Make ≺ Model ≺ Color ≺ Year ≺ Description ≺ Id Diversity Ordering

The DB example DescriptionYearColorModelMakeId Low miles2007GreenCivicHonda1 Low miles2007BlueCivicHonda2 Low miles2007RedCivicHonda3 Low miles2007BlackCivicHonda4 Low miles2006BlackCivicHonda5 Best Price2007BlueAccordHonda6 Good miles2006RedAccordHonda7 Rare2007GreenOdysseyHonda8 Good miles2006GreenOdysseyHonda9 Fun Car2007RedCRVHonda10 Good miles2006OrangeCRVHonda11 Low miles2007TanPriusToyota12 Low miles2007BlackCorollaToyota13 Low miles2007BlueTercelToyota14 Low miles2007BlueCamryToyota15

Similarity – SIM(X,Y) Low miles2007GreenCivicHonda1 Low miles2007BlueCivicHonda2 Low miles2007TanPriusToyota12 Low miles2007GreenCivicHonda1 Find a result set that minimizes

Example - Similarity DescriptionYearColorModelMakeId Low miles2007GreenCivicHonda1 Best Price2007BlueAccordHonda6 Rare2007GreenOdysseyHonda8 DescriptionYearColorModelMakeId Low miles2007GreenCivicHonda1 Low miles2007BlueCivicHonda2 Low miles2007TanPriusToyota12

Prefix DescriptionYearColorModelMakeId Low miles2007GreenCivicHonda1 DescriptionYearColorModelMakeId Low miles2007BlueCivicHonda2 DescriptionYearColorModelMakeId Rare2007GreenOdysseyHonda8 Good miles2006GreenOdysseyHonda9

Few more definitions  RES(R,Q) of size k  Given relation R and query Q, let maxval =

 Existing Solutions  Definition of diversity  Impossibility results of diversity.  Query processing technique. Agenda

Impossibility Results  Intuition: IR score of an item depends only on the item and possibly statistics from the entire corpus, but diversity depends on the other items in the query result set.

Inverted Lists Honda cars Honda Car d1d4d8d10d17d4d10d11d17d20 Merged Inverted List: d4d10d17

Impossibility Results  Item in an inverted list has a score, which can either be a global score (e.g., PageRank) or a value/keyword -dependent score (e.g., TF-IDF).  The items in each list are usually ordered by their score – so that we could handle top-k queries.  If we assume that we have a scoring function f() that is monotonic- which as a normal assumption for traditional IR system, then the article proofs either it’s not diverse or to inefficient\infeasible.

 Existing Solutions  Definition of diversity  Impossibility results of diversity.  Query processing technique. Agenda

The DB example DescriptionYearColorModelMakeId Low miles2007GreenCivicHonda1 Low miles2007BlueCivicHonda2 Low miles2007RedCivicHonda3 Low miles2007BlackCivicHonda4 Low miles2006BlackCivicHonda5 Best Price2007BlueAccordHonda6 Good miles2006RedAccordHonda7 Rare2007GreenOdysseyHonda8 Good miles2006GreenOdysseyHonda9 Fun Car2007RedCRVHonda10 Good miles2006OrangeCRVHonda11 Low miles2007TanPriusToyota12 Low miles2007BlackCorollaToyota13 Low miles2007BlueTercelToyota14 Low miles2007BlueCamryToyota15

The car indexing example

One-pass Algorithm Lets say Q looks for descriptions with ‘Low’, with k=3 Honda.Civic.Green.2007.’Low miles’ Pick first K Initialization go to next option and check if better, if so – prune While we can improve Diversity

One-pass Algorithm We start from two Civics, then we know that we need only one more so we pick the next Civic

One-pass Algorithm Then we look for another in next level (Accord)- no such, because it doesn’t have ‘Low’ in it (also no other in that level).

One-pass Algorithm Then we look for another in next level (make)- and prune, This is maximum diverse – we stop here.

One-pass Algorithm If we had a Ford, we would continue Ford Focus 0 Black Low miles 0

Scored One-pass Algorithm Give each car a score, then the query would take this score as parameter- minScore- smallest score in the result set, Choose next next ID by : The smallest ID such that score(id)>=root.minScore. And the algorithm proceeds as before.

Probing Algorithm Main idea: to go over all the cars as they were on an axis K=1 K=2 K=3

Advantage of bidirectional exploring  “Honda” only has one child, we found it quickly not exploring every option (only civic).  Each time we add a node to the diverse solution we do not have to prune it- unlike the OnePass algorithm.

WAND algorithm  WAND is an efficient method of obtaining top-K lists of scored results, without explicitly merging the full inverted lists.  AND(X1,X2,...Xk) ≡ WAND(X1,1,X2,1,...Xk,1,k),  OR(X1,X2,...Xk) ≡ WAND(X1,1,X2,1,...Xk,1,1).  To obtain k best results the operator uses the upper bounds of maximum contribution, and temp threshold. WAND(X1,UB1,X2,UB2,...,Xk,UBk, θ )

Scored Probing Algorithm We use the WAND algorithm- to obtain the top-k list. Next step is marking all possible nodes to add- as MIDDLE. we also maintain a heap – for a node with minimum child. Each step we move nodes from tentative to useful.

Experiments MultQ – rewriting the query as multiple queries and merging their results. Naïve – all the results of a query Basic - just first k answers – without diversity. OnePass, Probe – our algorithms U = unscored S = scored

Experiments

Conclusions  Formalized diversity in structured search and proposed inverted-list algorithms.  The experiments showed that the algorithms are scalable and efficient.  In particular, diversity can be implemented with little additional overhead when compared to traditional approaches

Extension of the algorithm  Assign higher weights to Hondas and Toyotas when compared to Teslas, so that the diverse results have more Hondas and Toyotas.

Questions? Thank You !