1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia.

Slides:



Advertisements
Similar presentations
Song Intersection by Approximate Nearest Neighbours Michael Casey, Goldsmiths Malcolm Slaney, Yahoo! Inc.
Advertisements

A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.
Group Recommendation: Semantics and Efficiency
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Fast Algorithms For Hierarchical Range Histogram Constructions
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.
Evaluating Search Engine
Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich 1.
Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Ymir Vigfusson Adam Silberstein Brian Cooper Rodrigo Fonseca.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
EFFICIENT COMPUTATION OF DIVERSE QUERY RESULTS Presenting: Karina Koifman Course : DB Seminar.
CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.
Design and Implementation of a Geographic Search Engine Alexander Markowetz Yen-Yu Chen Torsten Suel Xiaohui Long Bernhard Seeger.
Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign.
Presented By: - Chandrika B N
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Joint Histogram Based Cost Aggregation For Stereo Matching Dongbo Min, Member, IEEE, Jiangbo Lu, Member, IEEE, Minh N. Do, Senior Member, IEEE IEEE TRANSACTION.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
Improving Min/Max Aggregation over Spatial Objects Donghui Zhang, Vassilis J. Tsotras University of California, Riverside ACM GIS’01.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Querying Structured Text in an XML Database By Xuemei Luo.
Understanding and Predicting Personal Navigation Date : 2012/4/16 Source : WSDM 11 Speaker : Chiu, I- Chih Advisor : Dr. Koh Jia-ling 1.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
CSE373: Data Structures & Algorithms Lecture 11: Implementing Union-Find Aaron Bauer Winter 2014.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
Efficient Subwindow Search: A Branch and Bound Framework for Object Localization ‘PAMI09 Beyond Sliding Windows: Object Localization by Efficient Subwindow.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Less is More Probabilistic Models for Retrieving Fewer Relevant Documents Harr Chen, David R. Karger MIT CSAIL ACM SIGIR 2006 August 9, 2006.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Multi-object Similarity Query Evaluation Michal Batko.
A BRIEF INTRODUCTION TO CACHE LOCALITY YIN WEI DONG 14 SS.
1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Adversarial Search 2 (Game Playing)
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
Spatial Data Management
Data Mining (and machine learning)
TT-Join: Efficient Set Containment Join
Rank Aggregation.
Lecture 22 SVD, Eigenvector, and Web Search
Structure and Content Scoring for XML
CSE373: Data Structures & Algorithms Implementing Union-Find
Structure and Content Scoring for XML
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Information Retrieval and Web Design
Fraction-Score: A New Support Measure for Co-location Pattern Mining
Presentation transcript:

1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia

2 Motivation Imagine looking for shoes on Yahoo! Shopping, and seeing only Reeboks

3 Motivation Imagine looking for shoes on Yahoo! Shopping, and seeing only Reeboks … or looking for cars on Yahoo! Autos, and seeing only Hondas

4 Motivation Imagine looking for shoes on Yahoo! Shopping, and seeing only Reeboks … or looking for cars on Yahoo! Autos, and seeing only Hondas … or looking for jobs on Yahoo! Hotjobs, and seeing only jobs from Yahoo! It is not enough to simply give the best response –Need diversity of answers

5 Diversity Search If we display 30 results in 5 categories, then should show 6 items from each category –NB: Our goal is to show range of choices, not representative sample –Recurse on each subgroup of items Diversity crucial for users looking for range of results –e.g. Shopping, information gathering/research Useful for aiding navigation –Users tend to favor search-and-click over hierarchies Likely to give at least one good answer on first page

6 Contributions Formally define diversity search –Other diversity-like approaches use extensive post-processing or are not query-dependent Proved that traditional IR engines cannot produce guaranteed diverse results Gave novel algorithms to produce diverse results –Both one-pass (datastreaming) and probing algorithms Experimentally verified that these results are nearly as fast as normal top-k processing –Much faster than post-processing techniques

7 What about other approaches? If not diverse enough, query again –E.g. If all results are from one company, issue another query –Bad for latency Issue multiple queries (one for Honda, one for Toyota...) –Can be prohibitively expensive (kills throughput) latency fine –Some applications may have dozens of top-level categories Fetch extra results, then find most diverse set from this –Not guaranteed to get good results –Requires fetching additional results unnecessarily Fetch all results, then find diverse set –Many times slower Random sample of results –Miss important results this way

8 What about clever scoring? Can we give each item a global “diversity” score, then find top-k using this? –Prove in paper: There is no global score that gives guaranteed diversity Can we give each item a local “diversity” score, so that it has a different score in each list of the inverted index? –Prove in paper: There is no list-based scoring of the item that gives guaranteed diversity

9 Outline Definition of diversity Overview of our algorithms Our experimental results

10 Diversity search Over all possible sets of top-k results that match query, return set with most diversity Paper defines diversity more precisely –Focus on hierarchy view of diversity (in next slides) For scored diversity (in which each item has a score) –Over all possible sets of top-k results with maximum score, return set with highest diversity –Note: Diversity only useful when score not too fine-grained

11 Diversity definition (by picture) Implicitly defines hierarchy Make Model Color Year Text Determine a category ordering

12 Hierarchy after a query Diversity search always returns valid results E.g. Query text contains `Low`

13 Hierarchy after a query Diversity search always returns valid results E.g. Query text contains `Low` All siblings return the same number of results (or as close as possible)

14 Returning top-k diverse results Diversity search always returns valid results E.g. Query text contains `Low` Suppose return k=4 results Must return 2 Hondas and 2 Toyotas Will not return 2 green Civics

15 Outline Definition of diversity Overview of our algorithms Our experimental results

16 Algorithms One Pass –Never goes backward (just one pass over dataset) –Maintains a top-k diverse set based on what has been seen –Jumps ahead if more results will not help diversity –Optimal one-pass algorithm Probe –May jump forward or backward (i.e. probes) –Prove: at most 2k probes for top-k diverse result set Both also work for scored diversity

17 Dewey IDs Every branch gets a number Every item then labeled, e.g is Honda Odyssey Green ’06 `Good miles’ Create inverted index low  00000, 00010, 00100, 00200, 00300, 00310, 10000, 11000, 12000, 13000

18 Next and Prev Supports two basic operations: Next and Prev E.g. Query text contains `Low` Next( ) = Prev( ) = Inverted index for ‘Low’ lists all items in Dewey ID order In general, must find intersection of lists (still easy) low  00000, 00010, 00100, 00200, 00300, 00310, 10000, 11000, 12000, 13000

19 One pass (for k = 2) First finds 00000, Now knows Civic Green no longer helps Jumps by calling next( )

20 Finds Removes One pass (for k = 2) First finds 00000, Now knows Civic Green no longer helps! Jumps by calling next( ) Now knows Civic no longer helps! Jumps by calling next( )

21 Finds Removes One pass (for k = 2) First finds 00000, Now knows Civic Green no longer helps! Jumps by calling next( ) Now knows Civic no longer helps! Jumps by calling next( ) Finds Removes Knows to stop

22 Probe (for k = 4) Calls next( ) and prev( . . . .  ) to find first and last items Wants another Honda Calls prev(0. . . .  ) Discovers there are only 2 top-level categories

23 Probe (for k = 4) Calls next( ) and prev( . . . .  ) to find first and last items Wants another Honda Calls prev(0. . . .  ) Why not next( )? If Honda has only one child, then will return a Toyota!

24 Probe (for k = 4) Calls next( ) and prev( . . . .  ) to find first and last items Wants another Honda Calls prev(0. . . .  ) Finds Wants another Toyota Calls next( )

25 Probe (for k = 4) Calls next( ) and prev( . . . .  ) to find first and last items Wants another Honda Calls prev(0. . . .  ) Finds Wants another Toyota Calls next( ) Finds 10000

26 Outline Definition of diversity Overview of our algorithms Our experimental results

27 Results Dataset consisted of listing from Yahoo! Autos Queries were synthetic to test various parameters –Selectivity, # predicates, # results Preprocessing time for 100K listings < 5min –Times shown are for 5K queries 4 algorithms –Basic: No diversity –Naïve: Fetch everything, post-process –OnePass: Our algorithm. Takes just one pass over data –Probe: Our algorithm. May make multiple probes into data

28 Comparable time for diversity search unscoredscored Basic: No diversity Naïve: Many times slowerOnePass: Close to probe Probe: Within factor 2 of no diversity MultiQuery (not shown): Latency close to Basic, but throughput many times worse

29 Results summary Getting diverse results not too much slower than getting non-diverse results –Many times faster than naïve approaches Multi-query approach has even worse throughput than naïve –But keeps latency low How does this compare to getting extra results, then finding a diverse subset? –Getting 2k results instead of k is about twice as slow –Plus, does not guarantee diverse results

30 Conclusions Can get guaranteed diversity, taking time close to normal top-k query –Almost as fast or faster than non-guaranteed results –Diversity at every level Works even when items have scores Needs a different algorithm than traditional IR engines –Proved this in paper (under standard notions) Are there approximate notions that can use existing IR machinery?

31