Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Group Recommendation: Semantics and Efficiency
Web Information Retrieval
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
Best-Effort Top-k Query Processing Under Budgetary Constraints
Comparison of parallel and random approach to a candidate list in the multifeature querying Peter Gurský Institute of Computer Science UPJŠ, Košice, Slovakia.
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
6/15/20151 Top-k algorithms Finding k objects that have the highest overall grades.
Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.
Rank Aggregation. Rank Aggregation: Settings Multiple items – Web-pages, cars, apartments,…. Multiple scores for each item – By different reviewers, users,
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Aggregation Algorithms and Instance Optimality
Combining Fuzzy Information: an Overview Ronald Fagin Abdullah Mueen -- Slides by Abdullah Mueen.
A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1, Haixun Wang.
Integer Programming Difference from linear programming –Variables x i must take on integral values, not real values Lots of interesting problems can be.
Reaching the Top-k of the Skyline: A efficient Indexed Algorithm for Top-k Skyline Queries Marlene Goncalves and María-Esther Vidal Universidad Simón Bolívar,
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Link Recommendation In P2P Social Networks Yusuf Aytaş, Hakan Ferhatosmanoğlu, Özgür Ulusoy Bilkent University, Ankara, Turkey.
Winter Semester 2003/2004Selected Topics in Web IR and Mining7-1 7 Top-k Queries on Web Sources and Structured Data 7.1 Top-k Queries over Autonomous Web.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Ranking in DB Laks V.S. Lakshmanan Depf. of CS UBC.
Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
Information Networks Rank Aggregation Lecture 10.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Distributed Spatio-Temporal Similarity Search Demetrios Zeinalipour-Yazti University of Cyprus Song Lin
Efficient Processing of Top-k Spatial Preference Queries
1University of Texas at Arlington.  Introduction  Motivation  Requirements  Paper’s Contribution.  Related Work  Overview of Ripple Join  Rank.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
To Tune or not to Tune? A Lightweight Physical Design Alerter Nico Bruno, Surajit Chaudhuri DMX Group, Microsoft Research VLDB’06.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
Combining Fuzzy Information: An Overview Ronald Fagin.
Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.
A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Seung-won Hwang, Kevin Chen-Chuan Chang
Top-k Query Processing
Preference Query Evaluation Over Expensive Attributes
Rank Aggregation.
Laks V.S. Lakshmanan Depf. of CS UBC
Popular Ranking Algorithms
Implementation of Relational Operations
Structure and Content Scoring for XML
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Efficient Processing of Top-k Spatial Preference Queries
Relax and Adapt: Computing Top-k Matches to XPath Queries
Presentation transcript:

Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University

2/27/20022 “Top- k ” Queries Natural in Many Scenarios Example: NYC Restaurant Recommendation Service. Goal: Find best restaurants for a user: Close to address: “2290 Broadway” Price around $25 Good rating Query: Specification of Flexible Preferences Answer: Best k Objects for Distance Function

2/27/20023 Attributes often Handled by External Sources MapQuest returns the distance between two addresses. NYTimes Review gives the price range of a restaurant. Zagat gives a food rating to the restaurant.

2/27/20024 “Top- k ” Query Processing Challenges Attributes handled by external sources (e.g., MapQuest distance). External sources exhibit a variety of interfaces (e.g., NYTimes Review, Zagat ). Existing algorithms do not handle all types of interfaces.

2/27/20025 Processing Top- k Queries over Web-Accessible Data Sources Data and query model Algorithms for sources with different interfaces Our new algorithm: Upper Experimental results

2/27/20026 Data Model Top-k Query: assignment of weights and target values to attributes preferred price close to address preferred rating weights: price: most important attribute Combined in scoring function

2/27/20027 Sorted Access Source S Return objects sorted by scores for a given query. Example: Zagat GetNext S interface S-Source Access Time: tS(S)

2/27/20028 Random Access Source R Return the score of a given object for a given query. Example: MapQuest R-Source Access Time: tR(R) GetScore R interface

2/27/20029 Query Model Attributes scores between 0 and 1. Sequential access to sources. Score Ties broken arbitrarily. No wild guesses. One S-Source (or SR-Source ) and multiple R-sources. (More on this later.)

2/27/ Query Processing Goals Processing top- k queries over R-Sources. Returning exact answer to top- k query q. Minimizing query response time. Naïve solution too expensive (access all sources for all objects).

2/27/ Example: NYC Restaurants S-Source: Zagat : restaurants sorted by food rating. R-Sources: MapQuest: distance between two input addresses. User address: “2290 Broadway” NYTimes Review: price range of the input restaurant. Target Value: $25

2/27/ TA Algorithm for SR-Sources Perform sorted access sequentially to all SR-Sources Completely probe every object found for all attributes using random access. Keep best k objects. Stop when scores of best k objects are no less than maximum possible score of unseen objects (threshold). Fagin, Lotem, and Naor (PODS 2001) Does NOT handle R-Sources

2/27/ Our Adaptation of TA Algorithm for R-Sources: TA-Adapt Perform sorted access to S-Source S. Probe every R-Source R i for newly found object. Keep best k objects. Stop when scores of best k objects are no less than maximum possible score of unseen objects (threshold).

2/27/ An Example Execution of TA-Adapt ObjectS(Zagat)R 1 (MQ)R 2 (NYT)Final Score tS(S)=tR(R 1 )=tR(R 2 )=1, w=, k=1 Final Score = (3. score Zagat + 2. score MQ + 1. score NYT )/6 Threshold = 1 Total Execution Time = 9 o1o1 GetNext S (q) Threshold = GetScore R1 (q,o 1 ) Threshold = GetScore R2 (q,o 1 ) Threshold = GetNext S (q) Threshold = 0.9 o2o2 0.8 GetScore R1 (q,o 2 ) Threshold = GetScore R2 (q,o 2 ) Threshold = GetNext S (q) Threshold = o3o GetScore R1 (q,o 3 ) Threshold = GetScore R2 (q,o 3 ) Threshold =

2/27/ Improvements over TA-Adapt Add a shortcut test after each random- access probe ( TA-Opt ). Exploit techniques for processing selections with expensive predicates ( TA-EP ). Reorder accesses to R-Sources. Best weight/time ratio.

2/27/ The Upper Algorithm Selects a pair (object,source) to probe next. Based on the property: The object with the highest upper bound will be probed before top-k solution is reached. Object is one of top- k objectsObject is not one of top- k objects

2/27/ Threshold = 1 An Example Execution of Upper ObjectUpper BoundS(Zagat)R 1 (MQ)R 2 (NYT)Final Score Total Execution Time = GetNext S (q) Threshold = 0.95 o1o GetScore R1 (q,o 1 ) Threshold = 0.95 o2o GetNext S (q) Threshold = GetScore R1 (q,o 2 ) Threshold = o3o GetNext S (q) Threshold = GetScore R2 (q, o 2 ) Threshold = tS(S)=tR(R 1 )=tR(R 2 )=1, w=, k=1 Final Score = (3. score Zagat + 2. score MQ + 1. score NYT )/6

2/27/ The Upper Algorithm Choose object with highest upper bound. If some unseen object can have higher upper bound: Access S-Source S Else: Access best R-Source R i for chosen object Keep best k objects If top- k objects have final values higher than maximum possible value of any other object, return top- k objects. Interleaves accesses on objects

2/27/ Selecting the Best Source Upper relies on expected values to make its choices. Upper computes “best subset” of sources that is expected to: 1.Compute the final score for k top objects. 2.Discard other objects as fast as possible. Upper chooses best source in “best subset”. Best weight/time ratio.

2/27/ Experimental Setting: Synthetic Data Attribute scores randomly generated (three data sets: uniform, gaussian and correlated). tR(R i ) : integer between 1 and 10. tS(S)  {0.1, 0.2,…,1.0}. Query execution time: t total Default: k =50, objects, uniform data. Results: average t total of 100 queries. Optimal assumes complete knowledge (unrealistic, but useful performance bound)

2/27/ Experiments: Varying Number of Objects Requested k

2/27/ Experiments: Varying Number of Database Objects N

2/27/ Experimental Setting: Real Web Data S-Source: Verizon Yellow Pages (sorted by distance) R-Sources: Subway Navigator Subway time Altavista Popularity MapQuest Driving time NYTimes Review Food and price ratings Zagat Food, Service, Décor and Price ratings

2/27/ Experiments: Real-Web Data # of Random Accesses

2/27/ Evaluation Conclusions TA-EP and TA-Opt much faster than TA-Adapt. Upper significantly better than all versions of TA. Upper close to optimal. Real data experiments: Upper faster than TA adaptations.

2/27/ Conclusion Introduced first algorithm for top- k processing over R-Sources. Adapted TA to this scenario. Presented new algorithms: Upper and Pick (see paper) Evaluated our new algorithms with both real and synthetic data. Upper close to optimal

2/27/ Current and Future Work Relaxation of the Source Model Current source model limited Any number of R-Sources and SR-Sources Upper has good results even with only SR-Sources Parallelism Define a query model for parallel access to sources Adapt our algorithms to this model Approximate Queries

2/27/ References Top-k Queries: Evaluating Top-k Selection Queries, S. Chaudhuri and L. Gravano. VLDB 1999 TA algorithm: Optimal Aggregation Algorithms for Middleware, R. Fagin, A. Lotem, and M. Naor. PODS 2001 Variations of TA: Query Processing Issues on Image (Multimedia) Databases, S. Nepal and V. Ramakrishna. ICDE 1999 Optimizing Multi-Feature Queries for Image Databases, U. Güntzer, W.-T. Balke, and W.Kießling. VLDB 2000 Expensive Predicates Predicate Migration: Optimizing queries with Expensive Predicates, J.M. Hellerstein and M. Stonebraker. SIGMOD 1993

2/27/ Real-web Experiments

2/27/ Real-web Experiments with Adaptive Time

2/27/ Relaxing the Source Model Upper TA-EP

2/27/ Upcoming Journal Paper Variations of Upper Select best source Data Structures Complexity Analysis Relaxing Source Model Adaptation of our Algorithms New Algorithms Variations of Data and Query Model to handle real web data

2/27/ Optimality TA instance optimal over: Algorithms that do not make wild guesses. Databases that satisfy the distinctness property. TA Z instance optimal over: Algorithms that do not make wild guesses. No complexity analysis of our algorithms, but experimental evaluation instead