Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Web Information Retrieval

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

VLDB 2011 Pohang University of Science and Technology (POSTECH) Republic of Korea Jongwuk Lee, Seung-won Hwang VLDB 2011.

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

1 A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES Leong Hou U, Nikos Mamoulis, Kyriakos Mouratidis Gruppo 10: Paolo Barboni, Tommaso Campanella, Simone.

Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Best-Effort Top-k Query Processing Under Budgetary Constraints

CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.

Network Optimization Models: Maximum Flow Problems

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

6/15/20151 Top-k algorithms Finding k objects that have the highest overall grades.

Computer Science Spatio-Temporal Aggregation Using Sketches Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias Department of Computer.

Cache Placement in Sensor Networks Under Update Cost Constraint Bin Tang, Samir Das and Himanshu Gupta Department of Computer Science Stony Brook University.

Network Optimization Models: Maximum Flow Problems In this handout: The problem statement Solving by linear programming Augmenting path algorithm.

Rank Aggregation. Rank Aggregation: Settings Multiple items – Web-pages, cars, apartments,…. Multiple scores for each item – By different reviewers, users,

1 Variance Reduction via Lattice Rules By Pierre L’Ecuyer and Christiane Lemieux Presented by Yanzhi Li.

MAE 552 – Heuristic Optimization Lecture 26 April 1, 2002 Topic:Branch and Bound.

Aggregation Algorithms and Instance Optimality

WiOpt’04: Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks March 24-26, 2004, University of Cambridge, UK Session 2 : Energy Management.

Approximate Aggregation Techniques for Sensor Databases John Byers Department of Computer Science Boston University Joint work with Jeffrey Considine,

Extending Network Lifetime for Precision-Constrained Data Aggregation in Wireless Sensor Networks Xueyan Tang School of Computer Engineering Nanyang Technological.

A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1, Haixun Wang.

Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.

Indexing Spatio-Temporal Data Warehouses Dimitris Papadias, Yufei Tao, Panos Kalnis, Jun Zhang Department of Computer Science Hong Kong University of Science.

CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces Jian Pei # Wen Jin # Martin Ester # Yufei Tao + # Simon Fraser University,

Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

SUBSKY: Efficient Computation of Skylines in Subspaces Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei Conference: ICDE 2006 Presenter: Kamiru Superviosr:

1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.

Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Dave McKenney 1.  Introduction  Algorithms/Approaches  Tiny Aggregation (TAG)  Synopsis Diffusion (SD)  Tributaries and Deltas (TD)  OPAG  Exact.

Load-Balancing Routing in Multichannel Hybrid Wireless Networks With Single Network Interface So, J.; Vaidya, N. H.; Vehicular Technology, IEEE Transactions.

Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.

An Energy Efficient Hierarchical Clustering Algorithm for Wireless Sensor Networks Seema Bandyopadhyay and Edward J. Coyle Presented by Yu Wang.

Efficient Processing of Top-k Spatial Preference Queries

Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.

CS 4407, Algorithms University College Cork, Gregory M. Provan Network Optimization Models: Maximum Flow Problems In this handout: The problem statement.

CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.

On Computing Top-t Influential Spatial Sites Authors: T. Xia, D. Zhang, E. Kanoulas, Y.Du Northeastern University, USA Appeared in: VLDB 2005 Presenter:

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

The university of Hong Kong Department of Computer Science Continuous Monitoring of Top-k Queries over Sliding Windows Authors: Kyriakos Mouratidis, Spiridon.

All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

Combining Fuzzy Information: An Overview Ronald Fagin.

Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.

A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES

Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.

Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

HKU CSIS DB Seminar Skyline Queries HKU CSIS DB Seminar 9 April 2003 Speaker: Eric Lo.

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.

Net 435: Wireless sensor network (WSN)

Preference Query Evaluation Over Expensive Attributes

Sublinear Algorithmic Tools 2

Rank Aggregation.

Enumerating Distances Using Spanners of Bounded Degree

Xu Zhou Kenli Li Yantao Zhou Keqin Li

Range-Efficient Computation of F0 over Massive Data Streams

Minwise Hashing and Efficient Search

Efficient Processing of Top-k Spatial Preference Queries

Query Specific Ranking

Presentation transcript:

Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong

Outline  Problem Statement  Skyline Computation on Vertically Partitioned Datasets using Balke’s Algorithm  Algorithms for Top-k Query Processing  FM Sketches  Putting Everything Together

Outline  Problem Statement  Skyline Computation on Vertically Partitioned Datasets using Balke’s Algorithm  Algorithms for Top-k Query Processing  FM Sketches  Putting Everything Together

A Motivating Example  Consider a database containing information about hotels. The y-dimension represents the price of the room, whereas the x-dimension captures the distance of the room from the beach. Distance Price Skyline objects Hotel roomsp Dominance Region of p Borders of p’s Dominance Region

Skyline Preliminaries [ICDE, 2001]  Skylines constitute a very useful tool in numerous disciplines, such as for multidimensional decision making and data mining.  Given a set of d-dimensional objects p 1, …, p N, the skyline operator retrieves all these objects that are nor dominated by any other object in the set.  An object p i dominates another point p j, if it is not worse than p j in all dimensions and better than it in at least one dimension.  Properties: The top-1 tuple according to any preference function that assigns scores to tuples is in the skyline tuple. Conversely, for any skyline tuple, there exists a preference function according to which it is the top-1.

4 Common Data Distributions

Problem Definition  Compute the skyline when the dataset is vertically decomposed among a set of N servers.  Goal: minimize the data that must be retrieved from each server.  We assume wireless environments, where communication overhead constitutes the dominant factor in battery consumption.  Consider mobile phone applications as real world examples.

(a) Subspace D 1 at server N 1 (b) Subspace D 2 at Server N 2 First Observations  The global skyline may contain points that do not appear in the local skylines.  Instead of transmitting all records over the network, avoid sending out points that are guaranteed to be dominated globally by an anchor point.

Outline  Problem Statement  Skyline Computation on Vertically Partitioned Datasets using Balke’s Algorithm  Algorithms for Top-k Query Processing  FM Sketches  Putting Everything Together

Balke’s Algorithm [EDBT, 2004]  Assume that the d-dimensional database is vertically partitioned into d lists, one for each dimension, assigned to different servers. The lists contain values in ascending order.  Idea: perform sorted accesses on the d lists in a round- robin manner, until a point p (anchor), is reached in every list.  Points that have not showed up at this moment in any list can be safely pruned, since they are dominated by the anchor.

Example  Let a 2-dimensional database with the following two lists:  L1  L2 Pointabdmgc Value Pointcdekab Value … …

Example (cont.)  Let a 2-dimensional database with the following two lists:  L1  L2 Pointabdmgc Value Pointcdekab Value … … The first point to be retrieved from both lists.

Example (cont.)  Let a 2-dimensional database with the following two lists:  L1  L2 Pointabdmgc Value Pointcdekab Value … … The first point to be retrieved from both lists.These points cannot be part of the skyline.

Further Improvement  Efficiency can be improved, if instead of visiting the lists in a round-robin manner, we access the most promising list with random accesses.  As a result, only the least expansion is performed on each list. ∙ ∙ ∙ P ∙ ∙ ∙ L1 ∙ ∙ ∙ P ∙ ∙ ∙ L2 avoid visiting these points

Outline  Problem Statement  Skyline Computation on Vertically Partitioned Datasets using Balke’s Algorithm  Algorithms for Top-k Query Processing  FM Sketches  Putting Everything Together

Setting  Let N 1,.., N m be m servers storing the same dataset DB.  For each record P  DB every server N i maintains a local score s i (P), and sorts all records in decreasing order of their local scores.  A client wishes to obtain the k records of DB with the maximum global score s.  The score is computed using a monotonic function f on the local scores, i.e., s(P) = f(s 1 (P),.., s m (P)).  Goal: minimize the required number of accesses.

Fagin’s Algorithm [PODS, 2001]  Each server N i performs sorted round-robin accesses and sends to the client the next record and its local score.  When the first common record P anc is encountered by all servers, the client terminates the sorted accesses.  Then, it obtains the missing local scores of the other encountered points through random accesses.  The candidate with the highest global score is the top-1 result.

Threshold Algorithm [PODS, 2001]  It utilizes an upper bound  TA on the global score to terminate earlier than FA.  The client retrieves the local scores of newly encountered points with random accesses at the remaining servers and computes their global scores, and picks the best score s best.  The threshold  TA is equal to the sum of the local thresholds at each server.  As long  TA > s best, TA continues the sorted accesses, while it keeps updating  TA.  Eventually, the top-1 point will be returned.

Example for FA and TA

Best Position Algorithm [VLDB, 2007]  It further improves TA by utilizing a tighter threshold.  Let bp i be the position at server N i such that all points up to bp i have been encountered through sorted or random accesses.  The global threshold  BP is equal to the sum of the local thresholds at bp i.

Outline  Problem Statement  Skyline Computation on Vertically Partitioned Datasets using Balke’s Algorithm  Algorithms for Top-k Query Processing  FM Sketches  Putting Everything Together

Flajolet / Martin sketches [JCSS ’85]  Goal: Estimate the distinct number of objects from a small-space representation of a set.  Sketch of a union of items is the OR of their sketches  Insertion order and duplicates don’t matter! Prerequisite: Let h be a random, binary hash function. Sketch of an item For each unique item with ID x, For each integer 1 ≤ i ≤ k in turn, Compute h (x, i). Stop when h (x, i) = 1, and set bit i. X Z X Z ∩

Flajolet / Martin sketches (cont.) Estimating COUNT Take the sketch of a set of N items. Let j be the position of the leftmost zero in the sketch. j is an estimator of log 2 (0.77 N) Fixable drawbacks: Estimate has faint bias Variance in the estimate is large S 1 Best guess: COUNT ~ 11 j = 3

Flajolet / Martin sketches (cont.)  Standard variance reduction methods apply.  Compute m independent sketches in parallel.  Compute m independent estimates of N.  Take the mean of the estimates.  Provable tradeoffs between m and variance of the estimator.

Application to COUNT in Sensor Databases Each sensor computes k independent sketches of itself (using unique ID x) –sensor computes a sketch of its value. Use a robust routing algorithm to route sketches up to the sink. Aggregate the k sketches via union en-route. (OR) The sink then estimates the count. sink S1S1 S3S3 S2S2 S4S4 S1S1 S2S2 S1∪S2∪S3S1∪S2∪S3 S4S4 S1∪S2∪S3∪S3S1∪S2∪S3∪S3

Outline  Problem Statement  Skyline Computation on Vertically Partitioned Datasets using Balke’s Algorithm  Algorithms for Top-k Query Processing  FM Sketches  Putting Everything Together

Problem Characteristics  Each vertical decomposition has arbitrary dimensionality, contrary to Balke’s setting.  Anchor selection substantially determines the total number of transmitted data.  VPS adopts sorting on the local dominance. In particular, the local dominance count dom i (P) of a point P with respect to subspace D i is the number of points dominated by P in D i.  Balke selects as the anchor, the data point P with the maximal dom SUM (P).  We utilize a tighter upper bound for dom(P) is the minimum dom MIN among all local dominance counts.

Anchor Selection (a) Subspace D 1 at server N 1 (b) Subspace D 2 at Server N 2 C: optimal anchor point A: has maximal dom MIN B: has maximal dom SUM

Our algorithm on the previous example

1 st Optimization: Multiple Anchor Points  The previous algorithm performs pruning with a single anchor P anc. Specifically, a point P that is locally dominated by P anc in all subspaces is not sent to the client.  On the other hand, if P is incomparable with P anc even in a single subspace D i, it will be transmitted by the corresponding server N i.  We suggest that multiple points can often achieve more effective pruning.

Pruning with 2 points

2 nd Optimization: Integration of Sketches  So far, we have estimated the (expected) global dominance dom(P) of a point P using dom MIN (P).  This approach is biased towards points that have high local dominance counts in all subspaces, but dominate few records globally (A).  Thus, we propose an unbiased approach that directly estimates the global dominance counts using sketches that count the number of distinct objects approximately.  We assume that each N i server has a local dominance sketch sk i (P) for every point P, which aggregates all points that P dominates locally in D i.

Experiments

Thank you!