The Skyline Query in Databases Which Objects are the Most Important? Donghui Zhang College of Computer and Information Science Northeastern University
Tian Xia - Northeastern University The Skyline of Boston Buildings not “dominated”, i.e. shorter and further than another building. 2019/7/1 Tian Xia - Northeastern University
The Skyline of NBA Players NBA statistics data*: 19,112 records, 1946-2004, 17 attributes. A piece of data in 2004. Who are the best players? Best = Not dominated by any other player. Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 Kobe Bryant 1819 392 398 86 Shaquille O'Neal 1669 760 200 36 Yao Ming 1465 669 61 34 Dwyane Wade 1854 397 520 121 Steve Nash 1165 249 861 74 * www.databaseBasketball.com 2019/7/1 Tian Xia - Northeastern University
The Skyline of NBA Players NBA statistics data*: 19, 112 records, 1946-2004, 17 attributes. A piece of data in 2004. Who are the best players? Best = Not dominated by any other player. Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 Kobe Bryant 1819 392 398 86 Shaquille O'Neal 1669 760 200 36 Yao Ming 1465 669 61 34 Dwyane Wade 1854 397 520 121 Steve Nash 1165 249 861 74 * www.databaseBasketball.com 2019/7/1 Tian Xia - Northeastern University
The Skyline of NBA Players NBA statistics data*: 19, 112 records, 1946-2004, 17 attributes. A piece of data in 2004. Who are the best players? Best = Not dominated by any other player. Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 Kobe Bryant 1819 392 398 86 Shaquille O'Neal 1669 760 200 36 Yao Ming 1465 669 61 34 Dwyane Wade 1854 397 520 121 Steve Nash 1165 249 861 74 * www.databaseBasketball.com 2019/7/1 Tian Xia - Northeastern University
The Skyline of NBA Players NBA statistics data*: 19, 112 records, 1946-2004, 17 attributes. A piece of data in 2004. Who are the best players? Best = Not dominated by any other player. Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 Kobe Bryant 1819 392 398 86 Shaquille O'Neal 1669 760 200 36 Yao Ming 1465 669 61 34 Dwyane Wade 1854 397 520 121 Steve Nash 1165 249 861 74 * www.databaseBasketball.com 2019/7/1 Tian Xia - Northeastern University
The Skyline of NBA Players NBA statistics data*: 19, 112 records, 1946-2004, 17 attributes. A piece of data in 2004. Who are the best players? Best = Not dominated by any other player. Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 Kobe Bryant 1819 392 398 86 Shaquille O'Neal 1669 760 200 36 Yao Ming 1465 669 61 34 Dwyane Wade 1854 397 520 121 Steve Nash 1165 249 861 74 * www.databaseBasketball.com 2019/7/1 Tian Xia - Northeastern University
The Skyline of NBA Players NBA statistics data*: 19, 112 records, 1946-2004, 17 attributes. A piece of data in 2004. Who are the best players? Best = Not dominated by any other player. Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 Kobe Bryant 1819 392 398 86 Shaquille O'Neal 1669 760 200 36 Yao Ming 1465 669 61 34 Dwyane Wade 1854 397 520 121 Steve Nash 1165 249 861 74 * www.databaseBasketball.com 2019/7/1 Tian Xia - Northeastern University
Tian Xia - Northeastern University The Skyline of Hotels Example: suppose a student wants to find a hotel near the ICFP’07 conference hotel. Which are the best choices? Hotels in Germany Distance Price Distance t2, t3 and t4 are dominated. t1 3 2 t2 t2 4 7 t4 t3 t3 9 5 t7 t4 4 6 1 2 3 4 5 6 7 8 t5 t5 2 3 t1 t6 6 1 Skyline objects t6 t7 1 4 The smaller, the better! 1 2 3 4 5 6 7 8 9 Price 2019/7/1 Tian Xia - Northeastern University
Skyline Query Applications Find best NBA players: (#points, #rebounds), or any other subset of the 17 dimensions. Find best hotels: (price, distance to conference hotel). Find best researchers: (#pubs in POPL, PLDI, ICFP, SIGCOMM, SIGMOD) Any table in a RDBMS has a list of records with multiple attributes, so …… 2019/7/1 Tian Xia - Northeastern University
Tian Xia - Northeastern University How to Find A Skyline? For every object o Compare with all other objects. Report o if it is not dominated by any. Complexity: O(n2) Problem: in a large database, O(n2) is inefficient! 2019/7/1 Tian Xia - Northeastern University
Why is an O(n2) Algorithm inefficient? Data size is large stored on disk. 2019/7/1 Tian Xia - Northeastern University
transfer time Each disk access: seek time rotational delay Spindle Platters Spindle Disk head Arm movement Arm assembly Tracks Sector transfer time Each disk access: seek time rotational delay 21
Why is an O(n2) Algorithm inefficient? Data size is large stored on disk. Sample scenario: Disk page size: 8KB. Database size: 1GB = 131,072 disk pages. Let each disk I/O be 1 ms. O(n): 131 seconds 2 minutes. (Not efficient!) O(n2): 200 days! (Out of the question!) Find the nearest hospital… 2019/7/1 Tian Xia - Northeastern University
Tian Xia - Northeastern University Content Motivation and definition of skyline Branch-and-Bound Skyline (BBS) Compressed Skycube (CSC) BBS: [PTFS03], SIGMOD CSC:[XZ06], SIGMOD 2019/7/1 Tian Xia - Northeastern University
Branch-and-Bound Skyline (BBS) Use an R-tree to index the objects. Find NN to origin. This is a skyline object. Prune search space. Repeat finding NN in unpruned space. 2019/7/1 Tian Xia - Northeastern University
Tian Xia - Northeastern University R-Tree Motivation y axis 10 m g h l 8 k f e 6 i j d 4 b a 2 c x axis 2 4 6 8 10 Range query: find the objects in a given range. E.g. find all hotels in Boston. No index: scan through all objects. NOT EFFICIENT! 2019/7/1 Tian Xia - Northeastern University
R-Tree: Clustering by Proximity 2019/7/1 Tian Xia - Northeastern University
Tian Xia - Northeastern University R-Tree 2019/7/1 Tian Xia - Northeastern University
Tian Xia - Northeastern University R-Tree 2019/7/1 Tian Xia - Northeastern University
Tian Xia - Northeastern University Range Query y axis 10 m g h l 8 k f e E 6 2 i j E d 1 4 b a 2 c x axis 2 4 6 8 10 Root E E 1 2 E E E E 1 E E 3 4 5 6 7 E 2 a b c d e f g h i j k l m E 2019/7/1 E Tian Xia - Northeastern University E E E 3 4 5 6 7
Tian Xia - Northeastern University Range Query y axis 10 m g h l 8 k f e E 6 2 i j E d 1 4 b a 2 c x axis 2 4 6 8 10 Root E E 1 2 E E E E 1 E E 3 4 5 6 7 E 2 a b c d e f g h i j k l m E 2019/7/1 E Tian Xia - Northeastern University E E E 3 4 5 6 7
Branched and Bound Skyline (BBS) Assume all points are indexed in an R-tree. mindist(MBR) = the L1 distance between its lower-left corner and the origin. 2019/7/1 Tian Xia - Northeastern University
Branched and Bound Skyline (BBS) Each heap entry keeps the mindist of the MBR. 2019/7/1 Tian Xia - Northeastern University
Tian Xia - Northeastern University Example of BBS Process entries in ascending order of their mindists. 2019/7/1 Tian Xia - Northeastern University
Tian Xia - Northeastern University Example of BBS 2019/7/1 Tian Xia - Northeastern University
Tian Xia - Northeastern University Example of BBS 2019/7/1 Tian Xia - Northeastern University
Tian Xia - Northeastern University Example of BBS 2019/7/1 Tian Xia - Northeastern University
Tian Xia - Northeastern University Example of BBS 2019/7/1 Tian Xia - Northeastern University
Tian Xia - Northeastern University Example of BBS 2019/7/1 Tian Xia - Northeastern University
Tian Xia - Northeastern University Content Motivation and definition of skyline Branch-and-Bound Skyline (BBS) Compressed Skycube (CSC) BBS: [PTFS03], SIGMOD CSC:[XZ06], SIGMOD 2019/7/1 Tian Xia - Northeastern University
Subspace Skyline Queries Hotels have many attributes, e.g. price, distance, star rating, food quality, facility, location, transportation, … Users may ask skyline queries on any subsets of attributes, depending on their interests. Subspace skylines can be very different! 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 u3 u1 t2 t1 t3 t4 t5 t6 t7 Skyline in u1, u3 u2 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 u3 t5 t6 t7 t1 t2 t3 t4 Skyline in u2, u3 u1 u2 u3 u4 t1 3 4 2 5 t2 4 6 7 2 t3 9 7 5 6 t4 4 3 6 1 t5 2 2 3 1 t6 6 1 1 3 t7 1 3 4 1 Objects of 4-dimensions 2019/7/1 Tian Xia - Northeastern University
Tian Xia - Northeastern University Our Problem Our problem: how to support arbitrary subspace skyline queries in dynamic and frequently-updated databases? Problem Settings: Online systems: the database server receives multiple concurrent skyline queries on arbitrary, unpredictable subspaces. Frequently-updated databases: The data are also constantly changing. E.g., in an online hotel-booking system, room prices change due to the availability. 2019/7/1 Tian Xia - Northeastern University
Straightforward Solutions On-the-fly computation: slow in query processing Compute the results from scratch Process the whole dataset for each query Pre-compute and store all subspace skylines: high update costs Expensive to correctly maintain all results Waste of storage 2019/7/1 Tian Xia - Northeastern University
The complete pre-computation Subspace Skyline u1 u2 u3 u4 u1 , u2 u1 , u3 u1 , u4 u2 , u3 u2 , u4 u3 , u4 u1 , u2 , u3 u1 , u2 , u4 u1 , u3 , u4 u2 , u3 , u4 u1 , u2 , u3 , u4 t5 , t6 t1 , t5 , t6 , t7 t5 , t6 , t7 t6 t1 , t5 , t6 , t7 , t9 t5 , t6 , t7 , t9 t4 , t5 , t7 u1 u2 u3 u4 t1 3 4 2 5 t2 4 6 7 2 Skycube t3 9 7 5 6 t4 4 3 6 1 t5 2 2 3 1 t6 6 1 1 3 t7 1 3 4 1 Contains many duplicates, e.g. t6 appears 12 times t8 6 5 3 8 t9 2 2 3 7 2019/7/1 Tian Xia - Northeastern University
Our Solution: the Compressed Skycube The Compressed Skycube achieves both fast query response and efficient update support. New storage model Represent all skylines in a very concise way, by preserving only essential information of subspace skylines. New query processing algorithm Efficiently answer arbitrary subspace skyline queries without accessing the whole dataset. New object-aware update scheme Avoid unnecessary data access and subspace skyline computation upon updates. 2019/7/1 Tian Xia - Northeastern University
Minimum Subspace (mss) Subspace Skyline u1 t7 Object t6 appears in the skylines of 12 subspaces. The number of minimum subspaces of t6 is only 2. u2 t6 u3 t6 u4 t4 , t5 , t7 u1 , u2 t5 , t6 , t7 , t9 u1 , u3 t1 , t5 , t6 , t7 , t9 u1 , u4 t7 t4 u4 t9 u1, u2, u1, u3 t7 u1, u4 t1 u1, u3 t5 u4, u1, u2, u1, u3 t6 u2, u3 Minimum Subspaces u2 , u3 t6 u2 , u4 t5 , t6 u3 , u4 t5 , t6 u1 , u2 , u3 t1 , t5 , t6 , t7 , t9 u1 , u2 , u4 t5 , t6 , t7 u1 , u3 , u4 t1 , t5 , t6 , t7 u2 , u3 , u4 t5 , t6 u1 , u2 , u3 , u4 t1 , t5 , t6 , t7 2019/7/1 Tian Xia - Northeastern University
The Compressed Skycube (CSC) Definition: The Compressed Skycube (CSC) consists of non-empty subspace U, such that an object t is stored in a subspace U if and only if U is a minimum subspace of t, i.e. U mss(t). t7 Subspace Skyline u1 u2 u3 u4 u1 , u2 u1 , u3 t1 , t5 , t9 t5 , t9 t4 , t5 , t7 t6 CSC t4 u4 t9 u1, u2, u1, u3 t7 u1, u4 t1 u1, u3 t5 u4, u1, u2, u1, u3 t6 u2, u3 Minimum Subspaces 2019/7/1 Tian Xia - Northeastern University
Only visit CSC, not whole dataset Output is non-blocking! Querying CSC Only visit CSC, not whole dataset Find the skyline in subspace u2, u3, u4. Output is non-blocking! CSC u1 u2 u3 u4 Subspace Skyline t1 3 4 2 5 u1 t7 t4 4 3 6 1 u2 t6 t6 t5 2 2 3 1 u3 t6 t6 6 1 1 3 u4 t4 , t5 , t7 t5 t7 1 3 4 1 u1 , u2 t5 , t9 t9 2 2 3 7 u1 , u3 t1 , t5 , t9 Theorem 1: Given a query space Uq and an object t, if for any subspace Ui in mss(t), Ui Uq, then t is not in the skyline of Uq. Search the subspaces which are subsets of the query space. Theorem 2 (Local Comparison): To check a candidate t in a subspace V Uq, we only need to compare t with the objects within the same subspace. Compare candidates within their own subspaces. 2019/7/1 Tian Xia - Northeastern University
Tian Xia - Northeastern University Updating CSC Crucial questions: When do we access the whole dataset to retrieve new skyline objects? When do we re-compute the skylines of certain subspaces? Full-space: a subspace containing all dimensions, represented as D Skyline objects in full-space: sky(D) t: object before update; tnew: object after update t sky(D) No need to access data tnew sky(D) tnew sky(D) No skyline computation. Existing CSC objects are not changed. May update subspace skylines. May access dataset t sky(D) Retrieve new skyline objects Case 1: Case 2: The number of full-space skyline objects is small compared to the whole dataset! 2019/7/1 Tian Xia - Northeastern University
Tian Xia - Northeastern University Performance (Full-space) Dimensionality: 6 Object cardinality: [100K, 500K]. Distribution: Uniform Storage efficiency Query efficiency Update efficiency 2019/7/1 Tian Xia - Northeastern University
Thank you! Questions?