Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Skyline Query in Databases Which Objects are the Most Important?

Similar presentations


Presentation on theme: "The Skyline Query in Databases Which Objects are the Most Important?"— Presentation transcript:

1 The Skyline Query in Databases Which Objects are the Most Important?
Donghui Zhang College of Computer and Information Science Northeastern University

2 Tian Xia - Northeastern University
The Skyline of Boston Buildings not “dominated”, i.e. shorter and further than another building. 2019/7/1 Tian Xia - Northeastern University

3 The Skyline of NBA Players
NBA statistics data*: 19,112 records, , 17 attributes. A piece of data in 2004. Who are the best players? Best = Not dominated by any other player. Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 Kobe Bryant 1819 392 398 86 Shaquille O'Neal 1669 760 200 36 Yao Ming 1465 669 61 34 Dwyane Wade 1854 397 520 121 Steve Nash 1165 249 861 74 * 2019/7/1 Tian Xia - Northeastern University

4 The Skyline of NBA Players
NBA statistics data*: 19, 112 records, , 17 attributes. A piece of data in 2004. Who are the best players? Best = Not dominated by any other player. Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 Kobe Bryant 1819 392 398 86 Shaquille O'Neal 1669 760 200 36 Yao Ming 1465 669 61 34 Dwyane Wade 1854 397 520 121 Steve Nash 1165 249 861 74 * 2019/7/1 Tian Xia - Northeastern University

5 The Skyline of NBA Players
NBA statistics data*: 19, 112 records, , 17 attributes. A piece of data in 2004. Who are the best players? Best = Not dominated by any other player. Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 Kobe Bryant 1819 392 398 86 Shaquille O'Neal 1669 760 200 36 Yao Ming 1465 669 61 34 Dwyane Wade 1854 397 520 121 Steve Nash 1165 249 861 74 * 2019/7/1 Tian Xia - Northeastern University

6 The Skyline of NBA Players
NBA statistics data*: 19, 112 records, , 17 attributes. A piece of data in 2004. Who are the best players? Best = Not dominated by any other player. Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 Kobe Bryant 1819 392 398 86 Shaquille O'Neal 1669 760 200 36 Yao Ming 1465 669 61 34 Dwyane Wade 1854 397 520 121 Steve Nash 1165 249 861 74 * 2019/7/1 Tian Xia - Northeastern University

7 The Skyline of NBA Players
NBA statistics data*: 19, 112 records, , 17 attributes. A piece of data in 2004. Who are the best players? Best = Not dominated by any other player. Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 Kobe Bryant 1819 392 398 86 Shaquille O'Neal 1669 760 200 36 Yao Ming 1465 669 61 34 Dwyane Wade 1854 397 520 121 Steve Nash 1165 249 861 74 * 2019/7/1 Tian Xia - Northeastern University

8 The Skyline of NBA Players
NBA statistics data*: 19, 112 records, , 17 attributes. A piece of data in 2004. Who are the best players? Best = Not dominated by any other player. Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 Kobe Bryant 1819 392 398 86 Shaquille O'Neal 1669 760 200 36 Yao Ming 1465 669 61 34 Dwyane Wade 1854 397 520 121 Steve Nash 1165 249 861 74 * 2019/7/1 Tian Xia - Northeastern University

9 Tian Xia - Northeastern University
The Skyline of Hotels Example: suppose a student wants to find a hotel near the ICFP’07 conference hotel. Which are the best choices? Hotels in Germany Distance Price Distance t2, t3 and t4 are dominated. t t2 t t4 t3 t t7 t t5 t t1 t Skyline objects t6 t The smaller, the better! Price 2019/7/1 Tian Xia - Northeastern University

10 Skyline Query Applications
Find best NBA players: (#points, #rebounds), or any other subset of the 17 dimensions. Find best hotels: (price, distance to conference hotel). Find best researchers: (#pubs in POPL, PLDI, ICFP, SIGCOMM, SIGMOD) Any table in a RDBMS has a list of records with multiple attributes, so …… 2019/7/1 Tian Xia - Northeastern University

11 Tian Xia - Northeastern University
How to Find A Skyline? For every object o Compare with all other objects. Report o if it is not dominated by any. Complexity: O(n2) Problem: in a large database, O(n2) is inefficient! 2019/7/1 Tian Xia - Northeastern University

12 Why is an O(n2) Algorithm inefficient?
Data size is large  stored on disk. 2019/7/1 Tian Xia - Northeastern University

13 transfer time Each disk access: seek time rotational delay Spindle
Platters Spindle Disk head Arm movement Arm assembly Tracks Sector transfer time Each disk access: seek time rotational delay 21

14 Why is an O(n2) Algorithm inefficient?
Data size is large  stored on disk. Sample scenario: Disk page size: 8KB. Database size: 1GB = 131,072 disk pages. Let each disk I/O be 1 ms. O(n): 131 seconds  2 minutes. (Not efficient!) O(n2):  200 days! (Out of the question!) Find the nearest hospital… 2019/7/1 Tian Xia - Northeastern University

15 Tian Xia - Northeastern University
Content Motivation and definition of skyline Branch-and-Bound Skyline (BBS) Compressed Skycube (CSC) BBS: [PTFS03], SIGMOD CSC:[XZ06], SIGMOD 2019/7/1 Tian Xia - Northeastern University

16 Branch-and-Bound Skyline (BBS)
Use an R-tree to index the objects. Find NN to origin. This is a skyline object. Prune search space. Repeat finding NN in unpruned space. 2019/7/1 Tian Xia - Northeastern University

17 Tian Xia - Northeastern University
R-Tree Motivation y axis 10 m g h l 8 k f e 6 i j d 4 b a 2 c x axis 2 4 6 8 10 Range query: find the objects in a given range. E.g. find all hotels in Boston. No index: scan through all objects. NOT EFFICIENT! 2019/7/1 Tian Xia - Northeastern University

18 R-Tree: Clustering by Proximity
2019/7/1 Tian Xia - Northeastern University

19 Tian Xia - Northeastern University
R-Tree 2019/7/1 Tian Xia - Northeastern University

20 Tian Xia - Northeastern University
R-Tree 2019/7/1 Tian Xia - Northeastern University

21 Tian Xia - Northeastern University
Range Query y axis 10 m g h l 8 k f e E 6 2 i j E d 1 4 b a 2 c x axis 2 4 6 8 10 Root E E 1 2 E E E E 1 E E 3 4 5 6 7 E 2 a b c d e f g h i j k l m E 2019/7/1 E Tian Xia - Northeastern University E E E 3 4 5 6 7

22 Tian Xia - Northeastern University
Range Query y axis 10 m g h l 8 k f e E 6 2 i j E d 1 4 b a 2 c x axis 2 4 6 8 10 Root E E 1 2 E E E E 1 E E 3 4 5 6 7 E 2 a b c d e f g h i j k l m E 2019/7/1 E Tian Xia - Northeastern University E E E 3 4 5 6 7

23 Branched and Bound Skyline (BBS)
Assume all points are indexed in an R-tree. mindist(MBR) = the L1 distance between its lower-left corner and the origin. 2019/7/1 Tian Xia - Northeastern University

24 Branched and Bound Skyline (BBS)
Each heap entry keeps the mindist of the MBR. 2019/7/1 Tian Xia - Northeastern University

25 Tian Xia - Northeastern University
Example of BBS Process entries in ascending order of their mindists. 2019/7/1 Tian Xia - Northeastern University

26 Tian Xia - Northeastern University
Example of BBS 2019/7/1 Tian Xia - Northeastern University

27 Tian Xia - Northeastern University
Example of BBS 2019/7/1 Tian Xia - Northeastern University

28 Tian Xia - Northeastern University
Example of BBS 2019/7/1 Tian Xia - Northeastern University

29 Tian Xia - Northeastern University
Example of BBS 2019/7/1 Tian Xia - Northeastern University

30 Tian Xia - Northeastern University
Example of BBS 2019/7/1 Tian Xia - Northeastern University

31 Tian Xia - Northeastern University
Content Motivation and definition of skyline Branch-and-Bound Skyline (BBS) Compressed Skycube (CSC) BBS: [PTFS03], SIGMOD CSC:[XZ06], SIGMOD 2019/7/1 Tian Xia - Northeastern University

32 Subspace Skyline Queries
Hotels have many attributes, e.g. price, distance, star rating, food quality, facility, location, transportation, … Users may ask skyline queries on any subsets of attributes, depending on their interests. Subspace skylines can be very different! u3 u1 t2 t1 t3 t4 t5 t6 t7 Skyline in u1, u3 u2 u3 t5 t6 t7 t1 t2 t3 t4 Skyline in u2, u3 u u u u4 t t t t t t t Objects of 4-dimensions 2019/7/1 Tian Xia - Northeastern University

33 Tian Xia - Northeastern University
Our Problem Our problem: how to support arbitrary subspace skyline queries in dynamic and frequently-updated databases? Problem Settings: Online systems: the database server receives multiple concurrent skyline queries on arbitrary, unpredictable subspaces. Frequently-updated databases: The data are also constantly changing. E.g., in an online hotel-booking system, room prices change due to the availability. 2019/7/1 Tian Xia - Northeastern University

34 Straightforward Solutions
On-the-fly computation: slow in query processing Compute the results from scratch Process the whole dataset for each query Pre-compute and store all subspace skylines: high update costs Expensive to correctly maintain all results Waste of storage 2019/7/1 Tian Xia - Northeastern University

35 The complete pre-computation
Subspace Skyline u1 u2 u3 u4 u1 , u2 u1 , u3 u1 , u4 u2 , u3 u2 , u4 u3 , u4 u1 , u2 , u3 u1 , u2 , u4 u1 , u3 , u4 u2 , u3 , u4 u1 , u2 , u3 , u4 t5 , t6 t1 , t5 , t6 , t7 t5 , t6 , t7 t6 t1 , t5 , t6 , t7 , t9 t5 , t6 , t7 , t9 t4 , t5 , t7 u u u u4 t t Skycube t t t t t Contains many duplicates, e.g. t6 appears 12 times t t 2019/7/1 Tian Xia - Northeastern University

36 Our Solution: the Compressed Skycube
The Compressed Skycube achieves both fast query response and efficient update support. New storage model Represent all skylines in a very concise way, by preserving only essential information of subspace skylines. New query processing algorithm Efficiently answer arbitrary subspace skyline queries without accessing the whole dataset. New object-aware update scheme Avoid unnecessary data access and subspace skyline computation upon updates. 2019/7/1 Tian Xia - Northeastern University

37 Minimum Subspace (mss)
Subspace Skyline u1 t7 Object t6 appears in the skylines of 12 subspaces. The number of minimum subspaces of t6 is only 2. u2 t6 u3 t6 u4 t4 , t5 , t7 u1 , u2 t5 , t6 , t7 , t9 u1 , u3 t1 , t5 , t6 , t7 , t9 u1 , u4 t7 t u4 t u1, u2, u1, u3 t u1, u4 t u1, u3 t u4, u1, u2, u1, u3 t u2, u3 Minimum Subspaces u2 , u3 t6 u2 , u4 t5 , t6 u3 , u4 t5 , t6 u1 , u2 , u3 t1 , t5 , t6 , t7 , t9 u1 , u2 , u4 t5 , t6 , t7 u1 , u3 , u4 t1 , t5 , t6 , t7 u2 , u3 , u4 t5 , t6 u1 , u2 , u3 , u4 t1 , t5 , t6 , t7 2019/7/1 Tian Xia - Northeastern University

38 The Compressed Skycube (CSC)
Definition: The Compressed Skycube (CSC) consists of non-empty subspace U, such that an object t is stored in a subspace U if and only if U is a minimum subspace of t, i.e. U mss(t). t7 Subspace Skyline u1 u2 u3 u4 u1 , u2 u1 , u3 t1 , t5 , t9 t5 , t9 t4 , t5 , t7 t6 CSC t u4 t u1, u2, u1, u3 t u1, u4 t u1, u3 t u4, u1, u2, u1, u3 t u2, u3 Minimum Subspaces 2019/7/1 Tian Xia - Northeastern University

39 Only visit CSC, not whole dataset Output is non-blocking!
Querying CSC Only visit CSC, not whole dataset Find the skyline in subspace u2, u3, u4. Output is non-blocking! CSC u u u u4 Subspace Skyline t u1 t7 t u2 t6 t6 t u3 t6 t u4 t4 , t5 , t7 t5 t u1 , u2 t5 , t9 t u1 , u3 t1 , t5 , t9 Theorem 1: Given a query space Uq and an object t, if for any subspace Ui in mss(t), Ui  Uq, then t is not in the skyline of Uq. Search the subspaces which are subsets of the query space. Theorem 2 (Local Comparison): To check a candidate t in a subspace V  Uq, we only need to compare t with the objects within the same subspace. Compare candidates within their own subspaces. 2019/7/1 Tian Xia - Northeastern University

40 Tian Xia - Northeastern University
Updating CSC Crucial questions: When do we access the whole dataset to retrieve new skyline objects? When do we re-compute the skylines of certain subspaces? Full-space: a subspace containing all dimensions, represented as D Skyline objects in full-space: sky(D) t: object before update; tnew: object after update t  sky(D) No need to access data tnew  sky(D) tnew  sky(D) No skyline computation. Existing CSC objects are not changed. May update subspace skylines. May access dataset t  sky(D) Retrieve new skyline objects Case 1: Case 2: The number of full-space skyline objects is small compared to the whole dataset! 2019/7/1 Tian Xia - Northeastern University

41 Tian Xia - Northeastern University
Performance (Full-space) Dimensionality: 6 Object cardinality: [100K, 500K]. Distribution: Uniform Storage efficiency Query efficiency Update efficiency 2019/7/1 Tian Xia - Northeastern University

42 Thank you! Questions?


Download ppt "The Skyline Query in Databases Which Objects are the Most Important?"

Similar presentations


Ads by Google