The Skyline Query in Databases Which Objects are the Most Important?

Slides:



Advertisements
Similar presentations
The Optimal-Location Query
Advertisements

Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
 Definition of B+ tree  How to create B+ tree  How to search for record  How to delete and insert a data.
Storing Data: Disks and Files: Chapter 9
External Memory Hashing. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B.
Tutorial 8 CSI 2132 Database I. Exercise 1 Both disks and main memory support direct access to any desired location (page). On average, main memory accesses.
Progressive Computation of The Min-Dist Optimal-Location Query Donghui Zhang, Yang Du, Tian Xia, Yufei Tao* Northeastern University * Chinese University.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
ICS 624 Spring 2011 Multi-Dimensional Clustering Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 1/24/20111Lipyeow.
1 Storing Data: Disks and Files Yanlei Diao UMass Amherst Feb 15, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Chapter 8 File organization and Indices.
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
Top-k and Skyline Computation in Database Systems
Spatial Queries Nearest Neighbor Queries.
Lecture 11: DMBS Internals
1 Physical Data Organization and Indexing Lecture 14.
Announcements Exam Friday Project: Steps –Due today.
SUBSKY: Efficient Computation of Skylines in Subspaces Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei Conference: ICDE 2006 Presenter: Kamiru Superviosr:
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
ICS 321 Fall 2011 Overview of Storage & Indexing (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 11/9/20111Lipyeow.
1 Introduction to Spatial Databases Donghui Zhang CCIS Northeastern University.
Database Management Systems,Shri Prasad Sawant. 1 Storing Data: Disks and Files Unit 1 Mr.Prasad Sawant.
Indexing.
Chapter Ten. Storage Categories Storage medium is required to store information/data Primary memory can be accessed by the CPU directly Fast, expensive.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
DB Seminar Schedule Seminar Schedule ================================================================= Chui Chun Kit30/11/07 Gong Jian Jim7/12/07 Loo Kin.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
On Computing Top-t Influential Spatial Sites Authors: T. Xia, D. Zhang, E. Kanoulas, Y.Du Northeastern University, USA Appeared in: VLDB 2005 Presenter:
9/2/2005VLDB 2005, Trondheim, Norway1 On Computing Top-t Most Influential Spatial Sites Tian Xia, Donghui Zhang, Evangelos Kanoulas, Yang Du Northeastern.
Internal and External Sorting External Searching
Progressive Computation of The Min-Dist Optimal-Location Query Donghui Zhang, Yang Du, Tian Xia, Yufei Tao* Northeastern University * Chinese University.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
1 CS122A: Introduction to Data Management Lecture #14: Indexing Instructor: Chen Li.
1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.
1 Introduction to Spatial Databases Donghui Zhang CCIS Northeastern University.
Dense-Region Based Compact Data Cube
Tian Xia and Donghui Zhang Northeastern University
Data Indexing Herbert A. Evans.
CS522 Advanced database Systems
CS 540 Database Management Systems
Lecture 16: Data Storage Wednesday, November 6, 2006.
Progressive Computation of The Min-Dist Optimal-Location Query
Database Management Systems (CS 564)
CPSC-608 Database Systems
Database Management Systems (CS 564)
Chapter 12: Query Processing
Lecture 11: DMBS Internals
Nearest Neighbor Queries using R-trees
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
File organization and Indexing
Chapter 11: Indexing and Hashing
Introduction to Spatial Databases
Spatio-Temporal Databases
CPSC-310 Database Systems
External Memory Hashing
Skyline query with R*-Tree: Branch and Bound Skyline (BBS) Algorithm
Similarity Search: A Matching Based Approach
Lecture 13: Query Execution
Database Systems (資料庫系統)
Chapter 11: Indexing and Hashing
Efficient Processing of Top-k Spatial Preference Queries
Donghui Zhang, Tian Xia Northeastern University
Efficient Aggregation over Objects with Extent
CSE 190D Database System Implementation
Presentation transcript:

The Skyline Query in Databases Which Objects are the Most Important? Donghui Zhang College of Computer and Information Science Northeastern University

Tian Xia - Northeastern University The Skyline of Boston Buildings not “dominated”, i.e. shorter and further than another building. 2019/7/1 Tian Xia - Northeastern University

The Skyline of NBA Players NBA statistics data*: 19,112 records, 1946-2004, 17 attributes. A piece of data in 2004. Who are the best players? Best = Not dominated by any other player. Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 Kobe Bryant 1819 392 398 86 Shaquille O'Neal 1669 760 200 36 Yao Ming 1465 669 61 34 Dwyane Wade 1854 397 520 121 Steve Nash 1165 249 861 74 * www.databaseBasketball.com 2019/7/1 Tian Xia - Northeastern University

The Skyline of NBA Players NBA statistics data*: 19, 112 records, 1946-2004, 17 attributes. A piece of data in 2004. Who are the best players? Best = Not dominated by any other player. Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 Kobe Bryant 1819 392 398 86 Shaquille O'Neal 1669 760 200 36 Yao Ming 1465 669 61 34 Dwyane Wade 1854 397 520 121 Steve Nash 1165 249 861 74 * www.databaseBasketball.com 2019/7/1 Tian Xia - Northeastern University

The Skyline of NBA Players NBA statistics data*: 19, 112 records, 1946-2004, 17 attributes. A piece of data in 2004. Who are the best players? Best = Not dominated by any other player. Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 Kobe Bryant 1819 392 398 86 Shaquille O'Neal 1669 760 200 36 Yao Ming 1465 669 61 34 Dwyane Wade 1854 397 520 121 Steve Nash 1165 249 861 74 * www.databaseBasketball.com 2019/7/1 Tian Xia - Northeastern University

The Skyline of NBA Players NBA statistics data*: 19, 112 records, 1946-2004, 17 attributes. A piece of data in 2004. Who are the best players? Best = Not dominated by any other player. Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 Kobe Bryant 1819 392 398 86 Shaquille O'Neal 1669 760 200 36 Yao Ming 1465 669 61 34 Dwyane Wade 1854 397 520 121 Steve Nash 1165 249 861 74 * www.databaseBasketball.com 2019/7/1 Tian Xia - Northeastern University

The Skyline of NBA Players NBA statistics data*: 19, 112 records, 1946-2004, 17 attributes. A piece of data in 2004. Who are the best players? Best = Not dominated by any other player. Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 Kobe Bryant 1819 392 398 86 Shaquille O'Neal 1669 760 200 36 Yao Ming 1465 669 61 34 Dwyane Wade 1854 397 520 121 Steve Nash 1165 249 861 74 * www.databaseBasketball.com 2019/7/1 Tian Xia - Northeastern University

The Skyline of NBA Players NBA statistics data*: 19, 112 records, 1946-2004, 17 attributes. A piece of data in 2004. Who are the best players? Best = Not dominated by any other player. Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 Kobe Bryant 1819 392 398 86 Shaquille O'Neal 1669 760 200 36 Yao Ming 1465 669 61 34 Dwyane Wade 1854 397 520 121 Steve Nash 1165 249 861 74 * www.databaseBasketball.com 2019/7/1 Tian Xia - Northeastern University

Tian Xia - Northeastern University The Skyline of Hotels Example: suppose a student wants to find a hotel near the ICFP’07 conference hotel. Which are the best choices? Hotels in Germany Distance Price Distance t2, t3 and t4 are dominated. t1 3 2 t2 t2 4 7 t4 t3 t3 9 5 t7 t4 4 6 1 2 3 4 5 6 7 8 t5 t5 2 3 t1 t6 6 1 Skyline objects t6 t7 1 4 The smaller, the better! 1 2 3 4 5 6 7 8 9 Price 2019/7/1 Tian Xia - Northeastern University

Skyline Query Applications Find best NBA players: (#points, #rebounds), or any other subset of the 17 dimensions. Find best hotels: (price, distance to conference hotel). Find best researchers: (#pubs in POPL, PLDI, ICFP, SIGCOMM, SIGMOD) Any table in a RDBMS has a list of records with multiple attributes, so …… 2019/7/1 Tian Xia - Northeastern University

Tian Xia - Northeastern University How to Find A Skyline? For every object o Compare with all other objects. Report o if it is not dominated by any. Complexity: O(n2) Problem: in a large database, O(n2) is inefficient! 2019/7/1 Tian Xia - Northeastern University

Why is an O(n2) Algorithm inefficient? Data size is large  stored on disk. 2019/7/1 Tian Xia - Northeastern University

transfer time Each disk access: seek time rotational delay Spindle Platters Spindle Disk head Arm movement Arm assembly Tracks Sector transfer time Each disk access: seek time rotational delay 21

Why is an O(n2) Algorithm inefficient? Data size is large  stored on disk. Sample scenario: Disk page size: 8KB. Database size: 1GB = 131,072 disk pages. Let each disk I/O be 1 ms. O(n): 131 seconds  2 minutes. (Not efficient!) O(n2):  200 days! (Out of the question!) Find the nearest hospital… 2019/7/1 Tian Xia - Northeastern University

Tian Xia - Northeastern University Content Motivation and definition of skyline Branch-and-Bound Skyline (BBS) Compressed Skycube (CSC) BBS: [PTFS03], SIGMOD CSC:[XZ06], SIGMOD 2019/7/1 Tian Xia - Northeastern University

Branch-and-Bound Skyline (BBS) Use an R-tree to index the objects. Find NN to origin. This is a skyline object. Prune search space. Repeat finding NN in unpruned space. 2019/7/1 Tian Xia - Northeastern University

Tian Xia - Northeastern University R-Tree Motivation y axis 10 m g h l 8 k f e 6 i j d 4 b a 2 c x axis 2 4 6 8 10 Range query: find the objects in a given range. E.g. find all hotels in Boston. No index: scan through all objects. NOT EFFICIENT! 2019/7/1 Tian Xia - Northeastern University

R-Tree: Clustering by Proximity 2019/7/1 Tian Xia - Northeastern University

Tian Xia - Northeastern University R-Tree 2019/7/1 Tian Xia - Northeastern University

Tian Xia - Northeastern University R-Tree 2019/7/1 Tian Xia - Northeastern University

Tian Xia - Northeastern University Range Query y axis 10 m g h l 8 k f e E 6 2 i j E d 1 4 b a 2 c x axis 2 4 6 8 10 Root E E 1 2 E E E E 1 E E 3 4 5 6 7 E 2 a b c d e f g h i j k l m E 2019/7/1 E Tian Xia - Northeastern University E E E 3 4 5 6 7

Tian Xia - Northeastern University Range Query y axis 10 m g h l 8 k f e E 6 2 i j E d 1 4 b a 2 c x axis 2 4 6 8 10 Root E E 1 2 E E E E 1 E E 3 4 5 6 7 E 2 a b c d e f g h i j k l m E 2019/7/1 E Tian Xia - Northeastern University E E E 3 4 5 6 7

Branched and Bound Skyline (BBS) Assume all points are indexed in an R-tree. mindist(MBR) = the L1 distance between its lower-left corner and the origin. 2019/7/1 Tian Xia - Northeastern University

Branched and Bound Skyline (BBS) Each heap entry keeps the mindist of the MBR. 2019/7/1 Tian Xia - Northeastern University

Tian Xia - Northeastern University Example of BBS Process entries in ascending order of their mindists. 2019/7/1 Tian Xia - Northeastern University

Tian Xia - Northeastern University Example of BBS 2019/7/1 Tian Xia - Northeastern University

Tian Xia - Northeastern University Example of BBS 2019/7/1 Tian Xia - Northeastern University

Tian Xia - Northeastern University Example of BBS 2019/7/1 Tian Xia - Northeastern University

Tian Xia - Northeastern University Example of BBS 2019/7/1 Tian Xia - Northeastern University

Tian Xia - Northeastern University Example of BBS 2019/7/1 Tian Xia - Northeastern University

Tian Xia - Northeastern University Content Motivation and definition of skyline Branch-and-Bound Skyline (BBS) Compressed Skycube (CSC) BBS: [PTFS03], SIGMOD CSC:[XZ06], SIGMOD 2019/7/1 Tian Xia - Northeastern University

Subspace Skyline Queries Hotels have many attributes, e.g. price, distance, star rating, food quality, facility, location, transportation, … Users may ask skyline queries on any subsets of attributes, depending on their interests. Subspace skylines can be very different! 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 u3 u1 t2 t1 t3 t4 t5 t6 t7 Skyline in u1, u3 u2 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 u3 t5 t6 t7 t1 t2 t3 t4 Skyline in u2, u3 u1 u2 u3 u4 t1 3 4 2 5 t2 4 6 7 2 t3 9 7 5 6 t4 4 3 6 1 t5 2 2 3 1 t6 6 1 1 3 t7 1 3 4 1 Objects of 4-dimensions 2019/7/1 Tian Xia - Northeastern University

Tian Xia - Northeastern University Our Problem Our problem: how to support arbitrary subspace skyline queries in dynamic and frequently-updated databases? Problem Settings: Online systems: the database server receives multiple concurrent skyline queries on arbitrary, unpredictable subspaces. Frequently-updated databases: The data are also constantly changing. E.g., in an online hotel-booking system, room prices change due to the availability. 2019/7/1 Tian Xia - Northeastern University

Straightforward Solutions On-the-fly computation: slow in query processing Compute the results from scratch Process the whole dataset for each query Pre-compute and store all subspace skylines: high update costs Expensive to correctly maintain all results Waste of storage 2019/7/1 Tian Xia - Northeastern University

The complete pre-computation Subspace Skyline u1 u2 u3 u4 u1 , u2 u1 , u3 u1 , u4 u2 , u3 u2 , u4 u3 , u4 u1 , u2 , u3 u1 , u2 , u4 u1 , u3 , u4 u2 , u3 , u4 u1 , u2 , u3 , u4 t5 , t6 t1 , t5 , t6 , t7 t5 , t6 , t7 t6 t1 , t5 , t6 , t7 , t9 t5 , t6 , t7 , t9 t4 , t5 , t7 u1 u2 u3 u4 t1 3 4 2 5 t2 4 6 7 2 Skycube t3 9 7 5 6 t4 4 3 6 1 t5 2 2 3 1 t6 6 1 1 3 t7 1 3 4 1 Contains many duplicates, e.g. t6 appears 12 times t8 6 5 3 8 t9 2 2 3 7 2019/7/1 Tian Xia - Northeastern University

Our Solution: the Compressed Skycube The Compressed Skycube achieves both fast query response and efficient update support. New storage model Represent all skylines in a very concise way, by preserving only essential information of subspace skylines. New query processing algorithm Efficiently answer arbitrary subspace skyline queries without accessing the whole dataset. New object-aware update scheme Avoid unnecessary data access and subspace skyline computation upon updates. 2019/7/1 Tian Xia - Northeastern University

Minimum Subspace (mss) Subspace Skyline u1 t7 Object t6 appears in the skylines of 12 subspaces. The number of minimum subspaces of t6 is only 2. u2 t6 u3 t6 u4 t4 , t5 , t7 u1 , u2 t5 , t6 , t7 , t9 u1 , u3 t1 , t5 , t6 , t7 , t9 u1 , u4 t7 t4 u4 t9 u1, u2, u1, u3 t7 u1, u4 t1 u1, u3 t5 u4, u1, u2, u1, u3 t6 u2, u3 Minimum Subspaces u2 , u3 t6 u2 , u4 t5 , t6 u3 , u4 t5 , t6 u1 , u2 , u3 t1 , t5 , t6 , t7 , t9 u1 , u2 , u4 t5 , t6 , t7 u1 , u3 , u4 t1 , t5 , t6 , t7 u2 , u3 , u4 t5 , t6 u1 , u2 , u3 , u4 t1 , t5 , t6 , t7 2019/7/1 Tian Xia - Northeastern University

The Compressed Skycube (CSC) Definition: The Compressed Skycube (CSC) consists of non-empty subspace U, such that an object t is stored in a subspace U if and only if U is a minimum subspace of t, i.e. U mss(t). t7 Subspace Skyline u1 u2 u3 u4 u1 , u2 u1 , u3 t1 , t5 , t9 t5 , t9 t4 , t5 , t7 t6 CSC t4 u4 t9 u1, u2, u1, u3 t7 u1, u4 t1 u1, u3 t5 u4, u1, u2, u1, u3 t6 u2, u3 Minimum Subspaces 2019/7/1 Tian Xia - Northeastern University

Only visit CSC, not whole dataset Output is non-blocking! Querying CSC Only visit CSC, not whole dataset Find the skyline in subspace u2, u3, u4. Output is non-blocking! CSC u1 u2 u3 u4 Subspace Skyline t1 3 4 2 5 u1 t7 t4 4 3 6 1 u2 t6 t6 t5 2 2 3 1 u3 t6 t6 6 1 1 3 u4 t4 , t5 , t7 t5 t7 1 3 4 1 u1 , u2 t5 , t9 t9 2 2 3 7 u1 , u3 t1 , t5 , t9 Theorem 1: Given a query space Uq and an object t, if for any subspace Ui in mss(t), Ui  Uq, then t is not in the skyline of Uq. Search the subspaces which are subsets of the query space. Theorem 2 (Local Comparison): To check a candidate t in a subspace V  Uq, we only need to compare t with the objects within the same subspace. Compare candidates within their own subspaces. 2019/7/1 Tian Xia - Northeastern University

Tian Xia - Northeastern University Updating CSC Crucial questions: When do we access the whole dataset to retrieve new skyline objects? When do we re-compute the skylines of certain subspaces? Full-space: a subspace containing all dimensions, represented as D Skyline objects in full-space: sky(D) t: object before update; tnew: object after update t  sky(D) No need to access data tnew  sky(D) tnew  sky(D) No skyline computation. Existing CSC objects are not changed. May update subspace skylines. May access dataset t  sky(D) Retrieve new skyline objects Case 1: Case 2: The number of full-space skyline objects is small compared to the whole dataset! 2019/7/1 Tian Xia - Northeastern University

Tian Xia - Northeastern University Performance (Full-space) Dimensionality: 6 Object cardinality: [100K, 500K]. Distribution: Uniform Storage efficiency Query efficiency Update efficiency 2019/7/1 Tian Xia - Northeastern University

Thank you! Questions?