# Atish Das Sarma, Ashwin Lall, Danupon Nanongkai, Jun Xu 1 Georgia Tech VLDB 2009.

## Presentation on theme: "Atish Das Sarma, Ashwin Lall, Danupon Nanongkai, Jun Xu 1 Georgia Tech VLDB 2009."— Presentation transcript:

Atish Das Sarma, Ashwin Lall, Danupon Nanongkai, Jun Xu 1 Georgia Tech VLDB 2009

In one sentence …. 2

We develop a streaming algorithm 3

We develop a streaming algorithm for skyline problem 4

We develop a streaming algorithm for skyline problem with near-optimal worst-case guarantee. 5

6

HotelPriceDistance Athena\$972.9 km Park & Suites\$1243.6 km Hotel du Helder\$763.8 km de la Cité Concorde\$2200.67 km Mercure Carlton Lyon\$1633.0 km I want a cheap hotel nearby 7

HotelPriceDistance Athena\$972.9 km Park & Suites\$1243.6 km Hotel du Helder\$763.8 km de la Cité Concorde\$2200.67 km Mercure Carlton Lyon\$1633.0 km I want a cheap hotel nearby dominates 8

HotelPriceDistance Athena\$972.9 km Park & Suites\$1243.6 km Hotel du Helder\$763.8 km de la Cité Concorde\$2200.67 km Mercure Carlton Lyon\$1632.9 km I want a cheap hotel nearby dominates 9

Price Distance de la Cite Park & Suites du Helder Athena Mercure 10

Price Distance de la Cite Park & Suites du Helder Athena Mercure 11

Problem definition Given distinct d-dimensional points (a 1, …, a d ) dominates (b 1, …, b d ) if a i b i for all i and a i < b i for some i Skyline = set of undominated points dominates Skyline = { (1, 3), (3, 2) } (5,2) (1,3) (3,2) Example (1, 3), (5, 2), (3, 2) 12

Skyline algorithms RAM Disk (External) PreprocessingNon-preprocessing BBS Papadias et al. SIGMOD03 NN Kossman et al. VLDB02 13 DD&C Kung et al. FOCS 75 LD&C Bently et al. JACM78, FLET Bently et al. SODA90, SD&C Borzsonyi et al. ICDE01, BNL Borzsonyi et al. ICDE01, SFS Chomicki et al. ICDE03, LESS Godfrey et al. VLDB05

Our Goal Non-preprocessing external algorithm with worst-case guarantee What is the model of external algorithms? 14

CPU process I/O Sequental I/O Random I/O CPU process I/O Sequental I/O Random I/O Multi-pass Streaming Model 15 # of random I/Os = # of passes Streaming model naturally forces us to minimize the number of random I/Os

16

17 (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) Small RAM Huge Harddisk

18 (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) Small RAM Huge Harddisk

19 (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) Small RAM Huge Harddisk

20 (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) Small RAM Huge Harddisk

21 (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) Small RAM Huge Harddisk 2 nd pass

22 (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) Small RAM Huge Harddisk 3 rd pass

Our Goal Non-preprocessing external algorithm with worst-case guarantee streaming 23

Theory RAND: Almost optimal multi-pass streaming algorithm for skyline O(log n) passes & O(m) space O(log n) passes & O(m) space n = # of points and m = skyline size 1 pass needs Ω(n) space RAND uses O(log n) passes & O(m) space Every algorithm that uses 1 pass needs Ω(n) space Next: RAND algorithm Later: Experimental result 24

RAND algorithm 25

Algorithms: Main Idea Suppose m is known. Theorem: In 3 passes and m space, we can find skyline points that dominate at least n/2 points, with high probability 26

Eliminate-Points algorithm 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (4, 4) 27

Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (4, 4) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 28

Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (4, 4) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 29

Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (4, 4) (3, 4) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 30

Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 31

Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 32

Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 33

Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 34

Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 35

Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 36

Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 37

Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 38

Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 39

Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 40

Analysis Theorem: Eliminate-Points algorithm deletes at least n/2 points with high probability 41

Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 42

Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 43

Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 44 Note: There will be m trees, each rooted by a skyline point

Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (4, 4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 45

Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 (3, 3) 46

4, 4 Analysis Claim: The tree that some element is sampled will be deleted (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) 1, 5 3, 3 3, 4 4, 3 4, 5 (3, 3) 47

Analysis There are m trees, each rooted by a skyline point 48 12mm-1

Analysis There are m trees, each rooted by a skyline point 49 12mm-1

Analysis Big tree has bigger chance of being sampled … and deleted 50 12mm-1

Analysis If enough points are sampled, every tree that is big enough will be deleted 51 12mm-1

Analysis Lemma: With high probability, all trees of size n/(2m) are deleted We delete n/2 points in total 52 12mm-1

Extending to RAND Recall: If we know m then we can delete n/2 points in 3 passes If m is known, we can find skyline in O(log n) passes with high probability – We delete n/2 points every 3 passes m is not known – Guess m by doubling trick – Additional O(log m) passes Fixed-window case – Memory space is limited Random I/Os, Sequential I/Os and Number of comparisons have to be analyzed separately 53

Theory RAND: Almost optimal multi-pass streaming algorithm for skyline O(log n) passes & O(m) space O(log n) passes & O(m) space n = # of points and m = skyline size 1 pass needs Ω(n) space RAND uses O(log n) passes & O(m) space Every algorithm that uses 1 pass needs Ω(n) space 54

Theory RAND: Almost optimal multi-pass streaming algorithm for skyline O(log n) passes & O(m) space O(log n) passes & O(m) space n = # of points and m = skyline size 1 pass needs Ω(n) space Algorithms comparison w = window (memory) size AlgorithmRandom I/OsSequential I/OsComparisons BNL(w) (min{w, n/w}) (min{w, n 2 /w}) (dmin{wmn, n 2 }) LESS(w) (n log w (n/w)) (mn/w) (dmn+n log n) RAND(w) O(m log (n/w))O(mn/w)O(dmn) 55

56 Experiment RAND BNL & LESS Average case Worst case We try several datasets in the literature … Correlate, Anti-correlated, Independent, Island, House, NBA, Color

Average case - No clear winner between BNL and LESS - RAND is always close to the winner Average case - No clear winner between BNL and LESS - RAND is always close to the winner Experimental Results 57 RAND BNL & LESS

Experimental Results 58 RAND Worse: After sorting by decreasing first coordinate - RAND is the most robust and usually fastest Worse: After sorting by decreasing first coordinate - RAND is the most robust and usually fastest BNL & LESS

Experimental Results 59 RAND BNL & LESS Even Worse: After sorting by entropy

Summary 60 (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) 60 RAND BNL & LESS Average case Worst case Disk Stream 12mm-1 Random Sampling RAND Experiment

Extensions Distributed skyline algorithm Derandomize the algorithm for 2D case Skyline for partially ordered sets (posets) Open problems Develop algorithm on Parallel Disk Model (PDM) and Cache Oblivious model Extend the techniques to pre-processing algorithm Is O(log n) passes the best possible? 61 Summary

Thank you 62

Appendix 63

Charts for average case 64

65

The lower bound Theorem: Any randomized one-pass algorithm with space at most n/2 succeeds with probability at most 1/2 Proof 66 - Random unique survivor - 2 points come at the end - If space <= n/2 then will fail if didnt store survivor in the memory

Proof of Claim 67

Proof of Claim Claim: The tree that some element is sampled will be deleted (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 4, 4 (3, 3) 68

Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (4, 4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 4, 4 69

Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (4, 4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 4, 4 (3, 4) 3, 4 70

Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 4, 4 (3, 3) 3, 4 3, 3 71

Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 4, 4 (3, 3) 3, 4 3, 3 72

Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 4, 4 (3, 3) 3, 4 3, 3 73

Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 4, 4 (3, 3) 3, 4 3, 3 74

Download ppt "Atish Das Sarma, Ashwin Lall, Danupon Nanongkai, Jun Xu 1 Georgia Tech VLDB 2009."

Similar presentations