Presentation is loading. Please wait.

Presentation is loading. Please wait.

Atish Das Sarma, Ashwin Lall, Danupon Nanongkai, Jun Xu 1 Georgia Tech VLDB 2009.

Similar presentations


Presentation on theme: "Atish Das Sarma, Ashwin Lall, Danupon Nanongkai, Jun Xu 1 Georgia Tech VLDB 2009."— Presentation transcript:

1 Atish Das Sarma, Ashwin Lall, Danupon Nanongkai, Jun Xu 1 Georgia Tech VLDB 2009

2 In one sentence …. 2

3 We develop a streaming algorithm 3

4 We develop a streaming algorithm for skyline problem 4

5 We develop a streaming algorithm for skyline problem with near-optimal worst-case guarantee. 5

6 6

7 HotelPriceDistance Athena$972.9 km Park & Suites$1243.6 km Hotel du Helder$763.8 km de la Cité Concorde$2200.67 km Mercure Carlton Lyon$1633.0 km I want a cheap hotel nearby 7

8 HotelPriceDistance Athena$972.9 km Park & Suites$1243.6 km Hotel du Helder$763.8 km de la Cité Concorde$2200.67 km Mercure Carlton Lyon$1633.0 km I want a cheap hotel nearby dominates 8

9 HotelPriceDistance Athena$972.9 km Park & Suites$1243.6 km Hotel du Helder$763.8 km de la Cité Concorde$2200.67 km Mercure Carlton Lyon$1632.9 km I want a cheap hotel nearby dominates 9

10 Price Distance de la Cite Park & Suites du Helder Athena Mercure 10

11 Price Distance de la Cite Park & Suites du Helder Athena Mercure 11

12 Problem definition Given distinct d-dimensional points (a 1, …, a d ) dominates (b 1, …, b d ) if a i b i for all i and a i < b i for some i Skyline = set of undominated points dominates Skyline = { (1, 3), (3, 2) } (5,2) (1,3) (3,2) Example (1, 3), (5, 2), (3, 2) 12

13 Skyline algorithms RAM Disk (External) PreprocessingNon-preprocessing BBS Papadias et al. SIGMOD03 NN Kossman et al. VLDB02 13 DD&C Kung et al. FOCS 75 LD&C Bently et al. JACM78, FLET Bently et al. SODA90, SD&C Borzsonyi et al. ICDE01, BNL Borzsonyi et al. ICDE01, SFS Chomicki et al. ICDE03, LESS Godfrey et al. VLDB05

14 Our Goal Non-preprocessing external algorithm with worst-case guarantee What is the model of external algorithms? 14

15 CPU process I/O Sequental I/O Random I/O CPU process I/O Sequental I/O Random I/O Multi-pass Streaming Model 15 # of random I/Os = # of passes Streaming model naturally forces us to minimize the number of random I/Os

16 16

17 17 (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) Small RAM Huge Harddisk

18 18 (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) Small RAM Huge Harddisk

19 19 (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) Small RAM Huge Harddisk

20 20 (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) Small RAM Huge Harddisk

21 21 (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) Small RAM Huge Harddisk 2 nd pass

22 22 (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) Small RAM Huge Harddisk 3 rd pass

23 Our Goal Non-preprocessing external algorithm with worst-case guarantee streaming 23

24 Theory RAND: Almost optimal multi-pass streaming algorithm for skyline O(log n) passes & O(m) space O(log n) passes & O(m) space n = # of points and m = skyline size 1 pass needs Ω(n) space RAND uses O(log n) passes & O(m) space Every algorithm that uses 1 pass needs Ω(n) space Next: RAND algorithm Later: Experimental result 24

25 RAND algorithm 25

26 Algorithms: Main Idea Suppose m is known. Theorem: In 3 passes and m space, we can find skyline points that dominate at least n/2 points, with high probability 26

27 Eliminate-Points algorithm 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (4, 4) 27

28 Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (4, 4) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 28

29 Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (4, 4) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 29

30 Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (4, 4) (3, 4) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 30

31 Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 31

32 Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 32

33 Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 33

34 Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 34

35 Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 35

36 Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 36

37 Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 37

38 Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 38

39 Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 39

40 Eliminate-Points algorithm (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) (3, 3) 1. Sample x=2m ln(mn log n) points p 1, p 2, …, p x 2. Go through the stream, Replace each p i by a point dominating it 3. For each p i, delete p i and all points it dominates Output p 1, p 2, …, p x and repeat 40

41 Analysis Theorem: Eliminate-Points algorithm deletes at least n/2 points with high probability 41

42 Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 42

43 Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 43

44 Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 44 Note: There will be m trees, each rooted by a skyline point

45 Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (4, 4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 45

46 Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 (3, 3) 46

47 4, 4 Analysis Claim: The tree that some element is sampled will be deleted (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) 1, 5 3, 3 3, 4 4, 3 4, 5 (3, 3) 47

48 Analysis There are m trees, each rooted by a skyline point 48 12mm-1

49 Analysis There are m trees, each rooted by a skyline point 49 12mm-1

50 Analysis Big tree has bigger chance of being sampled … and deleted 50 12mm-1

51 Analysis If enough points are sampled, every tree that is big enough will be deleted 51 12mm-1

52 Analysis Lemma: With high probability, all trees of size n/(2m) are deleted We delete n/2 points in total 52 12mm-1

53 Extending to RAND Recall: If we know m then we can delete n/2 points in 3 passes If m is known, we can find skyline in O(log n) passes with high probability – We delete n/2 points every 3 passes m is not known – Guess m by doubling trick – Additional O(log m) passes Fixed-window case – Memory space is limited Random I/Os, Sequential I/Os and Number of comparisons have to be analyzed separately 53

54 Theory RAND: Almost optimal multi-pass streaming algorithm for skyline O(log n) passes & O(m) space O(log n) passes & O(m) space n = # of points and m = skyline size 1 pass needs Ω(n) space RAND uses O(log n) passes & O(m) space Every algorithm that uses 1 pass needs Ω(n) space 54

55 Theory RAND: Almost optimal multi-pass streaming algorithm for skyline O(log n) passes & O(m) space O(log n) passes & O(m) space n = # of points and m = skyline size 1 pass needs Ω(n) space Algorithms comparison w = window (memory) size AlgorithmRandom I/OsSequential I/OsComparisons BNL(w) (min{w, n/w}) (min{w, n 2 /w}) (dmin{wmn, n 2 }) LESS(w) (n log w (n/w)) (mn/w) (dmn+n log n) RAND(w) O(m log (n/w))O(mn/w)O(dmn) 55

56 56 Experiment RAND BNL & LESS Average case Worst case We try several datasets in the literature … Correlate, Anti-correlated, Independent, Island, House, NBA, Color

57 Average case - No clear winner between BNL and LESS - RAND is always close to the winner Average case - No clear winner between BNL and LESS - RAND is always close to the winner Experimental Results 57 RAND BNL & LESS

58 Experimental Results 58 RAND Worse: After sorting by decreasing first coordinate - RAND is the most robust and usually fastest Worse: After sorting by decreasing first coordinate - RAND is the most robust and usually fastest BNL & LESS

59 Experimental Results 59 RAND BNL & LESS Even Worse: After sorting by entropy

60 Summary 60 (1, 2) (3, 7) (5, 3) (2, 5) (4, 1) (9, 9) 60 RAND BNL & LESS Average case Worst case Disk Stream 12mm-1 Random Sampling RAND Experiment

61 Extensions Distributed skyline algorithm Derandomize the algorithm for 2D case Skyline for partially ordered sets (posets) Open problems Develop algorithm on Parallel Disk Model (PDM) and Cache Oblivious model Extend the techniques to pre-processing algorithm Is O(log n) passes the best possible? 61 Summary

62 Thank you 62

63 Appendix 63

64 Charts for average case 64

65 65

66 The lower bound Theorem: Any randomized one-pass algorithm with space at most n/2 succeeds with probability at most 1/2 Proof 66 - Random unique survivor - 2 points come at the end - If space <= n/2 then will fail if didnt store survivor in the memory

67 Proof of Claim 67

68 Proof of Claim Claim: The tree that some element is sampled will be deleted (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 4, 4 (3, 3) 68

69 Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (4, 4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 4, 4 69

70 Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (4, 4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 4, 4 (3, 4) 3, 4 70

71 Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 4, 4 (3, 3) 3, 4 3, 3 71

72 Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 4, 4 (3, 3) 3, 4 3, 3 72

73 Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 4, 4 (3, 3) 3, 4 3, 3 73

74 Analysis Draw trees: Each point points to its first dominating point (1, 5), (3, 4), (4, 5), (4, 3), (3, 3), (4,4) (3, 4) 1, 5 3, 3 3, 4 4, 3 4, 4 4, 5 4, 4 (3, 3) 3, 4 3, 3 74


Download ppt "Atish Das Sarma, Ashwin Lall, Danupon Nanongkai, Jun Xu 1 Georgia Tech VLDB 2009."

Similar presentations


Ads by Google