Download presentation

Presentation is loading. Please wait.

Published byJordan McKinney Modified about 1 year ago

1
Optimizing and Parallelizing Ranked Enumeration Konstantin Golenberg Benny Kimelfeld Benny Kimelfeld Yehoshua Sagiv The Hebrew University of Jerusalem IBM Research – Almaden The Hebrew University of Jerusalem VLDB 2011 Seattle, WA

2
2 Background: DB Search at HebrewU eu brussels search Initial implementation was too slow… Purchased a multi-core server Didn’t help: cores were usually idle –Due to the inherent flow of the enumeration technique we used Needed deeper understanding of ranked enumeration to benefit from parallelization – This paper demo in SIGMOD’10, implementation in SIGMOD’08, algorithms in PODS’06

3
Outline Lawler-Murty’s Ranked Enumeration Optimizing by Progressive Bounds Parallelization / Core Utilization Conclusions

4
4 Ranked Enumeration User Problem Huge number (e.g., 2 |Problem| ) of ranked answers best answer 2 nd best answer 3 rd best answer... Examples: Various graph optimizations –Shortest paths –Smallest spanning trees –Best perfect matchings Top results of keyword search on DBs (graph search) Most probable answers in probabilistic DBs Best recommendations for schema integration Examples: Various graph optimizations –Shortest paths –Smallest spanning trees –Best perfect matchings Top results of keyword search on DBs (graph search) Most probable answers in probabilistic DBs Best recommendations for schema integration “Complexity”: What is the delay between successive answers? How much time to get top-k? Here (Can’t afford to instantiate all answers)

5
5 Goal:Find top-k answers Goal: Find top-k answers Abstract Problem Formulation O = A collection of objects A = score() score( a ) is high a is of high-quality Huge, described by a condition on A ’s subsets…… Answers a ⊆ O input 17 a1a1 a2a2 a3a3 akak

6
6 Goal:Find top-k answers Goal: Find top-k answers Graph Search in The Abstraction A =… Answers a ⊆ O Data graph G Set Q of keywords Data graph G Set Q of keywords Edges of G Subtrees (edge sets) a c ontaining all keywords in Q (w/o redundancy, see [GKS 2008]) score( a ): 1, IR measures, etc. weight( a ) O =

7
7 What is the Challenge? O = 32 start 1 st (top) answer Optimization problem 31 2 nd answer ? j th answer ≠ previous (j-1) answers best remaining answer Conceivably, much more complicated than top-1! ? How to handle these constraints? (j may be large!)...

8
8 Lawler-Murty’s Procedure Lawler-Murty’s gives a general reduction: Finding top-k answers Finding top-1 answer under simple constraints if PTIME then PTIME We understand optimization much better! Often, amounts to classical optimization, e.g., shortest path (but sometimes it may get involved, e.g., [KS 2006]) [Murty, 1968] [Lawler, 1972] [Murty, 1968] [Lawler, 1972] Other general top-k procedure: [Hamacher & Queyranne 84], very similar!

9
9 Among the Uses of Lawler-Murty’s Shortest simple paths [Yen 1972] Minimum spanning trees [Gabow 1977, Katoh et al., 1981] Best solutions in resource allocation [Katoh et al. 1981] Best perfect matchings, best cuts [Hamacher & Queyranne 1985] Minimum Steiner trees [KS 2006] Graph/Combinatorial Algorithms: Yen’s algorithm to find sets of metabolites connected by chemical reactions [Takigawa & Mamitsuka 2008] Bioinformatics: ORDER-BY queries [KS 2006, 2007] Graph/XML search [GKS 2008] Generation of forms over integrated data [Talukdar et al. 2008] Course recommendation [Parameswaran & Garcia-Molina 2009] Querying Markov sequences [K & Ré 2010] Data Management:

10
10 Lawler-Murty’s Method: Conceptual start

11
11 Output 1. Find & Print the Top Answer start But Instead… In principle, at this point we should find the second-best answer

12
12 2. Partition the Remaining Answers simple constraints Partition defined by a set of simple constraints Output start Inclusion constraint: “ must contain ” Exclusion constraint: “ must not contain ”

13
13 3. Find the Top of Each Set Output start

14
14 4. Find & Print the Second Answer Output start Best among all the top answers in the partitions Next answer: Best among all the top answers in the partitions

15
15 5. Further Divide the Chosen Partition … and so on … (until k answers are printed) Output start...

16
16 Output Partition Reps. + Best of Each Lawler-Murty’s: Actual Execution Printed already Best of each partition best 19

17
17 Output Lawler-Murty’s: Actual Execution 24 Partition Reps. + Best of Each For each new partition, a task to find the best answer

18
18 Output Lawler-Murty’s: Actual Execution Partition Reps. + Best of Each 24 best…

19
Outline Lawler-Murty’s Ranked Enumeration Optimizing by Progressive Bounds Parallelization / Core Utilization Conclusions

20
20 Output Typical Bottleneck 24 Partition Reps. + Best of Each

21
21 Output Typical Bottleneck 24 Partition Reps. + Best of Each In top k?

22
22 12 Progressive Upper Bound Throughout the execution, an optimization alg. often upper bounds it’s final solution’s score Progressive: bound gets smaller in time Often, nontrivial bounds, e.g., –Dijkstra's algorithm: distance at the top of the queue Similarly: some Steiner-tree algorithms [DreyfusWagner72] –Viterbi algorithms: max intermediate probability –Primal-dual methods: value of dual LP solution ≤18≤14≤22≤24 Time

23
23 Output Freezing Tasks (Simplified) 24 Partition Reps. + Best of Each

24
24 Output Freezing Tasks (Simplified) 24 Partition Reps. + Best of Each ≤24≤ ≤24≤23≤

25
25 Output Freezing Tasks (Simplified) 24 Partition Reps. + Best of Each 22 > ≤24≤23≤20

26
26 Output Freezing Tasks (Simplified) Partition Reps. + Best of Each best ≤ ≤24≤23≤20≤18≤16≤15 15

27
27 Improvement of Freezing Mondial k = 10, 100 DBLP (part) k = 10, 100 DBLP (full) k = 10, 100 On average, freezing saved 56% of the running time Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory Simple Lawler-Murtyw/ Freezing

28
Outline Lawler-Murty’s Ranked Enumeration Optimizing by Progressive Bounds Parallelization / Core Utilization Conclusions

29
29 Awaiting Tasks Output Straightforward Parallelization

30
30 Awaiting Tasks Output Straightforward Parallelization

31
31 Awaiting Tasks Output Straightforward Parallelization

32
Not so fast… Typical: reduced 30% of running time Same for 2,3…,8 threads!

33
33 Awaiting Tasks Output Idle Cores while Waiting

34
34 Awaiting Tasks Output Idle Cores while Waiting idle

35
35 Awaiting Tasks Output Early Popping ≤24 ≤23≤20 22 > 20 ≤22 Skipped issues: Thread synchronization –semaphores, locking, etc. Correctness ≤19

36
36 Improvement of Early Popping Mondial short, medium-size & long queries DBLP (part) short, medium-size & long queries Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory

37
37 Early Popping vs. (Serial) Freezing Mondial short, medium-size & long queries DBLP (part) short, medium-size & long queries Need 4 threads to start gainingNeed 4 threads to start gaining And even then, fairly poor…And even then, fairly poor… Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory

38
38 Combining Freezing & Early Popping We discuss additional ideas and techniques to further utilize the cores –Not here, see the paper Main speedup by combining early popping with freezing –Cores kept busy… on high-potential tasks –Thread synchronization is quite involved At the high level, the final algorithm has the following flow:

39
39 Combining: General Idea Computed Answers (to-print) Partition Reps. as Frozen Tasks Output Threads work on frozen tasks frozen + new tasks computed answers

40
40 Combining: General Idea Computed Answers (to-print) Partition Reps. as Frozen Tasks Output Threads work on frozen tasks frozen + new tasks computed answers

41
41 Main task just pops computed results to print … but validates: no better results by frozen tasks Combining: General Idea Computed Answers (to-print) Partition Reps. as Frozen Tasks Output Threads work on frozen tasks frozen + new tasks computed answers

42
42 Combined vs. (Serial) Freezing MondialDBLP Now, significant gain (≈50%) already w/ 2 threads Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory

43
43 Improvement of Combined DBLP 4%-5% 3%-10% On average, with 8 threads we got 5.7% of the original running time Mondial Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory

44
Outline Lawler-Murty’s Ranked Enumeration Optimizing by Progressive Bounds Parallelization / Core Utilization Conclusions

45
45 Conclusions Considered Lawler-Murty’s ranked enumeration –Theoretical complexity guarantees –…but a direct implementation is very slow –Straightforward parallelization poorly utilizes cores Ideas: progressive bounds, freezing, early popping –In the paper: additional ideas, combination of ideas Most significant speedup by combining these ideas –Flow substantially differs from the original procedure –20x faster on 8 cores Test case: graph search; focus: general apps –Future: additional test cases Questions?

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google