Optimizing and Parallelizing Ranked Enumeration Konstantin Golenberg Benny Kimelfeld Benny Kimelfeld Yehoshua Sagiv The Hebrew University of Jerusalem.

Optimizing and Parallelizing Ranked Enumeration Konstantin Golenberg Benny Kimelfeld Benny Kimelfeld Yehoshua Sagiv The Hebrew University of Jerusalem IBM Research – Almaden The Hebrew University of Jerusalem VLDB 2011 Seattle, WA

2 Background: DB Search at HebrewU eu brussels search Initial implementation was too slow… Purchased a multi-core server Didn’t help: cores were usually idle –Due to the inherent flow of the enumeration technique we used Needed deeper understanding of ranked enumeration to benefit from parallelization – This paper demo in SIGMOD’10, implementation in SIGMOD’08, algorithms in PODS’06

Outline Lawler-Murty’s Ranked Enumeration Optimizing by Progressive Bounds Parallelization / Core Utilization Conclusions

4 Ranked Enumeration User Problem Huge number (e.g., 2 |Problem| ) of ranked answers best answer 2 nd best answer 3 rd best answer... Examples: Various graph optimizations –Shortest paths –Smallest spanning trees –Best perfect matchings Top results of keyword search on DBs (graph search) Most probable answers in probabilistic DBs Best recommendations for schema integration Examples: Various graph optimizations –Shortest paths –Smallest spanning trees –Best perfect matchings Top results of keyword search on DBs (graph search) Most probable answers in probabilistic DBs Best recommendations for schema integration “Complexity”: What is the delay between successive answers? How much time to get top-k? Here (Can’t afford to instantiate all answers)

5 Goal:Find top-k answers Goal: Find top-k answers Abstract Problem Formulation O = A collection of objects A = score() 2131282717 score( a ) is high  a is of high-quality Huge, described by a condition on A ’s subsets…… 32 31 28 Answers a ⊆ O input 17 a1a1 a2a2 a3a3 akak

6 Goal:Find top-k answers Goal: Find top-k answers Graph Search in The Abstraction A =… Answers a ⊆ O Data graph G Set Q of keywords Data graph G Set Q of keywords Edges of G Subtrees (edge sets) a c ontaining all keywords in Q (w/o redundancy, see [GKS 2008]) score( a ): 1, IR measures, etc. weight( a ) O =

7 What is the Challenge? O = 32 start 1 st (top) answer Optimization problem 31 2 nd answer ?... 17 j th answer ≠ previous (j-1) answers best remaining answer Conceivably, much more complicated than top-1! ? How to handle these constraints? (j may be large!)...

8 Lawler-Murty’s Procedure Lawler-Murty’s gives a general reduction: Finding top-k answers Finding top-1 answer under simple constraints if PTIME then PTIME We understand optimization much better! Often, amounts to classical optimization, e.g., shortest path (but sometimes it may get involved, e.g., [KS 2006]) [Murty, 1968] [Lawler, 1972] [Murty, 1968] [Lawler, 1972] Other general top-k procedure: [Hamacher & Queyranne 84], very similar!

9 Among the Uses of Lawler-Murty’s Shortest simple paths [Yen 1972] Minimum spanning trees [Gabow 1977, Katoh et al., 1981] Best solutions in resource allocation [Katoh et al. 1981] Best perfect matchings, best cuts [Hamacher & Queyranne 1985] Minimum Steiner trees [KS 2006] Graph/Combinatorial Algorithms: Yen’s algorithm to find sets of metabolites connected by chemical reactions [Takigawa & Mamitsuka 2008] Bioinformatics: ORDER-BY queries [KS 2006, 2007] Graph/XML search [GKS 2008] Generation of forms over integrated data [Talukdar et al. 2008] Course recommendation [Parameswaran & Garcia-Molina 2009] Querying Markov sequences [K & Ré 2010] Data Management:

10 Lawler-Murty’s Method: Conceptual start

11 Output 1. Find & Print the Top Answer start But Instead… In principle, at this point we should find the second-best answer

12 2. Partition the Remaining Answers simple constraints Partition defined by a set of simple constraints Output start Inclusion constraint: “ must contain ” Exclusion constraint: “ must not contain ”

13 3. Find the Top of Each Set Output start

14 4. Find & Print the Second Answer Output start Best among all the top answers in the partitions Next answer: Best among all the top answers in the partitions

15 5. Further Divide the Chosen Partition … and so on … (until k answers are printed) Output start...

16 Output Partition Reps. + Best of Each Lawler-Murty’s: Actual Execution 18 24 34 30 Printed already Best of each partition best 19

17 Output Lawler-Murty’s: Actual Execution 24 Partition Reps. + Best of Each For each new partition, a task to find the best answer 19 18 34 30

18 Output Lawler-Murty’s: Actual Execution 18 21 Partition Reps. + Best of Each 24 best… 19 18 34 30 22

20 Output Typical Bottleneck 24 Partition Reps. + Best of Each 34 30 14 12

21 Output Typical Bottleneck 24 Partition Reps. + Best of Each 34 30 22 20 15 14 12 In top k?

22 12 Progressive Upper Bound Throughout the execution, an optimization alg. often upper bounds it’s final solution’s score Progressive: bound gets smaller in time Often, nontrivial bounds, e.g., –Dijkstra's algorithm: distance at the top of the queue Similarly: some Steiner-tree algorithms [DreyfusWagner72] –Viterbi algorithms: max intermediate probability –Primal-dual methods: value of dual LP solution ≤18≤14≤22≤24 Time

23 Output Freezing Tasks (Simplified) 24 Partition Reps. + Best of Each 34 30 14 12

24 Output Freezing Tasks (Simplified) 24 Partition Reps. + Best of Each ≤24≤23 34 30 22 ≤24≤23≤22 20 14 12

25 Output Freezing Tasks (Simplified) 24 Partition Reps. + Best of Each 22 > 20 34 30 14 12 22 20 ≤24≤23≤20

26 Output Freezing Tasks (Simplified) Partition Reps. + Best of Each best 34 30 24 14 12 ≤20 2220 ≤24≤23≤20≤18≤16≤15 15

27 Improvement of Freezing Mondial k = 10, 100 DBLP (part) k = 10, 100 DBLP (full) k = 10, 100 On average, freezing saved 56% of the running time Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory Simple Lawler-Murtyw/ Freezing

29 Awaiting Tasks Output Straightforward Parallelization 14 12 34 30 24

30 Awaiting Tasks Output Straightforward Parallelization 14 12 34 30 24 22 15 20

31 Awaiting Tasks Output Straightforward Parallelization 14 12 20 22 34 30 24 15

Not so fast… Typical: reduced 30% of running time Same for 2,3…,8 threads!

33 Awaiting Tasks Output Idle Cores while Waiting 14 12 34 30 24

34 Awaiting Tasks Output Idle Cores while Waiting idle 14 12 34 30 24 22 15 20

35 Awaiting Tasks Output Early Popping ≤24 ≤23≤20 22 > 20 ≤22 Skipped issues: Thread synchronization –semaphores, locking, etc. Correctness 14 12 20 22 34 30 24 ≤19

36 Improvement of Early Popping Mondial short, medium-size & long queries DBLP (part) short, medium-size & long queries Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory

37 Early Popping vs. (Serial) Freezing Mondial short, medium-size & long queries DBLP (part) short, medium-size & long queries Need 4 threads to start gainingNeed 4 threads to start gaining And even then, fairly poor…And even then, fairly poor… Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory

38 Combining Freezing & Early Popping We discuss additional ideas and techniques to further utilize the cores –Not here, see the paper Main speedup by combining early popping with freezing –Cores kept busy… on high-potential tasks –Thread synchronization is quite involved At the high level, the final algorithm has the following flow:

39 Combining: General Idea Computed Answers (to-print) Partition Reps. as Frozen Tasks Output 17 25 15 Threads work on frozen tasks frozen + new tasks computed answers 34 30 24 20 12 26

40 Combining: General Idea Computed Answers (to-print) Partition Reps. as Frozen Tasks Output 17 25 15 Threads work on frozen tasks frozen + new tasks computed answers 34 30 24 20 12 20

41 Main task just pops computed results to print … but validates: no better results by frozen tasks Combining: General Idea Computed Answers (to-print) Partition Reps. as Frozen Tasks Output 17 25 1520 Threads work on frozen tasks frozen + new tasks computed answers 22 22 34 30 24 22 20 12

42 Combined vs. (Serial) Freezing MondialDBLP Now, significant gain (≈50%) already w/ 2 threads Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory

43 Improvement of Combined DBLP 4%-5% 3%-10% On average, with 8 threads we got 5.7% of the original running time Mondial Experiments: Graph Search 2 Intel Xeon processors (2.67GHz), 4 cores each (8 total); 48GB memory

45 Conclusions Considered Lawler-Murty’s ranked enumeration –Theoretical complexity guarantees –…but a direct implementation is very slow –Straightforward parallelization poorly utilizes cores Ideas: progressive bounds, freezing, early popping –In the paper: additional ideas, combination of ideas Most significant speedup by combining these ideas –Flow substantially differs from the original procedure –20x faster on 8 cores Test case: graph search; focus: general apps –Future: additional test cases Questions?

Optimizing and Parallelizing Ranked Enumeration Konstantin Golenberg Benny Kimelfeld Benny Kimelfeld Yehoshua Sagiv The Hebrew University of Jerusalem.

Similar presentations

Presentation on theme: "Optimizing and Parallelizing Ranked Enumeration Konstantin Golenberg Benny Kimelfeld Benny Kimelfeld Yehoshua Sagiv The Hebrew University of Jerusalem."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimizing and Parallelizing Ranked Enumeration Konstantin Golenberg Benny Kimelfeld Benny Kimelfeld Yehoshua Sagiv The Hebrew University of Jerusalem.

Similar presentations

Presentation on theme: "Optimizing and Parallelizing Ranked Enumeration Konstantin Golenberg Benny Kimelfeld Benny Kimelfeld Yehoshua Sagiv The Hebrew University of Jerusalem."— Presentation transcript:

Similar presentations

About project

Feedback