Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,

Similar presentations


Presentation on theme: "A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,"— Presentation transcript:

1 A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science, Texas State University-San Marcos 2 Department of Mathematics, Texas State University-San Marcos

2 Problem: HPC is Hard to Exploit  HPC application writers are domain experts  They are not typically computer scientists and have little or no formal education in parallel programming  Parallel programming is difficult and error prone  Modern HPC systems are complex  Consist of interconnected compute nodes with multiple CPUs and one or more GPUs per node  Require parallelization at multiple levels (inter-node, intra-node, and accelerator) for best performance A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 2

3 Target Area: Iterative Local Searches  Important application domain  Widely used in engineering & real-time environments  Examples  All sorts of random restart greedy algorithms  Ant colony opt, Monte Carlo, n-opt hill climbing, etc.  ILS properties  Iteratively produce better solutions  Can exploit large amounts of parallelism  Often have exponential search space A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 3

4 Our Solution: ILCS Framework  Iterative Local Champion Search (ILCS) framework  Supports non-random restart heuristics  Genetic algorithms, tabu search, particle swarm opt, etc.  Simplifies implementation of ILS on parallel systems  Design goal  Ease of use and scalability  Framework benefits  Handles threading, communication, locking, resource allocation, heterogeneity, load balance, termination decision, and result recording (check pointing) A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 4

5 User Interface  User writes 3 serial C functions and/or 3 single- GPU CUDA functions with some restrictions size_t CPU_Init(int argc, char *argv[]); void CPU_Exec(long seed, void const *champion, void *result); void CPU_Output(void const *champion);  See paper for GPU interface and sample code  Framework runs Exec (map) functions in parallel A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 5

6 Internal Operation: Threading A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 6 ILCS master thread starts master forks a worker per core master forks a handler per GPU workers evaluate seeds, record local opt GPU workers evaluate seeds, record local opt handlers launch GPU code, sleep, record result master sporadically finds global opt via MPI, sleeps

7 Internal Operation: Seed Distribution  E.g., 4 nodes w/ 4 cores (a,b,c,d) and 2 GPUs (1,2)  Benefits  Balanced workload irrespective of number of CPU cores or GPUs (or their relative performance)  Users can generate other distributions from seeds  Any injective mapping results in no redundant evaluations A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 7 each node gets chunk of 64-bit seed range CPUs process chunk bottom up GPUs process chunk top down

8 Related Work  MapReduce/Hadoop/MARS and PADO  Their generality and unnecessary features for ILS incur overhead and increase learning curve  Some do not support accelerators, some require Java  ILCS framework is optimized for ILS applications  Reduction is provided, does not require multiple keys, does not need secondary storage to buffer data, directly supports non-random restart heuristics, allows early termination, works with GPUs and MICs, targets single-node workstations through HPC clusters A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 8

9 Evaluation Methodology  Three HPC Systems (at TACC and NICS)  Largest tested configuration A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 9 datacenterknowledge.com

10 Sample ILS Codes  Traveling Salesman Problem (TSP)  Find shortest tour  4 inputs from TSPLIB  2-opt hill climbing  Finite State Machine (FSM)  Find best FSM config to predict hit/miss events  4 sizes (n = 3, 4, 5, 6)  Monte Carlo method A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 10

11 FSM Transitions/Second Evaluated A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 11 21,532,197,798,304 s -1 GPU shmem limit Ranger uses twice as many cores as Stampede

12 TSP Tour-Changes/Second Evaluated A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 12 12,239,050,704,370 s -1 based on serial CPU code CPU pre-computes: O(n 2 ) memory GPU re-computes: O(n) memory each core evals a tour change every 3.6 cycles

13 TSP Moves/Second/Node Evaluated A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 13 GPUs provide >90% of performance on Keeneland

14 ILCS Scaling on Ranger (FSM) A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 14 >99% parallel efficiency on 2048 nodes other two systems are similar

15 ILCS Scaling on Ranger (TSP) A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 15 >95% parallel efficiency on 2048 nodes longer runs are even better

16 Intra-Node Scaling on Stampede (TSP) A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 16 >98.9% parallel efficiency on 16 threads framework overhead is very small

17 Tour Quality Evolution (Keeneland) A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 17 quality depends on chance: ILS provides good solution quickly, then progressively improves it

18 Tour Quality after 6 Steps (Stampede) A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 18 larger node counts typically yield better results faster

19 Summary and Conclusions  ILCS Framework  Automatic parallelization of iterative local searches  Provides MPI, OpenMP, and multi-GPU support  Checkpoints currently best solution every few seconds  Scales very well (decentralized)  Evaluation  2-opt hill climbing (TSP) and Monte Carlo method (FSM)  AMD + Intel CPUs, NVIDIA GPUs, and Intel MICs  ILCS source code is freely available  http://cs.txstate.edu/~burtscher/research/ILCS/ Work supported by NSF, NVIDIA and Intel; resources provided by TACC and NICS A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches 19


Download ppt "A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,"

Similar presentations


Ads by Google