Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advisor: Dr. Gagan Agrawal

Similar presentations


Presentation on theme: "Advisor: Dr. Gagan Agrawal"— Presentation transcript:

1 Advisor: Dr. Gagan Agrawal
Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors Linchuan Chen Advisor: Dr. Gagan Agrawal

2 Motivation - Platforms
Accelerators are Evolving General Purpose Graphics Processing Units (GPGPU) Extreme-scale, cost-effective, power efficient Suitable for data-intensive high performance computing Heterogeneous CPU-GPU Architectures World’s top supercomputers compose heterogeneous CPU+NVIDIA GPU nodes Emergence of integrated CPU-GPU architecture AMD Fusion, Intel Ivy Bridge Shared memory CPU-GPU computation Coprocessors are Emerging Intel Xeon Phi Large-scale shared memory parallelism Wide SIMD lanes 11/8/2018

3 Motivation - Programming
Scientific Applications are Highly Diverse It is necessary to classify them and investigate each Different Communication Patterns Require different programming techniques Programmability Data partitioning - computation-thread mapping Data race avoidance – computation reorganization/proper use of locking Load balancing – task scheduling Communication Among thread blocks Among devices Among nodes Performance Efficiency Efficient use of memory hierarchy Reducing locking overhead Data access locality Efficient use of SIMD 11/8/2018

4 Motivation - A Variety of Applications
MapReduce Generalized Reductions Irregular Reductions Moldyn Euler Others Stencil Computations Jacobi Sobel Graph Algorithms SSSP PageRank 11/8/2018

5 Thesis Work Optimize MapReduce for GPUs
Intensively utilize shared memory on GPU (HPDC’12) MapReduce for a Coupled CPU-GPU Schemes for scheduling MapReduce on both CPU and GPU on an integrated architecture (SC’12) Scheduling Methods for Applications on Heterogeneous Architectures General scheduling methods for applications across heterogeneous cores (HCW’14) Pattern-specific Programming and Optimizations for Heterogeneous Clusters Pattern specific APIs and runtimes for a heterogeneous cluster (IPDPS'15) Graph Processing System over CPU and MIC Accelerating graph processing by utilizing MIMD and SIMD over CPU and MIC (IPDPS'15) Optimization Methods for Irregular Applications on MIC General optimization flow for irregular applications, including irregular reductions, graph algorithms, and SpMM (PACT'15 submission) 11/8/2018

6 Outline Work after Candidacy Work before Candidacy Conclusion
Optimizing Irregular Applications on MIC Data locality Access pattern Conflict avoidance Work before Candidacy MapReduce MapReduce system for an NVIDIA GPU, with effective shared memory usage MapReduce system for an Integrated CPU-GPU Pattern Specification System for Heterogeneous Clusters Classify applications based on communication patterns API for each pattern Automatic runtime optimization for each pattern at different parallel hierarcy Graph Processing over CPU and MIC Maintain moderate memory consumption Support SIMD message processing Reduce locking overhead Graph partitioning between CPU and MIC Conclusion 11/8/2018

7 Intel Xeon Phi Large-scale Parallelism Wide SIMD Lanes
61 cores Each supports 4 hyper-threads Wide SIMD Lanes 512 bit lane width = 16 floats More Flexible SIMD Programming with Gather/Scatter Enables the programming for irregular memory access 11/8/2018

8 Intel Xeon Phi Efficient Utilization is Critical Data Access Locality
MIMD level SIMD level Irregular access with Gather/Scatter SIMD Utilization Ratio Need to keep as many as possible SIMD lanes busy Synchronization Overhead Need to avoid conflicts among SIMD lanes Use of locking among threads should also be minimized 11/8/2018

9 Intel Xeon Phi - Gather/Scatter Performance
Access the same amount of floats Performance of gather/scatter fall below load/store significantly at access ranges of higher than 256 11/8/2018

10 Irregular Applications
Applications involving Irregular, Data-dependent, Indirection-based data accesses Typical Subsets Irregular Reductions Graph Algorithms SpMV SpMM 11/8/2018

11 Contributions A General Optimization Methodology The Steps
Based on a sparse matrix view Easier to discover computation and data access patterns The Steps Data locality enhancement through matrix tiling Data access pattern identification Write conflict removal at both SIMD and MIMD levels Subsets Studied Irregular reductions Graph algorithms SpMM 11/8/2018

12 Sparse Matrix View of Irregular Applications
Irregular Reductions Computation steps: Load node elements Do computation Write to reduction array 11/8/2018

13 Sparse Matrix View of Irregular Applications
Graph Algorithms Load values of source vertices Load values of edges Compute Update destination vertices 11/8/2018

14 Sparse Matrix View of Irregular Applications
SpMM C = A X B Access Two Sparse Matrices A, B, C are in CSR Straightforward Implementation Load one element from A Load a row of elements from B Conduct a SIMD multiplication Store the results to hash table using column IDs of the elements in B as keys Very irregular memory access 11/8/2018

15 Optimization – irregular reductions
Step 1: Matrix Tiling The sparse matrix is tiled into squares (tiles) A tile is stored as COO (rows, cols, vals) Hierarchical tiling Extract small dense tiles first – benefits locality Extract remaining sparse tiles (larger tiles) later – benefits SIMD Utilization ratio Each tile is exclusively processed by one thread in SIMD Step 2: Access Pattern Identification Writes happen in both row and column directions 11/8/2018

16 Optimization – irregular reductions
Step 3: Conflict Removal Conflicts SIMD level: different lanes might write to same locations MIMD level: different threads might write to same locations Conflict Removal at SIMD Level – Grouping Divide the non-zeros in each tile into conflict-free groups In each conflict group, every two elements have distinct row ID and distinct column ID Conflict Removal at MIMD Level Group tiles in the same way Each parallel step: all threads process the same conflict-free tile group 11/8/2018

17 Optimization – irregular reductions
Execution (SIMD processing of a conflict-free group) Load row IDs Load column IDs Load non-zero values Gather node data according to row IDs Gather node data according to column IDs Compute Update reduction array using row IDs with scatter Update reduction array using column IDs with scatter rid, cid, val 11/8/2018

18 Optimization – graph algorithms
Locality enhancement same approach as is used in irregular reductions Access pattern Different with irregular reductions Only write in one direction Will influence the grouping policy Conflict removal Conflict-free grouping: only need to consider column IDs 11/8/2018

19 Optimization – SpMM C = A X B Sequential kernel
For each nz e in A, scale the non-zeros in the row of e.col from B. Accumulate the scaling results to the hash table of row e.row in C. 11/8/2018

20 Optimization – SpMM Storage format: Tile the sparse matrices
Consider cluster-distributed non-zero matrices For a floating point matrix, store non-zero blocks as 4x4 dense blocks, and each block is stored continuously Index only the tiles SIMD Multiplication between Tiles 11/8/2018

21 Conflict-free SIMD Processing
Load of a tile from A: continuous Load of a tile from B uses gather, with different index maps for different iterations Conduct SIMD multiplication In each iteration, conduct a horizontal permutation to SIMD multiplication result, to align elements with corresponding elements (same color) in partial reduction result Locate the result tile from the hash table for the corresponding row A SIMD add is conducted to reduce the partial reduction result to the result tile 11/8/2018

22 MIMD Level Parallelism
C = A X B Treat each tile as an element Divide the workload based on the rows of A For each tile in A, scale (in SIMD) the elements in the corresponding row of B Store the scaling results to hash table (in SIMD) 11/8/2018

23 Results Platform Applications Intel Xeon Phi SE10P coprocessor
61 cores, 1.1 GHz, 4 hyper threads per core 8 GB GDDR5 Intel ICC , -O3 enabled Applications Irregular reductions Moldyn, Euler Graph algorithms Bellman-Ford, PageRank SpMM 11/8/2018

24 Irregular Reductions Compare different execution approaches Serial
Single thread scalar, row by row processing Serial Tiling (optimal tile size) Scalar, on tiled data SIMD Naive Row by row processing, SIMD SIMD Tiling (Our) (optimal tile size) SIMD processing, on tiled data 11/8/2018

25 Irregular Reductions Overall Speedup (Over Single Thread Scalar)
11/8/2018

26 Graph Algorithms Compare different execution approaches Serial
Single thread scalar, row by row processing Serial Tiling (optimal tile size) Scalar, on tiled data SIMD Naive Row by row processing, SIMD SIMD Tiling (Our) (optimal tile size) SIMD processing, on tiled data 11/8/2018

27 Graph Algorithms Overall Performance 11/8/2018

28 SpMM Single Thread Performance (clustered datasets) 11/8/2018

29 SpMM Overall Performance 11/8/2018

30 Outline Work after Candidacy Work before Candidacy Conclusion
Optimizing Irregular Applications on MIC Data locality Access pattern Conflict avoidance Work before Candidacy MapReduce on Accelerators MapReduce system for an NVIDIA GPU, with effective shared memory usage MapReduce system for an Integrated CPU-GPU Pattern Specification System for Heterogeneous Clusters Classify applications based on communication patterns API for each pattern Automatic runtime optimization for each pattern at different parallel hierarcy Graph Processing over CPU and MIC Maintain moderate memory consumption Support SIMD message processing Reduce locking overhead Graph partitioning between CPU and MIC Conclusion 11/8/2018

31 Contributions Re-implement MapReduce in a reduction-based manner
Reduce the intermediate key-value pairs directly to reduction objects Make use of the fast but small shared memory Schedule MapReduce on a coupled CPU-GPU Map-dividing scheme Pipelining scheme 11/8/2018

32 MapReduce Traditional MapReduce: Reduction Based: map(input) {
(key, value) = process(input); emit(key, value); } grouping the key-value pairs (by runtime system) reduce(key, iterator) for each value in iterator result = operation(result, value); emit(key, result); Reduction Based: map(input, reduction_object) { (key, value) = process(input); reduction_object->insert(key, value); } reduce(value1, value2) value1 = operation(value1, value2); 11/8/2018

33 Reduction Object Locks: … … Buckets: … … Memory Pool: Memory Allocator
KeyIdx[0] ValIdx[0] KeyIdx[1] ValIdx[1] Memory Allocator Memory Pool: Val Data Key Size Val Size Key Data Key Size Val Size Key Data Val Data 11/8/2018

34 Device Memory Reduction Object
Memory Hierarchy GPU RO 0 RO 1 RO 0 RO 1 … … … … Block 0’s Shared Memory Block 0’s Shared Memory Device Memory Reduction Object Result Array Device Memory CPU Host Memory 11/8/2018

35 MapReduce on Coupled CPU-GPUs
map tasks map stage reduce stage GPU Map-dividing Scheme reduce device map tasks reduce stage map device map stage key-value buffer Pipelining Scheme 11/8/2018

36 Performance on an Nvidia GPU
11/8/2018

37 Performance on a Coupled CPU-GPU
Compare single core execution, and CPU-GPU execution with handwritten sequential version Execution Time (ms) KM WC NBC MM kNN Sequential (1 CPU Core) 7042 2017 2655 98810 1004 MapReduce 7804 2057 2712 93647 1154 Best CPU-GPU (Relative Speedup) 959 (7.34x) 516 (3.91x) 818 (3.25x) 3445 (28.68x) 112 (8.69x) 11/8/2018

38 Outline Work after Candidacy Work before Candidacy Conclusion
Optimizing Irregular Applications on MIC Data locality Access pattern Conflict avoidance Work before Candidacy MapReduce on Accelerators MapReduce system for an NVIDIA GPU, with effective shared memory usage MapReduce system for an Integrated CPU-GPU Pattern Specification System for Heterogeneous Clusters Classify applications based on communication patterns API for each pattern Automatic runtime optimization for each pattern at different parallel hierarcy Graph Processing over CPU and MIC Maintain moderate memory consumption Support SIMD message processing Reduce locking overhead Graph partitioning between CPU and MIC Conclusion 11/8/2018

39 Main Issues Programming Heterogeneous Clusters is Extremely Complicated Need to explore multi-level parallelism Need to handle partitioning and communication Need to conduct optimizations General Programming Frameworks E.g. MapReduce Improves programmability General but not efficient 11/8/2018

40 Contributions A Programming Framework for Developing Scientific Applications on Heterogeneous Clusters Classify Communication Patterns in Scientific Applications Provides Pattern-specific APIs for Different Patterns Automatically Conduct Pattern-Specific Optimizations 11/8/2018

41 API Application driver code User-defined functions
IReduction_runtime *ir = env.get_IR(); ir->set_edge_comp_func(force_cmpt); Ir->set_node_reduc_func(force_reduce); for(int i = 0; i < n_tsteps; i++){ ir->start(); result = ir->get_output(); } DEVICE void force_cmpt(Object *obj, EDGE edge, void *edge_data, void *node_data, void *parameter){ if(dist < cutoff){ double f = compute_force(); obj->insert(&edge[0],&f); f = -f; obj->insert(&edge[1],&f); } DEVICE void force_reduce(VALUE *value1, VALUE *value2){ *value1 += *value2; Irregular reduction kernel in Moldyn force_cmpt: edge processing function, each invocation processes one edge force_reduce: accumulates a value to the reduction object 11/8/2018

42 Overall Structure 11/8/2018

43 Performance Kmeans: 1800x MiniMD: 600x Heat3D: 750x Kmeans MiniMD
11/8/2018 11/8/2018 43

44 Outline Work after Candidacy Work before Candidacy Conclusion
Optimizing Irregular Applications on MIC Data locality Access pattern Conflict avoidance Work before Candidacy MapReduce on Accelerators MapReduce system for an NVIDIA GPU, with effective shared memory usage MapReduce system for an Integrated CPU-GPU Pattern Specification System for Heterogeneous Clusters Classify applications based on communication patterns API for each pattern Automatic runtime optimization for each pattern at different parallel hierarcy Graph Processing over CPU and MIC Maintain moderate memory consumption Support SIMD message processing Reduce locking overhead Graph partitioning between CPU and MIC Conclusion 11/8/2018

45 Main Issues Xeon Phi Specific Utilizing CPU at the Same Time
Memory Access and Contention Overhead among Threads Irregular memory access Locking leads to contention among threads Difficult to Automatically Utilize SIMD Complex SSE programming Load Imbalance Associated processing to different vertices vary Need to keep as many as possible cores busy Utilizing CPU at the Same Time Graph partitioning between devices 11/8/2018

46 Contributions A Pregel-like Parallel Graph Programming API
Simple to Use User Defined functions Expresse sequential logics: message generation, message processing, and vertex update Message processing transparently utilizes SIMD General Enough to Specify Common Graph Algorithms A Runtime Enabling the Use of both CPU and MIC Message Buffer Pipelined Execution 11/8/2018

47 Interface User defined functions: // 3. Vertex update
// 1. Message generation void generate_messages(size_t vertex_id, graph<VertexValue, EdgeValue> *g) { float my_dist = g->vertex_value[vertex_id]; // Graph is in CSR format. for (size_t i = g->vertices[vertex_id]; i < g->vertices[vertex_id + 1]; ++i) { send_messages<MessageValue>(g->edges[i], my_dist + g->edge_value[i]); } User defined functions: // 3. Vertex update void update_vertex(MessageValue &msg, graph<VertexValue, EdgeValue> *g, size_t vertex_id){ // Relaxation. if (msg < g->vertex_value[vertex_id]) { g->vertex_value[vertex_id] = msg; // Distance reduced. Will send messages in the next step. g->active[vertex_id] = 1; } else { // Distance not changed. No msgs will be sent. g->active[vertex_id] = 0; } // 2. SIMD message processing void process_messages(vmsg_array<MessageValue> &vmsgs){ // Reduce the vector messages to vmsgs[0]. vfloat res = vmsgs[0]; for (int i = 1; i < vmsgs.size(); ++i) { res = min(res, vmsgs[i]); } vmsgs[0] = res; // 1. Message generation void generate_messages(size_t vertex_id, graph<VertexValue, EdgeValue> *g) { float my_dist = g->vertex_value[vertex_id]; // Graph is in CSR format. for (size_t i = g->vertices[vertex_id]; i < g->vertices[vertex_id + 1]; ++i) { send_messages<MessageValue>(g->edges[i], my_dist + g->edge_value[i]); } // 3. Vertex update void update_vertex(MessageValue &msg, graph<VertexValue, EdgeValue> *g, size_t vertex_id){ // Relaxation. if (msg < g->vertex_value[vertex_id]) { g->vertex_value[vertex_id] = msg; // Distance reduced. Will send messages in the next step. g->active[vertex_id] = 1; } else { // Distance not changed. No msgs will be sent. g->active[vertex_id] = 0; } // 2. SIMD message processing void process_messages(vmsg_array<MessageValue> &vmsgs){ // Reduce the vector messages to vmsgs[0]. vfloat res = vmsgs[0]; for (int i = 1; i < vmsgs.size(); ++i) { res = min(res, vmsgs[i]); } vmsgs[0] = res; 11/8/2018

48 Runtime Condensed Static Buffer Pipelined Execution
Pre-allocated space Columns Allocated continuously Avoid bubbles Pipelined Execution Use worker/mover threads Avoids locking in message insertion Beneficial to message-intensive applications Message Buffer 11/8/2018 Pipelined Execution

49 Hybrid Graph Partitioning
Partition graph into min-connectivity blocks (using Metis) Assign blocks to CPU and MIC round-robinly in (a : b) ratio Good load balance and less messages between devices 11/8/2018

50 Performance 11/8/2018

51 Outline Work after Candidacy Work before Candidacy
Optimizing Irregular Applications on MIC Data locality Access pattern Conflict avoidance Work before Candidacy MapReduce on Accelerators MapReduce system for an NVIDIA GPU, with effective shared memory usage MapReduce system for an Integrated CPU-GPU Pattern Specification System for Heterogeneous Clusters Classify applications based on communication patterns API for each pattern Automatic runtime optimization for each pattern at different parallel hierarcy Graph Processing over CPU and MIC Maintain moderate memory consumption Support SIMD message processing Reduce locking overhead Graph partitioning between CPU and MIC Future Work and Conclusion 11/8/2018

52 Future Work Handling Irregular Applications with Adaptively Changing Data Structures E.g., Moldyn, Euler Investigating the Features of New Generations of Many-core Architectures E.g., Knights Landing Adapt irregular applications according to the new features 11/8/2018

53 Conclusions Optimize MapReduce for GPUs
Intensively utilize shared memory on GPU (HPDC’12) MapReduce for a Coupled CPU-GPU Schemes for scheduling MapReduce on both CPU and GPU on an integrated architecture (SC’12) Scheduling Methods for Applications on Heterogeneous Architectures General scheduling methods for applications across heterogeneous cores (HCW’14) Pattern-specific Programming and Optimizations for Heterogeneous Clusters Pattern specific APIs and runtimes for a heterogeneous cluster (IPDPS'15) Graph Processing System over CPU and MIC Accelerating graph processing by utilizing MIMD and SIMD over CPU and MIC (IPDPS'15) Optimization Methods for Irregular Applications on MIC General optimization flow for irregular applications, including irregular reductions, graph algorithms, and SpMM (PACT'15 submission) 11/8/2018

54 Thank you! Questions? 11/8/2018

55 Backup Slides 11/8/2018

56 Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Linchuan Chen and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University

57 Outline Introduction Background System Design Experiment Results
Related Work Conclusions and Future Work

58 Introduction Motivations GPUs MapReduce Programming Model
Suitable for extreme-scale computing Cost-effective and Power-efficient MapReduce Programming Model Emerged with the development of Data-Intensive Computing GPUs have been proved to be suitable for implementing MapReduce Utilizing the fast but small shared memory for MapReduce is chanllenging Storing (Key, Value) pairs leads to high memory overhead, prohibiting the use of shared memory

59 Introduction Reduction-based method
Our approach Reduction-based method Reduce the (key, value) pair to the reduction object immediately after it is generated by the map function Very suitable for reduction-intensive applications A general and efficient MapReduce framework Dynamic memory allocation within a reduction object Maintaining a memory hierarchy Multi-group mechanism Overflow handling Before step into deeper, let me first talk about the background information of MapReduce and GPU architecture

60 Outline Introduction Background System Design Experiment Results
Related Work Conclusions and Future Work

61 MapReduce M M M M M M M M R R R R R Group by Key K1:v k1:v k2:v K1:v
K1: v, v, v, v K2:v K3:v, v K4:v, v, v K5:v R R R R R

62 MapReduce Programming Model Efficient Runtime System Map()
Generates a large number of (key, value) pairs Reduce() Merges the values associated with the same key Efficient Runtime System Parallelization Concurrency Control Resource Management Fault Tolerance … …

63 GPUs Processing Component Memory Component Host Kernel 1 Kernel 2
Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0) (Device) Grid Constant Memory Texture Device Block (0, 0) Shared Memory Local Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host

64 Outline Introduction Background System Design Experiment Results
Related Work Conclusions and Future Work

65 System Design Traditional MapReduce map(input) {
(key, value) = process(input); emit(key, value); } grouping the key-value pairs (by runtime system) reduce(key, iterator) for each value in iterator result = operation(result, value); emit(key, result);

66 System Design Reduction-based approach map(input, reduction_object) {
(key, value) = process(input); reduction_object->insert(key, value); } reduce(value1, value2) value1 = operation(value1, value2); Reduces the memory overhead of storing key-value pairs Makes it possible to effectively utilize shared memory on a GPU Eliminates the need of grouping Especially suitable for reduction-intensive applications

67 Chanllenges Result collection and overflow handling
Maintain a memory hierarchy Trade off space requirement and locking overhead A multi-group scheme To keep the framework general and efficient A well defined data structure for the reduction object

68 Device Memory Reduction Object
Memory Hierarchy GPU Reduction Object 0 Reduction Object 1 Reduction Object 0 Reduction Object 1 … … … … … … Block 0’s Shared Memory Block 0’s Shared Memory Device Memory Reduction Object Result Array Device Memory CPU Host Memory

69 Reduction Object Updating the reduction object
Use locks to synchronize Memory allocation in reduction object Dynamic memory allocation Multiple offsets in device memory reduction object

70 Reduction Object … … … … Memory Allocator KeyIdx[0] ValIdx[0]
Val Data Key Size Val Size Key Data Key Size Val Size Key Data Val Data

71 Multi-group Scheme Locks are used for synchronization
Large number of threads in each thread block Lead to severe contention on the shared memory RO One solution: full replication every thread owns a shared memory RO leads to memory overhead and combination overhead Trade-off multi-group scheme divide threads in each thread block into multiple sub-groups each sub-group owns a shared memory RO Choice of groups numbers Contention overhead Combination overhead

72 Overflow Handling Swapping In-object sorting
Merge the full shared memory ROs to the device memory RO Empty the full shared memory ROs In-object sorting Sort the buckets in the reduction object and delete the unuseful data Users define the way of comparing two buckets

73 Discussion Reduction-intensive applications
Our framework has a big advantage Applications with few or no reduction No need to use shared memory Users need to setup system parameters Develop auto-tuning techniques in future work

74 Extension for Multi-GPU
Shared memory usage can speed up single node execution Potentially benefits the overall performance Reduction objects can avoid global shuffling overhead Can also reduce communication overhead

75 Outline Introduction Background System Design Experiment Results
Related Work Conclusions and Future Work

76 Experiment Results Evaluating the swapping mechanism Applications used
5 reduction-intensive 2 map computation-intensive Tested with small, medium and large datasets Evaluation of the multi-group scheme 1, 2, 4 groups Comparison with other implementations Sequential implementations MapCG Ji et al.'s work Evaluating the swapping mechanism Test with large number of distinct keys

77 Evaluation of the Multi-group Scheme

78 Comparison with Sequential Implementations

79 Comparison with MapCG With reduction-intensive applications

80 Comparison with MapCG With other applications

81 Comparison with Ji et al.'s work

82 Evaluation of the Swapping Mechamism
VS MapCG and Ji et al.’s work

83 Evaluation of the Swapping Mechamism
VS MapCG

84 Evaluation of the Swapping Mechamism
swap_frequency = num_swaps / num_tasks

85 Outline Introduction Background System Design Experiment Results
Related Work Conclusions and Future Work

86 Related Work MapReduce for multi-core systems MapReduce on GPUs
Phoenix, Phoenix Rebirth MapReduce on GPUs Mars, MapCG MapReduce-like framework on GPUs for SVM Catanzaro et al. MapReduce in heterogeneous environments MITHRA, IDAV Utilizing shared memory of GPUs for specific applications Nyland et al., Gutierrez et al. Compiler optimizations for utilizing shared memory Baskaran et al. (PPoPP '08), Moazeni et al. (SASP '09)

87 Conclusions and Future Work
Reduction-based MapReduce Storing the reduction object on the memory hierarchy of the GPU A multi-group scheme Improved performance compared with previous implementations Future work: extend our framework to support new architectures

88 Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University

89 Outline Introduction Background System Design Experiment Results
Conclusions and Future Work 11/8/2018

90 Introduction Motivations Evolution of Heterogeneous Architectures
Decoupled CPU-GPU architectures CPU + NVIDIA GPU Coupled CPU-GPU architectures AMD Fusion Intel Ivy Bridge MapReduce Programming Model Emerged with the development of Data-Intensive Computing GPUs are Used to Speedup MapReduce It is yet unclear how to accelerate MapReduce on a coupled CPU-GPU 11/8/2018

91 Introduction Our Work A MapReduce Framework Task Scheduling Schemes
On a coupled CPU-GPU Using both CPU and GPU cores Based on continuous reduction Task Scheduling Schemes Map-Dividing Scheme Divides map tasks between CPU and GPU Pipelining Scheme Pipelines map and reduce stages on different devices Optimizing Load Balance Runtime Tuning Significant Speedup 1.21 – 2.1x speedups over single device versions 11/8/2018

92 Outline Introduction Background System Design Experiment Results
Conclusions and Future Work ffdsa 11/8/2018

93 Heterogeneous Architecture (AMD Fusion Chip)
Processing Component of the GPU Device Grid (CUDA) NDRange (OpenCL) Streaming Multiprocessor (SM) Block (CUDA) Workgroup (OpenCL) Processing Core Thread (CUDA) Work Item (OpenCL) Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0) 11/8/2018

94 Heterogeneous Architecture (AMD Fusion Chip)
Memory Component GPU shares the same physical memory with CPU No device memory No PCIe bus zero copy memory buffer Shared Memory Small size (32 KB) Faster I/O Faster locking operations GPU Private Private Private Private Thread 0 Thread 1 Thread 0 Thread 1 Shared Memory Shared Memory SM 0 SM 0 device memory RAM host memory Zero copy Zero copy CPU 11/8/2018

95 MapReduce Programming Model Efficient Runtime System Map()
Generates (key, value) pair(s) Reduce() Merges the values associated with the same key Efficient Runtime System Parallelization Concurrency Control Resource Management … … 11/8/2018

96 Outline Introduction Background System Design Experiment Results
Conclusions and Future Work 11/8/2018

97 MapReduce – Memory Overhead Challenges
Memory Requirements of MapReduce Intermediate data – key-value pairs Take large memory space Frequent I/O – low compute-to-memory-access ratio Hard to Utilize the Small but Fast Shared Memory Impossible to store key-value pairs in shared memory 11/8/2018

98 Traditional MapReduce Procedure
shuffle  reduce Reduction result Memory overhead and shuffling overhead 11/8/2018

99 Overcoming Memory Overhead of MapReduce
MapReduce Based on Continuous Reduction Reduction result 11/8/2018

100 MapReduce Based on Continuous Reduction
Key-value Pairs Reduced Immediately Low memory overhead No shuffling overhead A General Data Structure to Store the Result Reduction object – hash table based Small reduction object – can use shared memory Non-associative-and-commutative in-object sort 11/8/2018

101 Task Scheduling map tasks Map-dividing Scheme map tasks
CPU map tasks map stage reduce stage GPU Map-dividing Scheme reduce device map tasks reduce stage map device map stage key-value buffer Pipelining Scheme 11/8/2018

102 Map-dividing Scheme – Scheduling Challenges
Static Scheduling? Relative speeds of CPU and GPU vary Partitioning ratio cannot be determined Dynamic Scheduling Kernel re-launch based High kernel launch overhead Locking based Put global offset in zero-copy memory, and use atomic operations to retrieve tasks However, locking to this memory for CPU and GPU is not correctly supported 11/8/2018

103 Map-dividing Scheme (master-worker model)
Schedule Msg Task Info has_task task_idx task_size Scheduler (core 0) zero copy worker info … … busy busy idle busy busy busy idle busy CPU GPU zero copy B 0 B 1 B m B 0 B 1 B n map map map map map map … … … … combine 11/8/2018 Output

104 Map-dividing Scheme (master-worker model)
Locking-free Worker cores do not retrieve tasks actively No competition on global task offset Dedicating a CPU core to Scheduling A waste of resource Especially for applications where CPU is much faster than GPU 11/8/2018

105 Pipelining Scheme GPUs are Good at Highly Parallel Operations
Potentially good at doing map stage, which tends to be compute intensive and parallel CPUs are Good at Control Flow and Data Retrieval Potentially good at doing reduce stage, which involves branch operations and data retrieval Different Load Balancing Dynamic load balancing Static Load Balancing 11/8/2018

106 Pipelining Scheme with Dynamic Load Balancing
Scheduler (Core 0) busy worker info … … idle busy busy worker info … … busy idle busy busy Map Device Reduce Device B 0 B 1 B m B 0 B 1 B n … … map map map … … reduce reduce reduce 11/8/2018 key-value buffers (zero copy) Output

107 Pipelining Scheme with Static Load Balancing
reduce map B 1 B 1 reduce map Output … … … … … … … … B m B n reduce map Key-value buffers (zero copy) Map Device Reduce Device 11/8/2018

108 Runtime Tuning for Map-dividing Scheme
Fixed Size Scheduling Large task block size: low scheduling overhead, but high load imbalance Small task block size: low load imbalance but high scheduling overhead Runtime Tuning Worker ID Completed Task Number at Probe Stage Tuned Size N0 N0 / Nave * Sizelarge 1 N1 N1 / Nave * Sizelarge 2 N2 N2 / Nave * Sizelarge … … n Nn Nn / Nave * Sizelarge profile use small blocks adjust according to speed reduce at the end 11/8/2018

109 Outline Introduction Background System Design Experiment Results
Conclusions and Future Work 11/8/2018

110 Experimental Setup Platform Applications A Coupled CPU-GPU
AMD Fusion APU A3850 Quad Core AMD CPU + HD6550 GPU (5 x 80 = 400 cores) Applications Kmeans (KM), Word Count (WC), Naive Bayes Classifier (NBC), Matrix Multiplication (MM), K-nearest Neighbor (kNN) 11/8/2018

111 Computation Time under Different Task Block Sizes (Map-dividing Scheme)
Kmeans: CPU and GPU have near-identical processing speed 11/8/2018

112 Computation Time under Different Task Block Sizes (Map-dividing Scheme)
Word Count: CPU is 2.7 times faster than GPU 11/8/2018

113 Comparison of Different Approaches
Single Device Versions (baseline) CPU: CPU-only version. GPU: GPU-only version. Map-dividing Scheme MDO: map-dividing scheme with a manually chosen optimal task block size. TUNED: map-dividing scheme with runtime tuning Pipelining Scheme GMCR: pipelining scheme, GPU map, CPU reduce GMCRD: GMCR with dynamic load balancing GMCRS: GMCR with static load balancing CMGR: pipelining scheme, CPU map, GPU reduce CMGRD: CMGR with dynamic load balancing CMGRS: CMGR with static load balancing 11/8/2018

114 Comparison of Different Approaches
Kmeans: Comparision between single device versions (CPU, GPU), Map-dividing Scheme (MDO, TUNED) and Pipelining Scheme (GMCR, CMGR) 11/8/2018

115 Comparison of Different Approaches
Word Count: Comparision between single device versions (CPU, GPU), Map-dividing Scheme (MDO, TUNED) and Pipelining Scheme (GMCR, CMGR) 11/8/2018

116 Comparison of Different Approaches
Naive Bayes: Comparision between single device versions (CPU, GPU), Map-dividing Scheme (MDO, TUNED) and Pipelining Scheme (GMCR, CMGR) 11/8/2018

117 Comparison of Different Approaches
Matrix Multiplication: Comparision between single device versions (CPU, GPU) and Map-dividing Scheme (MDO, TUNED) 11/8/2018

118 Comparison of Different Approaches
K-nearest Neighbor: Comparision between single device versions (CPU, GPU), Map-dividing Scheme (MDO, TUNED), and Pipelining Scheme (GMCR, CMGR) 11/8/2018

119 Overall Speedups from Our Framework
Compare single core execution, and CPU-GPU execution with handwritten sequential version Execution Time (ms) KM WC NBC MM kNN Sequential (1 CPU Core) 7042 2017 2655 98810 1004 MapReduce 7804 2057 2712 93647 1154 Best CPU-GPU (Relative Speedup) 959 (7.34x) 516 (3.91x) 818 (3.25x) 3445 (28.68x) 112 (8.69x) 11/8/2018

120 Outline Introduction Background System Design Experiment Results
Conclusions and Future Work 11/8/2018

121 Conclusions and Future Work
Scheduling MapReduce on a Coupled CPU-GPU Two Different Scheduling Schemes Runtime Tuning to Lower Load Imbalance MapReduce is Based on Continuous Reduction Achieves Significant Speedup Over Single Device Versions for Most Applications Future Work Extend to clusters with coupled CPU-GPU nodes Apply the design ideas to other applications with different communication patterns 11/8/2018

122 A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University

123 Modern Parallel Computing Landscape
Super Computers Super computers are in the forms of heterogeneous clusters. 11/8/2018

124 Heterogeneous Clusters
Massive Computation Power Multiple levels of parallelism Large number of heterogeneous nodes High-end CPUs Many cores, e.g., GPU, Xeon Phi Play an Important Role in Scientific Computations 4 out of top 10 supercomputers involve CPU-accelerator nodes 11/8/2018

125 Programming Heterogeneous Clusters
Direct Programming Pros: performance Conduct application specific optimizations Cons: complexity, low productivity Workload partitioning Programming different devices Communications at different levels General Programming Models Pros: high programmability Runtime system handles underlying issues Cons: General but not efficient Too general to apply application specific optimizations A tradeoff is worth to be made 11/8/2018

126 Thus we have the following question:
“can high-level APIs be developed for several classes of popular scientific applications, to ease application development, while achieving high performance on clusters with accelerators? 11/8/2018

127 Previous Work 11/8/2018

128 Parallel Domain Specific Languages
Support a particular application domain Examples Liszt (Zachary et al.), for mesh-based PDE solvers, on clusters, GPUs, multi-cores OptiML (Sujeeth et al.), for machine learning, on GPUs CUSP, distributed with CUDA, for sparse linear algebra and graph computations on GPUs Not flexible for applications involving complex logics 11/8/2018

129 Our Solution 11/8/2018

130 The Approach Consider a Reasonable Variety of, but not all, Scientific Applications Summarize Scientific Computation Kernels by Patterns Provides Pattern-specific APIs for Each Pattern Automatically Conduct Pattern-Specific Optimizations 11/8/2018

131 The Approach Commonly Appeared Communication Patterns from Scientific Applications Generalized Reductions Irregular Reductions Stencil Computations Cover a reasonable subset of Berkeley Dwarfs (cover 16 out of 23 applications in Rodinia Benchmark Suite) Individually for Each Pattern Summarize its characteristics Computation pattern Communication pattern Design a general API Conduct automatic pattern-specific optimizations Computation level Communication level 11/8/2018

132 Communication Patterns
Generalized Reductions Parallel accumulation using associative and commutative operations, e.g., sum, mul, max E.g., supported by OpenMP Reduction space is typically small Irregular Reductions Stencil Computations Structured grids Update elements using neighbor elements 11/8/2018

133 APIs Pattern-specific Flexible
One set of user-defined functions for each pattern User-defined functions process a smallest unit Flexible C++ based. Allow the use of other parallel libraries Support applications with mixed communication patterns 11/8/2018

134 Example, Moldyn (irregular & generalized reductions)
//user-defined functions for CF kernel DEVICE void force_cmpt(Object *obj, EDGE edge, void *edge_data,void *node_data,void *parameter) { /*{compute the distance between nodes...}*/ if(dist < cutoff) { double f = compute_force( (double*)node_data[edge[0]], (double*)node_data[edge[1]]); obj->insert(&edge[0],&f); f = -f; obj->insert(&edge[1],&f); } DEVICE void force_reduce(VALUE *dst, VALUE *src) { *dst += *src; //user-defined functions for KE kernel DEVICE void ke_emit(Object *object, void *input, size_t index, void *parameter) {...} DEVICE void ke_reduce(VALUE *dst, VALUE *src) {...} //user-defined functions for AV kernel DEVICE void av_emit(...) {...} DEVICE void av_reduce(...) {...} Runtime_env env; env.init(); //runtime for irregular reduction CF IReduction_runtime *ir = env.get_IR(); //runtime for generalized reductions KE & AV GReduction_runtime *gr = env.get_GR(); //Compute Force (CF) kernel ir->set_edge_comp_func(force_cmpt); // use force_cmpt ir->set_node_reduc_func(force_reduce); // use force_reduce /*{set edge and node data filenames ...}*/ for(int i = 0; i < n_tsteps; i++){ ir->start(); // get local reduction result result = ir->get_local_reduction(); // update local node data ir->update_nodedata(result); } /*{set input filename}*/ //Kinetic Energy (KE) kernel gr->set_emit_func(ke_emit); // ke_emit as emit func gr->set_reduc_func(ke_reduce);// ke_reduce as reduce func ... gr->start(); double ke_output = (gr->get_global_reduction()); // Average Velocity (AV) kernel gr->set_emit_func(av_emit); // av_emit as emit func gr->set_reduc_func(av_reduce); // av_reduce as reduce func env.finalize();  compute force, Irregular Reduction Kinetic energy, Generalized Reduction Average velocity, Generalized Reduction User-defined Functions 11/8/2018 Application Driver Code

135 Example, Jacobi (stencil computation)
DEVICE void jacobi(void *input, void *output, int *offset, int *size, void *param) { int k = offset[0], j = offset[1], i = offset[2]; float total = GET_FLOAT3(input,i,j,k)+ GET_FLOAT3(input,i,j,k+1) GET_FLOAT3(input,i-1,j,k); GET_FLOAT3(output,i,j,k) = total/7; } Runtime_env env; env.init(); Stencil_runtime *sr=env.get_SR(); /* {prepare input data & input data eles...} */ Sconfig<float> conf; DIM3 grid_size(N, N, N), proc_size(2, 2, 2); conf.grid_size=grid_size; conf.proc_size=proc_size; /* {configure stencil width, diagonal access, #iters...} */ sr->set_config(config); sr->set_stencil_func(jacobi); // jacobi as user-defined func sr->set_grid(input_pointer); sr->start(); sr->copy_out_grid(output_pointer); env.finalize(); System-defined primitives User-defined Functions Application Driver Code 11/8/2018

136 Runtime Implementation
11/8/2018

137 Inter-process Generalized Reductions Irregular Reductions
Evenly partition input for all processes No data exchange during execution Conduct a final combination Irregular Reductions Workload Partitioning Based on reduction space (the nodes) Group edges according to the node partitioning Inter-process communication Node data exchanged for crossing edges Overlapped with local edges computation 11/8/2018

138 Inter-process – cont. Stencil Computations
Partition the grid according to the user-defined decomposition parameter Allocate sub-grids in each process with halo regions Exchange boundary data through halo region Boundary data exchange overlaps inner elements computation 11/8/2018

139 CPU-GPU Workload Partitioning
Goals Need to consider load balance Processing speeds of CPU and GPU are different Need to keep scheduling overhead low The relative amount of cycles spent on scheduling should be small compared with computation 11/8/2018

140 CPU-GPU Workload Partitioning – cont.
Generalized Reductions Dynamic scheduling between CPU and GPUs GPU launches a kernel after each task block fetch Use multiple streams for each GPU Overlap data copy and kernel execution among streams Irregular Reductions Adaptive partitioning Irregular reductions are iterative Evenly partition the reduction space for the first a few iterations Profile the relative speeds of devices in first a few iterations Re-partition the nodes according to the relative speed Stencil Computations Partition grid along the highest dimension Also use adaptive partitioning 11/8/2018

141 GPU and CPU Execution Optimizations
Reduction Localization (for Generalized Reductions and Irregular Reductions) GPU execution Reductions performed in GPU’s shared memory first, and combine into device memory CPU execution Each core has a private reduction object A combination is performed later Grid Tiling (for Stencil Computations) Increases neighbor access locality Overlapped execution Inner tiles are processed concurrently with the exchange of boundary tiles Boundary tiles are processed later 11/8/2018

142 Experiments 11/8/2018

143 Experimental Results Platform Applications Execution configurations
A GPU cluster 32 heterogeneous nodes Each node 12 core Intel Xeon 5650 CPU 2 Nvidia Tesla M2070 GPUs MVAPICH2 version 1.7 1 process per node, plus pthread multithreading CUDA 5.5 Applications Kmeans – generalized reduction Moldyn – irregular reduction and generalized reduction MiniMD - irregular reduction and generalized reduction Sobel – 2D stencil Heat3D – 3D stencil Execution configurations MPI (Hand) – from widely distributed benchmark suites CPU-ONLY – use only multi-core CPU execution on each node 1GPU-ONLY – use only 1 GPU execution on each node CPU+1GPU – use the CPU plus 1 GPU on each node CPU+2GPU – use the CPU plus 2 GPUs on each node 11/8/2018

144 Overall Performance Kmeans A GPU is 2.69x faster than a CPU
CPU+1GPU is 1.2x faster than GPU only CPU+2GPU is 1.92x faster than GPU only 32 nodes is 1760x faster than sequential version Framework (CPU-ONLY) faster than MPI (Hand), due to the difference in implementation Hand written code uses 1 MPI process per core Framework uses pthread, less communication 11/8/2018

145 Overall Performance Moldyn GPU is 1.5x faster than CPU
CPU+1GPU is 1.54x faster than GPU only CPU+2GPU is 2.31x faster than GPU only 589x speedup achieved using all 32 nodes 11/8/2018

146 Overall Performance Heat3D GPU is 2.4x faster than CPU
CPU+2GPU is 5.5x faster than CPU only 749x speedup achieved using 32 nodes Comparable with handwritten code 11/8/2018

147 Effect of Optimizations
Tiling optimization was used for stencil computations Overlapped (communication & computation) execution was used for both irregular reductions and stencil computations Tiling: improves Sobel by up to 20% Overlapped execution: 37% and 11% faster for Moldyn and Sobel 11/8/2018

148 Single GPU Performance -Comparison with GPU Benchmarks
Handwritten benchmarks Kmeans is from Rodinia benchmark suite Sobel is from NVIDIA SDK Use single-node single GPU execution Framework is 6% and 15% slower for Kmeans and Sobel, respectively 11/8/2018

149 Code Sizes Handwritten MPI codes (not able to use GPUs)
60% code size reduction, using the framework Framework is able to fully utilize CPU and GPUs on each node 11/8/2018

150 Conclusions A Programming Model Aiming to Trade off Programmability and Performance Pattern Based Optimizations Achieve Considerable Scalability, and Comparable Performance with Benchmarks Reduces Code Sizes Future Work Cover more communication patterns Support more architectures, e.g., Intel MIC 11/8/2018

151 Efficient and Simplified Parallel Graph Processing over CPU and MIC
Linchuan Chen, Xin Huo, Bin Ren, Surabhi Jain and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University

152 The System A Vertex Oriented Parallel Graph Programming System over CPU and Xeon Phi 11/8/2018

153 Intel Xeon Phi Intel Xeon Phi Large-scale Parallelism Wide SIMD Lanes
61 X86 cores, each supporting 4 hardware threads Wide SIMD Lanes 512 bit SIMD lanes 16 floats/ints or 8 doubles Limited Shared Main Memory per Core Only a few GBs of main memory 11/8/2018

154 Graph Applications Popular Irregular Classical algorithms Graph mining
Route problems Graph mining Social network Irregular Hard to utilize thread level parallelism Loop dependency Load balance Hard to utilize SIMD Random access 11/8/2018

155 Vertex Oriented Programming Models
E.g., Google’s Pregel Model Follows BSP Model Concurrent local processing Communication Synchronization Uses Message Passing to Abstract Graph Applications Simple to Use User Defined functions Expresse sequential logics: message generation, message processing, and vertex update General Enough to Specify Common Graph Algorithms 11/8/2018

156 The Challenges Xeon Phi Specific Utilizing CPU at the Same Time
Memory Access and Contention Overhead among Threads Irregular memory access Locking leads to contention among threads Difficult to Automatically Utilize SIMD Complex SSE programming Load Imbalance Associated processing to different vertices vary Need to keep as many as possible cores busy Utilizing CPU at the Same Time Graph partitioning between devices 11/8/2018

157 Programming Interface
11/8/2018

158 Programming Interface
Keeps the Simplicity of Pregel Users Express Application Logic through Message Passing message generation Active vertices send messages along neighbor links message processing Process the received messages for each vertex vertex update Update the vertex value using the processing result of the previous step 11/8/2018

159 Programming Interface: SSSP Example
1 2 2 1 1 1 2 1 2 4 4 5 1 4 5 1 2 4 2 4 1 1 1 1 6 6 3 3 3 3 1 1 1 2 2 1 1 1 2 3 1 2 3 1 4 5 4 2 2 1 4 5 4 2 2 1 1 1 1 6 6 3 3 3 3 4 4 1 1 active inactive 11/8/2018

160 Programming Interface: SSSP Example
Vector types with overloaded vector operations // 1. Message generation void generate_messages(size_t vertex_id, graph<VertexValue, EdgeValue> *g) { float my_dist = g->vertex_value[vertex_id]; // Graph is in CSR format. for (size_t i = g->vertices[vertex_id]; i < g->vertices[vertex_id + 1]; ++i) { send_messages<MessageValue>(g->edges[i], my_dist + g->edge_value[i]); } User defined functions: // 3. Vertex update void update_vertex(MessageValue &msg, graph<VertexValue, EdgeValue> *g, size_t vertex_id){ // Relaxation. if (msg < g->vertex_value[vertex_id]) { g->vertex_value[vertex_id] = msg; // Distance reduced. Will send messages in the next step. g->active[vertex_id] = 1; } else { // Distance not changed. No msgs will be sent. g->active[vertex_id] = 0; } // 2. SIMD message processing void process_messages(vmsg_array<MessageValue> &vmsgs){ // Reduce the vector messages to vmsgs[0]. vfloat res = vmsgs[0]; for (int i = 1; i < vmsgs.size(); ++i) { res = min(res, vmsgs[i]); } vmsgs[0] = res; // 1. Message generation void generate_messages(size_t vertex_id, graph<VertexValue, EdgeValue> *g) { float my_dist = g->vertex_value[vertex_id]; // Graph is in CSR format. for (size_t i = g->vertices[vertex_id]; i < g->vertices[vertex_id + 1]; ++i) { send_messages<MessageValue>(g->edges[i], my_dist + g->edge_value[i]); } // 3. Vertex update void update_vertex(MessageValue &msg, graph<VertexValue, EdgeValue> *g, size_t vertex_id){ // Relaxation. if (msg < g->vertex_value[vertex_id]) { g->vertex_value[vertex_id] = msg; // Distance reduced. Will send messages in the next step. g->active[vertex_id] = 1; } else { // Distance not changed. No msgs will be sent. g->active[vertex_id] = 0; } // 2. SIMD message processing void process_messages(vmsg_array<MessageValue> &vmsgs){ // Reduce the vector messages to vmsgs[0]. vfloat res = vmsgs[0]; for (int i = 1; i < vmsgs.size(); ++i) { res = min(res, vmsgs[i]); } vmsgs[0] = res; 11/8/2018

161 Runtime 11/8/2018

162 Workflow Message Generation Message Buffer Message Processing
Vertex Update 11/8/2018

163 Message Buffer Design Condensed Static Buffer Vertex Grouping
Pre-allocated space Avoids frequent memory allocation Vertices sorted according to in-degree Each vertex granted no less than in-degree msg slots Vertex Grouping Multiple consecutive vertices are grouped together Same group uses the same message array Length of the array equals the max in-degree in the group Width equals the number of vertices in the group Messages can be processed in SIMD For associative and commutative operations Moderate Memory Consumption Take care of power-law graphs 11/8/2018

164 Using Message Buffer One-to-one mapping Dynamic column allocation
Not all vertices receive msgs Lots of bubbles waste SIMD lanes Dynamic column allocation All columns in a group are dynamically allocated Columns are consumed from left to right continuously 11/8/2018

165 Message Insertion Using Buffers
Locking Based Concurrent writes to the same buffer column: use locks Contention overhead, especially for dense/power-law graphs Pipelining  avoids locks Each worker thread gens msgs to its private msg queues num_msg_queue = num_mover_threads qid = dst_id % num_mover_threads Mover thread tid moves messages from queues qid = tid to msg buffer Each queue is accessed by one writer and one reader  locking free Message insertion is locking-free Workers and movers run concurrently Pipelining using Workers + Movers 11/8/2018

166 Pipelining Pros Cons Suitable for Avoids locking
Each queue is accessed by one worker(writer) and one mover(reader) Each msg buffer column is written by one mover Cons Introduces extra message storage cost Choice of optimal workers/movers configuration is non-trivial Suitable for Dense graphs Message intensive applications 11/8/2018

167 Inter-core Load Balancing
Message Generation & Vertex Updating Basic task unit: a vertex Dynamically retrieved by working threads in chunks Message Processing Basic task unit: a message array in the message buffer Dynamically allocated to processing threads 11/8/2018

168 CPU-MIC Coprocessing Same runtime code is used on both CPU and MIC
MPI symmetric computing Key Issue: graph partitioning Load balance Communication overhead 11/8/2018

169 CPU-MIC Coprocessing Load Balance Communication Overhead
Dynamic load balancing not feasible High data movement cost Static load balancing more practical User indicates the workload ratio between CPU and MIC (user estimates the relative speeds of both devices) System does graph partitioning based on the ratio Communication Overhead The less cross edges, the better 11/8/2018

170 CPU-MIC Graph Partitioning
Problem: partitioning the graph according to a ratio of a : b Continuous Partitioning (An Intuitive Way) Directly divide vertices according to the partitioning ratio (a : b) Load imbalance, many graphs are power-law graphs --> # edges assigned to devices do not follow the ratio Round-robin For every a + b vertices, assign first a vertices to CPU, and remaining b vertices to MIC Good load balance High communication overhead due to cross edges between devices Hybrid Approach Partition graph into min-connectivity blocks (using Metis) Assign blocks to CPU and MIC round-robinly in (a : b) ratio Good load balance and less messages between devices 11/8/2018

171 Experiments 11/8/2018

172 Experimental Results Platform Applications
A heterogeneous CPU-MIC node CPU Xeon E5-2680, 16 cores, 2.70 GH, 63 GB Ram MIC Xeon Phi SE10P, 61 cores, 1.1 GHz, 4 hyperthreads/core, 8GB Mpic++ 4.1, built on top of icpc Symmetric mode Applications PageRank, BFS, Semi-clustering (SC), SSSP, Topological Sort 11/8/2018

173 Execution Modes CPU OMP: Multi-core CPU using OpenMP
MIC OMP: MIC execution with OpenMP CPU Lock: Multi-core CPU using framework, locking-based msg generation MIC Lock: MIC execution using our framework with locking-based msg generation CPU Pipe: Multi-core CPU using framework with pipelined msg generation MIC Pipe: MIC execution using our framework with pipelined msg generation (use 180 worker threads + 660 mover threads) CPU-MIC: Heterogeneous execution using CPU and MIC, with the best graph partitioning ratio Could not benefit from SIMD 11/8/2018

174 Overall Performance - PageRank
CPU CPU Lock is 30% faster than CPU Pipe (CPU does not suffer from contention) CPU Lock is as fast as CPU OMP MIC MIC Pipe is 2.32x faster than MIC Lock (MIC has much larger number of threads) MIC Pipe is 1.84x faster than MIC OMP CPU-MIC 1.30x faster than MIC Pipe 11/8/2018

175 Overall Performance - BFS
CPU CPU Lock is 1.45x faster than CPU Pipe CPU OMP is 1.07x faster than CPU Lock MIC MIC Lock is 1.22x faster than MIC Pipe (BFS is not message intensive) MIC Lock is 1.5x faster than MIC OMP CPU-MIC 1.32x faster than CPU Lock 11/8/2018

176 Overall Performance - TopoSort
CPU CPU Lock is 1.58x faster than CPU Pipe CPU OMP is 1.04x faster than CPU Lock MIC MIC Pipe is 3.36x faster than MIC Lock (TopoSort uses a dense graph - message intensive, contention intensive) MIC Pipe is 4.15x faster than MIC OMP CPU-MIC 1.2x faster than MIC Pipe 11/8/2018

177 Overall Performance CPU prefers Locking-based msg generation
Less threads  less contention Lower bandwidth, higher msg storage overhead for pipelining MIC prefers Pipelined msg generation For message-intensive applications (e.g., PageRank and TopoSort) More threads  high contention Higher parallel bandwidth, less IO overhead for pipelining. More threads hide memory latency 11/8/2018

178 Benefit from SIMD Three of the five applications involve reductions (in msg processing sub-step) Sub-step speedup: CPU: 2.22x – 2.35x MIC: 5.16x – 7.85x Overall speedup (depends on relative amount of time of the msg processing step): CPU: 1.08x – 1.13x MIC: 1.18x – 1.23x 11/8/2018

179 Effect of Hybrid Graph Partitioning
Partitioning ratio was chosen according to relative performance of single device executions Hybrid partitioning Communication time As low as Continuous partitioning Due to less cross edges between devices Execution time As low as round-robin partitioning Due to more balanced workload 11/8/2018

180 Summary Graph Processing Framework over CPU and MIC
Condensed Static Buffer Moderate memory consumption Support efficient SIMD message processing Pipelining Execution Flow Overlaps message generation and message insertion Reduces locking for Xeon Phi in message insertion Hybrid Graph Partitioning Maintains load balance and low communication overhead 11/8/2018


Download ppt "Advisor: Dr. Gagan Agrawal"

Similar presentations


Ads by Google