Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Wisconsin Petascale Tools Workshop Madison, WI August 4-7 th 2014 The Hybrid Model: Experiences at Extreme Scale Benjamin Welton.

Similar presentations


Presentation on theme: "University of Wisconsin Petascale Tools Workshop Madison, WI August 4-7 th 2014 The Hybrid Model: Experiences at Extreme Scale Benjamin Welton."— Presentation transcript:

1 University of Wisconsin Petascale Tools Workshop Madison, WI August 4-7 th 2014 The Hybrid Model: Experiences at Extreme Scale Benjamin Welton

2 The Hybrid Model o TBON + X o Leveraging TBONs, GPUs, and CPUs in large scale computation o Combination creates a new computational model with new challenges o Management of multiple devices, Local node load balancing, and Node level data management. o Traditional distributed systems problems get worse o Cluster wide load balancing, I/O management, and debugging. 2 The Hybrid Model: Experiences at Extreme Scale

3 MRNet and GPUs o To get more experience with GPUs at scale we built a leadership class application called Mr. Scan o Mr. Scan is a density based clustering algorithm utilizing GPUs o The first to application able to cluster multi-billion point datasets o Uses MRNet as its distribution framework o However we ran into some challenges o Load balancing, debugging, and I/O inhibited performance and increased development time 3 The Hybrid Model: Experiences at Extreme Scale

4 Density-based clustering o Discovers the number of clusters o Finds oddly-shaped clusters 4 The Hybrid Model: Experiences at Extreme Scale

5 Goal: Find regions that meet minimum density and spatial distance characteristics The two parameters that determine if a point is in a cluster is Epsilon (Eps), and MinPts If the number of points in Eps is > MinPts, the point is a core point. For every discovered point, this same calculation is performed until the cluster is fully expanded Clustering Example (DBSCAN [1] ) 5 The Hybrid Model: Experiences at Extreme Scale EpsMinPts MinPts: 3 [1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with noise, (1996)

6 6 The Hybrid Model: Experiences at Extreme Scale MRNet – Multicast / Reduction Network o General-purpose TBON API o Network: user-defined topology o Stream: logical data channel o to a set of back-ends o multicast, gather, and custom reduction o Packet: collection of data o Filter: stream data operator o synchronization o transformation o Widely adopted by HPC tools o CEPBA toolkit o Cray ATP & CCDB o Open|SpeedShop & CBTF o STAT o TAU FE ……… BE app BE app BE app BE app CP F(x 1,…,x n )

7 Computation in a Tree-Based Overlay Network 7 The Hybrid Model: Experiences at Extreme Scale FE BE CP BE o Adjustable for load balance o Output sizes MUST be constant or decreasing at each level for scalability o MRNet provides this process structure Data Size: 10MB per BE Total Size of Packets: ≤ 10 MB Total Size of Packets: ≤10 MB

8 MRNet Hybrid Computation o A hybrid computation includes GPU processing elements alongside traditional CPU elements. o In MRNet, GPUs were included as filters. o A combination of CPU and GPU filters were used in MRNet. 8 Mr. Scan: Performance challenges of an extreme scale GPU-Based Application FE …… BE app BE app CP F(x 1,…,x n )

9 Intro to Mr. Scan 9 The Hybrid Model: Experiences at Extreme Scale BE CP BE DBSCAN Merge FE Mr. Scan Phases Partition: Distributed DBSCAN: GPU (on BE) Merge: CPU (x #levels) Sweep: CPU (x #levels) FE BE Merge FS Sweep

10 Mr. Scan SC 2013 Performance 10 The Hybrid Model: Experiences at Extreme Scale Time: 0Time: 18.2 Min Partitioner DBSCAN Merge & Sweep Clustering 6.5 Billion Points FS Read 224 Secs FS Write 489 Secs MRNet Startup 130 Secs FS Read: 24 Secs DBSCAN 168 Secs Merge Time: 6 Secs Sweep Time: 4 Secs Write Output: 19 Secs

11 Load Balancing Issue o In initial testing imbalances in load between nodes was a significant limiting factor in performance o Increased run-time of Mr. Scan’s computation phase by a factor of 10. o Input point counts did not correlate to run times for a specific node o Adding an additional 25 minutes to the computation o Resolving the load balance problem required numerous Mr. Scan specific optimizations o Algorithm Tricks like Dense Box and heuristics in data partitioning 11 The Hybrid Model: Experiences at Extreme Scale

12 Partition Phase o Goal: Partitions computationally equivalent to DBSCAN o Algorithm: o Form initial partitions o Add shadow regions o Rebalance 12 The Hybrid Model: Experiences at Extreme Scale

13 Distributed Partitioner 13 The Hybrid Model: Experiences at Extreme Scale

14 GPU DBSCAN Computation 14 The Hybrid Model: Experiences at Extreme Scale DBSCAN computation is performed in two distinct steps on the leaf nodes of the tree Step 1: Detect Core Points Block 1 Block 2 Block 900 T1T1 T2T2 T 512 T1T1 T2T2 T 512 T1T1 T2T2 T 512 Block 1 T1T1 T2T2 T 512 Block 2 T1T1 T2T2 T 512 Block 900 T1T1 T2T2 T 512 Step 2: Expand core points and color

15 The DBSCAN Density Problem o Imbalances in point density can cause huge differences in runtimes between Thread Groups inside of a GPU (10- 15x variance in time) o Issue is caused by the lookup operation for a points neighbors in the DBSCAN point expansion phase. 15 The Hybrid Model: Experiences at Extreme Scale Higher density results in higher neighbor count which increases the number of comparison operations

16 Dense Box o Dense Box eliminates the need to perform neighbor lookups on points in dense regions by labeling points as a member of cluster before DBSCAN is run. 16 The Hybrid Model: Experiences at Extreme Scale Mr. Scan: Efficient Clustering with MRNet and GPUs 15

17 Challenges of the Hybrid Model o Debugging o Difficult to detect incorrect output without writing application specific verification tools o Load Balancing o GPUs increased the difficulty of balancing load both cluster wide and on a local node (due to large variance in runtimes with identical sized input) o Application-specific solution required for load balancing o Existing distributed framework components stressed o Increased computational performance of GPUs stress other non-accelerated components of the system (such as I/O) 17 The Hybrid Model: Experiences at Extreme Scale

18 Debugging Mr. Scan o Result verification complicated due to: o CUDA Warp Scheduling not being deterministic o Packet reception order not deterministic in MRNet o Both issues altered output slightly o DBSCAN non-core point cluster selection is order dependent o Output cluster IDs would vary based on packet processing order in MRNet o Easy verification of output, such as a bitwise comparison against a known correct output, not possible 18 The Hybrid Model: Experiences at Extreme Scale

19 Debugging Mr. Scan o We had to write verification tools to run after each run to ensure output was still correct o Very costly in terms of both programmer time (to write the tools) and wall clock runtime o Worst of all…. Tools used for verification are DBSCAN specific. o Generic solutions needed badly for increased productivity 19 The Hybrid Model: Experiences at Extreme Scale

20 Load Balancing o Load balancing between nodes proved to be a significant and serious issue o Identical input sizes would result in vastly differing runtimes (by up to 10x) o Without the load balancing work, Mr. Scan would not have scaled o Application specific GPU load balancing system implemented o No existing frameworks could help with balancing GPU applications 20 The Hybrid Model: Experiences at Extreme Scale

21 Other components o GPU use revealed flaws that were hidden in the original non-GPU implementation of Mr. Scan. o I/O, start-up, and other components of the system impacted performance greatly o Accounting for a majority of the run time of Mr. Scan o Solutions to these issued that scaled for a CPU based application, might not for a GPU based application 21 The Hybrid Model: Experiences at Extreme Scale

22 Work in progress o We are currently looking at ways to perform load balance/sharing in GPU applications in a generic way o We are looking at methods that do not change the distributed models used by applications and require no direct vendor support o Getting users or hardware vendors to make massive changes to their applications/hardware is hard 22 Mr. Scan: Performance challenges of an extreme scale GPU-Based Application

23 Questions? 23 Mr. Scan: Performance challenges of an extreme scale GPU-Based Application

24 Characteristics of a ideal load balancing framework o Require as few changes to existing applications as possible o We cannot expect application developers to give up MPI, MapReduce, TBON, or other computational frameworks to solve load imbalance o Take advantage of the fine grained computation decomposition we see with GPUs/Accelerators o Course grained solutions (such as moving entire kernel invocations/processes) limits options for balancing load. o Needs to play by the hardware vendors “rules” o We cannot rely on support from hardware vendors for a distributed framework. 24 Mr. Scan: Performance challenges of an extreme scale GPU-Based Application

25 An Idea: Automating Load Balancing o Have a layer above the GPU but below the user application framework to manage and load balance GPU computations across nodes o GPU Manager would execute user application code on device while attempting to share load with idle GPUs 25 Mr. Scan: Efficient Clustering with MRNet and GPUs User Application (MPI/MRNet/ MapReduce/etc) GPU Manager GPU Device

26 An Idea: A Load Balancing Service 26 Mr. Scan: Efficient Clustering with MRNet and GPUs Application Supplies CUDA functions (PTX, CUBIN) Program sent to device, pointer to function saved Function Binary Argument Data for functions passed to manager Data forwarded to device Data Application ask to run function binary. Supplying a data stride and number of compute blocks Compute Blocks created and added to queue Function Ptr + Data Offset SIMD Persistent kernel in GPU would pull off this queue and execute the user’s function At completion of all queued blocks results returned Result

27 An Idea: A Load Balancing Service o On detection of an idle GPU, load is shared between nodes. 27 Mr. Scan: Efficient Clustering with MRNet and GPUs Function Ptr + Data Offset BinaryData User Binary transfer to new host Binary sent to GPU Binary Data for compute blocks Data copied to GPU Data Block moved, updating data offset Function Ptr + Data Offset SIMD Block Executed, Result returned to originating node


Download ppt "University of Wisconsin Petascale Tools Workshop Madison, WI August 4-7 th 2014 The Hybrid Model: Experiences at Extreme Scale Benjamin Welton."

Similar presentations


Ads by Google