General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks NameDescriptionCUDA SourceLines of Code DatasetParallel sectn. Threads/sectn.

General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks NameDescriptionCUDA SourceLines of Code DatasetParallel sectn. Threads/sectn. CUDAXMTCUDAXMTCUDAXMT BfsBreadth-First Search Harish and Narayanan, Rodinia 290861M nodes, 6M edges 25121M87.4K BpropBack PropagationRodinia96052264K nodes2651.04M19.4K ConvImage ConvolutionNVIDIA CUDA SDK 283871024x51222131K512K MsortMerge-SortThrust library9662831M keys8214032K10.7K NWNeedleman- Wunsch Rodinia4301292x2048 sequences 25541921.1K ReductParallel ReductionNVIDIA CUDA SDK 4815916M elts.335.5K44K SpmvSparse matrix- vector multiply Bell and Garland 913436K x 36K, 4M non-zero 1130.7K36K Performance Comparison When using 1024-TCU XMT configuration: 6.05x average speedup on irregular applications 2.07x average slowdown on regular applications When using 512-TCU XMT configuration 4.57x average speedup on irregular 3.06x average slowdown on regular Case study: BFS on low parallelism dataset Speedup of 73.4x over Rodinia implementation Speedup of 6.89x over UIUC implementation Speedup of 110.6x when using only 64 TCUs (lower latencies for the smaller design) Benchmarks Experimental Platform XMTSim: The cycle-accurate XMT simulator Timing modeled after the 64-TCU FPGA prototype Highly configurable to simulate any configuration Modular design, enables architectural exploration Part of XMT Software Release: http://www.umiacs.umd.edu/users/vishkin/XMT/sw-release.html SPAA’09: 10X over Intel Core Duo with same silicon area Current work: XMT outperforms GPU on all irregular workloads XMT does not fall behind significantly on regular workloads No need to pay high performance penalty for ease-of-programming Promising candidate for pervasive platform of the future: Highly parallel general-purpose CPU coupled with: Parallel GPU Future work: Power/energy comparison of XMT and GPU TESLAXMT Memory Latency Hiding and Reduction Heavy multithreading (requires large register files and state-aware scheduler) Limited local shared scratchpad memory No coherent private caches at SM or SP Large globally shared cache No coherent private TCU or cluster caches Software prefetching Memory and Cache Bandwidth Memory access patterns need to be coordinated by the user for efficiency (request coalescing) Scratchpad memories prone to bank conflicts Relaxed need for user-coordinated DRAM access due to caches Address hashing for avoiding memory module hotspots High bandwidth mesh-of-trees interconnect between clusters and caches Functional Unit (FU) Allocation Dedicated FUs for SPs and SFUs Less arbitration logic required Higher theoretical peak performance Heavy FUs (FPU and MDU) are shared through arbitrators Lightweight FUs (ALU, branch) are allocated per TCU ALUs do not include multiply-divide functionality Control Flow and Synchronization Single instruction cache and issue per SM. Warps execute in lock-step (penalizes diverging branches) Efficient local synchronization and communication within blocks. Global communication is expensive Switching between serial and parallel modes (i.e. passing control from CPU to GPU) requires off-chip communication One instruction cache and program counter per TCU enables independent progress of threads Coordination of threads performed via constant time prefix-sum. Other communication through the shared cache Dynamic hardware support for fast switching between serial and parallel modes and load balance of virtual threads GTX280XMT-1024 Principal Computational Resources Cores240 SP, 60 SFU1024 TCU Integer Units240 ALU+MDU1024 ALU, 64 MDU Floating Point Units240 FPU, 60 SFU64 FPU On Chip Memory Registers1920KB128KB Prefetch Buffers--32KB Regular Caches480KB4104KB Constant Cache240KB128KB Texture Cache480KB-- Need configurations with equivalent area constraints (576 mm 2 in 65nm) Can not simply set the number of functional units and memory to the same values Area estimation of the envisioned XMT chip is based on the 64 TCU XMT ASIC prototype (designed in 90nm IBM technology) More area intensive side is emphasized in each category. TESLA XMT Tested Configurations: GTX280 vs. XMT-1024 Paraleap: XMT PRAM-on-chip silicon XMT: An Easy-to-Program Many-Core XMT: Motivation and Background XMT Programming Model At each step, provide all instructions that can execute concurrently (not dependent on each other) PRAM/XMT abstraction: all such instructions execute immediately (“uniform cost”) PRAM-like programming: using reduced synchrony Main construct: spawn-join block. Can start any number of virtual-threads at once Many-cores are coming. But 40yrs of parallel computing: Never a successful general-purpose parallel computer (easy to program, good speedups, up & down scalable). IF you could program it  great speedups. XMT: Fix the IF XMT: Designed from the ground up to address that for on-chip parallelism Tested HW & SW prototypes Builds on PRAM algorithmics. Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase Ease of programming Necessary condition for success of a general-purpose platform In von Neumann’s 1947 specs Indications that XMT is easy to program: 1.XMT is based on rich algorithmic theory (PRAM) 2.Ease-of-teaching as a benchmark: a.Successfully taught parallel programming to middle-school, high- school and up b.Evaluated by education experts (SIGCSE 2010) c.XMT superior to MPI, OpenMP and CUDA 3.Programmer’s workflow for deriving efficient programs from PRAM algorithms 4.DARPA HPCS productivity study: XMT development time half of MPI Virtual-Threads advance at own speed, not lockstep Prefix-sum (ps): similar to atomic fetch-and-add Arrzz XMTC Programming Language int A[N],B[N] int base=0; spawn(0,N-1) { int inc=1; if (A[$]!=0) { ps(inc,base); B[inc]=A[$]; } int A[N],B[N] int base=0; spawn(0,N-1) { int inc=1; if (A[$]!=0) { ps(inc,base); B[inc]=A[$]; } C with simple SPMD extensions spawn: start any number of virtual threads $: unique thread ID ps/psm: atomic prefix sum. Efficient hardware implementation XMTC Example: Array Compaction Non-zero elements of A copied into B Order is not necessarily preserved After atomically executing ps(inc,base) base = base + inc inc gets original value of base Elements copied into unique locations in B Built FPGA prototype Announced in SPAA’07 Built using 3 FPGA chips 2 Virtex-4 LX200, 1 Virtex-4 FX100 Clock rate75 MHz DRAM size1GB DRAM channels1 Mem. data rate0.6GB/s No. cores (TCUs)64 Clusters8 Cache modules8 Shared cache256KB George C. CarageaFuat KeceliAlexandros TzannesUzi Vishkin

General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks NameDescriptionCUDA SourceLines of Code DatasetParallel sectn. Threads/sectn.

Similar presentations

Presentation on theme: "General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks NameDescriptionCUDA SourceLines of Code DatasetParallel sectn. Threads/sectn."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks NameDescriptionCUDA SourceLines of Code DatasetParallel sectn. Threads/sectn.

Similar presentations

Presentation on theme: "General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks NameDescriptionCUDA SourceLines of Code DatasetParallel sectn. Threads/sectn."— Presentation transcript:

Similar presentations

About project

Feedback