Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

Slides:

Advertisements

Similar presentations

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Veynu Narasiman The University of Texas at Austin Michael Shebanow

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Improving GPU Performance via Improved SIMD Efficiency Ahmad Lashgar ECE Department University of Tehran Supervisors: Ahmad Khonsari Amirali Baniasadi.

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

GPU Programming with CUDA – Optimisation Mike Griffiths

University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

Understanding Outstanding Memory Request Handling Resources in GPGPUs

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

What GPGPU-Sim Simulates

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques.

WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

CS 732: Advance Machine Learning

Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

1 Lecture: Storage, GPUs Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

Sunpyo Hong, Hyesoon Kim

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

Performance in GPU Architectures: Potentials and Distances

(1) ©Sudhakar Yalamanchili unless otherwise noted Reducing Branch Divergence in GPU Programs T. D. Han and T. Abdelrahman GPGPU 2011.

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.

CUDA programming Performance considerations (CUDA best practices)

The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Employing compression solutions under openacc

Sathish Vadhiyar Parallel Programming

EECE571R -- Harnessing Massively Parallel Processors ece

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

ISPASS th April Santa Rosa, California

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Presented by: Isaac Martin

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Address-Stride Assisted Approximate Load Value Prediction in GPUs

Presentation transcript:

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria

This Work  Accelerators o Accelerators amortize control-flow over groups of threads (warps) o Warp size impacts performance (branch/memory divergence & memory access coalescing) o Small Warp: Low Branch/Memory Divergence (+), Low Memory Coalescing (-) o Large Warp: High Branch Divergence/Memory (-), High Memory Coalescing(+)  Question: Possible Solutions? o Enhance Coalescing in a Small-warp machine (SW+) OR o Enhance Divergence in a Large-warp machine (LW+)  Winner: SW+ 2 Warp Size Impact in GPUs: Large or Small?

Outline  Branch/Memory Divergence  Memory Access Coalescing  Warp Size Impact  Warp Size: Large or Small? o Use machine models to find the answer: o Small-Warp Coalescing-Enhanced Machine (SW+) o Large-Warp Control-flow-Enhanced Machine (LW+)  Experimental Results  Conclusions & Future Work 3 Warp Size Impact in GPUs: Large or Small?

Warping  Opportunities o Reduce scheduling overhead o Improve utilization of execution units (SIMD efficiency) o Exploit inter-thread data locality  Challenges o Memory divergence o Branch divergence 4 Warp Size Impact in GPUs: Large or Small?

Memory Divergence  Threads of a warp may hit or miss in L1 5 J = A[S]; // L1 cache access L = K * J; Hit Miss Hit Time Stall WarpT0T1T2T3 WarpT0T1T2T3 Warp Size Impact in GPUs: Large or Small?

Branch Divergence  Branch instruction can diverge to two different paths dividing the warp to two groups: 1.Threads with taken outcome 2.Threads with not-taken outcome 6 If(J==K){ C[tid]=A[tid]*B[tid]; }else if(J>K){ C[tid]=0; } Warp T0XXT3 Warp Time XT1T2X T0T1T2T3 T0XXT3 T0T1T2T3 Warp Size Impact in GPUs: Large or Small?

Memory Access Coalescing  Common memory access of neighbor threads are coalesced into one transaction 7 WarpT0T1T2T3 WarpT4T5T6T7 WarpT8T9T10T11 Hit Miss Hit Miss Mem. Req. AMem. Req. B Mem. Req. C Mem. Req. DMem. Req. E ABAB CCCC DEED Warp Size Impact in GPUs: Large or Small?

Warp Size  Warp Size: number of threads in warp  Small Warp Advantage: o Less branch/memory divergence o Less synchronization overhead at every instruction  Large Warp Advantage: o Greater opportunity for memory access coalescing 8 Warp Size Impact in GPUs: Large or Small?

Warp Size and Branch Divergence  Lower the warp size, lower the branch divergence 9 If(J>K){ C[tid]=A[tid]*B[tid]; else{ C[tid]=0; } ↓↓↓↓↓↓↓↓ ↓↓↓↓↓↓ ↓↓ ↓↓↓↓↓↓↓↓ 2-thread warp T1T2T3T4T5T6T7T8 No branch divergence 4-thread warp Branch divergence Warp Size Impact in GPUs: Large or Small?

Warp Size and Memory Divergence 10 WarpT0T1T2T3 WarpT4T5T6T7 WarpT8T9T10T11 Time Small warpsLarge warps Hit Miss Hit Warp T0T1T2T3 Hit Miss Hit Warp T0T1T2T3 T8T9T10T11 T4T5T6T7 Stall WarpT0T1T2T3 WarpT4T5T6T7 T4T5T6T7 T8T9T10T11 WarpT8T9T10T11 Improving latency hiding Warp Size Impact in GPUs: Large or Small?

Warp Size and Memory Access Coalescing 11 WarpT0T1T2T3 WarpT4T5T6T7 WarpT8T9T10T11 Time Small warpsLarge warps Miss Warp T0T1T2T3 Miss T4T5T6T7 T8T9T10T11 Miss Req. A Req. B Req. A Req. B Req. A Req. B Reducing # of memory accesses using wider coalescing 5 memory requests2 memory requests Warp Size Impact in GPUs: Large or Small?

Warp Size Impact on Coalescing  Often Warp Size Coalescing 12 Warp Size Impact in GPUs: Large or Small?

Warp Size Impact on Idle Cycles  MU: Warp Size Divergence Idle Cycles  BKP: Warp Size Coalescing Idle Cycles 13 Warp Size Impact in GPUs: Large or Small?

Warp Size Impact on Performance  MU: Warp Size Divergence Performance  BKP: Warp Size Coalescing Performance 14 Warp Size Impact in GPUs: Large or Small?

Approach 15 Baseline machine SW+: -Ideal MSHR compensates coalescing loss of small warps LW+: -MIMD lanes compensate divergence of large warps Warp Size Impact in GPUs: Large or Small?

SW+  Warps as wide as SIMD width o Low branch/memory divergence, Improve latency hiding  Compensating coalescing loss -> Ideal MSHR o Compensating small-warp deficiency (memory access coalescing loss) o Ideal MSHR prevents redundant memory transactions by merging the redundant requests of the warps on the same SM. o Outstanding MSHRs are searched to perform the merge 16 Warp Size Impact in GPUs: Large or Small?

LW+  Warps 8x larger than SIMD width o Improve memory access coalescing  Compensating divergence -> Lock-step MIMD execution o Compensate large warp deficiency (branch/memory divergence) 17 Warp Size Impact in GPUs: Large or Small?

Methodology  Cycle-accurate GPU simulation through GPGPU-sim o Six Memory Controllers (76 GB/s) o 16 8-wide SMs (332.8 GFLOPS) o 1024-thread per core o Warp Size: 8, 16, 32, and 64  Workloads o RODINIA o CUDA SDK o GPGPU-sim 18 Warp Size Impact in GPUs: Large or Small?

Coalescing Rate  SW+: 86%, 58%, 34% higher coalescing vs. 16, 32, 64 thd/warps  LW+: 37%, 17%, higher and -1% lower coalescing vs. 16, 32, 64 thd/warps 19 Warp Size Impact in GPUs: Large or Small?

Idle Cycles  SW+: 11%, 6%, 8% less Idle Cycles vs. 8, 16, 32 thd/warps  LW+: 1% more and 4%, 2% less Idle Cycles vs. 8, 16, 32 thd/warps 20 Warp Size Impact in GPUs: Large or Small?

Performance  SW+: Outperforms LW+ (11%), 8 (16%), 16(13%), 32 (20%) thd/warps.  LW+: Outperforms 8 (5%), 16 (1%), 32 (7%), 64 (15%) thd/warps. 21 Warp Size Impact in GPUs: Large or Small?

Conclusion & Future Work  Warp Size Impacts Coalescing, Idle Cycles, and Performance  Investing in Enhancement of small-warp (SW+) machine returns higher gain than investing in enhancement of large- warp (LW+)  Future Work: Evaluating warp size impact on energy efficiency 22 Warp Size Impact in GPUs: Large or Small?

23 Thank you! Question? Warp Size Impact in GPUs: Large or Small?

Backup-Slides 24 Warp Size Impact in GPUs: Large or Small?

Coalescing Width  Range of the threads in a warp which are considered for memory access coalescing o NVIDIA G80 -> Over sub-warp o NVIDIA GT200 -> Over half-warp o NVIDIA GF100 -> Over entire warp  When the coalescing width is over entire warp, optimal warp size depends on the workload 25 Warp Size Impact in GPUs: Large or Small?

Warp Size and Branch Divergence (continued) 26 WarpT0T1T2T3 WarpT4T5T6T7 WarpT8T9T10T11 WarpT0T1XX WarpT4T5T6T7 WarpXT9T10T11 WarpXXT2T3 WarpT8XXX WarpT0T1T2T3 WarpT4T5T6T7 WarpT8T9T10T11 Warp Time T0T1T2T3 T4T5T6T7 T8T9T10T11 Warp T0T1XX T4T5T6T7 XT9T10T11 Warp XXT2T3 XXXX T8XXX Warp T0T1T2T3 T4T5T6T7 T8T9T10T11 Small warpsLarge warps Saving some idle cycles Warp Size Impact in GPUs: Large or Small?

Warping  Thousands of threads are scheduled zero-overhead o All the context of threads are on-core  Tens of threads are grouped into warp o Execute same instruction in lock-step 27 Warp Size Impact in GPUs: Large or Small?

Key Question  Which warp size should be decided as the baseline? o Then, investing in augmenting the processor toward removing the associated deficiency  Machine models to find the answer Warp Size Impact in GPUs: Large or Small? 28

GPGPU-sim Config NoC Total Number of SMs16 Number of Memory Ctrls6 Number of SM Sharing an Network Interface2 SM Number of thread per SM1024 Maximum allowed CTA per SM8 Shared Memory/Register File size16KB/64KB SM SIMD width8 Warp Size8 / 16 / 32 / 64 L1 Data cache48KB: 8-way: LRU: 64BytePerBlock L1 Texture cache16KB: 2-way: LRU: 64BytePerBlock L1 Constant cache16KB: 2-way: LRU: 64BytePerBlock Clocking Core clock1300 MHz Interconnect clock650 MHz DRAM memory clock800 MHz Memory Number of Banks Per Memory Ctrls8 DRAM Scheduling PolicyFCFS 29 Warp Size Impact in GPUs: Large or Small?

Workloads NameGrid SizeBlock Size#Insn BFS: BFS Graph [3]16x(8,1,1)16x(512,1)1.4M BKP: Back Propagation [3]2x(1,64,1)2x(16,16)2.9M DYN: Dyn_Proc [3]13x(35,1,1)13x(256)64M FWAL: Fast Walsh Transform [6] 6x(32,1,1) 3x(16,1,1) (128,1,1) 7x(256) 3x(512) 11.1M GAS: Gaussian Elimination [3]48x(3,3,1)48x(16,16)8.8M HSPT: Hotspot [3](43,43,1)(16,16,1)76.2M MP: MUMmer-GPU++ [8](1,1,1)(256,1,1)0.3M MTM: Matrix Multiply [14](5,8,1)(16,16,1)2.4M MU: MUMmer-GPU [1](1,1,1)(100,1,1)0.15M NNC: Nearest Neighbor on cuda [2]4x(938,1,1)4x(16,1,1)5.9M NQU: N-Queen [1](256,1,1)(96,1,1)1.2M NW: Needleman-Wunsch [3] 2x(1,1,1) … 2x(31,1,1) (32,1,1) 63x(16)12.9M SC: Scan[14](64,1,1)(256,1,1)3.6M SR1: Speckle Reducing Anisotropic Diffusion [3] (large dataset) 3x(8,8,1)3x(16,16)9.1M SR2: Speckle Reducing Anisotropic Diffusion [3] (small dataset) 4x(4,4,1)4x(16,16)2.4M 30 Warp Size Impact in GPUs: Large or Small?