Download presentation

Presentation is loading. Please wait.

Published byAllyssa Worden Modified over 2 years ago

1
1 M Clark

2
2 Contents GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale

3
3 The March of GPUs

4
4 What is a GPU? Kepler K20X (2012) 2688 processing cores 3995 SP Gflops peak Effective SIMD width of 32 threads (warp) Deep memory hierarchy As we move away from registers Bandwidth decreases Latency increases Limited on-chip memory 65,536 32-bit registers per SM 48 KiB shared memory per SM 1.5 MiB L2 Programmed using a thread model

5
5 Minimum Port, Big Speed-up Application Code + GPUCPU Only Critical Functions Rest of Sequential CPU Code

6
6

7
7 Strong CUDA GPU Roadmap SGEMM / W Normalized 20122014200820102016 TeslaCUDA FermiFP64 Kepler Dynamic Parallelism MaxwellDX12 Pascal Unified Memory 3D Memory NVLink 20 16 12 8 6 2 0 4 10 14 18

8
8 Introducing NVLINK and Stacked Memory NVLINK GPU high speed interconnect 80-200 GB/s Planned support for POWER CPUs Stacked Memory 4x Higher Bandwidth (~1 TB/s) 3x Larger Capacity 4x More Energy Efficient per bit

9
9 NVLink Enables Data Transfer At Speed of CPU Memory TESLAGPUCPU DDR Memory Stacked Memory NVLink 80 GB/s DDR4 50-75 GB/s HBM 1 Terabyte/s

10
10 Correlator Calibration & Imaging RF Samplers Real-TimeR-T, post R-T O(N) O(N 2 ) O(N log N) O(N) O(N 2 ) N digital Radio Telescope Data Flow

11
11 Where can GPUs be Applied? Cross correlation – GPU are ideal Performance similar to CGEMM High performance open-source library https://github.com/GPU-correlators/xGPU https://github.com/GPU-correlators/xGPU Calibration and Imaging Gridding - Coordinate mapping of input data to a regular grid Arithmetic intensity scales with kernel convolution size Compute-bound problem maps well to GPUs Dominant time sink in compute pipeline Other image processing steps CUFFT – Highly optimized Fast Fourier Transform library PFB – Computational intensity increases with number of taps Coordinate transformations and resampling

12
12 GPUs in Radio Astronomy Already an essential tool in radio astronomy ASKAP – Western Australia LEDA – United States of America LOFAR – Netherlands (+ Europe) MWA – Western Australia NCRA - India PAPER – South Africa LOFAR LEDA MWA ASKAP PAPER

13
13 Cross correlation is essentially GEMM Hierarchical locality Cross Correlation on GPUs

14
14 Correlator Efficiency 2012201420082010 X-engine GFLOPS per Watt Kepler Tesla Fermi Maxwell Pascal 64 32 16 8 4 2 1 >1 TFLOPS sustained >2.5 TFLOPS sustained 0.35 TFLOPS sustained 2016

15
15 Software Correlation Flexibility Why do software correlation? Software correlators inherently have a great degree of flexibility Software correlation can do on-the-fly reconfiguration Subset correlation at increased bandwidth Subset correlation at decreased integration time Pulsar binning Easy classification of data (RFI threshold) Software is portable, correlator unchanged since 2010 Already running on 2016 architecture

16
16 Power of 300 Petaflop CPU-only Supercomputer = Power for the city of San Francisco HPC’s Biggest Challenge: Power

17
17 GPUs Power World’s 10 Greenest Supercomputers Green500 Rank MFLOPS/WSite 14,503.17GSIC Center, Tokyo Tech 23,631.86Cambridge University 33,517.84University of Tsukuba 43,185.91Swiss National Supercomputing (CSCS) 53,130.95ROMEO HPC Center 63,068.71GSIC Center, Tokyo Tech 72,702.16University of Arizona 82,629.10Max-Planck 92,629.10(Financial Institution) 102,358.69CSIRO 371959.90Intel Endeavor (top Xeon Phi cluster) 491247.57Météo France (top CPU cluster)

18
18 The End of Historic Scaling C Moore, Data Processing in ExaScale-Class Computer Systems, Salishan, April 2011

19
19 The End of Voltage Scaling In the Good Old Days Leakage was not important, and voltage scaled with feature size L’ = L/2 V’ = V/2 E’ = CV 2 = E/8 f’ = 2f D’ = 1/L 2 = 4D P’ = P Halve L and get 4x the transistors and 8x the capability for the same power The New Reality Leakage has limited threshold voltage, largely ending voltage scaling L’ = L/2 V’ = ~V E’ = CV 2 = E/2 f’ = 2f D’ = 1/L2 = 4D P’ = 4P Halve L and get 4x the transistors and 8x the capability for 4x the power, or 2x the capability for the same power in ¼ the area.

20
20 Major Software Implications Need to expose massive concurrency Exaflop at O(GHz) clocks O(billion-way) parallelism! Need to expose and exploit locality Data motion more expensive than computation > 100:1 global v. local energy Need to start addressing resiliency in the applications

21
21 How Parallel is Astronomy? SKA1-LOW specifications 1024 dual-pol stations => 2,096,128 visibilities 262,144 frequency channels 300 MHz bandwidth Correlator 5 Pflops of computation Data-parallel across visibilities Task-parallel across frequency channels O(trillion-way) parallelism

22
22 How Parallel is Astronomy? SKA1-LOW specifications 1024 dual-pol stations => 2,096,128 visibilities 262,144 frequency channels 300 MHz bandwidth Gridding (W-projection) Kernel size 100x100 Parallel across kernel size and visibilities (J. Romein) O(10 billion-way) parallelism

23
23 Energy Efficiency Drives Locality 20mm 64-bit DP 1000 pJ 28nm IC 256-bit access 8 kB SRAM 50 pJ 16000 pJ DRAM Rd/Wr 500 pJ Efficient off-chip link 20 pJ 26 pJ256 pJ 256 bits

24
24 Energy Efficiency Drives Locality

25
25 Energy Efficiency Drives Locality This is observable today We have lots of tunable parameters: Register tile size: how many much work should each thread do? Thread block size: how many threads should work together? Input precision: size of the input words Quick and dirty cross correlation example 4x4 => 8x8 register tiling 8.5% faster, 5.5% lower power => 14% improvement in GFLOPS / watt

26
26 Correlator Calibration & Imaging RF Samplers 8-bit digitizationO(100) PFLOPSO(10) PFLOPS N = 1024 digital SKA1 LOW Sketch 10 Tb/s 50 Tb/s

27
27 Energy Efficiency Drives Locality

28
28 Do we need Moore’s Law? Moore’s law come from shrinking process Moore’s law is slowing down Denard scaling is dead Increasing wafer costs means that it takes longer to move to the next process

29
29 Improving Energy Efficiency @ Iso-Process We don’t know how to build the perfect processor Huge focus on improved architecture efficiency Better understanding of a given process Compare Fermi vs. Kepler vs. Maxwell architectures @ 28 nm GF117: 96 cores, peak 192 Gflops GK107: 384 cores, peak 770 Gflops GM107: 640 cores, peak 1330 Gflops Use cross-correlation benchmark Only measure GPU power

30
30 Improving Energy Efficiency @ 28 nm 80% 55% 80%

31
31 How Hot is Your Supercomputer? 1. TSUBAME-KFC Tokyo Tech, oil cooled 4503 GFLOPS / watt 2. Wilkes Cluster U. Cambridge, air cooled 3631 GFLOPS / watt Number 1 is 24% more efficient than number 2

32
32 Temperature is Power Power is dynamic and static Dynamic power is work Static power is leakage Dominant term from sub-threshold leakage Voltage terms: V s : Gate to source voltage V th : Switching threshold voltage n: transistor sub-threshold swing coeff Device specifics: A : Technology specific constant L, W : device channel length & width Thermal Voltage: 8.62×10 −5 eV/K 26 mV at room temperature

33
33 Temperature is Power Tesla K20m GK110, 28nm Geforce GTX 580 GF110, 40nm

34
34 Tuning for Power Efficiency A given processor does not have a fixed power efficiency Dependent on Clock frequency Voltage Temperature Algorithm Tune in this multi-dimensional space for power efficiency E.g., cross-correlation on Kepler K20 12.96 -> 18.34 GFLOPS / watt Bad news: no two processors are exactly alike

35
35 Precision is Power Power scales with the square of the multiplier (approximately) Most computation done in FP32 / FP64 Should use the minimum precision required by science needs Maxwell GPUs have 16-bit integer multiply-add at FP32 rate Algorithms should increasingly use hierarchical precision Only invoke in high precision when necessary Signal processing folks known this for a long time Lesson feeding back into the HPC community...

36
36 Conclusions Astronomy has insatiable amount of compute Many-core processors are a perfect match to the processing pipeline Power is a problem but Astronomy has oodles of parallelism Key algorithms possess locality Precision requirements are well understood Scientists and Engineers wedded to the problem Astronomy is perhaps the ideal application for the exascale

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google