Presentation is loading. Please wait.

Presentation is loading. Please wait.

Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Similar presentations


Presentation on theme: "Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer."— Presentation transcript:

1 Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer

2 Outline Motivation and Vision Related Works and Background Linear Algebra Processor Power/Performance Analysis Conclusion and Future Work 9/30/20142©Ardavan Pedram 2012

3 Outline Motivation and Vision Related Works and Background Linear Algebra Processor Power/Performance Analysis Conclusion and Future Work 9/30/20143©Ardavan Pedram 2012

4 Trend of processors Technology scaling has reached physical limits – Limit of performance is power We may have Dark silicon on the chip – Only a percentage of chip might be active 9/30/20144©Ardavan Pedram 2012

5 Heterogeneous Solution – Increase power efficiency:  GFLOPS/W – More of cores with lower frequency and power – Specialized cores  Orders of magnitude better power efficiency (GFLOPS/W)  Expensive  Long time to market 9/30/20145 Nvidia Tegra System on Chip ©Ardavan Pedram 2012

6 Linear Algebra Processor Design Goals Efficiency of full custom hardware Orders of magnitude improvement Achieving upper limits of power/performance ratio Flexibility to execute a whole class of coarse- grain operations Co-optimized and co-designed across all layers Targeting linear algebra applications 9/30/20146 Source: Andreas Olofsson ©Ardavan Pedram 2012

7 Linear Algebra Routines Linear Algebra Package (LAPACK) level – Cholesky and QR factorization Basic Linear Algebra Subroutines (BLAS) – General matrix-matrix multiplication (GEMM) Inner kernels – Hand-optimized GEMM is often what delivers high- performance to many crucial applications 9/30/20147©Ardavan Pedram 2012

8 Outline Motivation and Vision Related Works and Background Linear Algebra Processor Power/Performance Analysis Conclusion and Future Work 9/30/20148©Ardavan Pedram 2012

9 GEMM Implementations CPUs: 95% peak – [Goto et al.2008][Intel MKL] GPUs: 70% peak – [Nath et al.2010] Nvidia Fermi – [Volkov et al.2008] Nvidia Tesla FPGAs: 99% peak – [Zikari et al. 2007] – [Zhuo et al. 2008] Specialized architectures – Clearspeed CSX: 78% peak – Systolic Arrays: [Lippert et al.2001] Intel Quad core – 40 GFLOPS @2.6 GHz Nvidia FERMI – 350 GFLOPS @1.15 GHz Altera Stratix IV – 100 GFLOPS @ 0.4 GHz CSX 700 – 75 GFLOPS @ 0.25 GHz 9/30/20149©Ardavan Pedram 2012

10 Common Sources of Inefficiencies in conventional architectures CPUs & GPUs – Instruction handling – Multi-ported register file – Cache overheads: tags and coherency – Thread scheduling FPGAs – Low area efficiency Specialized architectures – Data communication overheads 9/30/201410©Ardavan Pedram 2012

11 Outline Motivation and Vision Related Works and Background Linear Algebra Processor Power/Performance Modeling Generalization Conclusion and Future Work 9/30/201411©Ardavan Pedram 2012

12 Matrix Multiplication Hierarchy 9/30/2014 Fastest general-purpose implementation of GEMM.[GotoBLAS] CAB ©Ardavan Pedram 201212

13 Rank-1 Update Rank-1 Update: Updates a matrix by adding outer product of two vectors to it 9/30/201413 Matrix multiplication using series of rank-1 updates: Let C, A, and B be 4x4, 4xk c, and k c x4 matrices. C+=AB can be computed as: for i=0 to k c -1 end for A A C C B B ©Ardavan Pedram 2012

14 Linear Algebra Core (LAC) Desgin Customized for rank-1 update – 2D arrangement of PEs – Broadcast buses Integrates into memory hierarchy 9/30/201414©Ardavan Pedram 2012

15 C C 9/30/201415 On-Chip Memory C += A 0 B 0 + … + A K-1 B K-1 Main Memory Core Local stores Memory Hierarchy A A B B C C

16 9/30/201416 On-Chip Memory C i += A i,p B p Core Local stores Memory Hierarchy C C A A B B

17 9/30/201417 On-Chip Memory C i,j += A i,p B p,j Core Local stores Main Memory Memory Hierarchy C C A A B B

18 On-Chip Memory 9/30/201418 Core Local stores Main Memory Memory Hierarchy C C A A B B

19 Design of Linear Algebra Core (LAC) Distributed memory architecture Broadcast Buses 9/30/201419©Ardavan Pedram 2012

20 Data Mapping on LAC 9/30/201420 PE(0,0)PE(0,1)PE(0,2)PE(0,3) PE(1,0)PE(1,1)PE(1,2)PE(1,3) PE(2,0)PE(1,2)PE(2,2)PE(2,3) PE(3,0)PE(1,3)PE(3,2)PE(3,3) Mapping of A 16x16 on 4x4 2D arrangement of PEs 4x4 2D arrangement of PEs ©Ardavan Pedram 2012

21 Data Mapping on LAC 9/30/201421 PE(0,0)PE(0,1)PE(0,2)PE(0,3) PE(1,0)PE(1,1)PE(1,2)PE(1,3) PE(2,0)PE(1,2)PE(2,2)PE(2,3) PE(3,0)PE(1,3)PE(3,2)PE(3,3) Mapping of A 16x16 on 4x4 2D arrangement of PEs 4x4 2D arrangement of PEs ©Ardavan Pedram 2012

22 Rank-1 Update c 11 +=a 1i xb i1 c 21 +=a 2i xb i1 c 12 +=a 1i xb i2 c 22 +=a 2i xb i2 c 13 +=a 1i xb i3 c 23 +=a 2i xb i3 c 14 +=a 1i xb i4 c 24 +=a 2i xb i4 c 31 +=a 3i xb i1 c 41 +=a 4i xb i1 c 32 +=a 3i xb i2 c 42 +=a 4i xb i2 c 33 +=a 3i xb i3 c 43 +=a 4i xb i3 c 34 +=a 3i xb i4 c 44 +=a 4i xb i4 dd s s 9/30/2014 Orange : elements of A Green : elements of B Blue : elements of C 22©Ardavan Pedram 2012

23 GEMM on LAP 23 ©Ardavan Pedram 2012

24 Multi LAC on Chip Same panel of B for all cores On-chip memory stores a complete n×n block of C Each core computes different panel of C 9/30/201424 Lac 0 Memory Lac 1 Memory Lac 2 Memory ©Ardavan Pedram 2012

25 Outline Motivation and Vision Related Works and Background Linear Algebra Core Power/Performance Analysis Conclusion and Future Work 9/30/201425©Ardavan Pedram 2012

26 Performance and Power Analysis Analytical formulae – Utilization – Bandwidth – Size of local stores Cycle-accurate simulator – Matrix multiplication – Cholesky factorization Component selections – MAC units (45nm) [Galal et al.2010] – Storage model with [CACTI 6.0] Pure SRAM Model – Interconnect AMBA AHB [Lahiri.2004] [Wolkotte.2009] – Activity of components based on GEMM – Leakage as 25%~30% of dynamic power 9/30/201426©Ardavan Pedram 2012

27 Core Utilization Trade-off 9/30/201427 Bandwidth vs. local memory size trade-off 100% utilization Core dimension trade-off ©Ardavan Pedram 2012

28 Multi-LAC Solution Trade-off 9/30/201428 On-chip memory limits performance On-chip Bandwidth requirement grows exponentially to maintain peak performance ©Ardavan Pedram 2012

29 33 GB/s off-chip BW Over 600 DP-GFLOPS Over 90% utilization Performance vs. External Bandwidth 9/30/201429 256x256 /512x512 / 768x768 /1024x1024

30 PE Efficiency for Different Frequencies Area – Mostly occupied by SRAM Power – Mostly consumed by MAC units 120 GFLOPS/W – upper limit for SP-PE 60 GFLOPS/W – upper limit for DP-PE 1 GHz sweet spot of performance vs. efficiency Low voltages, – SRAM power consumption limits efficiency 9/30/201430©Ardavan Pedram 2012

31 LAP vs. Intel® Core2 Duo Penryn Power Break down – [V George et al.2007] Out of Order and Frontend – 40% of the core power (over 5 W) Execution logic – Register file 9/30/201431©Ardavan Pedram 2012

32 LAP vs. GTX280 Nvidia Tesla Single Precision GEMM 9/30/201432

33 LAP VS. GTX480 Nvidia Fermi 9/30/201433©Ardavan Pedram 2012

34 Summary of LAP – 600/1200 DP/SP-GFLOPS – One/two Orders of magnitude Improvements vs. GPUs/CPUs 9/30/201434©Ardavan Pedram 2012

35 GEMM Performance and efficiency on different platforms 9/30/201435 GFLOPSW/mm 2 GFLOPS/mm 2 GFLOPS/WUtilization Cell BE (SP)2000.31.5588% NVidia GTX480 SM (SP)7800.20.95.270% NVidia GTX480 SM (DP)3900.20.52.670% Intel Core-i7 960 (SP)960.40.51.295% Intel Core-i7 960 (DP)480.40.250.695% Altera Stratix IV (DP)1000.020.053.590+% ClearSpeed CSX700(DP)750.020.212.578% LAP (SP)12000.26-115590+% LAP (DP)6000.23-52590+% ©Ardavan Pedram 2012

36 Outline Motivation and Vision Related Works and Background Linear Algebra Core Power/Performance Analysis Conclusion and Future Work 9/30/201436©Ardavan Pedram 2012

37 Conclusion Linear algebra Processor – Algorithm/Architecture co-design – Power and efficiency estimation – Generalized to more complex algorithms (Cholesky) – Results @ 1GHz DP: 32 GFLOPS, 47 GFLOPS/W 0.6 Watts 2.8 mm 2 in 45nm 4 GB/s external BW Orders of magnitude improvement 9/30/201437©Ardavan Pedram 2012

38 Conclusion 9/30/201438©Ardavan Pedram 2012 Studied Architectures and their power consumption sources

39 Future Work Implementation – Hardware synthesis Generalization – Level-3 BLAS – LU and QR factorization 9/30/201439 Integration within a general purpose framework Design space exploration – Picking the right algorithm variant ©Ardavan Pedram 2012


Download ppt "Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer."

Similar presentations


Ads by Google