We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byBeatrice Swindall
Modified over 4 years ago
Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer
Outline Motivation and Vision Related Works and Background Linear Algebra Processor Power/Performance Analysis Conclusion and Future Work 9/30/20142©Ardavan Pedram 2012
Outline Motivation and Vision Related Works and Background Linear Algebra Processor Power/Performance Analysis Conclusion and Future Work 9/30/20143©Ardavan Pedram 2012
Trend of processors Technology scaling has reached physical limits – Limit of performance is power We may have Dark silicon on the chip – Only a percentage of chip might be active 9/30/20144©Ardavan Pedram 2012
Heterogeneous Solution – Increase power efficiency: GFLOPS/W – More of cores with lower frequency and power – Specialized cores Orders of magnitude better power efficiency (GFLOPS/W) Expensive Long time to market 9/30/20145 Nvidia Tegra System on Chip ©Ardavan Pedram 2012
Linear Algebra Processor Design Goals Efficiency of full custom hardware Orders of magnitude improvement Achieving upper limits of power/performance ratio Flexibility to execute a whole class of coarse- grain operations Co-optimized and co-designed across all layers Targeting linear algebra applications 9/30/20146 Source: Andreas Olofsson ©Ardavan Pedram 2012
Linear Algebra Routines Linear Algebra Package (LAPACK) level – Cholesky and QR factorization Basic Linear Algebra Subroutines (BLAS) – General matrix-matrix multiplication (GEMM) Inner kernels – Hand-optimized GEMM is often what delivers high- performance to many crucial applications 9/30/20147©Ardavan Pedram 2012
Outline Motivation and Vision Related Works and Background Linear Algebra Processor Power/Performance Analysis Conclusion and Future Work 9/30/20148©Ardavan Pedram 2012
GEMM Implementations CPUs: 95% peak – [Goto et al.2008][Intel MKL] GPUs: 70% peak – [Nath et al.2010] Nvidia Fermi – [Volkov et al.2008] Nvidia Tesla FPGAs: 99% peak – [Zikari et al. 2007] – [Zhuo et al. 2008] Specialized architectures – Clearspeed CSX: 78% peak – Systolic Arrays: [Lippert et al.2001] Intel Quad core – 40 GFLOPS @2.6 GHz Nvidia FERMI – 350 GFLOPS @1.15 GHz Altera Stratix IV – 100 GFLOPS @ 0.4 GHz CSX 700 – 75 GFLOPS @ 0.25 GHz 9/30/20149©Ardavan Pedram 2012
Common Sources of Inefficiencies in conventional architectures CPUs & GPUs – Instruction handling – Multi-ported register file – Cache overheads: tags and coherency – Thread scheduling FPGAs – Low area efficiency Specialized architectures – Data communication overheads 9/30/201410©Ardavan Pedram 2012
Outline Motivation and Vision Related Works and Background Linear Algebra Processor Power/Performance Modeling Generalization Conclusion and Future Work 9/30/201411©Ardavan Pedram 2012
Matrix Multiplication Hierarchy 9/30/2014 Fastest general-purpose implementation of GEMM.[GotoBLAS] CAB ©Ardavan Pedram 201212
Rank-1 Update Rank-1 Update: Updates a matrix by adding outer product of two vectors to it 9/30/201413 Matrix multiplication using series of rank-1 updates: Let C, A, and B be 4x4, 4xk c, and k c x4 matrices. C+=AB can be computed as: for i=0 to k c -1 end for A A C C B B ©Ardavan Pedram 2012
Linear Algebra Core (LAC) Desgin Customized for rank-1 update – 2D arrangement of PEs – Broadcast buses Integrates into memory hierarchy 9/30/201414©Ardavan Pedram 2012
C C 9/30/201415 On-Chip Memory C += A 0 B 0 + … + A K-1 B K-1 Main Memory Core Local stores Memory Hierarchy A A B B C C
9/30/201416 On-Chip Memory C i += A i,p B p Core Local stores Memory Hierarchy C C A A B B
9/30/201417 On-Chip Memory C i,j += A i,p B p,j Core Local stores Main Memory Memory Hierarchy C C A A B B
On-Chip Memory 9/30/201418 Core Local stores Main Memory Memory Hierarchy C C A A B B
Design of Linear Algebra Core (LAC) Distributed memory architecture Broadcast Buses 9/30/201419©Ardavan Pedram 2012
Data Mapping on LAC 9/30/201420 PE(0,0)PE(0,1)PE(0,2)PE(0,3) PE(1,0)PE(1,1)PE(1,2)PE(1,3) PE(2,0)PE(1,2)PE(2,2)PE(2,3) PE(3,0)PE(1,3)PE(3,2)PE(3,3) Mapping of A 16x16 on 4x4 2D arrangement of PEs 4x4 2D arrangement of PEs ©Ardavan Pedram 2012
Data Mapping on LAC 9/30/201421 PE(0,0)PE(0,1)PE(0,2)PE(0,3) PE(1,0)PE(1,1)PE(1,2)PE(1,3) PE(2,0)PE(1,2)PE(2,2)PE(2,3) PE(3,0)PE(1,3)PE(3,2)PE(3,3) Mapping of A 16x16 on 4x4 2D arrangement of PEs 4x4 2D arrangement of PEs ©Ardavan Pedram 2012
Rank-1 Update c 11 +=a 1i xb i1 c 21 +=a 2i xb i1 c 12 +=a 1i xb i2 c 22 +=a 2i xb i2 c 13 +=a 1i xb i3 c 23 +=a 2i xb i3 c 14 +=a 1i xb i4 c 24 +=a 2i xb i4 c 31 +=a 3i xb i1 c 41 +=a 4i xb i1 c 32 +=a 3i xb i2 c 42 +=a 4i xb i2 c 33 +=a 3i xb i3 c 43 +=a 4i xb i3 c 34 +=a 3i xb i4 c 44 +=a 4i xb i4 dd s s 9/30/2014 Orange : elements of A Green : elements of B Blue : elements of C 22©Ardavan Pedram 2012
GEMM on LAP 23 ©Ardavan Pedram 2012
Multi LAC on Chip Same panel of B for all cores On-chip memory stores a complete n×n block of C Each core computes different panel of C 9/30/201424 Lac 0 Memory Lac 1 Memory Lac 2 Memory ©Ardavan Pedram 2012
Outline Motivation and Vision Related Works and Background Linear Algebra Core Power/Performance Analysis Conclusion and Future Work 9/30/201425©Ardavan Pedram 2012
Performance and Power Analysis Analytical formulae – Utilization – Bandwidth – Size of local stores Cycle-accurate simulator – Matrix multiplication – Cholesky factorization Component selections – MAC units (45nm) [Galal et al.2010] – Storage model with [CACTI 6.0] Pure SRAM Model – Interconnect AMBA AHB [Lahiri.2004] [Wolkotte.2009] – Activity of components based on GEMM – Leakage as 25%~30% of dynamic power 9/30/201426©Ardavan Pedram 2012
Core Utilization Trade-off 9/30/201427 Bandwidth vs. local memory size trade-oﬀ 100% utilization Core dimension trade-off ©Ardavan Pedram 2012
Multi-LAC Solution Trade-off 9/30/201428 On-chip memory limits performance On-chip Bandwidth requirement grows exponentially to maintain peak performance ©Ardavan Pedram 2012
33 GB/s off-chip BW Over 600 DP-GFLOPS Over 90% utilization Performance vs. External Bandwidth 9/30/201429 256x256 /512x512 / 768x768 /1024x1024
PE Efficiency for Different Frequencies Area – Mostly occupied by SRAM Power – Mostly consumed by MAC units 120 GFLOPS/W – upper limit for SP-PE 60 GFLOPS/W – upper limit for DP-PE 1 GHz sweet spot of performance vs. efficiency Low voltages, – SRAM power consumption limits efficiency 9/30/201430©Ardavan Pedram 2012
LAP vs. Intel® Core2 Duo Penryn Power Break down – [V George et al.2007] Out of Order and Frontend – 40% of the core power (over 5 W) Execution logic – Register file 9/30/201431©Ardavan Pedram 2012
LAP vs. GTX280 Nvidia Tesla Single Precision GEMM 9/30/201432
LAP VS. GTX480 Nvidia Fermi 9/30/201433©Ardavan Pedram 2012
Summary of LAP – 600/1200 DP/SP-GFLOPS – One/two Orders of magnitude Improvements vs. GPUs/CPUs 9/30/201434©Ardavan Pedram 2012
GEMM Performance and efficiency on different platforms 9/30/201435 GFLOPSW/mm 2 GFLOPS/mm 2 GFLOPS/WUtilization Cell BE (SP)2000.31.5588% NVidia GTX480 SM (SP)7800.20.95.270% NVidia GTX480 SM (DP)3900.20.52.670% Intel Core-i7 960 (SP)9188.8.131.525% Intel Core-i7 960 (DP)4184.108.40.2065% Altera Stratix IV (DP)1000.020.053.590+% ClearSpeed CSX700(DP)750.020.212.578% LAP (SP)12000.26-115590+% LAP (DP)6000.23-52590+% ©Ardavan Pedram 2012
Outline Motivation and Vision Related Works and Background Linear Algebra Core Power/Performance Analysis Conclusion and Future Work 9/30/201436©Ardavan Pedram 2012
Conclusion Linear algebra Processor – Algorithm/Architecture co-design – Power and efficiency estimation – Generalized to more complex algorithms (Cholesky) – Results @ 1GHz DP: 32 GFLOPS, 47 GFLOPS/W 0.6 Watts 2.8 mm 2 in 45nm 4 GB/s external BW Orders of magnitude improvement 9/30/201437©Ardavan Pedram 2012
Conclusion 9/30/201438©Ardavan Pedram 2012 Studied Architectures and their power consumption sources
Future Work Implementation – Hardware synthesis Generalization – Level-3 BLAS – LU and QR factorization 9/30/201439 Integration within a general purpose framework Design space exploration – Picking the right algorithm variant ©Ardavan Pedram 2012
Symantec 2010 Windows 7 Migration EMEA Results. Methodology Applied Research performed survey 1,360 enterprises worldwide SMBs and enterprises Cross-industry.
Symantec 2010 Windows 7 Migration Global Results.
2011年上半年 我院团学工作活动图片展播 2011年8月28日.
Rev Monday, January 13, Foundations, Technology, Skills Tools.
AP STUDY SESSION 2.
CSE 6007 Mobile Ad Hoc Wireless Networks
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.
Processes and Operating Systems
Copyright © 2013 Elsevier Inc. All rights reserved.
ITRS Roadmap Design + System Drivers Makuhari, December 2007 Worldwide Design ITWG Good morning. Here we present the work that the ITRS Design TWG has.
David Burdett May 11, 2004 Package Binding for WS CDL.
BUS 220: Elementary Statistics
Local Customization Chapter 2. Local Customization 2-2 Objectives Customization Considerations Types of Data Elements Location for Locally Defined Data.
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
Custom Services and Training Provider Details Chapter 4.
© 2018 SlidePlayer.com Inc. All rights reserved.