Presentation is loading. Please wait.

Presentation is loading. Please wait.

Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency Elad Alon, Krste Asanovic (Director), Jonathan Bachrach,

Similar presentations

Presentation on theme: "Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency Elad Alon, Krste Asanovic (Director), Jonathan Bachrach,"— Presentation transcript:

1 Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency Elad Alon, Krste Asanovic (Director), Jonathan Bachrach, Jim Demmel, Armando Fox, Kurt Keutzer, Borivoje Nikolic, David Patterson, Koushik Sen, John Wawrzynek

2 UC Berkeley Future Application Drivers 2

3 UC Berkeley Compute Energy “Iron Law”  When power is constrained, need better energy efficiency for more performance  Where performance is constrained (real-time), want better energy efficiency to lower power Improving energy Efficiency is critical goal for all future systems and workloads 3 Performance = Power * Energy Efficiency (Tasks/Second) (Joules/Second) (Tasks/Joule)

4 UC Berkeley Good News: Moore’s Law Continues 4 “Cramming more components onto integrated circuits”, Gordon E. Moore, Electronics, 1965 More Transistors/Chip Cheaper!

5 UC Berkeley Bad News: Dennard (Voltage) Scaling Over 5 Dennard Scaling Post-Dennard Scaling Moore, ISSCC Keynote, 2003

6 UC Berkeley 1 st Impact of End of Scaling: End of Sequential Processor Era 6

7 UC Berkeley Parallelism: A one-time gain Use more, slower cores for better energy efficiency. Either  simpler cores, or  run cores at lower Vdd/frequency  Even simpler general-purpose microarchitectures? -Limited by smallest sensible core  Even Lower Vdd/Frequency? -Limited by Vdd/Vt scaling, errors  Now what? 7

8 UC Berkeley [Muller, ARM CTO, 2009] 2 nd Impact of End of Scaling: “Dark Silicon” Cannot switch all transistors at full frequency! 8 No savior device technology on horizon. Future energy-efficiency innovations must be above transistor level.

9 UC Berkeley The End of General-Purpose Processors?  Most computing happens in specialized, heterogeneous processors -Can be X more efficient than general-purpose processor  Challenges: -Hardware design costs -Software development costs 9 NVIDIA Tegra2

10 UC Berkeley The Real Scaling Challenge: Communication As transistors become smaller and cheaper, communication dominates performance and energy 10 All scales:  Across chip  Up and down memory hierarchy  Chip-to-chip  Board-to-board  Rack-to-rack

11 UC Berkeley ASPIRE: From Better to Best  What is the best we can do? -For a fixed target technology (e.g., 7nm)  Can we prove a bound?  Can we design implementation approaching bound?  P rovably Optimal I mplementations 11 Specialize and optimize communication and computation across whole stack from applications to hardware

12 UC Berkeley Communication-Avoiding Algorithms: Algorithm Cost Measures 1. Arithmetic (FLOPS) 2. Communication: moving data between -levels of a memory hierarchy (sequential case) -processors over a network (parallel case). 12 CPU Cache DRAM CPU DRAM CPU DRAM CPU DRAM CPU DRAM

13 UC Berkeley Modeling Runtime & Energy 13

14 UC Berkeley A few examples of speedups  Matrix multiplication -Up to 12x on IBM BG/P for n=8K on 64K cores; 95% less communication  QR decomposition (used in least squares, data mining, …) -Up to 8x on 8-core dual-socket Intel Clovertown, for 10M x 10 -Up to 6.7x on 16-proc. Pentium III cluster, for 100K x 200 -Up to 13x on Tesla C2050 / Fermi, for 110k x 100 -Up to 4x on Grid of 4 cities (Dongarra, Langou et al) -“infinite speedup” for out-of-core on PowerPC laptop -LAPACK thrashed virtual memory, didn’t finish  Eigenvalues of band symmetric matrices -Up to 17x on Intel Gainestown, 8 core, vs MKL 10.0 (up to 1.9x sequential)  Iterative sparse linear equations solvers (GMRES) -Up to 4.3x on Intel Clovertown, 8 core  N-body (direct particle interactions with cutoff distance) -Up to 10x on Cray XT-4 (Hopper), 24K particles on 6K procs. 14

15 UC Berkeley Modeling Energy: Dynamic 15

16 UC Berkeley Modeling Energy: Memory Retention 16

17 UC Berkeley Modeling Energy: Background Power 17

18 UC Berkeley Energy Lower Bounds 18

19 UC Berkeley Early Result: Perfect Strong Scaling in Time and Energy  Every time you add processor, use its memory M too  Start with minimal number of procs: PM = 3n 2  Increase P by factor c  total memory increases by factor c  Notation for timing model: -γ t, β t, α t = secs per flop, per word_moved, per message of size m T(cP) = n 3 /(cP) [ γ T + β t /M 1/2 + α t /(mM 1/2 ) ] = T(P)/c  Notation for energy model: -γ e, β e, α e = Joules for same operations -δ e = Joules per word of memory used per sec -ε e = Joules per sec for leakage, etc. E(cP) = cP { n 3 /(cP) [ γ e + β e /M 1/2 + α e /(mM 1/2 ) ] + δ e MT(cP) + εET(cP) } = E(P)  Perfect scaling extends to n-body, Strassen, … [IPDPS, 2013] 19

20 UC Berkeley C-A Algorithms Not Just for HPC  In ASPIRE, apply to other key application areas: machine vision, databases, speech recognition, software-defined radio, …  Initial results on lower bounds of database join algorithms 20

21 UC Berkeley From C-A Algorithms to Provably Optimal Systems?  1) Prove lower bounds on communication for a computation  2) Develop algorithm that achieves lower bound on a system  3) Find that communication time/energy cost is >90% of resulting implementation  4) We know we’re within 10% of optimal!  Supporting technique: Optimizing software stack and compute engines to reduce compute costs and expose unavoidable communication costs 21

22 UC Berkeley ESP: An Applications Processor Architecture for ASPIRE  Future server and mobile SoCs will have many fixed- function accelerators and a general-purpose programmable multicore  Well-known how to customize hardware engines for specific task  ESP challenge is using specialized engines for general-purpose code 22 Intel Ivy Bridge (22nm) Qualcomm Snapdragon MSM8960 (28nm)

23 UC Berkeley ESP: Ensembles of Specialized Processors  General-purpose hardware, flexible but inefficient  Fixed-function hardware, efficient but inflexible  Par Lab Insight: Patterns capture common operations across many applications, each with unique communication & computation structure  Build an ensemble of specialized engines, each individually optimized for particular pattern but collectively covering application needs  Bet: Will give us efficiency plus flexibility -Any given core can have a different mix of these depending on workload 23

24 UC Berkeley Par Lab: Motifs common across apps 24 DenseGraphSparse … Audio Recognition Object Recognition Scene Analysis

25 UC Berkeley 25 Par Lab Apps Blue Cool Motif (nee “Dwarf”) Popularity (Red Hot / Blue Cool) Computing Domains

26 UC Berkeley Pipe-and-Filter Agent-and-Repository Event-based Bulk Synchronous Map-Reduce Layered Systems Model-view controller Arbitrary Task Graphs Puppeteer Model-View-Controller Application Graph Algorithms Dynamic programming Dense/Spare Linear Algebra Un/Structured Grids Graphical Models Finite State Machines Backtrack Branch-and-Bound N-Body Methods Circuits Spectral Methods Monte-Carlo Architecting Parallel Software Identify the Software Structure Identify the Key Computations

27 UC Berkeley Mapping Software to ESP: Specializers  Capture desired functionality at high-level using patterns in a productive high-level language  Use pattern-specific compilers (Specializers) with autotuners to produce efficient low-level code  ASP specializer infrastructure, open-source download 27 ILP Engine Dense Engine Sparse Engine Graph Engine Glue Code Dense Code Sparse Code Graph Code DenseGraphSparse … Audio Recognition Object Recognition Scene Analysis

28 UC Berkeley Replacing Fixed Accelerators with Programmable Fabric  Future server and mobile SoCs will have many fixed- function accelerators and a general-purpose programmable multicore  Fabric challenge is retaining extreme energy efficiency while retaining programmability 28 Intel Ivy Bridge (22nm) Qualcomm Snapdragon MSM8960 (28nm)

29 UC Berkeley Strawman Fabric Architecture 29 M A R M A R M A R M A R M A R M A R M A R M A R M A R M A R M A R M A R M A R M A R M A R M A R  Will never have a C compiler  Only programmed using pattern-based DSLs  More dynamic, less static than earlier approaches  Dynamic dataflow-driven execution  Dynamic routing  Large memory support

30 UC Berkeley “Agile Hardware” Development  Current hardware design slow and arduous  But now have huge design space to explore  How to examine many design points efficiently?  Build parameterized generators, not point designs!  Adopt and adapt best practices from Agile Software -Complete LVS-DRC clean physical design of current version every ~ two weeks (“tapein”) -Incremental feature addition -Test & Verification first step 30

31 UC Berkeley Chisel: Constructing Hardware In a Scala Embedded Language  Embed a hardware-description language in Scala, using Scala’s extension facilities  A hardware module is just a data structure in Scala  Different output routines can generate different types of output (C, FPGA-Verilog, ASIC-Verilog) from same hardware representation  Full power of Scala for writing hardware generators -Object-Oriented: Factory objects, traits, overloading etc -Functional: Higher-order funcs, anonymous funcs, currying -Compiles to JVM: Good performance, Java interoperability 31

32 UC Berkeley Chisel Design Flow 32 Chisel Program C++ code FPGA Verilog ASIC Verilog Software Simulator C++ Compiler Scala/JVM FPGA Emulation FPGA Tools GDS Layout ASIC Tools

33 UC Berkeley Chisel is much more than an HDL  The base Chisel system allows you to use the full power of Scala to describe the RTL of a design, then generate Verilog or C++ output from the RTL  But Chisel can be extended above with domain-specific languages (e.g., signal processing) for fabric  Importantly, Chisel can also be extended below with new backends or to add new tools or features (e.g., quantum computing circuits)  Only ~6,000 lines of code in current version including libraries!  BSD-licensed open source at: 33

34 UC Berkeley Many processor tapeouts in few years with small group (45nm, 28nm) 34 Clock test site SRAM test site DCDC test site Processor Site CORE 0 VC0 CORE 1 VC1 CORE 2 VC2 CORE 3 VC3 512KB L2 VFIXED Test Sites

35 UC Berkeley Resilient Circuits & Modeling  Future scaled technologies have high variability but want to run with lowest-possible margins to save energy  Significant increase in soft errors, need resilient systems  Technology modeling to determine tradeoff between MTBF and energy per task for logic, SRAM, & interconnect. 35 Techniques to reduce operating voltage can be worse for energy due to rapid rise in errors

36 UC Berkeley Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency 36 DenseGraphSparse … C++ Simulation FPGA Emulation Audio Recognition Object Recognition Scene Analysis Hardware Cache Coherence ASIC SoC FPGA Computer Communication-Avoiding Algorithms C-A GEMM C-A BFS C-A SpMV Pipe&FilterMap-Reduce … … ILP Engine Dense Engine Sparse Engine Graph Engine Local Stores + DMA Glue Code Dense Code Sparse Code Graph Code

37 UC Berkeley ASPIRE Project  Initial $15.6M/5.5 year funding from DARPA PERFECT program -Started 9/28/2012 -Located in Par Lab space + BWRC  Looking for industrial affiliates (see Krste!)  Open House today, 5 th floor Soda Hall 37 Research funded by DARPA Award Number HR Approved for public release; distribution is unlimited. The content of this presentation does not necessarily reflect the position or the policy of the US government and no official endorsement should be inferred.

Download ppt "Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency Elad Alon, Krste Asanovic (Director), Jonathan Bachrach,"

Similar presentations

Ads by Google