Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chita R. Das High Performance Computing Laboratory

Similar presentations


Presentation on theme: "Chita R. Das High Performance Computing Laboratory"— Presentation transcript:

1 Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms
Chita R. Das High Performance Computing Laboratory Department of Computer Science & Engineering The Pennsylvania State University EEHiPC-December 19, 2010

2 Technology Scaling Challenges State-of-the-art Design Challenges
Talk Outline Technology Scaling Challenges State-of-the-art Design Challenges Opportunity: Heterogeneous Architectures Technology – 3D, TFET, Optics, STT-RAM Processor – new devices, Core heterogeneity Memory – STT-RAM, PCM, etc Interconnect – Network heterogeneity Conclusions

3 Computing Walls Data from ITRS 2008 Moore’s Law

4 Computing Walls Utilization and Power Wall P ≈ CV2f
Data from ITRS 2008 Utilization and Power Wall P ≈ CV2f Lower V can reduce P But, speed decreases with V 3x 1x High performance MOS started out with 12V Current high-perf. μPs have 1V supply => (12/1)2 = 144x over 28 years. Only (1/0.6)2 = 2.77x left in next 12 years!

5 Computing Walls Memory bandwidth:
Pin count increases only 4x compared to 25x increase in cores Data from ITRS 2009

6 Reliability Wall Computing Walls
Failure rate per transistor must decrease exponentially as we go deeper into the sub-nm regime Data from ITRS 2007

7 Computing Walls Global wires no longer scale

8 State-of-the-Art in Architecture Design
Multi-core Processor 16 nm technology 25K FPU (64b) 37.5 TFLOPS 150 W (compute only) 64b FPU 0.015 mm2 4pJ/op 3GHz 20mm 10mm 20pJ 4 cycles 64b 1 mm channel 2pJ/word The energy required to move a 64b-word across the die is equivalent to the energy for 10 Flops Traditional designs have approximately 75% of energy consumed by overhead 64b Off-Chip Channel 64pJ/word Performance = Parallelism Efficiency = Locality Bill Harrod, DARPA, IPTO, 2009

9 Energy Cost of Operations (Dally)
Energy(pJ) 64bFloatingFMA(2ops) 100 64bIntegerAdd 1 Write64bDFF 0.5 Read64bRegister(64x32bank) 3.5 Read64bRAM(64x2K) 25 Readtags(24x2K) 8 Move64b1mm 6 Move64b20mm 120 Move64boffchip 256 Read64bfromDRAM 2000

10 Energy cost for different operations
Energy(pJ) DPFLOPs Insts* I$Fetch 33 0.67 2.0 RegisterAccess 10.5 0.2 0.6 Access 3 op. D$ 100 2 6 Access 3 op. L2D$ 460 9 27 Access 3 op. off-chip 762 15 45 Access 3 op. from DRAM 6000 120 360 Energy dominated by data and instruction movement. * Insts column gives number of average instructions that can be performed for this energy.

11 Conventional Architecture (90nm) Energy is dominated by overhead
3.0E E E-10 4.8E E-10 FPU Local Global Off-chip Overhead DRAM 1.4E-8 Dally

12 Where is this overhead coming from?
Complex microarchitecture: OOO, register renaming, branch prediction…. Complex memory hierarchy High/unnecessary data movement Orthogonal design style Limited understanding of application requirements

13 Power becomes the deciding factor for designing HPC systems
Both Put Together…. Power becomes the deciding factor for designing HPC systems Joules/operation Hardware acquisition cost no more dominates in terms of TCO

14 IT Infrastructure Optimization: The new Math
Installed base (M units) Spending (US$B) $0 $50 $100 $150 $200 $250 $300 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 POWER & COOLING COSTS SERVER MANAGEMENT COST NEW SERVER SPENDING 5 10 15 20 25 30 35 40 45 50 Source: IDC Becoming comparable! Until Now: Minimize equipment, software/licenses, and service/management costs. Going Forward: Power and Physical Infrastructure costs to house the IT become equally important. 1.00 W DC-DC (0.18 W) 1.18 W AC-DC (0.31 W) 1.49 W Power Distribution (0.04 W) 1.53 W UPS (0.14 W) 1.67 W Cooling (1.07 W) 2.84 W 2.74 W Building Switchgear/ Transformer (0.10 W) 1 Watt consumed at the server cascades to approximately 2.84 watts of total consumption Server Component (1 W) Source: Emerson Cumulative consumption Become “Greener” in the process. - 14 -

15 A Holistic Green Strategy
Facilities Operations, Office spaces, Factories, Travel and Transportation, Energy sourcing, … UPS, Power distribution, Chillers, Lighting, Real estate Support Servers Storage Networking Core Technology for Greening Greening Of Technology Source: A. Sivasubramaniam

16 Power-Efficient Supercomputing: Goal
• 200pJ/FLOP (5GFLOPS/W) sustained • 25% of energy in FPU

17 Power-Efficient Supercomputing: Approach
– Eliminate overhead • Hide latency explicitly • Simple control – Energy-optimized architecture Efficient data supply Efficient instruction supply Reduce the number of instructions Agile memory hierarchy Over-provisioned architecture – Optimized components • Low-energy interconnect • Low-energy memories

18 Processor Power Efficiency Based on ExtremeScale Study
Bill Dally’s strawman processor architecture • Possible processor design methodology for achieving 28 pJ/Fflop • Requires optimization of communication, computation and memory components 28pJ/FLOP Minimize DRAM energy 2.5nJ/FLOP Conventional Design 631pJ/FLOP Minimize Overhead Bill Harrod, DARPA, IPTO, 2009 4

19 Opportunity- Heterogeneous Architectures
Multicore era Heterogeneous multicore architectures provide the most compelling architectural trajectory to mitigate these problems Hybrid memory sub-system: SRAM, TFET, STT-RAM Hybrid cores: Big, small, accelerators, GPUs Heterogeneous interconnect

20 A Holistic Design Paradigm
Heterogeinity in interconnect Heterogeinity in memory des. Heterogeinity in micro-arch. Heterogeinity in device/circuits

21 Technology Heterogeneity
Heterogeneity in technology: CMOS based scaling is expected to continue till 2022 Exploiting emerging technologies to design different cores/components is promising because it can enable cores with power/performance tradeoffs that were not possible before. TFETs provide higher performance than CMOS based designs at lower voltages V/F scaling of CMOS and TFET devices

22 Latency/ time critical
Processor Cores Big core Latency critical Small core Throughput critical GPGPUs BW critical Heterogeneous Compute Nodes Accelerators/ ASIC Latency/ time critical

23 Role of novel technologies in memory systems
Memory Architecture Role of novel technologies in memory systems Comparison of memory technologies

24 Heterogeneous Interconnect
Buffer Utilization Link Utilization Non-uniformity is due to: non-edge symmetric network and X-Y routing. So, Why clock the routers at the same frequency: Variable frequency routers for designing NoCs Why allocate all routers similar area/buffer/link resources: Heterogeneous routers/NoC

25 Compiler support OS support Software Support
Thread remapping to minimize power: migrate threads to TFET cores to reduce power Dynamic instruction morphing: instructions of a thread are morphed to match the heterogeneous hardware the thread is mapped to by the runtime system OS support Heterogeneity aware scheduling support Run-time thread migration support

26 Current research in HPCL
Problems with Current NoCs NoC power consumption is a concern today With technology scaling, NoC power can be as high as 40-60W for 128 nodes2 Intel 80 core tile power profile1 1. A 5-GHz Mesh Interconnect for A Teraflops Processor–Y. Hoskote, S. Vangal, A. Singh, N. Borkar, S. Borkar in IEEE MICRO 2007 2. Networks for Multi-core Chips:A contrarian view - S. Borkar in Special Session at ISLPED 2007

27 Network performance/power
` The proposed approach1 @low load: optimize for performance (reduce zero load latency and accelerate flits) @high load: manage congestion and power Observation: @low load: low power consumption @high load: high power consumption and congestion 1. A Case for Dynamic Frequency Tuning in On-Chip Networks, MICRO 2009

28 Frequency Tuning Rationale
Throttle/ Frequency is lowered No change Frequency is boosted Congested Upstream router throttles depending upon its total buffer utilization No change

29 Performance/Power improvement with RAFT
FreqTune gives both power reduction and throughput improvement 36% reduction in latency, 31% increase in throughput and 14% power reduction across all traffic patterns FreqThrtl at high load (optimize performance and power) FreqBoost at low load (optimize performance)

30 A Case for Heterogeneous NoCs
Using the same amount of link resources and fewer buffer resources as a homogeneous network, this proposal demonstrates that a carefully designed heterogeneous network can reduce average latency, improve network throughput and reduce power Explore types, number and placement of heterogeneous routers in the network Small router Big router Narrow link Wide link

31 HeteroNoC Performance-Power Envelope
22% throughput improvement 25% latency reduction 28% power reduction

32 3D Stacking = Increased Locality!
Many more neighbors within a few minutes of reach!

33 Reduced Global Interconnect Length
Delay/Power Reduction Bandwidth Increase Smaller Footprint Mixed Technology Integration

34 3D routers for 3D networks
One router in one grid (Total area = 4L2) Stack layers in 3D (Total area = L2) Stack routers components in 3D (Total area = L2) Results from MIRA: A Multi-layered On-Chip Interconnect Router Architecture, ISCA 2008

35 But, design of such systems is extremely complex
Conclusions Need a coherent approach to address the submicron technology problems in designing energy-eficient HPC systems Heterogeneous multicores can address these problems and would be the future architecture trajectory But, design of such systems is extremely complex Needs an integrated technology-hardware-software-application approach

36 HPCL Collaborators Faculty: Vijaykrishnan Narayanan Yuan Xie
Anand Sivasubramaniam Mahmut Kandemir Students: Sueng-Hwan Lim Bikash Sharma Adwait Jog Asit Mishra Reetuparna Das Dongkook Park Jongman Kim Partially supported by: NSF, DARPA, DOE, Intel, IBM, HP, Google, Samsung

37 THANK YOU !!! Questions???

38 Efficiency of Conventional Processors
From Many of these would be much lower without SSE

39 Conclusion • Supercomputers are energy limited.
–  On the chip and in the data center. •  Conventional processors are very inefficient –  15nJ/op in 90nm 5nJ/op in 45nm –  Largely due to overhead •  Fundamentally, energy is dominated by data and instruction movement. •  Efficient supercomputing –  200pJ/op in 45nm –  Efficient communication and memory circuits –  Efficient data and instruction supply –  Agile memory system

40 Application Demand Source: S. Scott, PACT’06

41 Application Demand Source: S. Scott, PACT’06

42 Frequency Scaling Party is Over!
END OF FREQUENCY SCALING Source: ISAT Last Classical Computer Study 2001 (cited by S. Scott, HPCA’04)

43 Then How Do We Meet the Performance Demand?
The Gap Widens: 30x difference; 1000x difference 75%/year 52%/year 19%/year Source: ISAT Last Classical Computer Study 2001 (cited by S. Scott, HPCA’04)

44 Interconnect Latency Nightmare
A Major Bottleneck to High Performance! Source: Saman Amarasinghe, MIT

45 Trends in Power Dissipation
Source: David Brooks, Harvard University


Download ppt "Chita R. Das High Performance Computing Laboratory"

Similar presentations


Ads by Google