Presentation is loading. Please wait.

Presentation is loading. Please wait.

Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.

Similar presentations


Presentation on theme: "Climate Machine Update David Donofrio RAMP Retreat 8/20/2008."— Presentation transcript:

1 Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

2 Agenda Project Overview Tensilica Architecture and Design Flow Tensilica Tools Demo Why we need RAMP Current Progress Next Steps

3 A New Approach to HPC Current HPC Design approach: –Leverage commodity processors from Intel, AMD, etc –Once machine is built, optimize problems to run on it –Power wall prevents scaling to exaflop performance –Power is the new design point Olukotun and Sutter Moore’s Law still in effect - but number of processors double every 18 months rather than clock rate

4 A New Approach to HPC Our approach: –Identify application, then tailor machine using semi-custom design –Optimize CPU architecture and further extend with semi-custom ISA –Leverage auto-tuning to access architecture specific optimizations –Even if each simple core is 1/4 as computationally efficient as a complex core you can fit hundreds on a single die and be 100x more power efficient Learn from embedded market where Flops / Watt and rapid design cycles are crucial –Start with building blocks from embedded designs rather than full custom ASIC –Preserve ability to run general purpose C code Application Target: 1km Scale Climate Model Tailor machine architecture to application to reduce waste

5 Climate Model Resource Requirements DOE has identified high-resolution climate modeling as a leading justification for exascale computing Must express 20M way parallelism Requires performance of 200 Pflops peak Simulation must run 1000x faster than real time Randall / CSU NASA Amenable to massively concurrent architectures composed of power efficient embedded cores. Actively working with the climate science community to enable new Icosahedral model

6 Tensilica Processor Design Flow Complete Solution: Hardware, Software and Verification Fully customizable –Required base ISA ensures general purpose applications Processor configuration submitted to Tensilica’s servers where synthesis is performed –Returned design can be spun for ASIC or FPGA –Bit file available for Avnet boards Building block approach drastically reduces design cycle time compared to full-custom design Tensilica Inc.

7 Tensilica Architecture Features Verilog-like TIE language allows for custom ISA extensions –Functional and performance verification built in –Auto generated compiler intrinsics –64-bit IEEE-DP floating point coded up in TIE and available Custom VLIW support Inter-processor communication easily enabled through: –TIE Ports –TIE Queues Access to direct HW support for interprocessor communication –TIE Lookups Allows interface to external ROMs or other RTL block

8 Tensilica Architecture Overview Tensilica Inc.

9 Tensilica Performance Debug Processor viewed as black box State can be compressed (via HW) and pushed out JTAG port –Intended for program replay Xtensa trace port gives real-time visibility into internal pipeline state with unprecedented detail –$ hit miss with virtual address –Branch taken / not taken –Call / return –Resource dependency –Etc… Opportunity for hundreds of performance counters to be made available Tensilica Inc.

10 Tensilica Tools Demo

11 Why we need RAMP Fast, accurate emulation enables: –Dual nested loop of HW / SW co-design Preliminary work using Stanford SM sim shows significant improvement in power eff. using automated HW/SW co-tuning RAMP critical to accelerate –Rapid prototyping and analysis of Tensilica architectural options –Inter-processor communication architecture exploration –Running FULL climate code providing a more complete performance picture Cycle accurate simulator currently running at ~100 kHz vs. 50MHz on V5 –Extensive HW performance counter data enables an emulation environment with similar resolution but much greater speed Tensilica provided emulation environment kick-starts this effort

12 Current Status ML505 used for initial design exploration –Basic xtensa processor + JTAG and memory controller is ~50% of a Virtex 5 50t –Runs at 50MHz ASIC in 65G process runs at 650MHz OnChip Debug working Can load / run programs using main memory synthesized from BRAM DRAM interface coded - currently being debugged RTL license recently obtained - full simulation environment (in ModelSim) being brought up

13 Next Steps… Transition to BEE3 from ML505 Bring up XTOS environment on single xtensa processor on BEE3 Run single column of climate code on single processor –Demo at SC’08 in November –Continue HW / SW co-tuning optimization Begin multi-processor emulation –Emulation of single socket, 32 core, using networked BEE3s –Running full 2 Million line climate model

14 Backup

15 The Need for Exascale Computing DOE has identified high-resolution climate modeling as leading justification for exascale computing –1 km resolution targeted for accurate cloud resolving model Difficult to scale existing systems –HPC design using commodity processors estimated to draw 179MW –BlueGene design estimated to draw 20MW –Leveraging embedded cores and more application specific design a power envelope of 3-5MW is projected Icosahedral LBNL will seek an external vendor to build the machine if our approach is proven valid - LBNL is not entering the commercial HPC market. Randall / CSU


Download ppt "Climate Machine Update David Donofrio RAMP Retreat 8/20/2008."

Similar presentations


Ads by Google