Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jim Held Intel Fellow & Director, Tera-scale Computing Research

Similar presentations


Presentation on theme: "Jim Held Intel Fellow & Director, Tera-scale Computing Research"— Presentation transcript:

1 Beyond Multi-core: The Dawning of the Era of Tera Intel™ 80-core Tera-scale Research Processor
Jim Held Intel Fellow & Director, Tera-scale Computing Research Intel Corporation

2 Agenda Tera-scale Computing Teraflops Research Processor
Motivation Platform Vision Teraflops Research Processor Key Ingredients Power Management Performance Programming Key Learnings Work in progress Summary

3 Emerging Applications will demand Tera-scale performance
Computer Vision Ray-Tracing 2 Webcam Video Surveillance Beetle Scene 1M pixel CFD, 150x100x100, 30 fps 0.1 1 100 1000 10000 Computational (GFLOPs) Memory BW (GB/s) 10 4 Camera Body Tracking Bar Scene, 1M pixel CFD, 75x50x50, 10 fps ALM, 6 assets, branches, 1 min ALM, 6 assets, 10 branches, 1 sec Physical Simulation Financial Analytics Courtesy of Prof. Ron Fetkiw Stanford University

4 A Tera-scale Platform Vision
Special Purpose Engines Integrated IO devices Cache Last Level Scalable On-die Interconnect Fabric Integrated Memory Controllers High Bandwidth Off Die interconnect IO Socket Inter- Connect

5 Tera-scale Computing Research
Applications – Identify, characterize & optimize Programming – Empower the mainstream System Software – Scalable services Memory Hierarchy – Feed the compute engine On-Die Interconnect – High bandwidth, low latency Cores – power efficient general & special function

6 Teraflops Research Processor
12.64mm I/O Area Goals: Deliver Tera-scale performance Single precision TFLOP at desktop power Frequency target 5GHz Bi-section B/W order of Terabits/s Link bandwidth in hundreds of GB/s Prototype two key technologies On-die interconnect fabric 3D stacked memory Develop a scalable design methodology Tiled design approach Mesochronous clocking Power-aware capability single tile 1.5mm 2.0mm 21.72mm PLL PLL TAP TAP I/O Area I/O Area

7 Key Ingredients Tile Special Purpose Cores 2D Mesh Interconnect
Mesochronous Interface MSINT Special Purpose Cores High performance Dual FPMACs 2D Mesh Interconnect High bandwidth low latency router Phase-tolerant tile to tile communication Mesochronous Clocking Modular & scalable Lower power Workload-aware Power Management Sleep instructions and Packets Chip voltage & freq. control MSINT 39 Crossbar Router MSINT 40 GB/s 39 MSINT 2KB Data memory (DMEM) 64 96 RIB 64 64 32 32 6-read, 4-write 32 entry RF 32 32 x x 96 3KB Inst. memory (IMEM) + + 32 32 Normalize Normalize FPMAC0 FPMAC1 Tile Processing Engine (PE)

8 Fine Grain Power Management
21 sleep regions per tile (not all shown) FP Engine 1 FP Engine 2 Router Data Memory Instruction Memory FP Engine 1 Sleeping: 90% less power FP Engine 2 Router 10% less power (stays on to pass traffic) Data Memory Sleeping: 57% less power Instruction Memory Sleeping: 56% less power Dynamic sleep STANDBY: Memory retains data 50% less power/tile FULL SLEEP: Memories fully off 80% less power/tile Scalable power to match workload demands

9 2X-5X leakage power reduction
Leakage Savings Est 1.2V 110C Dynamic sleep Total measured idle power 13W  ~7W sleeping Regulated sleep for memory arrays State retention MSINT Crossbar Router 82mW 70mW 63mW21mW 15mW7.5mW 42mW8mW 2KB Data memory (DMEM) RIB 6R, 4W 32 entry Register File 100mW 20mW 100mW 20mW x x Memory Clamping 3KB Inst. Memory (IMEM) + + Normalize Normalize FPMAC0 FPMAC1 Processing Engine (PE) 2X-5X leakage power reduction

10 Router Power Management
Activity based power management Individual port enables Queues on sleep and clock gated when port idle 924mW 7X power reduction for idle routers

11 Power Performance Results
Peak Performance Average Power Efficiency 1 2 3 4 5 6 0.6 0.7 0.8 0.9 1.1 1.2 1.3 1.4 V cc (V) Frequency (GHz) 80°C N=80 (0.32 TFLOP) 1GHz (1 TFLOP) 3.16GHz (1.81 TFLOP) 5.67GHz (1.63 TFLOP) 5.1GHz 5 10 15 20 200 400 600 800 1000 1200 1400 GFLOPS GFLOPS/W 80°C N=80 5.8 19.4 394 GFLOPS 10.5 Measured Power Leakage 25 50 75 100 125 150 175 200 225 250 0.70 0.80 0.90 1.00 1.10 1.20 1.30 Vcc (V) Power (W) Active Power Leakage Power 78 15.6 152 26 230W 80°C, N=80 97W 0% 4% 8% 12% 16% 0.70 0.80 0.90 1.00 1.10 1.20 1.30 Vcc (V) % Total Power Sleep disabled Sleep enabled 80°C N=80 2X Stencil: 97W, 1.07V; All tiles awake/asleep

12 Programming Results Not designed as a general Software Development Vehicle Small memory ISA limitations Limited data ports Four kernels hand-coded to explore delivered performance: Stencil 2D heat diffusion equation SGEMM for 100x100 matrices Spreadsheet doing weighted sums 64 point 2D FFT (w 64 tiles) Demonstrated utility and high scalability of message passing programming models on many core

13 Key Learnings Teraflop performance is possible within a mainstream power envelope Peak of 1.01 Teraflops at 62 watts Measured peak power efficiency of 19.4 GFLOPS/Watt Tile-based methodology fulfilled its promise Design possible with ½ the team in ½ the time Pre & Post-Si debug reduced – fully functional on A0 Fine-grained power management pays off Hierarchical clock gating and sleep transistor techniques Up to 3X measured reduction in standby leakage power Scalable low-power mesochronous clocking Excellent SW performance possible in this message-based architecture Further improvements possible with additional instructions, larger memory, wider data ports

14 Work in Progress: Stacked Memory Prototype
256 KB SRAM per core 4X C4 bump density 3200 thru-silicon vias 80-tile processor with Cu bumps “Polaris” Denser than C4 pitch Memory “Freya” C4 pitch Thru-Silicon Via Package Package Memory access to match the compute power

15 Teraflops on IA Pat Gelsinger – Intel Developer Forum 2007

16 Summary Emerging applications will demand teraflop performance
Teraflop performance is possible within a mainstream power envelope Intel is developing technologies to enable Tera-scale computing

17 Questions

18 Acknowledgments Sriram Vangal, Jason Howard, Gregory Ruhl, Saurabh Dighe, Howard Wilson, James Tschanz, David Finan, Priya Iyer, Arvind Singh, Tiju Jacob, Shailendra Jain, Sriram Venkataraman, Yatin Hoskote, Nitin Borkar, Rob van der Wijngaart, Michael Frumkin and Tim Mattson


Download ppt "Jim Held Intel Fellow & Director, Tera-scale Computing Research"

Similar presentations


Ads by Google