The Future of Many Core Computing: A tale of two processors

The Future of Many Core Computing: A tale of two processors
IT WAS THE BEST of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only. Tim Mattson Intel Labs 1

Disclosure The views expressed in this talk are those of the speaker and not his employer. I am in a research group and know nothing about Intel products. So anything I say about them is highly suspect. This was a team effort, but if I say anything really stupid, it’s all my fault … don’t blame my collaborators.

A common view of many-core chips
An Intel Exec’s slide from IDF’2006

Challenging the sacred cows
Assumes cache coherent shared address space! Is that the right choice? Many expert programmers do not fully understand relaxed consistency memory models required to make cache coherent architectures work. Programming models proven to scale non-trivial apps to 100’s to 1000’s of cores all based on distributed memory. Coherence incurs additional architectural overhead … fundamentally unscalable. Shared Cache Local Streamlined IA Core Don’t infer that this is to scale architecturally I mentioned before that we cannot continue to scale existing architectures to take advantage of the increasing number of transistors and to meet the increasing demands for performance. In the past we’ve scaled clock frequency, creating a faster pipeline. Then we added features like out-of-order or speculative execution to keep the pipeline full. We call that instruction level parallelism. Most software was single-threaded so high single-thread performance was most important. Increasingly, applications are becoming multi-threaded. Today’s processors, like our Core 2 Duo and Core 2 Quad, use cores that give good single-thread performance and, by their multi-core design, give good performance for multi-threaded apps. In the future, as applications become more highly parallel, the number of cores will increase to meet the performance demand. At the same time, the energy efficiency must continue to improve so that these devices will operate in a reasonable, and lower than today’s, power envelope. … IA cores optimized for multithreading

Isn’t shared memory programming easier? Not necessarily.
Extra work upfront, but easier optimization and debugging means overall, less time to solution Effort Message passing Time But difficult debugging and optimization means overall project takes longer initial parallelization can be quite easy Effort Multi-threading Time Proving that a shared address space program using semaphores is race free is an NP-complete problem* *P. N. Klein, H. Lu, and R. H. B. Netzer, Detecting Race Conditions in Parallel Programs that Use Semaphores, Algorithmica, vol. 35 pp. 321–345, 2003

The many core design challenge
Scalable architecture: How should we connect the cores so we can scale as far as we need (O(100’s to 1000) should be enough)? Software: Can “general purpose programmers” write software that takes advantage of the cores? Will ISV’s actually write scalable software? Manufacturability: Validation costs grow steeply as the number of transistors grows. Can we use tiled architectures to address this problem? Validate a tile and the connections between tiles … drops validation costs from K*O(N) to K*O(N½) (warning, K can be very large). Intel’s “TeraScale” processor research program is addressing these questions with a series of Test chips … two so far. 48 core SCC processor 80 core Research processor

Agenda The 80 core Research Processor The 48 core SCC processor
Max FLOPS/Watt in a tiled architecture The 48 core SCC processor Scalable IA cores for software/platform research Software in a many core world

Intel’s 80 core terascale processor Die Photo and Chip Details
Basic statistics: 65 nm CMOS process 100 Million transistors in 275 mm2 8x10 tiles, 3mm2/tile Mesosynchronous clock 1.6 SP 5 Ghz and 1.2 V 320 GB/s bisection bandwidth Variable voltage and multiple sleep states for explicit power management

The 80 core processor tile
5 PORT ROUTER 2KB DATA MEMORY 3KB INSTR. COMPUTE CORE: 2 FLOATING POINT ENGINES The 80 core processor tile All memory “on tile” … 256 instructions, 512 floats, 32 registers One sided anonymous msg passing into instruction or data memory 2 FP units …. 4 flops/cycle/tile. 2 loads per cycle No divide. No ints. 1D array indices, no nested loops This is an architecture concept that may or may not be reflected in future products from Intel Corp.

Programming Results Application Kernel Implementation Efficiency
0.2 0.4 0.6 0.8 1 1.2 1.4 Stencil SGEMM Spread Sheet 2D FFT Single Precision 4.27 GHz Actual Theoretical Peak = 1.37 Not general SDK – were using a simple core specialized to do FP – test the power, fabric Small private memory: 256 instructions operating on 512 floating point numbers. 2 SP FMAC units per tile → 4 FLOP/cycle/tile, Maximum two loads per cycle per tile 32 SP registers No Integer instructions No general branch, single branch-on-zero (single loop) Single wait for data synchronization Impact of Limitations Inability to nest multiple arbitrary loops forced us to use an SGEMM algorithm that had poor data reuse … limited us to 50% of peak instead of 95% or better. This also limited FFT. Only 2 loads from D-MEM per cycle doesn’t balance with 4 flop per cycle… limited spread sheet to 50%. Lack of integer instructions limited indexing into array – Fixed memory limits constrained loop unrolling and other optimizations that bloat code. Limited problem sizes. Peak is just driving the FP units doing ops, no useful work. Reduce by architectural characteristics that impact the algorithm à theoretical peak Reduce by features that impact the implementation of the algorithm à actual Stencil - Replace grid value with weighted sum of neighbors. Used in graphics (filter) & scientific computing (relaxation) SGEMM - Dense Matrix-Matrix Multiply, a key building block of dense linear algebra Spreadsheet - Consider a table of data v and weights w, stored by columns, compute weighted row and column sums (dot products) Stresses parallel part of some Monte Carlo and map-reduce-like algorithms. 2D FFT – 64x64FFT - Fundamental image processing algorithm 1st 3 expected good performance. FFT pushing the envelope on programmability. 2D fft - very loopy algo – only 1 counter loop, no room to unroll. Transcendental fxn as a table (and indexing into arrays - table look ups - hard) Actual numbers SP Teraflops : Stencil 1.00, SGEMM 0.51, Spreadsheet 0.45, 2D FFT 0.02 % Peak %, %, %, % Theoretical in SP Teraflops Stencil 1.3, SGEMM 0.68, Spreadsheet 0.68, 2D FFT 0.24 % Theoretical % 75% 66% 8% James Demmel – a leader in the BeBOP project (Berkeley benchmarking and Optimization). 99% of implementations of dense multiply at < 66% of Peak; 90% at < 33% 2D FFT will raise comment. 2D easily parallelizable. SGEMM & FFT have history. 1st to hit 1TFLOP big deal. We look good on number, but poor on performance as percent. Cell 200 GFLOPS. Very close to peak. Theoretical numbers from operation/communication counts and from rate limiting bandwidths. 1.07V, 4.27GHz operation 80 C

Intel’s ASCI Red Supercomputer
Why this is so exciting! First TeraScale* computer: 1997 First TeraScale* chip: 2007 Intel’s ASCI Option Red 10 years later Intel’s 80 core teraScale Chip 1 CPU 97 watt 275 mm2 Intel’s ASCI Red Supercomputer 9000 CPUs one megawatt of electricity. 1600 square feet of floor space. 1 penny is 1 mm thick. 10**12 mm = 10**9 m = 10**6 km Mean radius of moon’s orbit = 3.884*105 km Single Precision TFLOPS running stencil Double Precision TFLOPS running MP-Linpack Source: Intel

Lessons: Application programmers should help design chips
On-die memory is great 2 cycle latency compared to ~100 nsec for DRAM. Minimize message passing overhead. Routers wrote directly into memory without interrupting computing … i.e. any core could write directly into the memory of any other core. This led to extremely small comm. latency on the order of 2 cycles. Programmers can assist in keeping power low if sleep/wake instructions are exposed and if switching latency is low (~ a couple cycles). Application programmers should help design chips This chip was presented to us a completed package. Small changes to the instruction set could have had a large impact on the programmability of the chip. A simple computed jump statement would have allowed us to add nested loops. A second offset parameter would have allowed us to program general 2D array computations.

Max FLOPS/Watt in a tiled architecture The 48 core SCC processor Scalable IA cores for software/platform research Software in a many core world

SCC full chip 24 tiles in 6x4 mesh with 2 cores per tile (48 cores total). 26.5mm SCC TILE Technology 45nm Process Interconnect 1 Poly, 9 Metal (Cu) Transistors Die: 1.3B, Tile: 48M Tile Area 18.7mm2 Die Area 567.1mm2 DDR3 DDR3 MC MC PLL + I/O JTAG SCC 21.4mm TILE DDR3 DDR3 MC MC VRC System Interface + I/O

Hardware view of SCC 48 cores in 6x4 mesh with 2 cores per tile
45 nm, 1.3 B transistors, 25 to 125 W 16 to 64 GB total main memory using 4 DDR3 MCs P54C (16KB each L1) CC 256KB L2 P54c FSB Mesh I/F To router Tile Traffic gen R Tile Bus to PCI MC Message buffer Tile area: ~17 mm2 SCC die area: ~567 mm2 R = router, MC = Memory Controller, P54C = second generation Pentium core, CC = cache cntrl.

Programmer’s view of SCC
48 x86 cores with the familiar x86 memory model for Private DRAM 3 memory spaces, with fast message passing between cores ( / means on/off-chip) Shared off-chip DRAM (variable size) … CPU_0 L1$ L2$ Private DRAM t&s t&s Private DRAM CPU_47 L2$ L1$ Shared on-chip Message Passing Buffer (8KB/core) t&s Shared test and set register

RCCE: message passing library for SCC
Treat Msg Pass Buf (MPB) as 48 smaller buffers … one per core. Symmetric name space … Allocate memory as a collective op. Each core gets a variable with the given name at a fixed offset from the beginning of a core’s MPB. 2 A = (double *) RCCE_malloc(size) Called on all cores so any core can put/get(A at Core_ID) without error-prone explicit offsets Flags allocated and used to coordinate memory ops … 1 2 47 3

… and use flags to make the put’s and get’s “safe”
How does RCCE work? The foundation of RCCE is a one-sided put/get interface. Symmetric name space … Allocate memory as a collective and put a variable with a given name into each core’s MPB. CPU_0 L1$ L2$ Private DRAM t&s CPU_47 L1$ L2$ Private DRAM t&s Put(A,0) Get(A, 0) … 47 … and use flags to make the put’s and get’s “safe”

NAS Parallel benchmarks
x-sweep y-sweep z-sweep 1. BT: Multipartition decomposition Each core owns multiple blocks (3 in this case) update all blocks in plane of 3x3 blocks send data to neighbor blocks in next plane update next plane of 3x3 blocks UE 3 UE 7 UE 11 UE 15 2. LU: Pencil decomposition Define 2D-pipeline process. await data (bottom+left) compute new tile send data (top+right) UE 3 4 UE 2 3 UE 1 2 UE 0 1 Third party names are the property of their owners.

LU/BT NAS Parallel Benchmarks, SCC
Problem size: Class A, 64 x 64 x 64 grid* Using latency optimized, whole cache line flags * These are not official NAS Parallel benchmark results. SCC processor 500MHz core, 1GHz routers, 25MHz system interface, and DDR3 memory at 800 MHz. Third party names are the property of their owners. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference < or call (U.S.) or

Power and memory-controller domains
Voltage RC package Power ~ F V2 Power Control domains (RPC): 7 voltage domains … 6 4-tile blocks and one for on-die network. 1 clock divider register per tile (i.e. 24 frequency domains) One RPC register so can process only one voltage request at a time; other requestors block Tile Tile Tile Tile Tile Tile MC R R R R R R MC Tile Tile Tile Tile Tile Tile R R R R R R Tile Tile Tile Tile Tile Tile Bus to PCI R R R R R R Tile Tile Tile Tile Tile Tile MC R R R R R R MC Frequency

Power breakdown 23

Conclusions RCCE software works SCC architecture Future work
RCCE’s restrictions (Symmetric MPB memory model and blocking communications) have not been a fundamental obstacle SCC architecture The on-chip MPB was effective for scalable message passing applications Software controlled power management works … but it’s challenging to use because (1) granularity of 8 cores and (2) high latencies for voltage changes Future work The interesting work is yet to come … we will make ~100 of these systems available to industry and academic partners for research on: Scalable many core OS User friendly programming models that don’t depend on coherent shared memory

Max FLOPS/Watt in a tiled architecture The 48 core SCC processor Scalable IA cores for software/platform research Software in a many core world Third party names are the property of their owners.

The Future of Many Core Computing: A tale of two processors
IT WAS THE BEST of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only. IT WAS THE BEST of times, it was the age of wisdom, it was the epoch of belief, it was the season of Light, it was the spring of hope, we had everything before us, we were all going direct to Heaven, it was the worst of times, it was the age of foolishness, it was the epoch of incredulity, it was the season of Darkness, it was the winter of despair, we had nothing before us, we were all going direct the other way Hardware Software Tim Mattson Intel Labs 26

The many-core challenge
We have arrived at many-core solutions not because of the success of our parallel software but because of our failure to keep increasing CPU frequency. Result: a fundamental and dangerous mismatch Parallel hardware is ubiquitous. Parallel software is rare Our challenge … make parallel software as routine as our parallel hardware.

And remember … it’s the platform we care about, not just “the chip”
A modern platform has: CPU(s) GPU(s) DSP processors … other? CPU CPU GMCH GPU ICH DRAM Programmers need to make the best use of all the available resources from within a single program: One program that runs well (i.e. reasonably close to “hand-tuned” performance) on a heterogeneous mixture of processors. Intel dual core AMD Istanbul processor GMCH = graphics memory control hub, ICH = Input/output control hub

Solution: Find A Good parallel programming model, right?
ABCPL ACE ACT++ Active messages Adl Adsmith ADDAP AFAPI ALWAN AM AMDC AppLeS Amoeba ARTS Athapascan-0b Aurora Automap bb_threads Blaze BSP BlockComm C*. "C* in C C** CarlOS Cashmere C4 CC++ Chu Charlotte Charm Charm++ Cid Cilk CM-Fortran Converse Code COOL CORRELATE CPS CRL CSP Cthreads CUMULVS DAGGER DAPPLE Data Parallel C DC++ DCE++ DDD DICE. DIPC DOLIB DOME DOSMOS. DRL DSM-Threads Ease . ECO Eiffel Eilean Emerald EPL Excalibur Express Falcon Filaments FM FLASH The FORCE Fork Fortran-M FX GA GAMMA Glenda GLU GUARD HAsL. Haskell HPC++ JAVAR. HORUS HPC IMPACT ISIS. JAVAR JADE Java RMI javaPG JavaSpace JIDL Joyce Khoros Karma KOAN/Fortran-S LAM Lilac Linda JADA WWWinda ISETL-Linda ParLin P4-Linda POSYBL Objective-Linda LiPS Locust Lparx Lucid Maisie Manifold Mentat Legion Meta Chaos Midway Millipede CparPar Mirage MpC MOSIX Modula-P Modula-2* Multipol MPI MPC++ Munin Nano-Threads NESL NetClasses++ Nexus Nimrod NOW Objective Linda Occam Omega OpenMP Orca OOF90 P++ P3L Pablo PADE PADRE Panda Papers AFAPI. Para++ Paradigm Parafrase2 Paralation Parallel-C++ Parallaxis ParC ParLib++ Parmacs Parti pC PCN PCP: PH PEACE PCU PET PENNY Phosphorus POET. Polaris POOMA POOL-T PRESTO P-RIO Prospero Proteus QPC++ PVM PSI PSDM Quake Quark Quick Threads Sage++ SCANDAL SAM pC++ SCHEDULE SciTL SDDA. SHMEM SIMPLE Sina SISAL. distributed smalltalk SMI. SONiC Split-C. SR Sthreads Strand. SUIF. Synergy Telegrphos SuperPascal TCGMSG. Threads.h++. TreadMarks TRAPPER uC++ UNITY UC V ViC* Visifold V-NUS VPE Win32 threads WinPar XENOOPS XPC Zounds ZPL We learned more about creating programming models than how to use them. Please save us from ourselves … demand standards (or open source)! Lets look at the history of supercomputing and see if we can get some clues about what we should do. Back in the late 80’s and early 90’s, we thought all the ISV’s needed to enter parallel computing was a portable Application programming interface. So the universities and a number of small companies (3 of them – 2 of which I used to work for) came up with all sorts of API’s. It is my firm belief that this backfired on us. The fact there were so many API’s made parallel computing computer scientists look stupid. If we couldn’t agree on an effective approach to parallel computing, how could we expect non-specialist ISV’s to figure this out. Models from the golden age of parallel programming (~95) Third party names are the property of their owners.

DEC HP SGI IBM Cray Intel KAI ASCI 1997
How to program the heterogeneous platform? Let History can be our guide … consider the origins of OpenMP … DEC IBM Intel HP Other vendors invited to join SGI Cray Merged, needed commonality across products Wrote a rough draft straw man SMP API KAI ISV - needed larger market 1997 was tired of recoding for SMPs. Forced vendors to standardize. ASCI Third party names are the property of their owners.

OpenCL: Can history repeat itself?
Ericson Sony Blizzard Noikia Khronos Compute group formed Freescale TI IBM + many more As ASCI did for OpenMP, Apple is doing for GPU/CPU with OpenCL AMD ATI Merged, needed commonality across products Wrote a rough draft straw man API Nvidia GPU vendor - wants to steel mkt share from CPU Intel CPU vendor - wants to steel mkt share from GPU was tired of recoding for many core, GPUs. Pushed vendors to standardize. Apple Third party names are the property of their owners. Dec 2008

Conclusion HW/SW co-design is the key to a successful transition to a many core future. HW is in good shape … SW is in a tough spot If you (the users) do not DEMAND good standards … our many core future will be uncertain. Our many core future A noble SW professional Tim Mattson getting clobbered in Ilwaco, Dec 2007

80-core Research Processor teams
The software team Tim Mattson, Rob van der Wijngaart (Intel) Michael Frumkin (then at Intel, now at Google) Implementation Circuit Research Lab Advanced Prototyping team (Hillsboro, OR and Bangalore, India) PLL design Logic Technology Development (Hillsboro, OR) Package design Assembly Technology Development (Chandler, AZ) A special thanks to our “optimizing compiler” … Yatin Hoskote, Jason Howard, and Saurabh Dighe of Intel’s Microprocessor Technology Laboratory.

SCC SW Teams SCC Application software: SCC System software:
RCCE library and apps and HW/SW co-design Rob Van der Wijngaart Tim Mattson Developer tools (icc and MKL) Patrick Kennedy SCC System software: Management Console software Michael Riepen BareMetalC workflow Linux for SCC Thomas Lehnig Paul Brett System Interface FPGA development Matthias Steidl TCP/IP network driver Werner Haas And the HW-team that worked closely with the SW group: Jason Howard, Yatin Hoskote, Sriram Vangal, Nitin Borkar, Greg Ruhl

The Future of Many Core Computing: A tale of two processors

Similar presentations

Presentation on theme: "The Future of Many Core Computing: A tale of two processors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Future of Many Core Computing: A tale of two processors

Similar presentations

Presentation on theme: "The Future of Many Core Computing: A tale of two processors"— Presentation transcript:

Similar presentations

About project

Feedback