Multicore – The future of Computing Chief Engineer Terje Mathisen
Moore’s Law «The number of transistors we can put on a chip will double every two years» – Originally from 1965, modified in 1975 – Up to around the turn of century this meant a doubling in performance every 18 months. – Power has become the worst problem. – Bipolar transistors->NMOS->CMOS->(lots of tweaks)->3D – Voltage scaling – Today, leakage current is a limiter – Even CMOS transistors leak when they get really tiny
Moore's Law has held for 40 years Haswell: 5,6e9, 22nm
What could we use all the transistors for? Increase scalar performance Increasingly more complicated cpus Multiple cycles/instruction: – 8088 (29K) – (134K) – (275K) Pipeline, one cycle/instruction – (1,2M) Superscalar: Multiple instructions/cycle – Pentium (3,1M) (two in-order pipelines) Out of order/superscalar/multithreaded – Pentium Pro/Pentium III/Pentium4/Core/etc (5,5M --> 5,6B)
Pentium4 had the fastest pipeline ever 3 Ghz clock – Inner core ran at 2x, i.e. 6 Ghz – Only simple instructions, like ADD/SUB/AND/OR Guessing at branches – If (a > b) {...} else {…} Mistakes were very costly, both in time and power – 10 to 200 wasted instructions each time the cpu guessed wrong!
Core 2: Multiple complicated cores Running two individual processes in parallel causes fewer wasted instructions, leads to more power-efficient computing. – Shorter pipelines are better at branching – Object-oriented programming uses many branches Every two years: Double the number of cores – Core 2 –> Core 2 Duo -> Core 2 Quad – Latest server cpus have up to 18 cores, using 5.6e9 transistors
Vector operations SIMD: Work with more data in each instruction – SSE uses 16-byte vectors (4 float/2 double) – AVX uses 32-byte vectors (8 float/4 double) Each core can do two SSE operations/cycle – Quad cpu with 4*2*4 = 32 fp operations/cycle – 64 2 GHz, GHz – High-end AVX implementation doubles this, cores add another multiplier
Other CPU architectures Sun Sparc 2005: Niagara: 8 cores, 4 threads/core, low clock speed Multithreaded server workloads Oracle Sparc M7 2014: 32 cores, 8 threads/core Optimized for DB operations
Other CPU architectures Sparc – Multithreaded server workloads IBM/Sony Cell – 2005: Playstation 3 – 1 PPE SPE cores, each capable of 25 Gflops/s – Works on 16-byte vectors (4 float/2 double) – ~200 Gflops SP -> 14 Gflops DP – Special HPC version with 100+ Gflops DP
Other CPU architectures Sun Sparc IBM/Sony Cell GPGPU – Graphics cards with semi-general fp pipelines
Intel Larrabee/Many Integrated Core /Xeon Phi Project started 2003 – Architecture review Oct 2006 Announced 2007 – 64-bit – x86 compatible Similar to Pentium – Dual in-order pipelines – More flexible mixing of instructions Special graphics instructions, incl. scatter/gather – S/G are very useful for HPC applications
LRB cont. Even longer vectors – Works with 64-byte blocks (16 float/8 double) – Combined FMUL/FADD instruction More than 50 cores on first product – 4 threads/core – 16x2x51 = 1616 flops/cycle – 1.3 Ghz core -> 2 Tflop (Seismic cluster is ~10 Tflops) First product will be graphics coprocessor card Will use the same 125 watts (max) as a single P4 New name: Many Integrated Core (MIC)/ Knights Corner/ Xeon Phi
Future directions Heterogeneous cpus: – Maybe 2-4 Core Larrabee? – Run single-threaded applications on Core, multi-threaded/vector-based on Xeon Phi. ( Fastest computer in the world: Ivy Bridge+Phi) – OS threads without fp operations can also use simple in-order LRB cores Power-efficient processing – Both laptops/mobiles and servers are limited by power use Simpler/slower cores with mostly in-order processing can use 80% less power
Conclusion Multicore will give us an extra factor of ~10 increase in fp processing power – Most current forms of simulation becomes possible on a single workstation with 2-4 cpus MIPS/Watt is crucial – Easier to make many simpler cores than one complex – Less wasted work – Server farms and laptops
What are the consequences? High performance requires multithreading – Currently this is mostly server workloads – Games are next, today they use 2-4 threads High performance requires vector programming – Can we work on 4, 16 or more variables simultaneously? Many programs (and most programmers) don't care! – If it is fast enough today, it will surely be OK in the future as well? Not neccessarily, because – Data grows exponentially!
HPC applications Seismic processing – PC with – Complete model of small fields – Reduced resolution test runs for larger fields – Deskside server with nearly the same capability as current 2048-cpu seismic cluster Crash simulation – Everything could fit on a laptop in Financial modelling, incl Monte Carlo risk analysis Dynamic global process control
From current Unix cluster…
… to deskside workstation in 5 years?
Summary Multicore will give us an extra factor of ~10 increase in fp processing power Moore's law will go on MIPS/Watt is crucial Evry is at leading edge of this development
Thank you!
Do we have the required programmers? Will we get them from the universities in the future? – Possibly – Today, most graduates learn only Java, which isn't very suitable There's hope: – LRB on the NTNU CS curriculum today Similar situation at most universities Can our standard vendors deliver updated SW? – Eclipse, GeoFrame, Sismage, Ansys, Finite Element
Smaller transistors & slightly larger chips