Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multicore – The future of Computing Chief Engineer Terje Mathisen.

Similar presentations


Presentation on theme: "Multicore – The future of Computing Chief Engineer Terje Mathisen."— Presentation transcript:

1 Multicore – The future of Computing Chief Engineer Terje Mathisen

2 Moore’s Law  «The number of transistors we can put on a chip will double every two years» – Originally from 1965, modified in 1975 – Up to around the turn of century this meant a doubling in performance every 18 months. – Power has become the worst problem. – Bipolar transistors->NMOS->CMOS->(lots of tweaks)->3D – Voltage scaling – Today, leakage current is a limiter – Even CMOS transistors leak when they get really tiny

3 Moore's Law has held for 40 years Haswell: 5,6e9, 22nm

4 What could we use all the transistors for?  Increase scalar performance  Increasingly more complicated cpus  Multiple cycles/instruction: – 8088 (29K) – 80286 (134K) – 80386 (275K)  Pipeline, one cycle/instruction – 80486 (1,2M)  Superscalar: Multiple instructions/cycle – Pentium (3,1M) (two in-order pipelines)  Out of order/superscalar/multithreaded – Pentium Pro/Pentium III/Pentium4/Core/etc (5,5M --> 5,6B)

5 Pentium4 had the fastest pipeline ever  3 Ghz clock – Inner core ran at 2x, i.e. 6 Ghz – Only simple instructions, like ADD/SUB/AND/OR  Guessing at branches – If (a > b) {...} else {…}  Mistakes were very costly, both in time and power – 10 to 200 wasted instructions each time the cpu guessed wrong!

6 Core 2: Multiple complicated cores  Running two individual processes in parallel causes fewer wasted instructions, leads to more power-efficient computing. – Shorter pipelines are better at branching – Object-oriented programming uses many branches  Every two years: Double the number of cores – Core 2 –> Core 2 Duo -> Core 2 Quad – Latest server cpus have up to 18 cores, using 5.6e9 transistors

7 Vector operations  SIMD: Work with more data in each instruction – SSE uses 16-byte vectors (4 float/2 double) – AVX uses 32-byte vectors (8 float/4 double)  Each core can do two SSE operations/cycle – Quad cpu with 4*2*4 = 32 fp operations/cycle – 64 Gflops @ 2 GHz, 100 Gflops @ 3+ GHz – High-end AVX implementation doubles this, 12-18 cores add another multiplier

8 Other CPU architectures Sun Sparc 2005: Niagara: 8 cores, 4 threads/core, low clock speed Multithreaded server workloads Oracle Sparc M7 2014: 32 cores, 8 threads/core Optimized for DB operations

9

10 Other CPU architectures  Sparc – Multithreaded server workloads  IBM/Sony Cell – 2005: Playstation 3 – 1 PPE + 7-8 SPE cores, each capable of 25 Gflops/s – Works on 16-byte vectors (4 float/2 double) – ~200 Gflops SP -> 14 Gflops DP – Special HPC version with 100+ Gflops DP

11

12 Other CPU architectures  Sun Sparc  IBM/Sony Cell  GPGPU – Graphics cards with semi-general fp pipelines

13 Intel Larrabee/Many Integrated Core /Xeon Phi  Project started 2003 – Architecture review Oct 2006  Announced 2007 – 64-bit – x86 compatible  Similar to Pentium – Dual in-order pipelines – More flexible mixing of instructions  Special graphics instructions, incl. scatter/gather – S/G are very useful for HPC applications

14 LRB cont.  Even longer vectors – Works with 64-byte blocks (16 float/8 double) – Combined FMUL/FADD instruction  More than 50 cores on first product – 4 threads/core – 16x2x51 = 1616 flops/cycle – 1.3 Ghz core -> 2 Tflop (Seismic cluster is ~10 Tflops)  First product will be graphics coprocessor card  Will use the same 125 watts (max) as a single P4  New name: Many Integrated Core (MIC)/ Knights Corner/ Xeon Phi

15 Future directions  Heterogeneous cpus: – Maybe 2-4 Core2 + 20-60 Larrabee? – Run single-threaded applications on Core, multi-threaded/vector-based on Xeon Phi. (2013 - Fastest computer in the world: Ivy Bridge+Phi) – OS threads without fp operations can also use simple in-order LRB cores  Power-efficient processing – Both laptops/mobiles and servers are limited by power use  Simpler/slower cores with mostly in-order processing can use 80% less power

16 Conclusion  Multicore will give us an extra factor of ~10 increase in fp processing power – Most current forms of simulation becomes possible on a single workstation with 2-4 cpus  MIPS/Watt is crucial – Easier to make many simpler cores than one complex – Less wasted work – Server farms and laptops

17 What are the consequences?  High performance requires multithreading – Currently this is mostly server workloads – Games are next, today they use 2-4 threads  High performance requires vector programming – Can we work on 4, 16 or more variables simultaneously?  Many programs (and most programmers) don't care! – If it is fast enough today, it will surely be OK in the future as well?  Not neccessarily, because – Data grows exponentially!

18 HPC applications  Seismic processing – PC with – Complete model of small fields – Reduced resolution test runs for larger fields – Deskside server with nearly the same capability as current 2048-cpu seismic cluster  Crash simulation – Everything could fit on a laptop in 2012-2015  Financial modelling, incl Monte Carlo risk analysis  Dynamic global process control

19 From current Unix cluster…

20 … to deskside workstation in 5 years?

21 Summary  Multicore will give us an extra factor of ~10 increase in fp processing power  Moore's law will go on  MIPS/Watt is crucial  Evry is at leading edge of this development

22 Thank you!

23

24 Do we have the required programmers?  Will we get them from the universities in the future? – Possibly – Today, most graduates learn only Java, which isn't very suitable  There's hope: – LRB on the NTNU CS curriculum today  Similar situation at most universities  Can our standard vendors deliver updated SW? – Eclipse, GeoFrame, Sismage, Ansys, Finite Element

25 Smaller transistors & slightly larger chips


Download ppt "Multicore – The future of Computing Chief Engineer Terje Mathisen."

Similar presentations


Ads by Google