Multicore – The future of Computing Chief Engineer Terje Mathisen.

Slides:



Advertisements
Similar presentations
Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN
Advertisements

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
L1 Event Reconstruction in the STS I. Kisel GSI / KIP CBM Collaboration Meeting Dubna, October 16, 2008.
Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.
Structure of Computer Systems
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
XEON PHI. TOPICS What are multicore processors? Intel MIC architecture Xeon Phi Programming for Xeon Phi Performance Applications.
The Evolution of RISC A Three Party Rivalry By Jenny Mitchell CS147 Fall 2003 Dr. Lee.
Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.
A many-core GPU architecture.. Price, performance, and evolution.
VLSI Trends. A Brief History  1958: First integrated circuit  Flip-flop using two transistors  From Texas Instruments  2011  Intel 10 Core Xeon Westmere-EX.
ELEC 6200, Fall 07, Oct 29 McPherson: Vector Processors1 Vector Processors Ryan McPherson ELEC 6200 Fall 2007.
CSCE101 – 4.2, 4.3 October 17, Power Supply Surge Protector –protects from power spikes which ruin hardware. Voltage Regulator – protects from insufficient.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.
Lecture 1: Introduction to High Performance Computing.
Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.
K-Ary Search on Modern Processors Fakultät Informatik, Institut Systemarchitektur, Professur Datenbanken Benjamin Schlegel, Rainer Gemulla, Wolfgang Lehner.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
COMPUTER ARCHITECTURE (for Erasmus students)
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Event Reconstruction in STS I. Kisel GSI CBM-RF-JINR Meeting Dubna, May 21, 2009.
Computer Performance Computer Engineering Department.
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
3. April 2006Bernd Panzer-Steindel, CERN/IT1 HEPIX 2006 CPU technology session some ‘random walk’
Status of the L1 STS Tracking I. Kisel GSI / KIP CBM Collaboration Meeting GSI, March 12, 2009.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
Shashwat Shriparv InfinitySoft.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Floating Point Numbers & Parallel Computing. Outline Fixed-point Numbers Floating Point Numbers Superscalar Processors Multithreading Homogeneous Multiprocessing.
Copyright © Curt Hill Parallelism in Processors Several Approaches.
Gravitational N-body Simulation Major Design Goals -Efficiency -Versatility (ability to use different numerical methods) -Scalability Lesser Design Goals.
Computer performance issues* Pipelines, Parallelism. Process and Threads.
EKT303/4 Superscalar vs Super-pipelined.
Succeeding with Technology Chapter 2 Hardware Designed to Meet the Need The Digital Revolution Integrated Circuits and Processing Storage Input, Output,
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.
Introduction. News you can use Hardware –Multicore chips (2009: mostly 2 cores and 4 cores, but doubling) (cores=processors) –Servers (often.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
General Purpose computing on Graphics Processing Units
Manycore processors Sima Dezső October Version 6.2.
Lynn Choi School of Electrical Engineering
Lynn Choi School of Electrical Engineering
Multi-Core Computing Osama Awwad Department of Computer Science
Technology and Historical Perspective: A peek of the microprocessor Evolution 11/14/2018 cpeg323\Topic1a.ppt.
CS 286 Computer Organization and Architecture
Multicore and GPU Programming
A Level Computer Science Topic 5: Computer Architecture and Assembly
EE 155 / Comp 122 Parallel Computing
CSE 502: Computer Architecture
Multicore and GPU Programming
Intel CPU for Desktop PC: Past, Present, Future
Presentation transcript:

Multicore – The future of Computing Chief Engineer Terje Mathisen

Moore’s Law  «The number of transistors we can put on a chip will double every two years» – Originally from 1965, modified in 1975 – Up to around the turn of century this meant a doubling in performance every 18 months. – Power has become the worst problem. – Bipolar transistors->NMOS->CMOS->(lots of tweaks)->3D – Voltage scaling – Today, leakage current is a limiter – Even CMOS transistors leak when they get really tiny

Moore's Law has held for 40 years Haswell: 5,6e9, 22nm

What could we use all the transistors for?  Increase scalar performance  Increasingly more complicated cpus  Multiple cycles/instruction: – 8088 (29K) – (134K) – (275K)  Pipeline, one cycle/instruction – (1,2M)  Superscalar: Multiple instructions/cycle – Pentium (3,1M) (two in-order pipelines)  Out of order/superscalar/multithreaded – Pentium Pro/Pentium III/Pentium4/Core/etc (5,5M --> 5,6B)

Pentium4 had the fastest pipeline ever  3 Ghz clock – Inner core ran at 2x, i.e. 6 Ghz – Only simple instructions, like ADD/SUB/AND/OR  Guessing at branches – If (a > b) {...} else {…}  Mistakes were very costly, both in time and power – 10 to 200 wasted instructions each time the cpu guessed wrong!

Core 2: Multiple complicated cores  Running two individual processes in parallel causes fewer wasted instructions, leads to more power-efficient computing. – Shorter pipelines are better at branching – Object-oriented programming uses many branches  Every two years: Double the number of cores – Core 2 –> Core 2 Duo -> Core 2 Quad – Latest server cpus have up to 18 cores, using 5.6e9 transistors

Vector operations  SIMD: Work with more data in each instruction – SSE uses 16-byte vectors (4 float/2 double) – AVX uses 32-byte vectors (8 float/4 double)  Each core can do two SSE operations/cycle – Quad cpu with 4*2*4 = 32 fp operations/cycle – 64 2 GHz, GHz – High-end AVX implementation doubles this, cores add another multiplier

Other CPU architectures Sun Sparc 2005: Niagara: 8 cores, 4 threads/core, low clock speed Multithreaded server workloads Oracle Sparc M7 2014: 32 cores, 8 threads/core Optimized for DB operations

Other CPU architectures  Sparc – Multithreaded server workloads  IBM/Sony Cell – 2005: Playstation 3 – 1 PPE SPE cores, each capable of 25 Gflops/s – Works on 16-byte vectors (4 float/2 double) – ~200 Gflops SP -> 14 Gflops DP – Special HPC version with 100+ Gflops DP

Other CPU architectures  Sun Sparc  IBM/Sony Cell  GPGPU – Graphics cards with semi-general fp pipelines

Intel Larrabee/Many Integrated Core /Xeon Phi  Project started 2003 – Architecture review Oct 2006  Announced 2007 – 64-bit – x86 compatible  Similar to Pentium – Dual in-order pipelines – More flexible mixing of instructions  Special graphics instructions, incl. scatter/gather – S/G are very useful for HPC applications

LRB cont.  Even longer vectors – Works with 64-byte blocks (16 float/8 double) – Combined FMUL/FADD instruction  More than 50 cores on first product – 4 threads/core – 16x2x51 = 1616 flops/cycle – 1.3 Ghz core -> 2 Tflop (Seismic cluster is ~10 Tflops)  First product will be graphics coprocessor card  Will use the same 125 watts (max) as a single P4  New name: Many Integrated Core (MIC)/ Knights Corner/ Xeon Phi

Future directions  Heterogeneous cpus: – Maybe 2-4 Core Larrabee? – Run single-threaded applications on Core, multi-threaded/vector-based on Xeon Phi. ( Fastest computer in the world: Ivy Bridge+Phi) – OS threads without fp operations can also use simple in-order LRB cores  Power-efficient processing – Both laptops/mobiles and servers are limited by power use  Simpler/slower cores with mostly in-order processing can use 80% less power

Conclusion  Multicore will give us an extra factor of ~10 increase in fp processing power – Most current forms of simulation becomes possible on a single workstation with 2-4 cpus  MIPS/Watt is crucial – Easier to make many simpler cores than one complex – Less wasted work – Server farms and laptops

What are the consequences?  High performance requires multithreading – Currently this is mostly server workloads – Games are next, today they use 2-4 threads  High performance requires vector programming – Can we work on 4, 16 or more variables simultaneously?  Many programs (and most programmers) don't care! – If it is fast enough today, it will surely be OK in the future as well?  Not neccessarily, because – Data grows exponentially!

HPC applications  Seismic processing – PC with – Complete model of small fields – Reduced resolution test runs for larger fields – Deskside server with nearly the same capability as current 2048-cpu seismic cluster  Crash simulation – Everything could fit on a laptop in  Financial modelling, incl Monte Carlo risk analysis  Dynamic global process control

From current Unix cluster…

… to deskside workstation in 5 years?

Summary  Multicore will give us an extra factor of ~10 increase in fp processing power  Moore's law will go on  MIPS/Watt is crucial  Evry is at leading edge of this development

Thank you!

Do we have the required programmers?  Will we get them from the universities in the future? – Possibly – Today, most graduates learn only Java, which isn't very suitable  There's hope: – LRB on the NTNU CS curriculum today  Similar situation at most universities  Can our standard vendors deliver updated SW? – Eclipse, GeoFrame, Sismage, Ansys, Finite Element

Smaller transistors & slightly larger chips