Lynn Choi School of Electrical Engineering

Slides:



Advertisements
Similar presentations
CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Advertisements

Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Microprocessors. Von Neumann architecture Data and instructions in single read/write memory Contents of memory addressable by location, independent of.
Instruction Level Parallelism (ILP) Colin Stevens.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Computer performance.
Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.
8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Computer Architecture Introduction Lynn Choi Korea University.
Lynn Choi School of Electrical Engineering Microprocessor Microarchitecture The Past, Present, and Future of CPU Architecture.
Copyright © 2007 Heathkit Company, Inc. All Rights Reserved PC Fundamentals Presentation 27 – A Brief History of the Microprocessor.
History of Microprocessor MPIntroductionData BusAddress Bus
Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.
Computer Architecture Lec 10 –Simultaneous Multithreading.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Computer Architecture Introduction Lynn Choi Korea University.
Microprocessor Microarchitecture Introduction Lynn Choi School of Electrical Engineering.
CSC 7080 Graduate Computer Architecture Lec 8 – Multiprocessors & Thread- Level Parallelism (3) – Sun T1 Dr. Khalaf Notes adapted from: David Patterson.
Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
Lecture 3 Dr. Muhammad Ayaz Computer Organization and Assembly Language. (CSC-210)
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Use of Pipelining to Achieve CPI < 1
Chapter Overview General Concepts IA-32 Processor Architecture
Itanium® 2 Processor Architecture
Microprocessor Microarchitecture Introduction
CS 352H: Computer Systems Architecture
Lynn Choi School of Electrical Engineering
Electrical and Computer Engineering
Limits to ILP How much ILP is available using existing mechanisms with increasing HW budgets? Do we need to invent new HW/SW mechanisms to keep on processor.
Visit for more Learning Resources
CIT 668: System Architecture
CPE 731 Advanced Computer Architecture Thread Level Parallelism
Multi-core processors
Lynn Choi Dept. Of Computer and Electronics Engineering
HISTORY OF MICROPROCESSORS
Scalable Processor Design
CC 423: Advanced Computer Architecture Limits to ILP
Chapter 3: ILP and Its Exploitation
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Lec 10 – Example Architectures - Simultaneous Multithreading
Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
John Kubiatowicz Electrical Engineering and Computer Sciences
Hyperthreading Technology
Advanced Computer Architecture 5MD00 / 5Z033 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
Lecture on High Performance Processor Architecture (CS05162)
Lecture 2: Performance Today’s topics: Technology wrap-up
Levels of Parallelism within a Single Processor
Limits to ILP Conflicting studies of amount
Lec 9 – Limits to ILP and Simultaneous Multithreading
Alpha Microarchitecture
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
* From AMD 1996 Publication #18522 Revision E
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
8 – Simultaneous Multithreading
Lecture 3 (Microprocessor)
Advanced Computer Architecture 5MD00 / 5Z032 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
Chapter 3: Limits of Instruction-Level Parallelism
Spring’19 Prof. Eric Rotenberg
Presentation transcript:

Lynn Choi School of Electrical Engineering Microprocessor Microarchitecture CPU Architecture: The Past and Present Lynn Choi School of Electrical Engineering

Contents Performance of Microprocessors Past: ILP Saturation I. Superscalar Hardware Complexity II. Limits of ILP III. Power Inefficiency Present: TLP Era I. Multithreading II. Multicore Present: Today’s Microprocessor Intel Core 2 Quad, Sun Niagara II, and ARM Cortex A-9 MPCore Future: Looking into the Future I. Manycores II. Multiple Systems on Chip III. Trend – Change of Wisdoms

CPU Performance Texe (Execution time per program) = NI * CPIexecution * Tcycle NI = # of instructions / program (program size) CPI = clock cycles / instruction Tcycle = second / clock cycle (clock cycle time) To increase performance Decrease NI (or program size) Instruction set architecture (CISC vs. RISC), compilers Decrease CPI (or increase IPC) Instruction-level parallelism (Superscalar, VLIW) Decrease Tcycle (or increase clock speed) Pipelining, process technology

Advances in Intel Microprocessors 80 81.3 (projected) Pentium IV 2.8GHz (superscalar, out-of-order) 70 60 42X Clock Speed ↑ 2X IPC ↑ 45.2 (projected) Pentium IV 1.7GHz (superscalar, out-of-order) SPECInt95 Performance 50 40 24 Pentium III 600MHz (superscalar, out-of-order) 30 8.09 11.6 PPro 200MHz (superscalar, out-of-order) 20 3.33 Pentium 100MHz (superscalar, in-order) 1 Pentium II 300MHz (superscalar, out-of-order) 80486 DX2 66MHz (pipelined) 10 1992 1993 1994 1995 1996 1997 1998 1999 2000 2002

Microprocessor Performance Curve

ILP Saturation I – Hardware Complexity Palacharla & Smith, IEEE, All rights reserved Superscalar hardware is not scalable in terms of issue width! Limited instruction fetch bandwidth Renaming complexity ∝ issue width2 Wakeup & selection logic ∝ instruction window2 Bypass logic complexity ∝ # of FUs2 Also, on-chip wire delays, # register and memory access ports, etc. Higher IPC implies lowering the Clock Speed!

ILP Saturation II – Limits of ILP Even with a very aggressive superscalar microarchitecture 2K window Max. 64 instruction issues per cycle 8K entry tournament predictors 2K jump and return predictors 256 integer and 256 FP registers Available ILP is only 3 ~ 6!

ILP Saturation III – Power Inefficiency Increasing issue rate is not energy efficient Increasing clock rate is also not energy efficient Increasing clock rate will increase transistor switching frequency Faster clock needs deeper pipeline, but the pipelining overhead grows faster Existing processors already reach the power limit 1.6GHz Itanium 2 consumes 130W of power! Temperature problem –Pentium power density passes that of a hot plate (‘98) and would pass a nuclear reactor in 2005, and a rocket nozzle in 2010. Higher IPC and higher clock speed have been pushed to their limit! Hardware complexity & Power Peak issue rate Sustained issue rate & Performance

TLP Era I - Multithreading Interleave multiple independent threads into the pipeline every cycle Each thread has its own PC, RF, branch prediction structures but shares instruction pipelines and backend execution units Increase resource utilization & throughput for multiple-issue processors Improve total system throughput (IPC) at the expense of compromised single program performance Superscalar Fine-Grain Multithreading SMT

TLP Era I - Multithreading IBM 8-processor Power 5 with SMT (2 threads per core) Run two copies of an application in SMT mode versus single-thread mode 23% improvement in SPECintRate and 16% improvement in SPECfpRate

TLP Era II - Multicore Multicore Pdyn = αCL * VDD2 * F Single-chip multiprocessing Easy to design and verify functionally Excellent performance/watt Pdyn = αCL * VDD2 * F Dual core at half clock speed can achieve the same performance (throughput) but with only ¼ of the power consumption ! Dual core consumes 2 * C * 0.52V * 0.5F = 0.25 CV2F Packaging, cooling, reliability Power also determines the cost of packaging/cooling. Chip temperature must be limited to avoid reliability issue and leakage power dissipation. Improved throughput with minor degradation in single program performance For multiprogramming workloads and multi-threaded applications

Today’s Microprocessor Intel Core 2 Quad Processor (code name “Yorkfield”) Technology 45nm process, 820M transistors, 2x107 mm² dies 2.83 GHz, two 64-bit dual-core dies in one MCM package Core microarchitecture Next generation multi-core microarchitecture introduced in Q1 2006 Derived from P6 microarchitecture Optimized for multi-cores and lower power consumption Lower clock speeds for lower power but higher performance 1/2 power (up to 65W) but more performance compared to dual-core Pentium D 14-stage 4-issue out-of-order (OOO) pipeline 64bit Intel architecture (x86-64) 2 unified 6MB L2 Caches 1333MHz system bus

Today’s Microprocessor Sun UltraSPARC T2 processor (“Niagara II”) Multithreaded multicore technology Eight 1.4 GHz cores, 8 threads per core → total 64 threads 65nm process, 1831 pin BGA, 503M transistors, 84W power consumption Core microarchitecture Two issue 8-stage instruction pipelines & pipelined FPU per core 4MB L2 – 8 banks, 64 FB DIMMs, 60+ GB/s memory bandwidth Security coprocessor per core and dual 10GB Ethernet, PCI Express Oracle. All rights reserved

Today’s Microprocessor Cortex A-9 MPCore ARMv7 ISA Support complex OS and multiuser applications 2-issue superscalar 8-stage OOO pipeline FPU supports both SP and DP operations NEON SIMD media processing engine MPCore technology that can support 1 ~ 4 cores ARM Ltd. All rights reserved