CPU Architecture Overview Varun Sampath CIS 565 Spring 2012.

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

CSCI 4717/5717 Computer Architecture

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

© 2006 Edward F. Gehringer ECE 463/521 Lecture Notes, Spring 2006 Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Spring 2006.

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar1 CIS 501: Computer Architecture Unit 8: Superscalar Pipelines Slides developed by Joe Devietti,

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Lecture 6: Multicore Systems

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

1 Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies (Section 5.2)

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp 마이크로 프로세서.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

GPU Architecture Overview

Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

EKT303/4 Superscalar vs Super-pipelined.

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

GPU Architecture Overview Patrick Cozzi University of Pennsylvania CIS Fall 2013.

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

My Coordinates Office EM G.27 contact time:

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction.

GPU Architecture Overview Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

PipeliningPipelining Computer Architecture (Fall 2006)

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Use of Pipelining to Achieve CPI < 1

CS 352H: Computer Systems Architecture

Instruction Level Parallelism

Lynn Choi School of Electrical Engineering

Computer Structure Multi-Threading

/ Computer Architecture and Design

Instructional Parallelism

Hyperthreading Technology

Levels of Parallelism within a Single Processor

Comparison of Two Processors

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Lecture 20: OOO, Memory Hierarchy

* From AMD 1996 Publication #18522 Revision E

Overview Prof. Eric Rotenberg

Mattan Erez The University of Texas at Austin

8 – Simultaneous Multithreading

Lecture 1 An Overview of High-Performance Computer Architecture

The University of Adelaide, School of Computer Science

Advanced Computer Architecture 5MD00 / 5Z032 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.

Spring’19 Prof. Eric Rotenberg

Presentation transcript:

CPU Architecture Overview Varun Sampath CIS 565 Spring 2012

Objectives Performance tricks of a modern CPU – Pipelining – Branch Prediction – Superscalar – Out-of-Order (OoO) Execution – Memory Hierarchy – Vector Operations – SMT – Multicore

What is a CPU anyways? Execute instructions Now so much more – Interface to main memory (DRAM) – I/O functionality Composed of transistors

Instructions

Desktop Programs Lightly threaded Lots of branches Lots of memory accesses vimls Conditional branches13.6%12.5% Memory accesses45.7% Vector instructions1.1%0.2% Profiled with psrun on ENIAC

Source: intel.comintel.com

What is a Transistor? Approximation: a voltage-controlled switch Typical channel lengths (for 2012): 22-32nm Channel Image: Penn ESE370Penn ESE370

Moore’s Law “The complexity for minimum component costs has increased at a rate of roughly a factor of two per year” Self-fulfilling prophecy What do we do with our transistor budget? Source: intel.comintel.com

Intel Core i7 3960X (Codename Sandy Bridge-E) – 2.27B transistors, Total Size 435mm 2 Source:

A Simple CPU Core Image: Penn CIS501Penn CIS501 PC I$ Register File s1s2d D$ +4+4 control

A Simple CPU Core Image: Penn CIS501Penn CIS501 Fetch  Decode  Execute  Memory  Writeback PC I$ Register File s1s2d D$ +4+4 control

Pipelining Image: Penn CIS501Penn CIS501 PC Insn Mem Register File s1s2d Data Mem +4+4 T insn-mem T regfile T ALU T data-mem T regfile T singlecycle

Pipelining Capitalize on instruction-level parallelism (ILP) + Significantly reduced clock period – Slight latency & area increase (pipeline latches) ? Dependent instructions ? Branches Alleged Pipeline Lengths: – Core 2: 14 stages – Pentium 4 (Prescott): > 20 stages – Sandy Bridge: in between

Bypassing Image: Penn CIS501Penn CIS501 sub R2,R3  R7add R1,R7  R2 Register File SXSX s1s2d Data Mem a d IR A B IR O B IR O D IR F/DD/XX/MM/W stall nop

Stalls Image: Penn CIS501Penn CIS501 load [R3]  R7add R1,R7  R2 Register File SXSX s1s2d Data Mem a d IR A B IR O B IR O D IR F/DD/XX/MM/W stall nop

Branches Image: Penn CIS501Penn CIS501 jeq loop ??? Register File SXSX s1s2d Data Mem a d IR A B IR O B IR O D IR F/DD/XX/MM/W nop

Branch Prediction Guess what instruction comes next Based off branch history Example: two-level predictor with global history – Maintain history table of all outcomes for M successive branches – Compare with past N results (history register) – Sandy Bridge employs 32-bit history register

Branch Prediction + Modern predictors > 90% accuracy o Raise performance and energy efficiency (why?) – Area increase – Potential fetch stage latency increase

Another option: Predication Replace branches with conditional instructions ; if (r1==0) r3=r2 cmoveq r1, r2 -> r3 + Avoids branch predictor o Avoids area penalty, misprediction penalty – Avoids branch predictor o Introduces unnecessary nop if predictable branch GPUs also use predication

Improving IPC IPC (instructions/cycle) bottlenecked at 1 instruction / clock Superscalar – increase pipeline width Image: Penn CIS371

Superscalar + Peak IPC now at N (for N-way superscalar) o Branching and scheduling impede this o Need some more tricks to get closer to peak (next) – Area increase o Doubling execution resources o Bypass network grows at N 2 o Need more register & memory bandwidth

Superscalar in Sandy Bridge Image © David Kanter, RWTRWT

Scheduling Consider instructions: xor r1,r2 -> r3 add r3,r4 -> r4 sub r5,r2 -> r3 addi r3,1 -> r1 xor and add are dependent (Read-After- Write, RAW) sub and addi are dependent (RAW) xor and sub are not (Write-After-Write, WAW)

Register Renaming How about this instead: xor p1,p2 -> p6 add p6,p4 -> p7 sub p5,p2 -> p8 addi p8,1 -> p9 xor and sub can now execute in parallel

Out-of-Order Execution Reordering instructions to maximize throughput Fetch  Decode  Rename  Dispatch  Issue  Register-Read  Execute  Memory  Writeback  Commit Reorder Buffer (ROB) – Keeps track of status for in-flight instructions Physical Register File (PRF) Issue Queue/Scheduler – Chooses next instruction(s) to execute

OoO in Sandy Bridge Image © David Kanter, RWTRWT

Out-of-Order Execution + Brings IPC much closer to ideal – Area increase – Energy increase Modern Desktop/Mobile In-order CPUs – Intel Atom – ARM Cortex-A8 (Apple A4, TI OMAP 3) – Qualcomm Scorpion Modern Desktop/Mobile OoO CPUs – Intel Pentium Pro and onwards – ARM Cortex-A9 (Apple A5, NV Tegra 2/3, TI OMAP 4) – Qualcomm Krait

Memory Hierarchy Memory: the larger it gets, the slower it gets Rough numbers: LatencyBandwidthSize SRAM (L1, L2, L3)1-2ns200GBps1-20MB DRAM (memory)70ns20GBps1-20GB Flash (disk)70-90µs200MBps GB HDD (disk)10ms1-150MBps GB SRAM & DRAM latency, and DRAM bandwidth for Sandy Bridge from LostcircuitsLostcircuits Flash and HDD latencies from AnandTechAnandTech Flash and HDD bandwidth from AnandTech BenchBench SRAM bandwidth guesstimated.

Caching Keep data you need close Exploit: – Temporal locality Chunk just used likely to be used again soon – Spatial locality Next chunk to use is likely close to previous

Cache Hierarchy Hardware-managed – L1 Instruction/Data caches – L2 unified cache – L3 unified cache Software-managed – Main memory – Disk I$ D$ L2 L3 Main Memory Disk (not to scale) LargerLarger FasterFaster

Intel Core i7 3960X – 15MB L3 (25% of die). 4-channel Memory Controller, 51.2GB/s total Source:

Some Memory Hierarchy Design Choices Banking – Avoid multi-porting Coherency Memory Controller – Multiple channels for bandwidth

Parallelism in the CPU Covered Instruction-Level (ILP) extraction – Superscalar – Out-of-order Data-Level Parallelism (DLP) – Vectors Thread-Level Parallelism (TLP) – Simultaneous Multithreading (SMT) – Multicore

Vectors Motivation for (int i = 0; i < N; i++) A[i] = B[i] + C[i];

CPU Data-level Parallelism Single Instruction Multiple Data (SIMD) – Let’s make the execution unit (ALU) really wide – Let’s make the registers really wide too for (int i = 0; i < N; i+= 4) { // in parallel A[i] = B[i] + C[i]; A[i+1] = B[i+1] + C[i+1]; A[i+2] = B[i+2] + C[i+2]; A[i+3] = B[i+3] + C[i+3]; }

Vector Operations in x86 SSE2 – 4-wide packed float and packed integer instructions – Intel Pentium 4 onwards – AMD Athlon 64 onwards AVX – 8-wide packed float and packed integer instructions – Intel Sandy Bridge – AMD Bulldozer

Thread-Level Parallelism Thread Composition – Instruction streams – Private PC, registers, stack – Shared globals, heap Created and destroyed by programmer Scheduled by programmer or by OS

Simultaneous Multithreading Instructions can be issued from multiple threads Requires partitioning of ROB, other buffers + Minimal hardware duplication + More scheduling freedom for OoO – Cache and execution resource contention can reduce single-threaded performance

Multicore Replicate full pipeline Sandy Bridge-E: 6 cores + Full cores, no resource sharing other than last- level cache + Easier way to take advantage of Moore’s Law – Utilization

Locks, Coherence, and Consistency Problem: multiple threads reading/writing to same data A solution: Locks – Implement with test-and-set, load-link/store- conditional instructions Problem: Who has the correct data? A solution: cache coherency protocol Problem: What is the correct data? A solution: memory consistency model

Conclusions CPU optimized for sequential programming – Pipelines, branch prediction, superscalar, OoO – Reduce execution time with high clock speeds and high utilization Slow memory is a constant problem Parallelism – Sandy Bridge-E great for 6-12 active threads – How about 12,000?

References Milo Martin, Penn CIS501 Fall David Kanter, “Intel's Sandy Bridge Microarchitecture.” 9/25/10. D=RWT D=RWT Agner Fog, “The microarchitecture of Intel, AMD and VIA CPUs.” 6/8/ e.pdf e.pdf

Bibliography Classic Jon Stokes’ articles introducing basic CPU architecture, pipelining (1, 2), and Moore’s Lawarchitecture12 Moore’s Law CMOV discussion on Mozilla mailing listmailing list Herb Sutter, “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software.” linklink