CS61C L28 Intra-machine Parallelism (1) Magrino, Summer 2010 © UCB TA Tom Magrino inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures.

Slides:

Advertisements

Similar presentations

Instruction Level Parallelism

Advertisements

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

The University of Adelaide, School of Computer Science

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.

EECE476: Computer Architecture Lecture 23: Speculative Execution, Dynamic Superscalar (text 6.8 plus more) The University of British ColumbiaEECE 476©

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Multicore & Parallel Processing P&H Chapter ,

Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,

Instruction Level Parallelism (ILP) Colin Stevens.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

L18 – Pipeline Issues 1 Comp 411 – Spring /03/08 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you.

L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.

Old-Fashioned Mud-Slinging with Flash

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Part 5 – Superscalar & Dynamic Pipelining - An Extra Kicker! 5/5/04+ Three major directions that simple pipelines of chapter 6 have been extended If you.

CS61C L31 Pipelined Execution, part II (1) Garcia, Fall 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.

Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.

CS 300 – Lecture 24 Intro to Computer Architecture / Assembly Language The LAST Lecture!

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

CS 61C L38 Pipelined Execution, part II (1) Garcia, Spring 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.

Basics and Architectures

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

CMPE 421 Parallel Computer Architecture

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

COMP25212 Lecture 51 Pipelining Reducing Instruction Execution Time.

Pipelining and Parallelism Mark Staveley

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

EKT303/4 Superscalar vs Super-pipelined.

Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.

BCS361: Computer Architecture I/O Devices. 2 Input/Output CPU Cache Bus MemoryDiskNetworkUSBDVD …

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng

L17 – Pipeline Issues 1 Comp 411 – Fall /23/09 CPU Pipelining Issues Read Chapter This pipe stuff makes my head hurt! What have you been.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

CS203 – Advanced Computer Architecture Performance Evaluation.

CS203 – Advanced Computer Architecture

CS 352H: Computer Systems Architecture

COMP 740: Computer Architecture and Implementation

Advanced Architectures

Simultaneous Multithreading

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

November 5 No exam results today. 9 Classes to go!

/ Computer Architecture and Design

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

CSC3050 – Computer Architecture

Guest Lecturer: Justin Hsia

Presentation transcript:

CS61C L28 Intra-machine Parallelism (1) Magrino, Summer 2010 © UCB TA Tom Magrino inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures Lecture 28 – Intra-machine Parallelism Parallel Programming Boot Camp with Par Lab next week! You should be able to understand the basics of what they’ll be teaching after taking this course! It is free for UC Berkeley students, so take advantage of this amazing opportunity!

CS61C L28 Intra-machine Parallelism (2) Magrino, Summer 2010 © UCB Review Parallelism is necessary for performance It looks like it’s It is the future of computing It is unlikely that serial computing will ever catch up with parallel computing Software Parallelism Grids and clusters, networked computers MPI & MapReduce – two common ways to program Parallelism is often difficult Speedup is limited by serial portion of the code and communication overhead

CS61C L28 Intra-machine Parallelism (3) Magrino, Summer 2010 © UCB Today Why we’re going parallel Flynn’s Taxonomy of Parallel Architectures Superscalars Multithreading Vector Processors Multicore Challenges for Parallelism

CS61C L28 Intra-machine Parallelism (4) Magrino, Summer 2010 © UCB Disclaimers Please don’t let today’s material confuse what you have already learned about CPU’s and pipelining When programmer is mentioned today, it means whoever is generating the assembly code (so it is probably a compiler) Many of the concepts described today are difficult to implement, so if it sounds easy, think of possible hazards

CS61C L28 Intra-machine Parallelism (5) Magrino, Summer 2010 © UCB Scaling

CS61C L28 Intra-machine Parallelism (6) Magrino, Summer 2010 © UCB Conventional Wisdom (CW) in Computer Architecture 1.Old CW: Power is free, but transistors expensive New CW: Power wall Power expensive, transistors “free” Can put more transistors on a chip than have power to turn on 2.Old CW: Multiplies slow, but loads fast New CW: Memory wall Loads slow, multiplies fast 200 clocks to DRAM, but even FP multiplies only 4 clocks 3.Old CW: More ILP via compiler / architecture innovation Branch prediction, speculation, Out-of-order execution, VLIW, … New CW: ILP wall Diminishing returns on more ILP 4.Old CW: 2X CPU Performance every 18 months New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall Implication: We must go parallel!

CS61C L28 Intra-machine Parallelism (7) Magrino, Summer 2010 © UCB Flynn’s Taxonomy Classification of Parallel Architectures wwww.wikipedia.org Single Data Multiple Data Single InstructionMultiple Instruction

CS61C L28 Intra-machine Parallelism (8) Magrino, Summer 2010 © UCB Superscalar Add more functional units or pipelines to the CPU Directly reduces CPI by doing more per cycle Consider what if we: Added another ALU Added 2 more read ports to the RegFile Added 1 more write port to the RegFile

CS61C L28 Intra-machine Parallelism (9) Magrino, Summer 2010 © UCB Simple Superscalar MIPS CPU clk 5 W0RaRb Register File Rd Data In Data Addr Data Memory Inst0 Instruction Address Instruction Memory PC 5 Rs 5 Rt 32 A B Next Address clk ALU 32 ALU 5 Rd Inst1 5 Rs 5 Rt W1RcRd C D 32 Can now do 2 instructions in 1 cycle!

CS61C L28 Intra-machine Parallelism (10) Magrino, Summer 2010 © UCB Simple Superscalar MIPS CPU (cont.) Considerations We now have to figure out what will use the other functional unit Forwarding for pipelining now harder Limitations of our example Improvement only if other instructions can fill slots Doesn’t scale well Who decides how to use the new slots?

CS61C L28 Intra-machine Parallelism (11) Magrino, Summer 2010 © UCB Superscalars in Practice Modern superscalar processors usually hide behind a scalar ISA Gives the illusion of a scalar processor Dynamically schedules instructions  Tries to fill all slots with useful instructions  Detects hazards and avoids them Instruction Level Parallelism (ILP) Multiple instructions from same instruction stream in flight at the same time Where in this course did we already make use of ILP? Pipelining

CS61C L28 Intra-machine Parallelism (12) Magrino, Summer 2010 © UCB Multithreading There’s only so much ILP we can get out of the same program Why not try to have parallel tasks or threads running at the same time in the CPU? No use in having the CPU time go to waste waiting on a single task Let the programmer/OS figure out what tasks can happen in parallel Now the CPU is able to just assume they’re independent.

CS61C L28 Intra-machine Parallelism (13) Magrino, Summer 2010 © UCB Multithreading (cont.) Idea is to have the processor try and have multiple threads working at the same time. Less work on the CPU’s part to figure out conflicts Programmer/OS already figured out what is independent! All the CPU has to do is just make sure they don’t trample each other!

CS61C L28 Intra-machine Parallelism (14) Magrino, Summer 2010 © UCB Static Multithreading First attempt: Let’s just keep switching threads every N clock cycles! We can probably do better…

CS61C L28 Intra-machine Parallelism (15) Magrino, Summer 2010 © UCB Dynamic Multithreading Well, let’s try to improve by determining while we run when to switch That way we make even more use of the CPU despite some threads taking a long time on a single instruction. This can be hard to do You can have some threads “starve” if you aren’t “fair” We can still do better! How? Hint: Superscalars

CS61C L28 Intra-machine Parallelism (16) Magrino, Summer 2010 © UCB Simultaneous Multithreading Let’s just have different threads active at the SAME time, every cycle. Use the CPU’s resources as much as possible We’ll probably still have some nops, but that’s okay. Very complex hardware to do this! Different Clock cycles Functional Units

CS61C L28 Intra-machine Parallelism (17) Magrino, Summer 2010 © UCB Administrivia Proj3 and HW9 due tonight at 11:59pm I know a few of you guys have found some of the questions vague/unclear, if you’re ever not sure just note what you’re assuming in your answer. Final is Thursday 8–11am in 10 Evans We’ll be having two review lectures between now and then. We’re ending a bit early for course evaluations Don’t forget: Paul has office hours today

CS61C L28 Intra-machine Parallelism (18) Magrino, Summer 2010 © UCB Multicore Put multiple CPU’s on the same die Cost of multicore: complexity and slower single- thread execution Task/Thread Level Parallelism (TLP) Multiple instructions streams in flight at the same time

CS61C L28 Intra-machine Parallelism (19) Magrino, Summer 2010 © UCB Multicore Helps Energy Efficiency Power ~ CV 2 f Circuit delay is roughly linear with V William Holt, HOT Chips 2005

CS61C L28 Intra-machine Parallelism (20) Magrino, Summer 2010 © UCB Cache Coherence Problem CPU0 Cache Addr Value CPU1 Shared Main Memory AddrValue 16 Cache Addr Value 5 CPU0: LW R2, 16(R0) 516 CPU1: LW R2, 16(R0) 165 View of memory no longer “coherent”. Loads of location 16 from CPU0 and CPU1 see different values! CPU1: SW R0,16(R0) 0 0 From: John Lazzaro

CS61C L28 Intra-machine Parallelism (21) Magrino, Summer 2010 © UCB Real World Example: IBM POWER7

CS61C L28 Intra-machine Parallelism (22) Magrino, Summer 2010 © UCB Vector Processors Vector Processors implement SIMD One operation applied to whole vector  Not all program can easily fit into this Can get high performance and energy efficiency by amortizing instruction fetch  Spends more silicon and power on ALUs Data Level Parallelism (DLP) Compute on multiple pieces of data with the same instruction stream

CS61C L28 Intra-machine Parallelism (23) Magrino, Summer 2010 © UCB Vectors (continued) Vector Extensions Extra instructions added to a scalar ISA to provide short vectors Examples: MMX, SSE, AltiVec Graphics Processing Units (GPU) Also leverage SIMD Initially very fixed function for graphics, but now becoming more flexible to support general purpose computation

CS61C L28 Intra-machine Parallelism (24) Magrino, Summer 2010 © UCB Parallelism at Different Levels Intra-Processor (within the core) Pipelining & superscalar (ILP) Vector extensions & GPU (DLP) Intra-System (within the box) Multicore – multiple cores per socket (TLP) Multisocket – multiple sockets per system Intra-Cluster (within the room) Multiple systems per rack and multiple racks per cluster

CS61C L28 Intra-machine Parallelism (25) Magrino, Summer 2010 © UCB Peer Instruction 1)A GPU is an example of SIMD 2)With simultaneous multithreading, every part of the processor will be active, at all times. 12 a) FF b) FT c) TF d) TT

CS61C L28 Intra-machine Parallelism (26) Magrino, Summer 2010 © UCB 12 a) FF b) FT c) TF d) TT Peer Instruction Answer 1)GPUs work on large batches of data at a time … TRUE 2)There is still a possibility that your threads are competing for a single part of the processor and nothing else … FALSE 1)A GPU is an example of SIMD 2)With simultaneous multithreading, every part of the processor will be active, at all times.

CS61C L28 Intra-machine Parallelism (27) Magrino, Summer 2010 © UCB “And In conclusion…” Modern processors are now focusing more and more about getting a lot done all at once. Types of Parallelism: ILP, TLP, DLP Parallelism already at many levels Great opportunity for change Please stick around for surveys