Lecture 9: Pipelining.

Slides:



Advertisements
Similar presentations
IT253: Computer Organization
Advertisements

1  1998 Morgan Kaufmann Publishers Interfacing Processors and Peripherals.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Caching IV Andreas Klappenecker CPSC321 Computer Architecture.
11/1/2005Comp 120 Fall November Exam Postmortem 11 Classes to go! Read Sections 7.1 and 7.2 You read 6.1 for this time. Right? Pipelining then.
Cache Performance, Interfacing, Multiprocessors CPSC 321 Andreas Klappenecker.
Interfacing Processors and Peripherals Andreas Klappenecker CPSC321 Computer Architecture.
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
Processor Design 5Z0321 Processor Design 5Z032 Chapter 8 Interfacing Processors and Peripherals Henk Corporaal.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
1  1998 Morgan Kaufmann Publishers Chapter 8 Interfacing Processors and Peripherals.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  2004 Morgan Kaufmann Publishers Chapters 8 & 9 (partial coverage)
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Interfacing Processors and Peripherals I/O Design affected by many factors (expandability, resilience) Performance: — access latency — throughput — connection.
1 Chapter Six - 2nd Half Pipelined Processor Forwarding, Hazards, Branching EE3055 Web:
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Computing Systems Memory Hierarchy.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
CMPE 421 Parallel Computer Architecture
1  2004 Morgan Kaufmann Publishers Multilevel cache Used to reduce miss penalty to main memory First level designed –to reduce hit time –to be of small.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.
1  2004 Morgan Kaufmann Publishers Lecture 11, Oct. 29, 2007 Syllabus change: –Tues, Oct. 29, 2007 – Caching –Thurs, Nov. 1, 2007 – Review –Tues, Nov.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1  2004 Morgan Kaufmann Publishers No encoding: –1 bit for each datapath operation –faster, requires more memory (logic) –used for Vax 780 — an astonishing.
1 ECE3055 Computer Architecture and Operating Systems Lecture 8 Memory Subsystem Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering.
1  2004 Morgan Kaufmann Publishers Page Tables. 2  2004 Morgan Kaufmann Publishers Page Tables.
1  2004 Morgan Kaufmann Publishers Fallacies and Pitfalls Fallacy: the rated mean time to failure of disks is 1,200,000 hours, so disks practically never.
Chapter Six.
COSC3330 Computer Architecture
Bus Interfacing Processor-Memory Bus Backplane Bus I/O Bus
Advanced Architectures
Memory COMPUTER ARCHITECTURE
The Goal: illusion of large, fast, cheap memory
Chapter Seven.
CS352H: Computer Systems Architecture
Improving Memory Access 1/3 The Cache and Virtual Memory
Cache Memory Presentation I
Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: illusion of having more physical memory program relocation protection.
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Virtual Memory 4 classes to go! Today: Virtual Memory.
Lecture 7: Pipelining.
Different Architectures
Chapter Six.
Chapter Six.
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
November 5 No exam results today. 9 Classes to go!
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Lecture 1 An Overview of High-Performance Computer Architecture
Presentation transcript:

Lecture 9: Pipelining

Notes Covers all chapters, including some basic cache problems Question about reading Book Paper Reading Question(s) abut PopNet? Project? Finish Chapter 6: Pipelining today Schedule: Finish Pipelining today Thursday: Caches Tuesday: Review Thursday: Midterm Covers all chapters, including some basic cache problems

Chapter Six -- Pipelining

Pipelining Improve performance by increasing instruction throughput Ideal speedup is number of stages in the pipeline. Do we achieve this? Note: timing assumptions changed for this example

Pipelining What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction We’ll build a simple pipeline and look at these issues We’ll talk about modern processors and what really makes it hard: exception handling trying to improve performance with out-of-order execution, etc.

Pipelined Datapath Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?

Corrected Datapath

Graphically Representing Pipelines Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths

Pipeline Control

Pipeline control We have 5 stages. What needs to be controlled in each stage? Instruction Fetch and PC Increment Instruction Decode / Register Fetch Execution Memory Stage Write Back How would control be handled in an automobile plant? a fancy control center telling everyone what to do? should we use a finite state machine?

Pipeline Control Pass control signals along just like the data

Datapath with Control

Dependencies Problem with starting next instruction before first is finished dependencies that “go backward in time” are data hazards

Forwarding Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding what if this $2 was $13?

Forwarding

Can't always forward Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to the same register. Thus, we need a hazard detection unit to “stall” the load instruction

Stalling We can stall the pipeline by keeping an instruction in the same stage

Hazard Detection Unit Stall by letting an instruction that won’t write anything go forward

Branch Hazards When we decide to branch, other instructions are in the pipeline! We are predicting “branch not taken” need to add hardware for flushing instructions if we are wrong

Flushing Instructions Note: we’ve also moved branch decision to ID stage

Branches A 2-bit prediction scheme If the branch is taken, we have a penalty of one cycle For our simple design, this is reasonable With deeper pipelines, penalty increases and static branch prediction drastically hurts performance Solution: dynamic branch prediction A 2-bit prediction scheme

Branch Prediction Sophisticated Techniques: A “branch target buffer” to help us look up the destination Correlating predictors that base prediction on global behavior and recently executed branches (e.g., prediction for a specific branch instruction based on what happened in previous branches) Tournament predictors that use different types of prediction strategies and keep track of which one is performing best. A “branch delay slot” which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA) Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective! Modern processors predict correctly 95% of the time!

Improving Performance Try and avoid stalls! E.g., reorder instructions Dynamic Pipeline Scheduling Hardware chooses which instructions to execute next Will execute instructions out of order (e.g., doesn’t wait for a dependency to be resolved, but rather keeps going!) Speculates on branches and keeps the pipeline full (may need to rollback if prediction incorrect) Trying to exploit instruction-level parallelism

Advanced Pipelining Increase the depth of the pipeline Start more than one instruction each cycle (multiple issue) Loop unrolling to expose more ILP (better scheduling) “Superscalar” processors DEC Alpha 21264: 9 stage pipeline, 6 instruction issue All modern processors are superscalar and issue multiple instructions usually with some limitations (e.g., different “pipes”) VLIW: very long instruction word, static multiple issue (relies more on compiler technology) This class has given you the background you need to learn more!

Chapter 6 Summary Pipelining does not improve latency, but does improve throughput

Last Slide

Chapter Seven

Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be refreshed) very small but slower than SRAM (factor of 5 to 10)

Exploiting Memory Hierarchy Users want large and fast memories! SRAM access times are .5 – 5ns at cost of $4000 to $10,000 per GB. DRAM access times are 50-70ns at cost of $100 to $200 per GB. Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB. Try and give it to them anyway build a memory hierarchy 2004

Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be referenced soon. Why does code have locality? Our initial focus: two levels (upper, lower) block: minimum unit of data hit: data requested is in the upper level miss: data requested is not in the upper level

Cache Two issues: How do we know if a data item is in the cache? If it is, how do we find it? Our first example: block size is one word of data "direct mapped" For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level

Direct Mapped Cache Mapping: address is modulo the number of blocks in the cache

Direct Mapped Cache For MIPS: What kind of locality are we taking advantage of?

Direct Mapped Cache Taking advantage of spatial locality:

Hits vs. Misses Read hits this is what we want! Read misses stall the CPU, fetch block from memory, deliver to cache, restart Write hits: can replace data in cache and memory (write-through) write the data only into the cache (write-back the cache later) Write misses: read the entire block into the cache, then write the word

Hardware Issues Make reading multiple words easier by using banks of memory It can get a lot more complicated...

Performance Increasing the block size tends to decrease miss rate: Use split caches because there is more spatial locality in code:

Performance Simplified model: execution time = (execution cycles + stall cycles) ´ cycle time stall cycles = # of instructions ´ miss ratio ´ miss penalty Two ways of improving performance: decreasing the miss ratio decreasing the miss penalty What happens if we increase block size?

Decreasing miss ratio with associativity Compared to direct mapped, give a series of references that: results in a lower miss ratio using a 2-way set associative cache results in a higher miss ratio using a 2-way set associative cache assuming we use the “least recently used” replacement strategy

An implementation

Performance

Decreasing miss penalty with multilevel caches Add a second level cache: often primary cache is on the same chip as the processor use SRAMs to add another cache above primary memory (DRAM) miss penalty goes down if data is in 2nd level cache Example: CPI of 1.0 on a 5 Ghz machine with a 5% miss rate, 100ns DRAM access Adding 2nd level cache with 5ns access time decreases miss rate to .5% Using multilevel caches: try and optimize the hit time on the 1st level cache try and optimize the miss rate on the 2nd level cache

Cache Complexities Theoretical behavior of Radix sort vs. Quicksort Not always easy to understand implications of caches: Theoretical behavior of Radix sort vs. Quicksort Observed behavior of Radix sort vs. Quicksort

Cache Complexities Here is why: Memory system performance is often critical factor multilevel caches, pipelined processors, make it harder to predict outcomes Compiler optimizations to increase locality sometimes hurt ILP Difficult to predict best algorithm: need experimental data

Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: illusion of having more physical memory program relocation protection

Pages: virtual memory blocks Page faults: the data is not in memory, retrieve it from disk huge miss penalty, thus pages should be fairly large (e.g., 4KB) reducing page faults is important (LRU is worth the price) can handle the faults in software instead of hardware using write-through is too expensive so we use writeback

Page Tables

Page Tables

Making Address Translation Fast A cache for address translations: translation lookaside buffer Typical values: 16-512 entries, miss-rate: .01% - 1% miss-penalty: 10 – 100 cycles

TLBs and caches

TLBs and Caches

Modern Systems

Modern Systems Things are getting complicated!

Some Issues Processor speeds continue to increase very fast — much faster than either DRAM or disk access times Design challenge: dealing with this growing disparity Prefetching? 3rd level caches and more? Memory design?

Chapters 8 & 9 (partial coverage)

Interfacing Processors and Peripherals I/O Design affected by many factors (expandability, resilience) Performance: — access latency — throughput — connection between devices and the system — the memory hierarchy — the operating system A variety of different users (e.g., banks, supercomputers, engineers)

I/O Important but neglected “The difficulties in assessing and designing I/O systems have often relegated I/O to second class status” “courses in every aspect of computing, from programming to computer architecture often ignore I/O or give it scanty coverage” “textbooks leave the subject to near the end, making it easier for students and instructors to skip it!” GUILTY! — we won’t be looking at I/O in much detail — be sure and read Chapter 8 in its entirety. — you should probably take a networking class!

I/O Devices Very diverse devices — behavior (i.e., input vs. output) — partner (who is at the other end?) — data rate

I/O Example: Disk Drives To access data: — seek: position head over the proper track (3 to 14 ms. avg.) — rotational latency: wait for desired sector (.5 / RPM) — transfer: grab the data (one or more sectors) 30 to 80 MB/sec

I/O Example: Buses Shared communication link (one or more wires) Difficult design: — may be bottleneck — length of the bus — number of devices — tradeoffs (buffers for higher bandwidth increases latency) — support for many different devices — cost Types of buses: — processor-memory (short high speed, custom design) — backplane (high speed, often standardized, e.g., PCI) — I/O (lengthy, different devices, e.g., USB, Firewire) Synchronous vs. Asynchronous — use a clock and a synchronous protocol, fast and small but every device must operate at same rate and clock skew requires the bus to be short — don’t use a clock and instead use handshaking

I/O Bus Standards Today we have two dominant bus standards:

Other important issues Bus Arbitration: — daisy chain arbitration (not very fair) — centralized arbitration (requires an arbiter), e.g., PCI — collision detection, e.g., Ethernet Operating system: — polling — interrupts — direct memory access (DMA) Performance Analysis techniques: — queuing theory — simulation — analysis, i.e., find the weakest link (see “I/O System Design”) Many new developments

Pentium 4 I/O Options

Fallacies and Pitfalls Fallacy: the rated mean time to failure of disks is 1,200,000 hours, so disks practically never fail. Fallacy: magnetic disk storage is on its last legs, will be replaced. Fallacy: A 100 MB/sec bus can transfer 100 MB/sec. Pitfall: Moving functions from the CPU to the I/O processor, expecting to improve performance without analysis.

Multiprocessors Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad news: its really hard to write good concurrent programs many commercial failures

Questions How do parallel processors share data? — single address space (SMP vs. NUMA) — message passing How do parallel processors coordinate? — synchronization (locks, semaphores) — built into send / receive primitives — operating system protocols How are they implemented? — connected by a single bus — connected by a network

Supercomputers Plot of top 500 supercomputer sites over a decade:

Using multiple processors an old idea Some SIMD designs: Costs for the the Illiac IV escalated from $8 million in 1966 to $32 million in 1972 despite completion of only ¼ of the machine. It took three more years before it was operational! “For better or worse, computer architects are not easily discouraged” Lots of interesting designs and ideas, lots of failures, few successes

Topologies

Clusters Constructed from whole computers Independent, scalable networks Strengths: Many applications amenable to loosely coupled machines Exploit local area networks Cost effective / Easy to expand Weaknesses: Administration costs not necessarily lower Connected using I/O bus Highly available due to separation of memories In theory, we should be able to do better

Google Serve an average of 1000 queries per second Google uses 6,000 processors and 12,000 disks Two sites in silicon valley, two in Virginia Each site connected to internet using OC48 (2488 Mbit/sec) Reliability: On an average day, 20 machines need rebooted (software error) 2% of the machines replaced each year In some sense, simple ideas well executed. Better (and cheaper) than other approaches involving increased complexity

Concluding Remarks Evolution vs. Revolution “More often the expense of innovation comes from being too disruptive to computer users” “Acceptance of hardware ideas requires acceptance by software people; therefore hardware people should learn about software. And if software people want good machines, they must learn more about hardware to be able to communicate with and thereby influence hardware engineers.”