Cost/Performance, DLX, Pipelining

Cost/Performance, DLX, Pipelining
Prof. Fred Chong

Computer Architecture Is …
the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation. Amdahl, Blaaw, and Brooks, 1964 SOFTWARE

Computer Architecture’s Changing Definition
1950s to 1960s: Computer Architecture Course Computer Arithmetic 1970s to mid 1980s: Computer Architecture Course Instruction Set Design, especially ISA appropriate for compilers 1990s: Computer Architecture Course Design of CPU, memory system, I/O system, Multiprocessors

Computer Architecture Topics
Input/Output and Storage Disks, WORM, Tape RAID Emerging Technologies Interleaving Bus protocols DRAM Coherence, Bandwidth, Latency Memory Hierarchy L2 Cache L1 Cache Addressing, Protection, Exception Handling VLSI Instruction Set Architecture Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, DSP Pipelining and Instruction Level Parallelism

Computer Architecture Topics
Shared Memory, Message Passing, Data Parallelism P M P M P M P M ° ° ° Network Interfaces S Interconnection Network Processor-Memory-Switch Topologies, Routing, Bandwidth, Latency, Reliability Multiprocessors Networks and Interconnections

Measurement & Evaluation
ECS 250A Course Focus Understanding the design techniques, machine structures, technology factors, evaluation methods that will determine the form of computers in 21st Century Parallelism Technology Programming Languages Applications Computer Architecture: • Instruction Set Design • Organization • Hardware Interface Design (ISA) Operating Measurement & Evaluation History Systems

Topic Coverage Textbook: Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2nd Ed., 1996. Performance/Cost, DLX, Pipelining, Caches, Branch Prediction ILP, Loop Unrolling, Scoreboarding, Tomasulo, Dynamic Branch Prediction Trace Scheduling, Speculation Vector Processors, DSPs Memory Hierarchy I/O Interconnection Networks Multiprocessors

ECS250A: Staff Instructor: Fred Chong Office: EUII-3031 chong@cs
Office Hours: Mon 4-6pm or by appt. T. A: Diana Keen Office: EUII-2239 TA Office Hours: Fri 1-3pm Class: Mon 6:10-9pm Text: Computer Architecture: A Quantitative Approach, Second Edition (1996) Web page: Lectures available online before 1PM day of lecture Newsgroup: ucd.class.cs250a{.d} This slide is for the 3-min class administrative matters. Make sure we update Handout #1 so it is consistent with this slide.

Grading Problem Sets 35% 1 In-class exam (prelim simulation) 20%
Project Proposals and Drafts 10% Project Final Report 25% Project Poster Session (CS colloquium) 10%

Assignments Read Ch 1-3 Problem Set 1 - due Mon 1/25/99
alone or in pairs Project Proposals - due Mon 1/25/99 groups of 2 or 3 see web page and links to me and cc:diana about ideas - due Mon 1/18/99 pick 3 research papers

VLSI Transistors A G A B G B

CMOS Inverter In Out In Out

CMOS NAND Gate A A B C B C

Integrated Circuits Costs
IC cost = Die cost Testing cost Packaging cost Final test yield Die cost = Wafer cost Dies per Wafer * Die yield Dies per wafer = š * ( Wafer_diam / 2)2 – š * Wafer_diam – Test dies Die Area ¦ 2 * Die Area Die Yield = Wafer yield * 1 +  Defects_per_unit_area * Die_Area  { } Die Cost goes roughly with die area4

Real World Examples Chip Metal Line Wafer Defect Area Dies/ Yield Die Cost layers width cost /cm2 mm2 wafer 386DX $ % $4 486DX $ % $12 PowerPC $ % $53 HP PA $ % $73 DEC Alpha $ % $149 SuperSPARC $ % $272 Pentium $ % $417 From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15

Cost/Performance What is Relationship of Cost to Price?
Component Costs Direct Costs (add 25% to 40%) recurring costs: labor, purchasing, scrap, warranty Gross Margin (add 82% to 186%) nonrecurring costs: R&D, marketing, sales, equipment maintenance, rental, financing cost, pretax profits, taxes Average Discount to get List Price (add 33% to 66%): volume discounts and/or retailer markup List Price Average Discount 25% to 40% Avg. Selling Price Gross Margin 34% to 39% 6% to 8% Direct Cost Component Cost 15% to 33%

Chip Prices (August 1993) Assume purchase 10,000 units
Chip Area Mfg. Price Multi- Comment mm2 cost plier 386DX 43 $9 $ Intense Competition 486DX2 81 $35 $ No Competition PowerPC $77 $ DEC Alpha 234 $202 $ Recoup R&D? Pentium 296 $473 $ Early in shipments

Summary: Price vs. Cost

Technology Trends: Microprocessor Capacity
Alpha 21264: 15 million Pentium Pro: 5.5 million PowerPC 620: 6.9 million Alpha 21164: 9.3 million Sparc Ultra: 5.2 million Moore’s Law CMOS improvements: Die size: 2X every 3 yrs Line width: halve / 7 yrs

Memory Capacity (Single Chip DRAM)
year size(Mb) cyc time ns ns ns ns ns ns ns

Technology Trends (Summary)
Capacity Speed (latency) Logic 2x in 3 years 2x in 3 years DRAM 4x in 3 years 2x in 10 years Disk 4x in 3 years 2x in 10 years

Processor Performance Trends
1000 Supercomputers 100 Mainframes 10 Minicomputers 1 Microprocessors 0.1 1965 1970 1975 1980 1985 1990 1995 2000 Year

Processor Performance (1.35X before, 1.55X now)
1.54X/yr

Performance Trends (Summary)
Workstation performance (measured in Spec Marks) improves roughly 50% per year (2X every 18 months) Improvement in cost performance estimated at 70% per year

Computer Engineering Methodology
Technology Trends

Evaluate Existing Systems for Bottlenecks Benchmarks Where to start: existing systems bottlenecks Technology Trends

Evaluate Existing Systems for Bottlenecks Benchmarks Technology Trends Simulate New Designs and Organizations Workloads

Evaluate Existing Systems for Bottlenecks Implementation Complexity Benchmarks How hard to build Importance of simplicity (wearing a seat belt); avoiding a personal disaster Theory vs. practice Technology Trends Implement Next Generation System Simulate New Designs and Organizations Workloads

Measurement Tools Benchmarks, Traces, Mixes
Hardware: Cost, delay, area, power estimation Simulation (many levels) ISA, RT, Gate, Circuit Queuing Theory Rules of Thumb Fundamental “Laws”/Principles

The Bottom Line: Performance (and Cost)
Plane DC to Paris 6.5 hours 3 hours Speed 610 mph 1350 mph Passengers 470 132 Throughput (pmph) 286,700 178,200 Boeing 747 Fastest for 1 person? Which takes less time to transport 470 passengers? BAD/Sud Concodre Time to run the task (ExTime) Execution time, response time, latency Tasks per day, hour, week, sec, ns … (Performance) Throughput, bandwidth

The Bottom Line: Performance (and Cost)
"X is n times faster than Y" means ExTime(Y) Performance(X) = ExTime(X) Performance(Y) Speed of Concorde vs. Boeing 747 Throughput of Boeing 747 vs. Concorde 1350 / 610 = 2.2X 286,700/ 178, X

Amdahl's Law Speedup due to enhancement E:
ExTime w/o E Performance w/ E Speedup(E) = = ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected

Amdahl’s Law ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced 1 ExTimeold ExTimenew Speedupoverall = = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced

Amdahl’s Law Floating point instructions improved to run 2X; but only 10% of actual instructions are FP ExTimenew = Speedupoverall =

Amdahl’s Law Floating point instructions improved to run 2X; but only 10% of actual instructions are FP ExTimenew = ExTimeold x ( /2) = 0.95 x ExTimeold 1 Speedupoverall = = 1.053 0.95

Metrics of Performance
Application Answers per month Operations per second Programming Language Compiler (millions) of Instructions per second: MIPS (millions) of (FP) operations per second: MFLOP/s ISA Datapath Megabytes per second Control Function Units Cycles per second (clock rate) Transistors Wires Pins

Aspects of CPU Performance
CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set X X Organization X X Technology X

Cycles Per Instruction
“Average Cycles per Instruction” CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count n CPU time = CycleTime * S CPI * I i i i = 1 “Instruction Frequency” n CPI = S CPI * F where F = I i i i i i = 1 Instruction Count Invest Resources where time is Spent!

Example: Calculating CPI
Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (% Time) ALU 50% (33%) Load 20% (27%) Store 10% (13%) Branch 20% (27%) 1.5 Typical Mix

SPEC: System Performance Evaluation Cooperative
First Round 1989 10 programs yielding a single number (“SPECmarks”) Second Round 1992 SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs) Compiler Flags unlimited. March 93 of DEC 4000 Model 610: spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)= memcpy(b,a,c)” wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200 nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas Third Round 1995 new set of programs: SPECint95 (8 integer programs) and SPECfp95 (10 floating point) “benchmarks useful for 3 years” Single flag setting for all programs: SPECint_base95, SPECfp_base95

How to Summarize Performance
Arithmetic mean (weighted arithmetic mean) tracks execution time: S(Ti)/n or S(Wi*Ti) Harmonic mean (weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time: n/ S(1/Ri) or n/ S(Wi/Ri) Normalized execution time is handy for scaling performance (e.g., X times faster than SPARCstation 10) But do not take the arithmetic mean of normalized execution time, use the geometric: (P xi)^1/n

SPEC First Round One program: 99% of time in single line of code
New front-end compiler could improve dramatically

Impact of Means on SPECmark89 for IBM 550
Ratio to VAX: Time: Weighted Time: Program Before After Before After Before After gcc espresso spice doduc nasa li eqntott matrix fpppp tomcatv Mean Geometric Arithmetic Weighted Arith. Ratio 1.33 Ratio 1.16 Ratio 1.09

Performance Evaluation
“For better or worse, benchmarks shape a field” Good products created when have: Good benchmarks Good ways to summarize performance Given sales is a function in part of performance relative to competition, investment in improving product as reported by performance summary If benchmarks/summary inadequate, then choose between improving product for real programs vs. improving product to get more sales; Sales almost always wins! Execution time is the measure of computer performance!

Instruction Set Architecture (ISA)
software instruction set hardware

Interface Design A good interface:
Lasts through many implementations (portability, compatibility) Is used in many differeny ways (generality) Provides convenient functionality to higher levels Permits an efficient implementation at lower levels use time imp 1 Interface use imp 2 use imp 3

Evolution of Instruction Sets
Single Accumulator (EDSAC 1950) Accumulator + Index Registers (Manchester Mark I, IBM 700 series 1953) Separation of Programming Model from Implementation High-level Language Based Concept of a Family (B ) (IBM ) General Purpose Register Machines Complex Instruction Sets Load/Store Architecture (CDC 6600, Cray ) (Vax, Intel ) RISC (Mips,Sparc,HP-PA,IBM RS6000, )

Evolution of Instruction Sets
Major advances in computer architecture are typically associated with landmark instruction set designs Ex: Stack vs GPR (System 360) Design decisions must take into account: technology machine organization programming languages compiler technology operating systems And they in turn influence these

A "Typical" RISC 32-bit fixed format instruction (3 formats)
32 32-bit GPR (R0 contains zero, DP take pair) 3-address, reg-reg arithmetic instruction Single address mode for load/store: base + displacement no indirection Simple branch conditions Delayed branch see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3

Example: MIPS Register-Register Op Rs1 Rs2 Rd Opx Register-Immediate
31 26 25 21 20 16 15 11 10 6 5 Op Rs1 Rs2 Rd Opx Register-Immediate 31 26 25 21 20 16 15 immediate Op Rs1 Rd Branch 31 26 25 21 20 16 15 immediate Op Rs1 Rs2/Opx Jump / Call 31 26 25 target Op

Summary, #1 Designing to Last through Trends Time to run the task
Capacity Speed Logic 2x in 3 years 2x in 3 years DRAM 4x in 3 years 2x in 10 years Disk 4x in 3 years 2x in 10 years 6 yrs to graduate => 16X CPU speed, DRAM/Disk size Time to run the task Execution time, response time, latency Tasks per day, hour, week, sec, ns, … Throughput, bandwidth “X is n times faster than Y” means ExTime(Y) Performance(X) = ExTime(X) Performance(Y)

Summary, #2 Amdahl’s Law: CPI Law:
Execution time is the REAL measure of computer performance! Good products created when have: Good benchmarks, good ways to summarize performance Die Cost goes roughly with die area4 Can PC industry support engineering/research investment? Speedupoverall = ExTimeold ExTimenew = 1 (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle

Pipelining: Its Natural!
Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes A B C D

Sequential Laundry 6 PM 7 8 9 10 11 Midnight 30 40 20 30 40 20 30 40
Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e A B C D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?

Pipelined Laundry Start work ASAP
6 PM 7 8 9 10 11 Midnight Time 30 40 20 T a s k O r d e A B C D Pipelined laundry takes 3.5 hours for 4 loads

Pipelining Lessons 6 PM 7 8 9 30 40 20 A B C D
Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup 7 8 9 Time T a s k O r d e 30 40 20 A B C D

Computer Pipelines Execute billions of instructions, so throughput is what matters DLX desirable features: all instructions same length, registers located in same place in instruction format, memory operands only in loads or stores

5 Steps of DLX Datapath Figure 3.1, Page 130
Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back IR L M D

Pipelined DLX Datapath Figure 3.4, page 137
Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc. Write Back Memory Access Data stationary control local decode for each instruction phase / pipeline stage

Visualizing Pipelining Figure 3.3, Page 133
Time (clock cycles) I n s t r. O r d e

Its Not That Easy for Computers
Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) Control hazards: Pipelining of branches & other instructionsstall the pipeline until the hazardbubbles” in the pipeline

One Memory Port/Structural Hazards Figure 3.6, Page 142
Time (clock cycles) Load I n s t r. O r d e Instr 1 Instr 2 Instr 3 Instr 4

One Memory Port/Structural Hazards Figure 3.7, Page 143
Time (clock cycles) Load I n s t r. O r d e Instr 1 Instr 2 stall Instr 3

Speed Up Equation for Pipelining
CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined Ideal CPI + Pipeline stall CPI Clock Cyclepipelined Speedup = Pipeline depth Clock Cycleunpipelined 1 + Pipeline stall CPI Clock Cyclepipelined x x

Example: Dual-port vs. Single-port
Machine A: Dual ported memory Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipeline Depth/( x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 Machine A is 1.33 times faster

Data Hazard on R1 Figure 3.9, page 147
Time (clock cycles) IF ID/RF EX MEM WB I n s t r. O r d e add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

Three Generic Data Hazards
InstrI followed by InstrJ Read After Write (RAW) InstrJ tries to read operand before InstrI writes it

InstrI followed by InstrJ Write After Read (WAR) InstrJ tries to write operand before InstrI reads i Gets wrong operand Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5

InstrI followed by InstrJ Write After Write (WAW) InstrJ tries to write operand before InstrI writes it Leaves wrong result ( InstrI not InstrJ ) Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes

Forwarding to Avoid Data Hazard Figure 3.10, Page 149
Time (clock cycles) I n s t r. O r d e add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

HW Change for Forwarding Figure 3.20, Page 161

Data Hazard Even with Forwarding Figure 3.12, Page 153
Time (clock cycles) I n s t r. O r d e lw r1, 0(r2) MIPS actutally didn’t interlecok: MPU without Interlocked Pipelined Stages sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

Data Hazard Even with Forwarding Figure 3.13, Page 154
Time (clock cycles) I n s t r. O r d e lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

Software Scheduling to Avoid Load Hazards
Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd

Control Hazard on Branches Three Stage Stall

Branch Stall Impact If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier DLX branch tests if register = 0 or ° 0 DLX Solution: Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3

Pipelined DLX Datapath Figure 3.22, page 163
Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc. Memory Access Write Back This is the correct 1 cycle latency implementation! Does MIPS test affect clock (add forwarding logic too!)

Four Branch Hazard Alternatives
#1: Stall until branch direction is clear #2: Predict Branch Not Taken Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% DLX branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken 53% DLX branches taken on average But haven’t calculated branch target address in DLX DLX still incurs 1 cycle branch penalty Other machines: branch target known before outcome

Four Branch Hazard Alternatives
#4: Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor sequential successorn branch target if taken 1 slot delay allows proper decision and branch target address in 5 stage pipeline DLX uses this Branch delay of length n

Delayed Branch Where to get instructions to fill branch delay slot?
Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Cancelling branches allow more slots to be filled Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

Evaluating Branch Alternatives
Scheduling Branch CPI speedup v. speedup v. scheme penalty unpipelined stall Stall pipeline Predict taken Predict not taken Delayed branch Conditional & Unconditional = 14%, 65% change PC

Pipelining Summary Just overlap tasks, and easy if tasks are independent Speed Up Š Pipeline Depth; if ideal CPI is 1, then: Hazards limit performance on computers: Structural: need more HW resources Data (RAW,WAR,WAW): need forwarding, compiler scheduling Control: delayed branch, prediction Pipeline Depth Clock Cycle Unpipelined Speedup = X 1 + Pipeline stall CPI Clock Cycle Pipelined

Lecture 2: Caches and Advanced Pipelining
Prof. Fred Chong ECS 250A Computer Architecture Winter 1999 (Adapted from Patterson CS252 Copyright 1998 UCB)

Review, #1 Designing to Last through Trends Time to run the task
Capacity Speed Logic 2x in 3 years 2x in 3 years DRAM 4x in 3 years 2x in 10 years Disk 4x in 3 years 2x in 10 years Processor ( n.a.) 2x in 1.5 years Time to run the task Execution time, response time, latency Tasks per day, hour, week, sec, ns, … Throughput, bandwidth “X is n times faster than Y” means ExTime(Y) Performance(X) = ExTime(X) Performance(Y)

Assignments Chapter 4 Problem Set 2 Project Drafts in 3 weeks
can work in project groups Project Drafts in 3 weeks

Review, #2 Amdahl’s Law: CPI Law:
Execution time is the REAL measure of computer performance! Good products created when have: Good benchmarks Good ways to summarize performance Die Cost goes roughly with die area4 Speedupoverall = ExTimeold ExTimenew = 1 (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle

Recap: Who Cares About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency) µProc 60%/yr. (2X/1.5yr) 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap: (grows 50% / year) Y-axis is performance X-axis is time Latency Cliché: Not e that x86 didn’t have cache on chip until 1989 Performance 10 DRAM 9%/yr. (2X/10 yrs) DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time

Levels of the Memory Hierarchy
Upper Level Capacity Access Time Cost Staging Xfer Unit faster CPU Registers 100s Bytes <10s ns Registers Instr. Operands prog./compiler 1-8 bytes Cache K Bytes ns 1-0.1 cents/bit Cache cache cntl 8-128 bytes Blocks Main Memory M Bytes 200ns- 500ns $ cents /bit Memory OS 512-4K bytes Pages Disk G Bytes, 10 ms (10,000,000 ns) cents/bit Disk -5 -6 user/operator Mbytes Files Tape infinite sec-min 10 Larger Tape Lower Level -8

The Principle of Locality
Program access a relatively small portion of the address space at any instant of time. Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) Last 15 years, HW relied on localilty for speed The principle of locality states that programs access a relatively small portion of the address space at any instant of time. This is kind of like in real life, we all have a lot of friends. But at any given time most of us can only keep in touch with a small group of them. There are two different types of locality: Temporal and Spatial. Temporal locality is the locality in time which says if an item is referenced, it will tend to be referenced again soon. This is like saying if you just talk to one of your friends, it is likely that you will talk to him or her again soon. This makes sense. For example, if you just have lunch with a friend, you may say, let’s go to the ball game this Sunday. So you will talk to him again soon. Spatial locality is the locality in space. It says if an item is referenced, items whose addresses are close by tend to be referenced soon. Once again, using our analogy. We can usually divide our friends into groups. Like friends from high school, friends from work, friends from home. Let’s say you just talk to one of your friends from high school and she may say something like: “So did you hear so and so just won the lottery.” You probably will say NO, I better give him a call and find out more. So this is an example of spatial locality. You just talked to a friend from your high school days. As a result, you end up talking to another high school friend. Or at least in this case, you hope he still remember you are his friend. +3 = 10 min. (X:50)

Memory Hierarchy: Terminology
Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty (500 instructions on 21264!) A HIT is when the data the processor wants to access is found in the upper level (Blk X). The fraction of the memory access that are HIT is defined as HIT rate. HIT Time is the time to access the Upper Level where the data is found (X). It consists of: (a) Time to access this level. (b) AND the time to determine if this is a Hit or Miss. If the data the processor wants cannot be found in the Upper level. Then we have a miss and we need to retrieve the data (Blk Y) from the lower level. By definition (definition of Hit: Fraction), the miss rate is just 1 minus the hit rate. This miss penalty also consists of two parts: (a) The time it takes to replace a block (Blk Y to BlkX) in the upper level. (b) And then the time it takes to deliver this new block to the processor. It is very important that your Hit Time to be much much smaller than your miss penalty. Otherwise, there will be no reason to build a memory hierarchy. +2 = 14 min. (X:54) To Processor Upper Level Memory Lower Level Memory Blk X From Processor Blk Y

Cache Measures Hit rate: fraction found in that level
So high that usually talk about Miss rate Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks) Miss penalty: time to replace a block from lower level, including time to replace in CPU access time: time to lower level = f(latency to lower level) transfer time: time to transfer block =f(BW between upper & lower levels)

Simplest Cache: Direct Mapped
Memory Address Memory 4 Byte Direct Mapped Cache 1 Cache Index 2 3 1 4 2 5 3 6 Let’s look at the simplest cache one can build. A direct mapped cache that only has 4 bytes. In this direct mapped cache with only 4 bytes, location 0 of the cache can be occupied by data form memory location 0, 4, 8, C, ... and so on. While location 1 of the cache can be occupied by data from memory location 1, 5, 9, ... etc. So in general, the cache location where a memory location can map to is uniquely determined by the 2 least significant bits of the address (Cache Index). For example here, any memory location whose two least significant bits of the address are 0s can go to cache location zero. With so many memory locations to chose from, which one should we place in the cache? Of course, the one we have read or write most recently because by the principle of temporal locality, the one we just touch is most likely to be the one we will need again soon. Of all the possible memory locations that can be placed in cache Location 0, how can we tell which one is in the cache? +2 = 22 min. (Y:02) Location 0 can be occupied by data from: Memory location 0, 4, 8, ... etc. In general: any memory location whose 2 LSBs of the address are 0s Address<1:0> => cache index Which one should we place in the cache? How can we tell which one is in the cache? 7 8 9 A B C D E F

1 KB Direct Mapped Cache, 32B blocks
For a 2 ** N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2 ** M) 31 9 4 Cache Tag Example: 0x50 Cache Index Byte Select Ex: 0x01 Ex: 0x00 Stored as part of the cache “state” Let’s use a specific example with realistic numbers: assume we have a 1 KB direct mapped cache with block size equals to 32 bytes. In other words, each block associated with the cache tag will have 32 bytes in it (Row 1). With Block Size equals to 32 bytes, the 5 least significant bits of the address will be used as byte select within the cache block. Since the cache size is 1K byte, the upper 32 minus 10 bits, or 22 bits of the address will be stored as cache tag. The rest of the address bits in the middle, that is bit 5 through 9, will be used as Cache Index to select the proper cache entry. +2 = 30 min. (Y:10) Valid Bit Cache Tag Cache Data : Byte 31 Byte 1 Byte 0 : 0x50 Byte 63 Byte 33 Byte 32 1 2 3 : : : : Byte 1023 Byte 992 31

Two-way Set Associative Cache
N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel (N typically 2 to 4) Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared in parallel Data is selected based on the tag result This is called a 2-way set associative cache because there are two cache entries for each cache index. Essentially, you have two direct mapped cache works in parallel. This is how it works: the cache index selects a set from the cache. The two tags in the set are compared in parallel with the upper bits of the memory address. If neither tag matches the incoming address tag, we have a cache miss. Otherwise, we have a cache hit and we will select the data on the side where the tag matches occur. This is simple enough. What is its disadvantages? +1 = 36 min. (Y:16) Cache Index Valid Cache Tag Cache Data Cache Data Cache Block 0 Cache Tag Valid : Cache Block 0 : : : Compare Adr Tag Compare 1 Sel1 Mux Sel0 OR Cache Block Hit

Disadvantage of Set Associative Cache
N-way Set Associative Cache v. Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss. First of all, a N-way set associative cache will need N comparators instead of just one comparator (use the right side of the diagram for direct mapped cache). A N-way set associative cache will also be slower than a direct mapped cache because of this extra multiplexer delay. Finally, for a N-way set associative cache, the data will be available AFTER the hit/miss signal becomes valid because the hit/mis is needed to control the data MUX. For a direct mapped cache, that is everything before the MUX on the right or left side, the cache block will be available BEFORE the hit/miss signal (AND gate output) because the data does not have to go through the comparator. This can be an important consideration because the processor can now go ahead and use the data without knowing if it is a Hit or Miss. Just assume it is a hit. Since cache hit rate is in the upper 90% range, you will be ahead of the game 90% of the time and for those 10% of the time that you are wrong, just make sure you can recover. You cannot play this speculation game with a N-way set-associatvie cache because as I said earlier, the data will not be available to you until the hit/miss signal is valid. +2 = 38 min. (Y:18) Cache Data Cache Block 0 Cache Tag Valid : Cache Index Mux 1 Sel1 Sel0 Cache Block Compare Adr Tag OR Hit

4 Questions for Memory Hierarchy
Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)

Q1: Where can a block be placed in the upper level?
direct mapped - 1 place n-way set associative - n places fully-associative - any place

Q2: How is a block found if it is in the upper level?
Tag on each block No need to check index or block offset Increasing associativity shrinks index, expands tag Block Address Block offset Tag Index

Q3: Which block should be replaced on a miss?
Easy for Direct Mapped Set Associative or Fully Associative: Random LRU (Least Recently Used) Associativity: 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

Q4: What happens on a write?
Write through—The information is written to both the block in the cache and to the block in the lower-level memory. Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? Pros and Cons of each? WT: read misses cannot result in writes WB: no repeated writes to same location WT always combined with write buffers so that don’t wait for lower level memory

Write Buffer for Write Through
Cache Processor DRAM Write Buffer A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare: Store frequency (w.r.t. time) -> 1 / DRAM write cycle Write buffer saturation You are right, memory is too slow. We really didn't writ e to the memory directly. We are writing to a write buffer. Once the data is written into the write buffer and assuming a cache hit, the CPU is done with the write. The memory controller will then move the write buffer’s contents to the real memory behind the scene. The write buffer works as long as the frequency of store is not too high. Notice here, I am referring to the frequency with respect to time, not with respect to number of instructions. Remember the DRAM cycle time we talked about last time. It sets the upper limit on how frequent you can write to the main memory. If the store are too close together or the CPU time is so much faster than the DRAM cycle time, you can end up overflowing the write buffer and the CPU must stop and wait. +2 = 60 min. (Y:40)

Impact of Memory Hierarchy on Algorithms
Today CPU time is a function of (ops, cache misses) vs. just f(ops): What does this mean to Compilers, Data structures, Algorithms? “The Influence of Caches on the Performance of Sorting” by A. LaMarca and R.E. Ladner. Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, January, 1997, Quicksort: fastest comparison based sorting algorithm when all keys fit in memory Radix sort: also called “linear time” sort because for keys of fixed length and fixed radix a constant number of passes over the data is sufficient independent of the number of keys For Alphastation 250, 32 byte blocks, direct mapped L2 2MB cache, 8 byte keys, from 4000 to Let’s do a short review of what you learned last time. Virtual memory was originally invented as another level of memory hierarchy such that programers, faced with main memory much smaller than their programs, do not have to manage the loading and unloading portions of their program in and out of memory. It was a controversial proposal at that time because very few programers believed software can manage the limited amount of memory resource as well as human. This all changed as DRAM size grows exponentially in the last few decades. Nowadays, the main function of virtual memory is to allow multiple processes to share the same main memory so we don’t have to swap all the non-active processes to disk. Consequently, the most important function of virtual memory these days is to provide memory protection. The most common technique, but we like to emphasis not the only technique, to translate virtual memory address to physical memory address is to use a page table. TLB, or translation lookaside buffer, is one of the most popular hardware techniques to reduce address translation time. Since TLB is so effective in reducing the address translation time, what this means is that TLB misses will have a significant negative impact on processor performance. +3 = 3 min. (X:43)

Quicksort vs. Radix as vary number keys: Instructions
Radix sort Quick sort Instructions/key Set size in keys

Quicksort vs. Radix as vary number keys: Instrs & Time
Radix sort Time Quick sort Instructions Set size in keys

Quicksort vs. Radix as vary number keys: Cache misses
Radix sort Cache misses Quick sort Set size in keys What is proper approach to fast algorithms?

A Modern Memory Hierarchy
By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. Processor The design goal is to present the user with as much memory as is available in the cheapest technology (points to the disk). While by taking advantage of the principle of locality, we like to provide the user an average access speed that is very close to the speed that is offered by the fastest technology. (We will go over this slide in details in the next lecture on caches). +1 = 16 min. (X:56) Control Tertiary Storage (Disk/Tape) Secondary Storage (Disk) Second Level Cache (SRAM) Main Memory (DRAM) Datapath On-Chip Cache Registers Speed (ns): 1s 10s 100s 10,000,000s (10s ms) 10,000,000,000s (10s sec) Size (bytes): 100s Ks Ms Gs Ts

Basic Issues in VM System Design
size of information blocks that are transferred from secondary to main storage (M) block of information brought into M, and M is full, then some region of M must be released to make room for the new block --> replacement policy which region of M is to hold the new block --> placement policy missing item fetched from secondary memory only on the occurrence of a fault --> demand load policy disk mem cache reg pages frame Paging Organization virtual and physical address space partitioned into blocks of equal size page frames pages

Address Map V = {0, 1, . . . , n - 1} virtual address space
M = {0, 1, , m - 1} physical address space MAP: V --> M U {0} address mapping function n > m MAP(a) = a' if data at virtual address a is present in physical address a' and a' in M = 0 if data at virtual address a is not present in M a missing item fault Name Space V fault handler Processor Addr Trans Mechanism Main Memory Secondary Memory a a' physical address OS performs this transfer

Paging Organization + V.A. P.A. unit of mapping frame 0 1K Addr Trans
frame 0 1K Addr Trans MAP page 0 1K 1024 1 1K 1024 1 1K also unit of transfer from virtual to physical memory 7168 7 1K Physical Memory 31744 31 1K Virtual Memory Address Mapping 10 VA page no. disp Page Table Page Table Base Reg Access Rights V actually, concatenation is more likely PA + index into page table table located in physical memory physical memory address

Virtual Address and a Cache
VA PA miss Trans- lation Cache Main Memory CPU hit data It takes an extra memory access to translate VA to PA This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible ASIDE: Why access cache with PA at all? VA caches have a problem! synonym / alias problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address! for update: must update all cache entries with same physical address or memory becomes inconsistent determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits

TLBs A way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB Virtual Address Physical Address Dirty Ref Valid Access Really just a cache on the page table mappings TLB access time comparable to cache access time (much less than main memory access time)

Translation Look-Aside Buffers
Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped TLBs are usually small, typically not more than entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations. hit VA PA miss TLB Lookup Cache Main Memory CPU Translation with a TLB miss hit Trans- lation data 1/2 t t 20 t

Reducing Translation Time
Machines with TLBs go one step further to reduce # cycles/cache access They overlap the cache access with the TLB access: high order bits of the VA are used to look in the TLB while low order bits are used as index into cache

Overlapped Cache & TLB Access
assoc lookup index 32 1 K 4 bytes 10 2 00 Hit/ Miss PA 20 12 PA Data Hit/ Miss page # disp = IF cache hit AND (cache tag = PA) then deliver data to CPU ELSE IF [cache miss OR (cache tag = PA)] and TLB hit THEN access memory with the PA from the TLB ELSE do standard VA translation

Problems With Overlapped TLB Access
Overlapped access only works as long as the address bits used to index into the cache do not change as the result of VA translation This usually limits things to small caches, large page sizes, or high n-way set associative caches if you want a large cache Example: suppose everything the same except that the cache is increased to 8 K bytes instead of 4 K: 11 2 cache index 00 This bit is changed by VA translation, but is needed for cache lookup 20 12 virt page # disp Solutions: go to 8K byte page sizes; go to 2 way set associative cache; or 1K 2 way set assoc cache 10 4 4

Summary #1/4: The Principle of Locality:
Program access a relatively small portion of the address space at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses. Capacity Misses: increase cache size Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Write Policy: Write Through: needs a write buffer. Nightmare: WB saturation Write Back: control can be complex

Summary #2 / 4: The Cache Design Space
Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back write allocation The optimal choice is a compromise depends on access characteristics workload use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins Cache Size Associativity No fancy replacement policy is needed for the direct mapped cache. As a matter of fact, that is what cause direct mapped trouble to begin with: only one place to go in the cache--causes conflict misses. Besides working at Sun, I also teach people how to fly whenever I have time. Statistic have shown that if a pilot crashed after an engine failure, he or she is more likely to get killed in a multi-engine light airplane than a single engine airplane. The joke among us flight instructors is that: sure, when the engine quit in a single engine stops, you have one option: sooner or later, you land. Probably sooner. But in a multi-engine airplane with one engine stops, you have a lot of options. It is the need to make a decision that kills those people. Block Size Bad Good Factor A Factor B Less More

Summary #3/4: TLB, Virtual Memory
Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What block is repalced on miss? 4) How are writes handled? Page tables map virtual address to physical address TLBs are important for fast translation TLB misses are significant in processor performance funny times, as most systems can’t access all of 2nd level cache without TLB misses! Let’s do a short review of what you learned last time. Virtual memory was originally invented as another level of memory hierarchy such that programers, faced with main memory much smaller than their programs, do not have to manage the loading and unloading portions of their program in and out of memory. It was a controversial proposal at that time because very few programers believed software can manage the limited amount of memory resource as well as human. This all changed as DRAM size grows exponentially in the last few decades. Nowadays, the main function of virtual memory is to allow multiple processes to share the same main memory so we don’t have to swap all the non-active processes to disk. Consequently, the most important function of virtual memory these days is to provide memory protection. The most common technique, but we like to emphasis not the only technique, to translate virtual memory address to physical memory address is to use a page table. TLB, or translation lookaside buffer, is one of the most popular hardware techniques to reduce address translation time. Since TLB is so effective in reducing the address translation time, what this means is that TLB misses will have a significant negative impact on processor performance. +3 = 3 min. (X:43)

Summary #4/4: Memory Hierachy
VIrtual memory was controversial at the time: can SW automatically manage 64KB across many programs? 1000X DRAM growth removed the controversy Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy Today CPU time is a function of (ops, cache misses) vs. just f(ops): What does this mean to Compilers, Data structures, Algorithms? Let’s do a short review of what you learned last time. Virtual memory was originally invented as another level of memory hierarchy such that programers, faced with main memory much smaller than their programs, do not have to manage the loading and unloading portions of their program in and out of memory. It was a controversial proposal at that time because very few programers believed software can manage the limited amount of memory resource as well as human. This all changed as DRAM size grows exponentially in the last few decades. Nowadays, the main function of virtual memory is to allow multiple processes to share the same main memory so we don’t have to swap all the non-active processes to disk. Consequently, the most important function of virtual memory these days is to provide memory protection. The most common technique, but we like to emphasis not the only technique, to translate virtual memory address to physical memory address is to use a page table. TLB, or translation lookaside buffer, is one of the most popular hardware techniques to reduce address translation time. Since TLB is so effective in reducing the address translation time, what this means is that TLB misses will have a significant negative impact on processor performance. +3 = 3 min. (X:43)

Case Study: MIPS R4000 (200 MHz)
8 Stage Pipeline: IF–first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. IS–second half of access to instruction cache. RF–instruction decode and register fetch, hazard checking and also instruction cache hit detection. EX–execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. DF–data fetch, first half of access to data cache. DS–second half of access to data cache. TC–tag check, determine whether the data cache access hit. WB–write back for loads and register-register operations. 8 Stages: What is impact on Load delay? Branch delay? Why? Answer is 3 stages between branch and new instruction fetch and 2 stages between load and use (even though if looked at red insertions that it would be 3 for load and 2 for branch) Reasons: 1) Load: TC just does tag check, data available after DS; thus supply the data & forward it, restarting the pipeline on a data cache miss 2) EX phase does the address calculation even though just added one phase; presumed reason is that since want fast clockc cycle don’t want to sitck RF phase with reading regisers AND testing for zero, so just moved it back on phase

Case Study: MIPS R4000 TWO Cycle Load Latency IF IS IF RF IS IF EX RF
DF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF THREE Cycle Branch Latency IF IS IF RF IS IF EX RF IS IF DF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF (conditions evaluated during EX phase) Delay slot plus two stalls Branch likely cancels delay slot if not taken

MIPS R4000 Floating Point FP Adder, FP Multiplier, FP Divider
Last step of FP Multiplier/Divider uses FP Adder HW 8 kinds of stages in FP units: Stage Functional unit Description A FP adder Mantissa ADD stage D FP divider Divide pipeline stage E FP multiplier Exception test stage M FP multiplier First stage of multiplier N FP multiplier Second stage of multiplier R FP adder Rounding stage S FP adder Operand shift stage U Unpack FP numbers

MIPS FP Pipe Stages FP Instr 1 2 3 4 5 6 7 8 …
Add, Subtract U S+A A+R R+S Multiply U E+M M M M N N+A R Divide U A R D28 … D+A D+R, D+R, D+A, D+R, A, R Square root U E (A+R)108 … A R Negate U S Absolute value U S FP compare U A R Stages: M First stage of multiplier N Second stage of multiplier R Rounding stage S Operand shift stage U Unpack FP numbers A Mantissa ADD stage D Divide pipeline stage E Exception test stage

R4000 Performance Not ideal CPI of 1:
Load stalls (1 or 2 clock cycles) Branch stalls (2 cycles + unfilled slots) FP result stalls: RAW data hazard (latency) FP structural stalls: Not enough FP hardware (parallelism)

Advanced Pipelining and Instruction Level Parallelism (ILP)
ILP: Overlap execution of unrelated instructions gcc 17% control transfer 5 instructions + 1 branch Beyond single block to get more instruction level parallelism Loop level parallelism one opportunity, SW and HW Do examples and then explain nomenclature DLX Floating Point as example Measurements suggests R4000 performance FP execution has room for improvement

FP Loop: Where are the Hazards?
Loop: LD F0,0(R1) ;F0=vector element ADDD F4,F0,F2 ;add scalar from F2 SD 0(R1),F4 ;store result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Integer op Integer op 0 Where are the stalls?

FP Loop Hazards Loop: LD F0,0(R1) ;F0=vector element
ADDD F4,F0,F2 ;add scalar in F2 SD 0(R1),F4 ;store result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Integer op Integer op 0

FP Loop Showing Stalls 9 clocks: Rewrite code to minimize stalls?
1 Loop: LD F0,0(R1) ;F0=vector element 2 stall 3 ADDD F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 SD 0(R1),F4 ;store result 7 SUBI R1,R1,8 ;decrement pointer 8B (DW) 8 BNEZ R1,Loop ;branch R1!=zero 9 stall ;delayed branch slot Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 9 clocks: Rewrite code to minimize stalls?

Revised FP Loop Minimizing Stalls
1 Loop: LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2 4 SUBI R1,R1,8 5 BNEZ R1,Loop ;delayed branch 6 SD 8(R1),F4 ;altered when move past SUBI Swap BNEZ and SD by changing address of SD Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 6 clocks: Unroll loop 4 times code to make faster?

Unroll Loop Four Times (straightforward way)
1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 ;drop SUBI & BNEZ 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 ;drop SUBI & BNEZ 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 12 SD -24(R1),F16 13 SUBI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,LOOP 15 NOP x (1+2) = 27 clock cycles, or 6.8 per iteration Assumes R1 is multiple of 4 Rewrite loop to minimize stalls?

Unrolled Loop That Minimizes Stalls
1 Loop: LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2 7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4 10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,#32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 ; 8-32 = -24 14 clock cycles, or 3.5 per iteration When safe to move instructions? What assumptions made when moved code? OK to move store past SUBI even though changes register OK to move loads before stores: get right data? When is it safe for compiler to do such changes?

Compiler Perspectives on Code Movement
Definitions: compiler concerned about dependencies in program, whether or not a HW hazard depends on a given pipeline Try to schedule to avoid hazards (True) Data dependencies (RAW if a hazard for HW) Instruction i produces a result used by instruction j, or Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. If dependent, can’t execute in parallel Easy to determine for registers (fixed names) Hard for memory: Does 100(R4) = 20(R6)? From different loop iterations, does 20(R6) = 20(R6)?

Where are the data dependencies?
1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 3 SUBI R1,R1,8 4 BNEZ R1,Loop ;delayed branch 5 SD 8(R1),F4 ;altered when move past SUBI

Another kind of dependence called name dependence: two instructions use same name (register or memory location) but don’t exchange data Antidependence (WAR if a hazard for HW) Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first Output dependence (WAW if a hazard for HW) Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved.

Where are the name dependencies?
1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4 LD F0,-8(R1) 3 SD -8(R1),F4 ;drop SUBI & BNEZ 7 LD F0,-16(R1) 8 ADDD F4,F0,F2 9 SD -16(R1),F4 ;drop SUBI & BNEZ 10 LD F0,-24(R1) 11 ADDD F4,F0,F2 12 SD -24(R1),F4 13 SUBI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,LOOP 15 NOP How can remove them?

Where are the name dependencies?
1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 ;drop SUBI & BNEZ 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 ;drop SUBI & BNEZ 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 12 SD -24(R1),F16 13 SUBI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,LOOP 15 NOP Called “register renaming”

Again Name Dependenceis are Hard for Memory Accesses Does 100(R4) = 20(R6)? From different loop iterations, does 20(R6) = 20(R6)? Our example required compiler to know that if R1 doesn’t change then: 0(R1) ≠ -8(R1) ≠ -16(R1) ≠ -24(R1) There were no dependencies between some loads and stores so they could be moved by each other

Final kind of dependence called control dependence Example if p1 {S1;}; if p2 {S2;}; S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.

Two (obvious) constraints on control dependences: An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch. Control dependencies relaxed to get parallelism; get same effect if preserve order of exceptions (address in register checked by branch before use) and data flow (value in register depends on branch)

Where are the control dependencies?
1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 SUBI R1,R1,8 5 BEQZ R1,exit 6 LD F0,0(R1) 7 ADDD F4,F0,F2 8 SD 0(R1),F4 9 SUBI R1,R1,8 10 BEQZ R1,exit 11 LD F0,0(R1) 12 ADDD F4,F0,F2 13 SD 0(R1),F4 14 SUBI R1,R1,8 15 BEQZ R1,exit ....

When Safe to Unroll Loop?
Example: Where are data dependencies? (A,B,C distinct & nonoverlapping) for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1];} /* S2 */ 1. S2 uses the value, A[i+1], computed by S1 in the same iteration. 2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. This is a “loop-carried dependence”: between iterations Implies that iterations are dependent, and can’t be executed in parallel Not the case for our prior example; each iteration was distinct

HW Schemes: Instruction Parallelism
Why in HW at run time? Works when can’t know real dependence at compile time Compiler simpler Code for one machine runs well on another Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 Enables out-of-order execution => out-of-order completion ID stage checked both for structuralScoreboard dates to CDC 6600 in 1963

HW Schemes: Instruction Parallelism
Out-of-order execution divides ID stage: 1. Issue—decode instructions, check for structural hazards 2. Read operands—wait until no data hazards, then read operands Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions CDC 6600: In order issue, out of order execution, out of order commit ( also called completion)

Scoreboard Implications
Out-of-order completion => WAR, WAW hazards? Solutions for WAR Queue both the operation and copies of its operands Read registers only during Read Operands stage For WAW, must detect hazard: stall until other completes Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units Scoreboard keeps track of dependencies, state or operations Scoreboard replaces ID, EX, WB with 4 stages

Four Stages of Scoreboard Control
1. Issue—decode instructions & check for structural hazards (ID1) If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared. 2. Read operands—wait until no data hazards, then read operands (ID2) A source operand is available if no earlier issued active instruction is going to write it, or if the register containing the operand is being written by a currently active functional unit. When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order.

Four Stages of Scoreboard Control
3. Execution—operate on operands (EX) The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. 4. Write result—finish execution (WB) Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction. Example: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14 CDC 6600 scoreboard would stall SUBD until ADDD reads operands

Three Parts of the Scoreboard
1. Instruction status—which of 4 steps the instruction is in 2. Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy—Indicates whether the unit is busy or not Op—Operation to perform in the unit (e.g., + or –) Fi—Destination register Fj, Fk—Source-register numbers Qj, Qk—Functional units producing source registers Fj, Fk Rj, Rk—Flags indicating when Fj, Fk are ready 3. Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register What you might have thought 1. 4 stages of instruction executino 2.Status of FU: Normal things to keep track of (RAW & structura for busyl): Fi from instruction format of the mahine (Fi is dest) Add unit can Add or Sub Rj, Rk - status of registers (Yes means ready) Qj,Qk - If a no in Rj, Rk, means waiting for a FU to write result; Qj, Qk means wihch FU waiting for it 3.Status of register result (WAW &WAR)s: which FU is going to write into registers Scoreboard on 6600 = size of FU 6.7, 6.8, 6.9, 6.12, 6.13, 6.16, 6.17 FU latencies: Add 2, Mult 10, Div 40 clocks

Scoreboard Example

Scoreboard Example Cycle 1

Issue 2nd LD?

Issue MULT?

Read multiply operands?

Scoreboard Example Cycle 8a

Scoreboard Example Cycle 8b

Read operands for MULT & SUBD? Issue ADDD?

Read operands for DIVD?

Write result of ADDD?

CDC 6600 Scoreboard Speedup 1.7 from compiler; 2.5 by hand BUT slow memory (no cache) limits benefit Limitations of 6600 scoreboard: No forwarding hardware Limited to instructions in basic block (small window) Small number of functional units (structural hazards), especailly integer/load store units

Summary Instruction Level Parallelism (ILP) in SW or HW
Loop level parallelism is easiest to see SW parallelism dependencies defined for program, hazards if HW cannot resolve SW dependencies/compiler sophistication determine if compiler can unroll loops Memory dependencies hardest to determine HW exploiting ILP Works when can’t know dependence at run time Code for one machine runs well on another Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode => Issue instr & read operands) Enables out-of-order execution => out-of-order completion

ECS 250A Computer Architecture
Lecture 3: Tomasulo Algorithm, Dynamic Branch Prediction, VLIW, Software Pipelining, and Limits to ILP Prof. Fred Chong ECS 250A Computer Architecture Winter 1999 (Adapted from Patterson CS252 Copyright 1998 UCB)

Assignments Read Ch 5 Problem Set 3 out on Wed Problem Set 2 back soon
Proposal comments by soon

Review: Summary Instruction Level Parallelism (ILP) in SW or HW
Loop level parallelism is easiest to see SW parallelism dependencies defined for program, hazards if HW cannot resolve SW dependencies/compiler sophistication determine if compiler can unroll loops Memory dependencies hardest to determine HW exploiting ILP Works when can’t know dependence at run time Code for one machine runs well on another Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode => Issue instr & read operands) Enables out-of-order execution => out-of-order completion

Review: Three Parts of the Scoreboard
1. Instruction status—which of 4 steps the instruction is in 2. Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy—Indicates whether the unit is busy or not Op—Operation to perform in the unit (e.g., + or –) Fi—Destination register Fj, Fk—Source-register numbers Qj, Qk—Functional units producing source registers Fj, Fk Rj, Rk—Flags indicating when Fj, Fk are ready 3. Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register What you might have thought 1. 4 stages of instruction executino 2.Status of FU: Normal things to keep track of (RAW & structura for busyl): Fi from instruction format of the mahine (Fi is dest) Add unit can Add or Sub Rj, Rk - status of registers (Yes means ready) Qj,Qk - If a no in Rj, Rk, means waiting for a FU to write result; Qj, Qk means wihch FU waiting for it 3.Status of register result (WAW &WAR)s: which FU is going to write into registers Scoreboard on 6600 = size of FU 6.7, 6.8, 6.9, 6.12, 6.13, 6.16, 6.17 FU latencies: Add 2, Mult 10, Div 40 clocks

Review: Scoreboard Example Cycle 3
Issue MULT? No, stall on structural hazard

Read operands for MULT & SUBD? Issue ADDD?

Write result of ADDD? No, WAR hazard

In-order issue; out-of-order execute & commit

Review: Scoreboard Summary
Speedup 1.7 from compiler; 2.5 by hand BUT slow memory (no cache) Limitations of 6600 scoreboard No forwarding (First write register then read it) Limited to instructions in basic block (small window) Number of functional units(structural hazards) Wait for WAR hazards Prevent WAW hazards

Another Dynamic Algorithm: Tomasulo Algorithm
For IBM 360/91 about 3 years after CDC 6600 (1966) Goal: High Performance without special compilers Differences between IBM 360 & CDC 6600 ISA IBM has only 2 register specifiers/instr vs. 3 in CDC 6600 IBM has 4 FP registers vs. 8 in CDC 6600 Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …

Tomasulo Algorithm vs. Scoreboard
Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard; FU buffers called “reservation stations”; have pending operands Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ; avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers can’t Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue

Tomasulo Organization
FP Op Queue FP Registers Load Buffer Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Store Buffer Common Data Bus FP Add Res. Station FP Mul Res. Station

Reservation Station Components
Op—Operation to perform in the unit (e.g., + or –) Vj, Vk—Value of Source operands Store buffers has V field, result to be stored Qj, Qk—Reservation stations producing source registers (value to be written) Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready Store buffers only have Qi for RS producing result Busy—Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register. What you might have thought 1. 4 stages of instruction executino 2.Status of FU: Normal things to keep track of (RAW & structura for busyl): Fi from instruction format of the mahine (Fi is dest) Add unit can Add or Sub Rj, Rk - status of registers (Yes means ready) Qj,Qk - If a no in Rj, Rk, means waiting for a FU to write result; Qj, Qk means wihch FU waiting for it 3.Status of register result (WAW &WAR)s: which FU is going to write into registers Scoreboard on 6600 = size of FU 6.7, 6.8, 6.9, 6.12, 6.13, 6.16, 6.17 FU latencies: Add 2, Mult 10, Div 40 clocks

Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available Common data bus: data + source (“come from” bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast

Tomasulo Example Cycle 0

Yes

Note: Unlike 6600, can have multiple loads outstanding

Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued vs. scoreboard Load1 completing; what is waiting for Load1?

Load2 completing; what is waiting for it?

Issue ADDD here vs. scoreboard?

Add1 completing; what is waiting for it?

Add2 completing; what is waiting for it?

Write result of ADDD here vs. scoreboard?

Note: all quick instructions complete already

Mult1 completing; what is waiting for it?

Note: Just waiting for divide

Mult 2 completing; what is waiting for it?

Again, in-oder issue, out-of-order execution, completion

Compare to Scoreboard Cycle 62
Why takes longer on Scoreboard/6600?

Tomasulo v. Scoreboard (IBM 360/91 v. CDC 6600)
Pipelined Functional Units Multiple Functional Units (6 load, 3 store, 3 +, 2 x/÷) (1 load/store, 1 + , 2 x, 1 ÷) window size: ≤ 14 instructions ≤ 5 instructions No issue on structural hazard same WAR: renaming avoids stall completion WAW: renaming avoids stall completion Broadcast results from FU Write/read registers Control: reservation stations central scoreboard

Tomasulo Drawbacks Complexity
delays of 360/91, MIPS 10000, IBM 620? Many associative stores (CDB) at high speed Performance limited by Common Data Bus Multiple CDBs => more FU logic for parallel assoc stores

Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
SUBI R1 R1 #8 BNEZ R1 Loop Assume Multiply takes 4 clocks Assume first load takes 8 clocks (cache miss?), second load takes 4 clocks (hit)

Loop Example Cycle 0

Loop Example Cycle 3 Note: MULT1 has no registers names in RS

Loop Example Cycle 6 Note: F0 never sees Load1 result

Loop Example Cycle 7 Note: MULT2 has no registers names in RS

Loop Example Cycle 9 Load1 completing; what is waiting for it?

Loop Example Cycle 10 Load2 completing; what is waiting for it?

Loop Example Cycle 14 Mult1 completing; what is waiting for it?

Loop Example Cycle 15 Mult2 completing; what is waiting for it?

Tomasulo Summary Reservations stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW Not limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well Lasting Contributions Dynamic scheduling Register renaming Load/store disambiguation 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264

Dynamic Branch Prediction
Performance = ƒ(accuracy, cost of misprediction) Branch HistoryLower bits of PC address index table of 1-bit values Says whether or not branch taken last time No address check Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iteratios before exit): End of loop case, when it exits instead of looping as before First time through loop on next time through code, when it predicts exit instead of looping

Dynamic Branch Prediction
Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 4.13, p. 264) Red: stop, not taken Green: go, taken T NT Predict Taken Predict Taken T T NT NT Predict Not Taken Predict Not Taken T NT

BHT Accuracy Mispredict because either:
Wrong guess for that branch Got branch history of wrong branch when index the table 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% 4096 about as good as infinite table (in Alpha )

Correlating Branches Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table In general, (m,n) predictor means record last m branches to select between 2m history talbes each with n-bit counters Old 2-bit BHT is then a (0,2) predictor

Correlating Branches (2,2) predictor
Then behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction Branch address 2-bits per branch predictors Prediction 2-bit global branch history

Accuracy of Different Schemes (Figure 4.21, p. 272)
18% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT Frequency of Mispredictions 0%

Re-evaluating Correlation
Several of the SPEC benchmarks have less than a dozen branches responsible for 90% of taken branches: program branch % static # = 90% compress 14% eqntott 25% gcc 15% mpeg 10% real gcc 13% Real programs + OS more like gcc Small benefits beyond benchmarks for correlation? problems with branch aliases?

Need Address at Same Time as Prediction
Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) Note: must check for branch match now, since can’t use wrong branch address (Figure 4.22, p. 273) Return instruction addresses predicted with stack Branch Prediction: Taken or not Taken Predicted PC

HW support for More ILP HW support for More ILP
Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP If false, then neither store result nor cause exception Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr. IA-64: 64 1-bit condition fields selected so conditional execution of any instruction Drawbacks to conditional instructions Still takes a clock even if “annulled” Stall if condition evaluated late Complex conditions reduce effectiveness; condition becomes known late in pipeline x A = B op C

Dynamic Branch Prediction Summary
Branch History Table: 2 bits for loop accuracy Correlation: Recently executed branches correlated with next branch Branch Target Buffer: include branch address & prediction Predicated Execution can reduce number of branches, number of mispredicted branches

HW support for More ILP HW support for More ILP
Speculation: allow an instructionwithout any consequences (including exceptions) if branch is not actually taken (“HW undo”); called “boosting” Combine branch prediction with dynamic scheduling to execute before branches resolved Separate speculative bypassing of results from real bypassing of results When instruction no longer speculative, write boosted results (instruction commit) or discard boosted results execute out-of-order but commit in-order to prevent irrevocable action (update state or exception) until instruction commits

HW support for More ILP Need HW buffer for results of uncommitted instructions: reorder buffer 3 fields: instr, destination, value Reorder buffer can be operand source => more registers like RS Use reorder buffer number instead of reservation station when execution completes Supplies operands between execution complete & commit Once operand commits, result is put into register Instructionscommit As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions Reorder Buffer FP Op Queue FP Regs Res Stations Res Stations FP Adder FP Adder

Four Steps of Speculative Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”) 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”) 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4. Commit—update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”)

Renaming Registers Common variation of speculative design
Reorder buffer keeps instruction information but not the result Extend register file with extra renaming registers to hold speculative results Rename register allocated at issue; result into rename register on execution complete; rename register into real register on commit Operands read either from register file (real or speculative) or via Common Data Bus Advantage: operands are always from single source (extended register file)

Dynamic Scheduling in PowerPC 604 and Pentium Pro
Both In-order Issue, Out-of-order execution, In-order Commit Pentium Pro more like a scoreboard since central control vs. distributed

Dynamic Scheduling in PowerPC 604 and Pentium Pro
Parameter PPC PPro Max. instructions issued/clock 4 3 Max. instr. complete exec./clock 6 5 Max. instr. commited/clock 6 3 Window (Instrs in reorder buffer) 16 40 Number of reservations stations 12 20 Number of rename registers 8int/12FP 40 No. integer functional units (FUs) 2 2 No. floating point FUs 1 1 No. branch FUs No. complex integer FUs 1 0 No. memory FUs 1 1 load +1 store Q: How pipeline 1 to 17 byte x86 instructions?

Dynamic Scheduling in Pentium Pro
PPro doesn’t pipeline 80x86 instructions PPro decode unit translates the Intel instructions into 72-bit micro-operations ( DLX) Sends micro-operations to reorder buffer & reservation stations Takes 1 clock cycle to determine length of 80x86 instructions + 2 more to create the micro-operations 12-14 clocks in total pipeline ( 3 state machines) Many instructions translate to 1 to 4 micro-operations Complex 80x86 instructions are executed by a conventional microprogram (8K x 72 bits) that issues long sequences of micro-operations

Getting CPI < 1: Issuing Multiple Instructions/Cycle
Two variations Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo) IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000 (Very) Long Instruction Words (V)LIW: fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates Joint HP/Intel agreement in 1999/2000? Intel Architecture-64 (IA-64) 64-bit address Style: “Explicitly Parallel Instruction Computer (EPIC)” Anticipated success lead to use of Instructions Per Clock cycle (IPC) vs. CPI

Getting CPI < 1: Issuing Multiple Instructions/Cycle
Superscalar DLX: 2 instructions, 1 FP & 1 anything else – Fetch 64-bits/clock cycle; Int on left, FP on right – Can only issue 2nd instruction if 1st instruction issues – More ports for FP registers to do FP load & FP op in a pair Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB 1 cycle load delay expands to 3 instructions in SS instruction in right half can’t use it, nor instructions in next slot

Review: Unrolled Loop that Minimizes Stalls for Scalar
1 Loop: LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2 7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4 10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,#32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 ; 8-32 = -24 14 clock cycles, or 3.5 per iteration LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles

Loop Unrolling in Superscalar
Integer instruction FP instruction Clock cycle Loop: LD F0,0(R1) 1 LD F6,-8(R1) 2 LD F10,-16(R1) ADDD F4,F0,F2 3 LD F14,-24(R1) ADDD F8,F6,F2 4 LD F18,-32(R1) ADDD F12,F10,F2 5 SD 0(R1),F4 ADDD F16,F14,F2 6 SD -8(R1),F8 ADDD F20,F18,F2 7 SD (R1),F12 8 SD (R1),F16 9 SUBI R1,R1,#40 10 BNEZ R1,LOOP 11 SD (R1),F20 12 Unrolled 5 times to avoid delays (+1 due to SS) 12 clocks, or 2.4 clocks per iteration (1.5X)

Multiple Issue Challenges
While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: Exactly 50% FP operations No hazards If more instructions issue at same time, greater difficulty of decode and issue Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue VLIW: tradeoff instruction space for simple decoding The long instruction word has room for many operations By definition, all the operations the compiler puts in the long instruction word are independent => execute in parallel E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide Need compiling technique that schedules across several branches

Loop Unrolling in VLIW Unrolled 7 times to avoid delays
Memory Memory FP FP Int. op/ Clock reference 1 reference 2 operation 1 op. 2 branch LD F0,0(R1) LD F6,-8(R1) 1 LD F10,-16(R1) LD F14,-24(R1) 2 LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3 LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4 ADDD F20,F18,F2 ADDD F24,F22,F2 5 SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6 SD -16(R1),F12 SD -24(R1),F SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8 SD -0(R1),F28 BNEZ R1,LOOP 9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 6 in SS)

Trace Scheduling Parallelism across IF branches vs. LOOP branches
Two steps: Trace Selection Find likely sequence of basic blocks (trace) of (statically predicted or profile predicted) long sequence of straight-line code Trace Compaction Squeeze trace into few VLIW instructions Need bookkeeping code in case prediction is wrong Compiler undoes bad guess (discards values in registers) Subtle compiler bugs mean wrong answer vs. pooer performance; no hardware interlocks

Advantages of HW (Tomasulo) vs. SW (VLIW) Speculation
HW determines address conflicts HW better branch prediction HW maintains precise exception model HW does not execute bookkeeping instructions Works across multiple implementations SW speculation is much easier for HW design

Superscalar v. VLIW Smaller code size
Binary compatability across generations of hardware Simplified Hardware for decoding, issuing instructions No Interlock Hardware (compiler checks?) More registers, but simplified Hardware for Register Ports (multiple independent register files?)

Intel/HP “Explicitly Parallel Instruction Computer (EPIC)”
3 Instructions in 128 bit “groups”; field determines if instructions dependent or independent Smaller code size than old VLIW, larger than x86/RISC Groups can be linked to show independence > 3 instr 64 integer registers + 64 floating point registers Not separate filesper funcitonal unit as in old VLIW Hardware checks dependencies (interlocks => binary compatibility over time) Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions? IA-64 : name of instruction set architecture; EPIC is type Merced is name of first implementation (1999/2000?) LIW = EPIC?

Dynamic Scheduling in Superscalar
Dependencies stop instruction issue Code compiler for old version will run poorly on newest version May want code to vary depending on how superscalar

Dynamic Scheduling in Superscalar
How to issue two instructions and keep in-order instruction issue for Tomasulo? Assume 1 integer + 1 floating point 1 Tomasulo control for integer, 1 for floating point Issue 2X Clock Rate, so that issue remains in order Only FP loads might cause dependency between integer and FP issue: Replace load reservation station with a load queue; operands must be read in the order they are fetched Load checks addresses in Store Queue to avoid RAW violation Store checks addresses in Load Queue to avoid WAR,WAW Called “decoupled architecture”

Performance of Dynamic SS
Iteration Instructions Issues Executes Writes result no clock-cycle number 1 LD F0,0(R1) 1 2 4 1 ADDD F4,F0,F 1 SD 0(R1),F4 2 9 1 SUBI R1,R1,# 1 BNEZ R1,LOOP 4 5 2 LD F0,0(R1) 5 6 8 2 ADDD F4,F0,F 2 SD 0(R1),F4 6 13 2 SUBI R1,R1,# 2 BNEZ R1,LOOP 8 9 4 clocks per iteration; only 1 FP instr/iteration Branches, Decrements issues still take 1 clock cycle How get more performance?

Software Pipelining Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop ( Tomasulo in SW)

Software Pipelining Example
Before: Unrolled 3 times 1 LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 10 SUBI R1,R1,#24 11 BNEZ R1,LOOP After: Software Pipelined 1 SD 0(R1),F4 ; Stores M[i] 2 ADDD F4,F0,F2 ; Adds to M[i-1] 3 LD F0,-16(R1); Loads M[i-2] 4 SUBI R1,R1,#8 5 BNEZ R1,LOOP SW Pipeline overlapped ops Time Loop Unrolled Symbolic Loop Unrolling Maximize result-use distance Less code space than unrolling Fill & drain pipe only once per loop vs. once per each unrolled iteration in loop unrolling Time

Limits to Multi-Issue Machines
Inherent limitations of ILP 1 branch in 5: How to keep a 5-way VLIW busy? Latencies of units: many operations must be scheduled Need about Pipeline Depth x No. Functional Units of independentDifficulties in building HW Easy: More instruction bandwidth Easy: Duplicate FUs to get parallel execution Hard: Increase ports to Register File (bandwidth) VLIW example needs 7 read and 3 write for Int. Reg. & 5 read and 3 write for FP reg Harder: Increase ports to memory (bandwidth) Decoding Superscalar and impact on clock rate, pipeline depth?

Limits to Multi-Issue Machines
Limitations specific to either Superscalar or VLIW implementation Decode issue in Superscalar: how wide practical? VLIW code size: unroll loops + wasted fields in VLIW IA-64 compresses dependent instructions, but still larger VLIW lock step => 1 hazard & all instructions stall IA-64 not lock step? Dynamic pipeline? VLIW & binary compatibilityIA-64 promises binary compatibility

Limits to ILP Conflicting studies of amount
Benchmarks (vectorized Fortran FP vs. integer C programs) Hardware sophistication Compiler sophistication How much ILP is available using existing mechanims with increasing HW budgets? Do we need to invent new HW/SW mechanisms to keep on processor performance curve?

Limits to ILP Initial HW Model here; MIPS compilers.
Assumptions for ideal/perfect machine to start: 1. Register renaming–infinite virtual registers and all WAW & WAR hazards are avoided 2. Branch prediction–perfect; no mispredictions 3. Jump prediction–all jumps perfectly predicted => machine with perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis–addresses are known & a store can be moved before a load provided addresses not equal 1 cycle latency for all instructions; unlimited number of instructions issued per clock cycle

Upper Limit to ILP: Ideal Machine (Figure 4.38, page 319)
FP: Integer: IPC

More Realistic HW: Branch Impact Figure 4.40, Page 323
Change from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycle FP: Integer: IPC Perfect Pick Cor. or BHT BHT (512) Profile No prediction

Selective History Predictor
8096 x 2 bits 1 Taken/Not Taken 11 10 01 00 Choose Non-correlator Branch Addr Choose Correlator 2 Global History 00 8K x 2 bit Selector 01 10 11 11 Taken 10 01 Not Taken 00 2048 x 4 x 2 bits

More Realistic HW: Register Impact Figure 4.44, Page 328
FP: Change instr window, 64 instr issue, 8K 2 level Prediction Integer: IPC Infinite 256 128 64 32 None

More Realistic HW: Alias Impact Figure 4.46, Page 330
Change instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers FP: (Fortran, no heap) Integer: 4 - 9 IPC Perfect Global/Stack perf; heap conflicts Inspec. Assem. None

Realistic HW for ‘9X: Window Impact (Figure 4.48, Page 332)
Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window FP: IPC Integer: Infinite 256 128 64 32 16 8 4

Braniac vs. Speed Demon(1993)
8-scalar IBM 71.5 MHz (5 stage pipe) vs. 2-scalar 200 MHz (7 stage pipe)

3 1996 Era Machines Alpha 21164 PPro HP PA-8000 Year 1995 1995 1996
Clock 400 MHz 200 MHz 180 MHz Cache 8K/8K/96K/2M 8K/8K/0.5M 0/0/2M Issue rate 2int+2FP 3 instr (x86) 4 instr Pipe stages Out-of-Order 6 loads 40 instr (µop) 56 instr Rename regs none 40 56

SPECint95base Performance (July 1996)

SPECfp95base Performance (July 1996)

3 1997 Era Machines Alpha 21164 Pentium II HP PA-8000
Year Clock 600 MHz (‘97) 300 MHz (‘97) 236 MHz (‘97) Cache 8K/8K/96K/2M 16K/16K/0.5M 0/0/4M Issue rate 2int+2FP 3 instr (x86) 4 instr Pipe stages Out-of-Order 6 loads 40 instr (µop) 56 instr Rename regs none 40 56

SPECint95base Performance (Oct. 1997)

SPECfp95base Performance (Oct. 1997)

Summary Branch Prediction
Branch History Table: 2 bits for loop accuracy Recently executed branches correlated with next branch? Branch Target Buffer: include branch address & prediction Predicated Execution can reduce number of branches, number of mispredicted branches Speculation: Out-of-order execution, In-order commit (reorder buffer) SW Pipelining Symbolic Loop Unrolling to get most from pipeline with little code expansion, little overhead Superscalar and VLIW: CPI < 1 (IPC > 1) Dynamic issue vs. Static issue More instructions issue at same time => larger hazard penalty

Lecture 4: Memory Systems

Assignments Read Appendix B Problem Set 3 due Wed
Problem 4.18 CANCELLED Problem Set 4 available Wed -- due in 2 weeks 5-page project draft due in 2 weeks (M 2/22)

Review: Who Cares About the Memory Hierarchy?
Processor Only Thus Far in Course: CPU cost/performance, ISA, Pipelined Execution CPU-DRAM Gap 1980: no cache in µproc; level cache on chip (1989 first Intel µproc with a cache on chip) µProc 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap: (grows 50% / year) Y-axis is performance X-axis is time Latency Cliché: Not e that x86 didn’t have cache on chip until 1989 Performance 10 DRAM 7%/yr. DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Processor-Memory Performance Gap “Tax”
Processor % Area %Transistors (cost) (power) Alpha % 77% StrongArm SA110 61% 94% Pentium Pro 64% 88% 2 dies per package: Proc/I$/D$ + L2$ Caches have no inherent value, only try to close performance gap 1/3 to 2/3s of area of processor 386 had 0; successor PPro has second die!

Generations of Microprocessors
Time of a full cache miss in instructions executed: 1st Alpha (7000): 340 ns/5.0 ns = 68 clks x 2 or 136 2nd Alpha (8400): 266 ns/3.3 ns = 80 clks x 4 or 320 3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648 1/2X latency x 3X clock rate x 3X Instr/clock  5X 1st generation Latency 1/2 but Clock rate 3X and IPC is 3X Now move to other 1/2 of industry

Review: Four Questions for Memory Hierarchy Designers
Q1: Where can a block be placed in the upper level? (Block placement) Fully Associative, Set Associative, Direct Mapped Q2: How is a block found if it is in the upper level? (Block identification) Tag/Block Q3: Which block should be replaced on a miss? (Block replacement) Random, LRU Q4: What happens on a write? (Write strategy) Write Back or Write Through (with Write Buffer)

Review: Cache Performance
CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty) Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty

Review: Cache Performance
CPUtime = Instruction Count x (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty) x Clock cycle time Misses per instruction = Memory accesses per instruction x Miss rate CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x Clock cycle time

Improving Cache Performance
1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

Reducing Misses Classifying Misses: 3 Cs
Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache) Intuitive Model by Mark Hill

3Cs Absolute Miss Rate (SPEC92)
Conflict Compulsory vanishingly small

2:1 Cache Rule miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2 Conflict

How Can Reduce Misses? 3 Cs: Compulsory, Capacity, Conflict
In all cases, assume total cache size not changed: What happens if: 1) Change Block Size: Which of 3Cs is obviously affected? 2) Change Associativity: Which of 3Cs is obviously affected? 3) Change Compiler: Which of 3Cs is obviously affected? Ask which affected? Block size 1) Compulsory 2) More subtle, will change mapping

1. Reduce Misses via Larger Block Size

2. Reduce Misses via Higher Associativity
2:1 Cache Rule: Miss Rate DM cache size N Miss Rate 2-way cache size N/2 Beware: Execution time is only final measure! Will Clock Cycle time increase? Hill [1988] suggested hit time for 2-way vs. 1-way external cache +10%, internal + 2%

Example: Avg. Memory Access Time vs. Miss Rate
Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped Cache Size Associativity (KB) 1-way 2-way 4-way 8-way (Red means A.M.A.T. not improved by more associativity)

3. Reducing Misses via a “Victim Cache”
How to combine fast hit time of direct mapped yet still avoid conflict misses? Add buffer to place data discarded from cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache Used in Alpha, HP machines

4. Reducing Misses via “Pseudo-Associativity”
How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache? Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles Better for caches not tied directly to processor (L2) Used in MIPS R1000 L2 cache, similar in UltraSPARC Hit Time Pseudo Hit Time Miss Penalty Time

5. Reducing Misses by Hardware Prefetching of Instructions & Data
E.g., Instruction Prefetching Alpha fetches 2 blocks on a miss Extra block placed in “stream buffer” On miss check stream buffer Works with data blocks too: Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Prefetching relies on having extra memory bandwidth that can be used without penalty

6. Reducing Misses by Software Prefetching Data
Data Prefetch Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) Special prefetching instructions cannot cause faults; a form of speculative execution Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses? Higher superscalar reduces difficulty of issue bandwidth

7. Reducing Misses by Compiler Optimizations
McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software Instructions Reorder procedures in memory so as to reduce conflict misses Profiling to look at conflicts(using tools they developed) Data Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows

Merging Arrays Example
/* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reducing conflicts between val & key; improve spatial locality

Loop Interchange Example
/* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ Sequential accesses instead of striding through memory every 100 words; improved spatial locality

Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j]; /* After */ { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access to a & c vs. one miss per access; improve spatial locality

Blocking Example Two Inner Loops:
/* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; }; Two Inner Loops: Read all NxN elements of z[] Read N elements of 1 row of y[] repeatedly Write N elements of 1 row of x[] Capacity Misses a function of N & Cache Size: 3 NxNx4 => no capacity misses; otherwise ... Idea: compute on BxB submatrix that fits

Blocking Example B called Blocking Factor
/* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; }; B called Blocking Factor Capacity Misses from 2N3 + N2 to 2N3/B +N2 Conflict Misses Too?

Reducing Conflict Misses by Blocking
Conflict misses in caches not FA vs. Blocking size Lam et al [1991] a blocking factor of 24 had a fifth the misses vs. 48 despite both fit in cache

Summary of Compiler Optimizations to Reduce Cache Misses (by hand)

Summary 3 Cs: Compulsory, Capacity, Conflict
1. Reduce Misses via Larger Block Size 2. Reduce Misses via Higher Associativity 3. Reducing Misses via Victim Cache 4. Reducing Misses via Pseudo-Associativity 5. Reducing Misses by HW Prefetching Instr, Data 6. Reducing Misses by SW Prefetching Data 7. Reducing Misses by Compiler Optimizations Remember danger of concentrating on just one parameter when evaluating performance

Review: Improving Cache Performance

1. Reducing Miss Penalty: Read Priority over Write on Miss
Write through with write buffers offer RAW conflicts with main memory reads on cache misses If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read; if no conflicts, let the memory access continue Write Back? Read miss replacing dirty block Normal: Write dirty block to memory, and then do the read Instead copy the dirty block to a write buffer, then do the read, and then do the write CPU stall less since restarts as soon as do read

2. Reduce Miss Penalty: Subblock Placement
Don’t have to load full block on a miss Have valid bits per subblock to indicate valid Valid Bits Subblocks

3. Reduce Miss Penalty: Early Restart and Critical Word First
Don’t wait for full block to be loaded before restarting CPU Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first Generally useful only in large blocks, Spatial locality a problem; tend to want next sequential word, so not clear if benefit by early restart block

4. Reduce Miss Penalty: Non-blocking Caches to reduce stalls on misses
Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss requires out-of-order execution CPU “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses Requires multiple memory banks (otherwise cannot support) Pentium Pro allows 4 outstanding memory misses

Value of Hit Under Miss for SPEC
0->1 1->2 2->64 Base “Hit under n Misses” Integer Floating Point FP programs on average: AMAT= > > > 0.26 Int programs on average: AMAT= > > > 0.19 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss

5. Miss Penalty L2 Equations Definitions:
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2) Definitions: Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2) Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU (Miss RateL1 x Miss RateL2) Global Miss Rate is what matters

Comparing Local and Global Miss Rates
32 KByte 1st level cache; Increasing 2nd level cache Global miss rate close to single level cache rate provided L2 >> L1 Don’t use local miss rate L2 not tied to CPU clock cycle! Cost & A.M.A.T. Generally Fast Hit Times and fewer misses Since hits are few, target miss reduction

Reducing Misses: Which apply to L2 Cache?
Reducing Miss Rate 1. Reduce Misses via Larger Block Size 2. Reduce Conflict Misses via Higher Associativity 3. Reducing Conflict Misses via Victim Cache 4. Reducing Conflict Misses via Pseudo-Associativity 5. Reducing Misses by HW Prefetching Instr, Data 6. Reducing Misses by SW Prefetching Data 7. Reducing Capacity/Conf. Misses by Compiler Optimizations

L2 cache block size & A.M.A.T.
32KB L1, 8 byte path to memory

Reducing Miss Penalty Summary
Five techniques Read priority over write on miss Subblock placement Early Restart and Critical Word First on miss Non-blocking Caches (Hit under Miss, Miss under Miss) Second Level Cache Can be applied recursively to Multilevel Caches Danger is that time to DRAM will grow with multiple levels in between First attempts at L2 caches can make things worse, since increased worst case is worse

What is the Impact of What You’ve Learned About Caches?
: Speed = ƒ(no. operations) 1990 Pipelined Execution & Fast Clock Rate Out-of-Order execution Superscalar Instruction Issue 1998: Speed = ƒ(non-cached memory accesses) Superscalar, Out-of-Order machines hide L1 data cache miss (5 clocks) but not L2 cache miss (50 clocks)?

Cache Optimization Summary
Technique MR MP HT Complexity Larger Block Size + – 0 Higher Associativity + – 1 Victim Caches Pseudo-Associative Caches HW Prefetching of Instr/Data Compiler Controlled Prefetching Compiler Reduce Misses + 0 Priority to Read Misses Subblock Placement Early Restart & Critical Word 1st Non-Blocking Caches Second Level Caches + 2 miss rate miss penalty

Improving Cache Performance

1. Fast Hit times via Small and Simple Caches
Why Alpha has 8KB Instruction and 8KB data cache + 96KB second level cache? Small data cache and clock rate Direct Mapped, on chip

2. Fast hits by Avoiding Address Translation
Send virtual address to cache? Called Virtually Addressed Cache or just Virtual Cache vs. Physical Cache Every time process is switched logically must flush the cache; otherwise get false hits Cost is time to flush + “compulsory” misses from empty cache Dealing with aliases (sometimes called synonyms); Two different virtual addresses map to same physical address I/O must interact with cache, so need virtual address Solution to aliases SW guarantees all aliases share last n bits for a n-bit indexed direct-mapped cache; called page coloring Solution to cache flush Add process identifier tag that identifies process as well as address within process: can’t get a hit if wrong process

Virtually Addressed Caches
CPU CPU CPU VA VA VA VA Tags TB $ PA Tags $ TB PA VA PA L2 $ $ TB MEM PA PA MEM MEM Overlap $ access with VA translation: requires $ index to remain invariant across translation Conventional Organization Virtually Addressed Cache Translate only on miss Synonym Problem

2. Fast Cache Hits by Avoiding Translation: Process ID impact
Black is uniprocess Light Gray is multiprocess when flush cache Dark Gray is multiprocess when use Process ID tag Y axis: Miss Rates up to 20% X axis: Cache size from 2 KB to 1024 KB

2. Fast Cache Hits by Avoiding Translation: Index with Physical Portion of Address
If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag Limits cache to page size: what if want bigger caches and uses same trick? Higher associativity moves barrier to right Page coloring

3. Fast Hit Times Via Pipelined Writes
Pipeline Tag Check and Update Cache as separate stages; current write tag check & previous write cache update Only STORES in the pipeline; empty during a miss Store r2, (r1) Check r1 Add -- Sub -- Store r4, (r3) M[r1]<-r2& check r3 In shade is “Delayed Write Buffer”; must be checked on reads; either complete write or read from buffer

4. Fast Writes on Misses Via Small Subblocks
If most writes are 1 word, subblock size is 1 word, & write through then always write subblock & tag immediately Tag match and valid bit already set: Writing the block was proper, & nothing lost by setting valid bit on again. Tag match and valid bit not set: The tag match means that this is the proper block; writing the data into the subblock makes it appropriate to turn the valid bit on. Tag mismatch: This is a miss and will modify the data portion of the block. Since write-through cache, no harm was done; memory still has an up-to-date copy of the old value. Only the tag to the address of the write and the valid bits of the other subblock need be changed because the valid bit for this subblock has already been set Doesn’t work with write back due to last case

Technique MR MP HT Complexity Larger Block Size + – 0 Higher Associativity + – 1 Victim Caches Pseudo-Associative Caches HW Prefetching of Instr/Data Compiler Controlled Prefetching Compiler Reduce Misses + 0 Priority to Read Misses Subblock Placement Early Restart & Critical Word 1st Non-Blocking Caches Second Level Caches + 2 Small & Simple Caches – + 0 Avoiding Address Translation Pipelining Writes + 1 miss rate penalty miss hit time

What is the Impact of What You’ve Learned About Caches?
: Speed = ƒ(no. operations) 1990 Pipelined Execution & Fast Clock Rate Out-of-Order execution Superscalar Instruction Issue 1998: Speed = ƒ(non-cached memory accesses) What does this mean for Compilers?,Operating Systems?, Algorithms? Data Structures?

DRAM logical organization (4 Mbit)
Column Decoder … 1 1 Sense Amps & I/O D Memory Array Q A0…A1 (2,048 x 2,048) Storage W ord Line Cell Square root of bits per RAS/CAS

DRAM physical organization (4 Mbit)
… 8 I/Os I/O I/O Column Addr ess I/O I/O Row D Addr ess Block Block … Block Block Row Dec. Row Dec. Row Dec. Row Dec. 9 : 512 9 : 512 9 : 512 9 : 512 Q 2 I/O I/O I/O I/O Block 0 … Block 3 8 I/Os

DRAM History DRAMs: capacity +60%/yr, cost –30%/yr
2.5X cells/area, 1.5X die size in 3 years ‘98 DRAM fab line costs $2B DRAM only: density, leakage v. speed Rely on increasing no. of computers & memory per computer (60% market) SIMM or DIMM is replaceable unit => computers use any generation DRAM Commodity, second source industry => high volume, low profit, conservative Little organization innovation in 20 years Order of importance: 1) Cost/bit 2) Capacity First RAMBUS: 10X BW, +30% cost => little impact

DRAM Future: 1 Gbit DRAM (ISSCC ‘96; production ‘02?)
Mitsubishi Samsung Blocks 512 x 2 Mbit x 1 Mbit Clock 200 MHz 250 MHz Data Pins 64 16 Die Size 24 x 24 mm 31 x 21 mm Sizes will be much smaller in production Metal Layers 3 4 Technology 0.15 micron micron Wish could do this for Microprocessors!

Main Memory Performance
Simple: CPU, Cache, Bus, Memory same width (32 or 64 bits) Wide: CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UtraSPARC 512) Interleaved: CPU, Cache, Bus 1 word: Memory N Modules (4 Modules); example is word interleaved

Main Memory Performance
Timing model (word size is 32 bits) 1 to send address, 6 access time, 1 to send data Cache Block is 4 words Simple M.P = 4 x (1+6+1) = 32 Wide M.P = = 8 Interleaved M.P. = x1 = 11

Independent Memory Banks
Memory banks for independent accesses vs. faster sequential accesses Multiprocessor I/O CPU with Hit under n Misses, Non-blocking Cache Superbank: all memory active on one block transfer (or Bank) Bank: portion within a superbank that is word interleaved (or Subbank) …

Independent Memory Banks
How many banks? number banks => number clocks to access word in bank For sequential accesses, otherwise will return to original bank before it has next word ready (like in vector case) Increasing DRAM => fewer chips => harder to have banks

DRAMs per PC over Time DRAM Generation ‘86 ‘89 ‘92 ‘96 ‘99 ‘02
‘86 ‘89 ‘92 ‘96 ‘99 ‘02 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb 32 8 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB 16 4 8 2 4 1 Minimum Memory Size 8 2 4 1 8 2

Avoiding Bank Conflicts
Lots of banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; Even with 128 banks, since 512 is multiple of 128, conflict on word accesses SW: loop interchange or declaring array not power of 2 (“array padding”) HW: Prime number of banks bank number = address mod number of banks address within bank = address / number of words in bank modulo & divide per memory access with prime no. banks? address within bank = address mod number words in bank bank number? easy if 2N words per bank

Fast Memory Systems: DRAM specific
Multiple CAS accesses: several names (page mode) Extended Data Out (EDO): 30% faster in page mode New DRAMs to address gap; what will they cost, will they survive? RAMBUS: startup company; reinvent DRAM interface Each Chip a module vs. slice of memory Short bus between CPU and chips Does own refresh Variable amount of data returned 1 byte / 2 ns (500 MB/s per chip) Synchronous DRAM: 2 banks on chip, a clock signal to DRAM, transfer synchronous to system clock ( MHz) Intel claims RAMBUS Direct (16 b wide) is future PC memory Niche memory or main memory? e.g., Video RAM for frame buffers, DRAM + fast serial output

DRAM Latency >> BW
Application => Lower DRAM Latency RAMBUS, Synch DRAM increase BW but higher latency I$ D$ Proc L2$ BW to cache is then the latency + transfer time of block EDO 30% higher DRAM BW => < 5% on PC benchmarks) Bus D R A M

Potential DRAM Crossroads?
After 20 years of 4X every 3 years, running into wall? (64Mb - 1 Gb) How can keep $1B fab lines full if buy fewer DRAMs per computer? Cost/bit –30%/yr if stop 4X/3 yr? What will happen to $40B/yr DRAM industry?

Main Memory Summary Wider Memory
Interleaved Memory: for sequential or independent accesses Avoiding bank conflicts: SW & HW DRAM specific optimizations: page mode & Specialty DRAM DRAM future less rosy?

Cache Cross Cutting Issues
Superscalar CPU & Number Cache Ports must match: number memory accesses/cycle? Speculative Execution and non-blocking caches/TLB Parallel Execution vs. Cache locality Want far separation to find independent operations vs. want reuse of data accesses to avoid misses I/O and Caches => multiple copies of data Consistency

Victim Buffer

Alpha Memory Performance: Miss Rates of SPEC92
I$ miss = 6% D$ miss = 32% L2 miss = 10% 8K 8K 2M I$ miss = 2% D$ miss = 13% L2 miss = 0.6% I$ miss = 1% D$ miss = 21% L2 miss = 0.3%

Alpha CPI Components Instruction stall: branch mispredict (green);
Data cache (blue); Instruction cache (yellow); L2$ (pink) Other: compute + reg conflicts, structural conflicts

Pitfall: Predicting Cache Performance from Different Prog
Pitfall: Predicting Cache Performance from Different Prog. (ISA, compiler, ...) D$, Tom 4KB Data cache miss rate 8%,12%, or 28%? 1KB Instr cache miss rate 0%,3%,or 10%? Alpha vs. MIPS for 8KB Data $: 17% vs. 10% Why 2X Alpha v. MIPS? D$, gcc D$, esp I$, gcc I$, esp I$, Tom

Pitfall: Simulating Too Small an Address Trace
I$ = 4 KB, B=16B D$ = 4 KB, B=16B L2 = 512 KB, B=128B MP = 12, 200

Main Memory Summary Wider Memory
Interleaved Memory: for sequential or independent accesses Avoiding bank conflicts: SW & HW DRAM specific optimizations: page mode & Specialty DRAM DRAM future less rosy?

Technique MR MP HT Complexity Larger Block Size + – 0 Higher Associativity + – 1 Victim Caches Pseudo-Associative Caches HW Prefetching of Instr/Data Compiler Controlled Prefetching Compiler Reduce Misses + 0 Priority to Read Misses Subblock Placement Early Restart & Critical Word 1st Non-Blocking Caches Second Level Caches + 2 Small & Simple Caches – + 0 Avoiding Address Translation Pipelining Writes + 1 miss rate penalty miss hit time

Intelligent Memory

IRAM Vision Statement Proc L o g i c f a b
Microprocessor & DRAM on a single chip: on-chip memory latency 5-10X, bandwidth X improve energy efficiency 2X-4X (no off-chip bus) serial I/O 5-10X v. buses smaller board area/volume adjustable memory size/width $ $ L2$ I/O I/O Bus Bus $B for separate lines for logic and memory Single chip: either processor in DRAM or memory in logic fab D R A M I/O I/O Proc D R A M f a b Bus D R A M

App #1: Intelligent PDA ( 2003?)
Pilot PDA (todo,calendar, calculator, addresses,...) + Gameboy (Tetris, ...) + Nikon Coolpix (camera) + Cell Phone, Pager, GPS, tape recorder, TV remote, am/fm radio, garage door opener, ... + Wireless data (WWW) + Speech, vision recog. + Speech output for conversations Speech control of all devices Vision to see surroundings, scan documents, read bar codes, measure room

App #2: “Intelligent Disk”(IDISK): Scalable Decision Support?
cross bar 1 IRAM/disk + xbar + fast serial link v. conventional SMP Move function to data v. data to CPU (scan, sort, join,...) Network latency = f(SW overhead), not link distance Avoid I/O bus bottleneck of SMP Cheaper, faster, more scalable (1/3 $, 3X perf) … cross bar cross bar 75.0 GB/s How does TPC-D scale with dataset size? Compare NCR 5100M 20 node system (each node is MHz Pentium CPUs), March 28, 1997; 100 GB, 300GB, 1000GB Per 19 queries, all but 2 go up linearly with database size: (3-5 vs 300, 7-15 vs. 1000) e.g, interval time ratios 300/100 = 3.35; 1000/100=9.98; 1000/300= 2.97 How much memory for IBM SP2 node? 100 GB: 12 processors with 24 GB; 300 GB: 128 thin nodes with 32 GB total; 256 MB/node (2 boards/processor) TPC-D is business analysis vs. business operation 17 read only queries; results in queries per Gigabyte Hour Scale Factor (SF) multiplies each portion of the data: 10 to 10000 SF 10 is about 10 GB; indices + temp table increase 3X - 5X cross bar cross bar IRAM IRAM IRAM IRAM 6.0 GB/s … … … … … … IRAM IRAM IRAM IRAM … … …

V-IRAM-2: 0.13 µm, Fast Logic, 1GHz 16 GFLOPS(64b)/64 GOPS(16b)/128MB
+ 8 x 64 or 16 x 32 32 x 16 2-way Superscalar x Vector Instruction Processor ÷ Queue I/O Load/Store I/O 1Gbit technology Put in perspective 10X of Cray T90 today 8K I cache 8K D cache Vector Registers 8 x 64 8 x 64 Serial I/O Memory Crossbar Switch M M M M M M M M M M M M M M M M M M … M M I/O 8 x 64 8 x 64 8 x 64 8 x 64 … … … … … … … … 8 x 64 … … I/O M M M M M M M M M M

Tentative VIRAM-1 Floorplan
0.18 µm DRAM 32 MB in 16 banks x 256b, 128 subbanks 0.25 µm, 5 Metal Logic 200 MHz MIPS, K I$, 16K D$ MHz FP/int. vector units die: 16x16 mm xtors: 270M power: 2 Watts Memory (128 Mbits / 16 MBytes) 4 Vector Pipes/Lanes C P U +$ Ring- based Switch Floor plan showing memory in purple Crossbar in blue (need to match vector unit, not maximum memory system) vector units in pink CPU in orange I/O in yellow How to spend 1B transistors vs. all CPU! VFU size based on looking at 3 MPUs in 0.25 micron technology; MIPS mm2 for 1FPU (Mul,Add, misc) IBM Power3 48 mm2 for 2 FPUs (2 mul/add units) HAL SPARC III 40 mm2 for 2 FPUs (2 multiple, add units) I/O Memory (128 Mbits / 16 MBytes)

Active Pages P M P C M Up to 1000X speedup!

Active-Page Interface
Memory Operations write(address,data) and read(address) Active-Page functions (AP_funcs) Allocation AP_alloc(group_id, AP_funcs, vaddr) Shared Variables Synchronization Variables

Partitioning Application P C M Processor-centric vs Memory-centric

Processor-Centric Applications
Matrix MPEG-MMX Matrix multiply for Simplex and finite element MPEG decoder using MMX instructions

Memory-Centric Applications
Array Database Median Dynamic Programming C++ std template library class Unindexed search of an address database Median filter for images Protein sequence matching

Reconfigurable Architecture DRAM (RADram)
Bit-Cell Array Row Select Sense Amps Reconfigurable Logic

RADram vs IRAM Higher yield through redundancy
IRAMs will fabricate at processor costs RADrams will be closer to DRAMs RADram exploits parallelism RADram can be application-specific RADram supports commodity processors

RADram Technology 1 Gbit DRAM in 2001 (SIA Roadmap)
50% for logic = 32M transistors 32K LEs (1K transistors / LE) 512Mbit = 128 x 512K superpages 256 LEs / superpage

RADram Logic

RADram Parameters

Speedup vs Data Size

ActiveOS Minimal OS on SimpleScalar Mixed Workload Virtual memory
Process scheduling Mixed Workload conventional: perl, gcc, gzip Active Page: database, matrix, DNA

Process Time

Wall-Clock Time

Algorithmic Scaling

Status OS - larger workloads, scheduling for power MERL Prototype
Hybrid Page Processor Power Estimation System Integration / Cache Coherence Driving Applications - Java GC Parallelizing/Partitioning Compiler

Other Projects Impulse (Utah) - smart memory controller
Smart Memory (Stanford) - general tech RAW (MIT) - ILP vs speculation Active Disks (CMU, UMD, UCSB) Active Networks (Penn, MIT)

Lecture 5: Vector Processors and DSPs

Review Speculation: Out-of-order execution, In-order commit (reorder buffer+rename registers)=>precise exceptions Branch Prediction Branch History Table: 2 bits for loop accuracy Recently executed branches correlated with next branch? Branch Target Buffer: include branch address & prediction Predicated Execution can reduce number of branches, number of mispredicted branches Software Pipelining Symbolic loop unrolling (instructions from different iterations) to optimize pipeline with little code expansion, little overhead Superscalar and VLIW(“EPIC”): CPI < 1 (IPC > 1) Dynamic issue vs. Static issue More instructions issue at same time => larger hazard penalty # independent instructions = # functional units X latency

Review: Theoretical Limits to ILP? (Figure 4.48, Page 332)
Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window FP: IPC Integer: Infinite 256 128 64 32 16 8 4

Review: Instructon Level Parallelism
High speed execution based on instruction level parallelism (ilp): potential of short instruction sequences to execute in parallel High-speed microprocessors exploit ILP by: 1) pipelined execution: overlap instructions 2) superscalar execution: issue and execute multiple instructions per clock cycle 3) Out-of-order execution (commit in-order) Memory accesses for high-speed microprocessor? Data Cache, possibly multiported, multiple levels

Problems with conventional approach
Limits to conventional exploitation of ILP: 1) pipelined clock rate: at some point, each increase in clock rate has corresponding CPI increase (branches, other hazards) 2) instruction fetch and decode: at some point, its hard to fetch and decode more instructions per clock cycle 3) cache hit rate: some long-running (scientific) programs have very large data sets accessed with poor locality; others have continuous data streams (multimedia) and hence poor locality

Alternative Model: Vector Processing
Vector processors have high-level operations that work on linear arrays of numbers: "vectors" + r1 r2 r3 add r3, r1, r2 SCALAR (1 operation) v1 v2 v3 vector length add.vv v3, v1, v2 VECTOR (N operations) 25 25

Properties of Vector Processors
Each result independent of previous result => long pipeline, compiler ensures no dependencies => high clock rate Vector instructions access memory with known pattern => highly interleaved memory => amortize memory latency of over 64 elements => no (data) caches required! (Do use instruction cache) Reduces branches and branch problems in pipelines Single vector instruction implies lots of work ( loop) => fewer instruction fetches

Operation & Instruction Count: RISC v. Vector Processor (from F
Operation & Instruction Count: RISC v. Vector Processor (from F. Quintana, U. Barcelona.) Spec92fp Operations (Millions) Instructions (M) Program RISC Vector R / V RISC Vector R / V swim x x hydro2d x x nasa x x su2cor x x tomcatv x x wave x x mdljdp x x Vector reduces ops by 1.2X, instructions by 20X

Styles of Vector Architectures
memory-memory vector processors: all vector operations are memory to memory vector-register processors: all vector operations between vector registers (except load and store) Vector equivalent of load-store architectures Includes all vector machines since late 1980s: Cray, Convex, Fujitsu, Hitachi, NEC We assume vector-register for rest of lectures

Components of Vector Processor
Vector Register: fixed length bank holding a single vector has at least 2 read and 1 write ports typically 8-32 vector registers, each holding bit elements Vector Functional Units (FUs): fully pipelined, start new operation every clock typically 4 to 8 FUs: FP add, FP mult, FP reciprocal (1/X), integer add, logical, shift; may have multiple of same unit Vector Load-Store Units (LSUs): fully pipelined unit to load or store a vector; may have multiple LSUs Scalar registers: single element for FP scalar or address Cross-bar to connect FUs , LSUs, registers

“DLXV” Vector Instructions
Instr. Operands Operation Comment ADDV V1,V2,V3 V1=V2+V3 vector + vector ADDSV V1,F0,V2 V1=F0+V2 scalar + vector MULTV V1,V2,V3 V1=V2xV3 vector x vector MULSV V1,F0,V2 V1=F0xV2 scalar x vector LV V1,R1 V1=M[R1..R1+63] load, stride=1 LVWS V1,R1,R2 V1=M[R1..R1+63*R2] load, stride=R2 LVI V1,R1,V2 V1=M[R1+V2i,i=0..63] indir.("gather") CeqV VM,V1,V2 VMASKi = (V1i=V2i)? comp. setmask MOV VLR,R1 Vec. Len. Reg. = R1 set vector length MOV VM,R1 Vec. Mask = R1 set vector mask

Memory operations Load/store operations move groups of data between registers and memory Three types of addressing Unit stride Fastest Non-unit (constant) stride Indexed (gather-scatter) Vector equivalent of register indirect Good for sparse arrays of data Increases number of programs that vectorize 32 32

DAXPY (Y = a * X + Y) Assuming vectors X, Y are length 64
Scalar vs. Vector LD F0,a ;load scalar a LV V1,Rx ;load vector X MULTS V2,F0,V1 ;vector-scalar mult. LV V3,Ry ;load vector Y ADDV V4,V2,V3 ;add SV Ry,V4 ;store the result LD F0,a ADDI R4,Rx,#512 ;last address to load loop: LD F2, 0(Rx) ;load X(i) MULTD F2,F0,F2 ;a*X(i) LD F4, 0(Ry) ;load Y(i) ADDD F4,F2, F4 ;a*X(i) + Y(i) SD F4 ,0(Ry) ;store into Y(i) ADDI Rx,Rx,#8 ;increment index to X ADDI Ry,Ry,#8 ;increment index to Y SUB R20,R4,Rx ;compute bound BNZ R20,loop ;check if done 578 (2+9*64) vs (1+5*64) ops (1.8X) 578 (2+9*64) vs instructions (96X) 64 operation vectors no loop overhead also 64X fewer pipeline hazards

Example Vector Machines
Machine Year Clock Regs Elements FUs LSUs Cray MHz Cray XMP MHz L, 1 S Cray YMP MHz L, 1 S Cray C MHz Cray T MHz Conv. C MHz Conv. C MHz Fuj. VP MHz Fuj. VP MHz NEC SX/ MHz 8+8K 256+var 16 8 NEC SX/ MHz 8+8K 256+var 16 8 Cray 1; fastest scalar computer + 1st commercially successful vector computer, offered another 10X 6600 1st scoreboard Cray XMP: 3 LSUs, Multiprocessor 4 way (not by Cray) => YMP, C-90, T-90; 2X processors, 1.5X clock Cray 2 went to DRAM to get more memory, not so great Like parallel teams as Intel (486, PPro, Pentium, next one) Japan Fujitsu, vary number of registers elements (8x1024 or 32x256) NEC, 8x K of varying elements

Vector Linpack Performance (MFLOPS)
Machine Year Clock 100x100 1kx1k Peak(Procs) Cray MHz (1) Cray XMP MHz (4) Cray YMP MHz ,667(8) Cray C MHz ,238(16) Cray T MHz ,600(32) Conv. C MHz (1) Conv. C MHz (4) Fuj. VP MHz (1) NEC SX/ MHz (1) NEC SX/ MHz ,600(4) 6X in 20 years; 32X in 20 years; Peak is 360X speedup Weighed tons

Vector Surprise Use vectors for inner loop parallelism (no surprise)
One dimension of array: A[0, 0], A[0, 1], A[0, 2], ... think of machine as, say, 32 vector regs each with 64 elements 1 instruction updates 64 elements of 1 vector register and for outer loop parallelism! 1 element from each column: A[0,0], A[1,0], A[2,0], ... think of machine as 64 “virtual processors” (VPs) each with 32 scalar registers! ( multithreaded processor) 1 instruction updates 1 scalar register in 64 VPs Hardware identical, just 2 compiler perspectives

Virtual Processor Vector Model
Vector operations are SIMD (single instruction multiple data)operations Each element is computed by a virtual processor (VP) Number of VPs given by vector length vector control register 26

Vector Architectural State
General Purpose Registers Flag (32) VP0 VP1 VP$vlr-1 vr0 vr1 vr31 vf0 vf1 vf31 $vdw bits 1 bit Virtual Processors ($vlr) vcr0 vcr1 vcr31 Control 32 bits 27

Vector Implementation
Vector register file Each register is an array of elements Size of each register determines maximum vector length Vector length register determines vector length for a particular operation Multiple parallel execution units = “lanes” (sometimes called “pipelines” or “pipes”) 33 33

Vector Terminology: 4 lanes, 2 vector functional units
34 34

Vector Execution Time Time = f(vector length, data dependicies, struct. hazards) Initiation rate: rate that FU consumes vector elements (= number of lanes; usually 1 or 2 on Cray T-90) Convoy: set of vector instructions that can begin execution in same clock (no struct. or data hazards) Chime: approx. time for a vector operation m convoys take m chimes; if each vector length is n, then they take approx. m x n clock cycles (ignores overhead; good approximation for long vectors) 1: LV V1,Rx ;load vector X 2: MULV V2,F0,V1 ;vector-scalar mult. LV V3,Ry ;load vector Y 3: ADDV V4,V2,V3 ;add 4: SV Ry,V4 ;store the result 4 conveys, 1 lane, VL=64 => 4 x 64 256 clocks (or 4 clocks per result)

DLXV Start-up Time Start-up time: pipeline latency time (depth of FU pipeline); another sources of overhead Operation Start-up penalty (from CRAY-1) Vector load/store 12 Vector multply 7 Vector add 6 Assume convoys don't overlap; vector length = n: 2. Starts at 12 because single LOAD/STORE unit, next convey must wait 3. starts 1 clock cycle after last result Convoy Start 1st result last result 1. LV n (12+n-1) 2. MULV, LV 12+n 12+n n Load start-up 3. ADDV 24+2n 24+2n n Wait convoy 2 4. SV n 30+3n n Wait convoy 3

Why startup time for each vector instruction?
Why not overlap startup time of back-to-back vector instructions? Cray machines built from many ECL chips operating at high clock rates; hard to do? Berkeley vector design (“T0”) didn’t know it wasn’t supposed to do overlap, so no startup times for functional units (except load)

Vector Load/Store Units & Memories
Start-up overheads usually longer fo LSUs Memory system must sustain (# lanes x word) /clock cycle Many Vector Procs. use banks (vs. simple interleaving): 1) support multiple loads/stores per cycle => multiple banks & address banks independently 2) support non-sequential accesses (see soon) Note: No. memory banks > memory latency to avoid stalls m banks => m words per memory latency l clocks if m < l, then gap in memory pipeline: clock: 0 … l l+1 l+2 … l+m- 1 l+m … 2 l word: -- … … m … m may have 1024 banks in SRAM

Vector Length What to do when vector length is not exactly 64?
vector-length register (VLR) controls the length of any vector operation, including a vector load or store. (cannot be > the length of vector registers) do 10 i = 1, n 10 Y(i) = a * X(i) + Y(i) Don't know n until runtime! n > Max. Vector Length (MVL)?

Strip Mining Suppose Vector Length > Max. Vector Length (MVL)?
Strip mining: generation of code such that each vector operation is done for a size Š to the MVL 1st loop do short piece (n mod MVL), rest VL = MVL low = 1 VL = (n mod MVL) /*find the odd size piece*/ do 1 j = 0,(n / MVL) /*outer loop*/ do 10 i = low,low+VL-1 /*runs for length VL*/ Y(i) = a*X(i) + Y(i) /*main operation*/ 10 continue low = low+VL /*start of next vector*/ VL = MVL /*reset the length to max*/ 1 continue

Common Vector Metrics Rinf: MFLOPS rate on an infinite-length vector
upper bound Real problems do not have unlimited vector lengths, and the start-up penalties encountered in real problems will be larger (Rn is the MFLOPS rate for a vector of length n) N1/2: The vector length needed to reach one-half of Rinf a good measure of the impact of start-up NV: The vector length needed to make vector mode faster than scalar mode measures both start-up and speed of scalars relative to vectors, quality of connection of scalar unit to vector unit

Vector Stride Suppose adjacent elements not sequential in memory
do 10 i = 1,100 do 10 j = 1,100 A(i,j) = 0.0 do 10 k = 1,100 10 A(i,j) = A(i,j)+B(i,k)*C(k,j) Either B or C accesses not adjacent (800 bytes between) stride: distance separating elements that are to be merged into a single vector (caches do unit stride) => LVWS (load vector with stride) instruction Strides => can cause bank conflicts (e.g., stride = 32 and 16 banks)

Compiler Vectorization on Cray XMP
Benchmark %FP %FP in vector ADM 23% 68% DYFESM 26% 95% FLO % 100% MDG 28% 27% MG3D 31% 86% OCEAN 28% 58% QCD 14% 1% SPICE 16% 7% (1% overall) TRACK 9% 23% TRFD 22% 10%

Vector Opt #1: Chaining Suppose: MULV V1,V2,V3
ADDV V4,V1,V5 ; separate convoy? chaining: vector register (V1) is not as a single entity but as a group of individual registers, then pipeline forwarding can work on individual elements of a vector Flexible chaining: allow vector to chain to any other active vector operation => more read/write port As long as enough HW, increases convoy size 7 64 6 64 Total = 141 Unchained MULTV ADDV 7 64 Chained MULTV Total = 77 6 64 ADDV

Vector Opt #2: Conditional Execution
Suppose: do 100 i = 1, 64 if (A(i) .ne. 0) then A(i) = A(i) – B(i) endif 100 continue vector-mask control takes a Boolean vector: when vector-mask register is loaded from vector test, vector instructions operate only on vector elements whose corresponding entries in the vector-mask register are 1. Still requires clock even if result not stored; if still performs operation, what about divide by 0?

Vector Opt #3: Sparse Matrices
Suppose: do 100 i = 1,n 100 A(K(i)) = A(K(i)) + C(M(i)) gather (LVI) operation takes an index vector and fetches the vector whose elements are at the addresses given by adding a base address to the offsets given in the index vector => a nonsparse vector in a vector register After these elements are operated on in dense form, the sparse vector can be stored in expanded form by a scatter store (SVI), using the same index vector Can't be done by compiler since can't know Ki elements distinct, no dependencies; by compiler directive Use CVI to create index 0, 1xm, 2xm, ..., 63xm CVI gets used under mask

Sparse Matrix Example Cache (1993) vs. Vector (1988)
IBM RS6000 Cray YMP Clock 72 MHz 167 MHz Cache 256 KB 0.25 KB Linpack 140 MFLOPS 160 (1.1) Sparse Matrix 17 MFLOPS 125 (7.3) (Cholesky Blocked ) Memory bandwidth is the key: Cache: 1 value per cache block (32B to 64B) Vector: 1 value per element (4B)

Limited to scientific computing?
Applications Limited to scientific computing? Multimedia Processing (compress., graphics, audio synth, image proc.) Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort) Lossy Compression (JPEG, MPEG video and audio) Lossless Compression (Zero removal, RLE, Differencing, LZW) Cryptography (RSA, DES/IDEA, SHA/MD5) Speech and handwriting recognition Operating systems/Networking (memcpy, memset, parity, checksum) Databases (hash/join, data mining, image/video serving) Language run-time support (stdlib, garbage collection) even SPECint95 43

Vector for Multimedia? +
Intel MMX: 57 new 80x86 instructions (1st since 386) similar to Intel 860, Mot , HP PA-71000LC, UltraSPARC 3 data types: 8 8-bit, 4 16-bit, 2 32-bit in 64bits reuse 8 FP registers (FP and MMX cannot mix) short vector: load, add, store 8 8-bit operands Claim: overall speedup 1.5 to 2X for 2D/3D graphics, audio, video, speech, comm., ... use in drivers or added to library routines; no compiler +

MMX Instructions Move 32b, 64b
Add, Subtract in parallel: 8 8b, 4 16b, 2 32b opt. signed/unsigned saturate (set to max) if overflow Shifts (sll,srl, sra), And, And Not, Or, Xor in parallel: 8 8b, 4 16b, 2 32b Multiply, Multiply-Add in parallel: 4 16b Compare = , > in parallel: 8 8b, 4 16b, 2 32b sets field to 0s (false) or 1s (true); removes branches Pack/Unpack Convert 32b<–> 16b, 16b <–> 8b Pack saturates (set to max) if number is too large

Vectors and Variable Data Width
Programmer thinks in terms of vectors of data of some width (8, 16, 32, or 64 bits) Good for multimedia; More elegant than MMX-style extensions Don’t have to worry about how data stored in hardware No need for explicit pack/unpack operations Just think of more virtual processors operating on narrow data Expand Maximum Vector Length with decreasing data width: 64 x 64bit, 128 x 32 bit, 256 x 16 bit, 512 x 8 bit 28

Mediaprocesing: Vectorizable? Vector Lengths?
Kernel Vector length Matrix transpose/multiply # vertices at once DCT (video, communication) image width FFT (audio) Motion estimation (video) image width, iw/16 Gamma correction (video) image width Haar transform (media mining) image width Median filter (image processing) image width Separable convolution (img. proc.) image width (from Pradeep Dubey - IBM, 44

Vector Pitfalls Pitfall: Concentrating on peak performance and ignoring start-up overhead: NV (length faster than scalar) > 100! Pitfall: Increasing vector performance, without comparable increases in scalar performance (Amdahl's Law) failure of Cray competitor from his former company Pitfall: Good processor vector performance without providing good memory bandwidth MMX?

Vector Advantages Easy to get high performance; N operations:
are independent use same functional unit access disjoint registers access registers in same order as previous instructions access contiguous memory words or known pattern can exploit large memory bandwidth hide memory latency (and any other latency) Scalable (get higher performance as more HW resources available) Compact: Describe N operations with 1 short instruction (v. VLIW) Predictable (real-time) performance vs. statistical performance (cache) Multimedia ready: choose N * 64b, 2N * 32b, 4N * 16b, 8N * 8b Mature, developed compiler technology Vector Disadvantage: Out of Fashion Why MPP? Best potential performance! Few successes Operator on vectors of registers Its easier to vectorize than parallelize Scales well: more hardware and slower clock rate Crazy research 24

Vector Summary Alternate model accommodates long memory latency, doesn’t rely on caches as does Out-Of-Order, superscalar/VLIW designs Much easier for hardware: more powerful instructions, more predictable memory accesses, fewer harzards, fewer branches, fewer mispredicted branches, ... What % of computation is vectorizable? Is vector a good match to new apps such as multimedia, DSP?

More Vector Processing
Hard vector example Vector vs. Superscalar Krste Asanovic’s dissertation: designing a vector processor issues Vector vs. Superscalar: area, energy Real-time vs. Average time

Vector Example with dependency
/* Multiply a[m][k] * b[k][n] to get c[m][n] */ for (i=1; i<m; i++) { for (j=1; j<n; j++) sum = 0; for (t=1; t<k; t++) sum += a[i][t] * b[t][j]; } c[i][j] = sum;

Straightforward Solution
Must sum all the elements of a vector besides grabbing one element at a time from a vector register and putting it in the scalar unit? In T0, the vector extract instruction, vext.v. This shifts elements within a vector Called a “reduction”

Novel Matrix Multiply Solution
You don't need to do reductions for matrix multiply You can calculate multiple independent sums within one vector register You can vectorize the j loop to perform 32 dot-products at the same time Or you can think of each 32 Virtual Processor doing one of the dot products (Assume Maximal Vector Length is 32) Show it in C source code, but can imagine the assembly vector instructions from it

Original Vector Example with dependency
/* Multiply a[m][k] * b[k][n] to get c[m][n] */ for (i=1; i<m; i++) { for (j=1; j<n; j++) sum = 0; for (t=1; t<k; t++) sum += a[i][t] * b[t][j]; } c[i][j] = sum;

Optimized Vector Example
/* Multiply a[m][k] * b[k][n] to get c[m][n] */ for (i=1; i<m; i++) { for (j=1; j<n; j+=32)/* Step j 32 at a time. */ sum[0:31] = 0; /* Initialize a vector register to zeros. */ for (t=1; t<k; t++) a_scalar = a[i][t]; /* Get scalar from a matrix. */ b_vector[0:31] = b[t][j:j+31]; /* Get vector from b matrix. */ prod[0:31] = b_vector[0:31]*a_scalar; /* Do a vector-scalar multiply. */

Optimized Vector Example cont’d
/* Vector-vector add into results. */ sum[0:31] += prod[0:31]; } /* Unit-stride store of vector of results. */ c[i][j:j+31] = sum[0:31];

Designing a Vector Processor
Changes to scalar How Pick Vector Length? How Pick Number of Vector Registers? Context switch overhead Exception handling Masking and Flag Instructions

Changes to scalar processor to run vector instructions
Decode vector instructions Send scalar registers to vector unit (vector-scalar ops) Synchronization for results back from vector register, including exceptions Things that don’t run in vector don’t have high ILP, so can make scalar CPU simple

vector length = (# lanes) X (# VFUs ) # Vector instructions/cycle
How Pick Vector Length? Vector length => Keep all VFUs busy: vector length = (# lanes) X (# VFUs ) # Vector instructions/cycle

How Pick Vector Length? Longer good because:
1) Hide vector startup 2) lower instruction bandwidth 3) if know max length of app. is < max vector length, no strip mining overhead 4) Better spatial locality for memory access Longer not much help because: 1) diminishing returns on overhead savings as keep doubling number of elements 2) need natural app. vector length to match physical register length, or no help

How Pick Number of Vector Registers?
More Vector Registers: 1) Reduces vector register “spills” (save/restore) 20% reduction to 16 registers for su2cor and tomcatv 40% reduction to 32 registers for tomcatv others 10%-15% 2) Aggressive scheduling of vector instructions: better compiling to take advantage of ILP Fewer: 1) Fewer bits in instruc format (usually 3 fields) 2) Context switching overhead

Context switch overhead
Extra dirty bit per processor If vector registers not written, don’t need to save on context switch Extra valid bit per vector register, cleared on process start Don’t need to restore on context switch until needed

Exception handling: External
If external exception, can just put pseudo-op into pipeline and wait for all vector ops to complete Alternatively, can wait for scalar unit to complete and begin working on exception code assuming that vector unit will not cause exception and interrupt code does not use vector unit

Exception handling: Arithmetic
Arithmetic traps harder Precise interrupts => large performance loss Alternative model: arithmetic exceptions set vector flag registers, 1 flag bit per element Software inserts trap barrier instructions from SW to check the flag bits as needed

Exception handling: Page Faults
Page Faults must be precise Instruction Page Faults not a problem Data Page Faults harder Option 1: Save/restore internal vector unit state Freeze pipeline, dump vector state perform needed ops Restore state and continue vector pipeline

Exception handling: Page Faults
Option 2: expand memory pipeline to check addresses before send to memory + memory buffer between address check and registers multiple queues to transfer from memory buffer to registers; check last address in queues before load 1st element from buffer. Pre Address Iinstruction Queue (PAIQ) which sends to TLB and memory while in parallel go to Address Check Instruction Queue (ACIQ) When passes checks, instruction goes to Committed Instruction Queue (CIQ) to be there when data returns. On page fault, only save instructions in PAIQ and ACIQ

Masking and Flag Instructions
Flag have multiple uses (conditional, arithmetic exceptions) Downside is: 1) extra bits in instruction to specify the flag register 2) extra interlock early in the pipeline for RAW hazards on Flag registers

Vectors Are Inexpensive
Scalar N ops per cycle 2) circuitry HP PA-8000 4-way issue reorder buffer: 850K transistors incl. 6,720 5-bit register number comparators Vector N ops per cycle 2) circuitry T0 vector micro 24 ops per cycle 730K transistors total only 23 5-bit register number comparators No floating point 39

Vectors Lower Power Single-issue Scalar Vector
One instruction fetch,decode, dispatch per vector Structured register accesses Smaller code for high performance, less power in instruction cache misses Bypass cache One TLB lookup per group of loads or stores Move only necessary data across chip boundary Single-issue Scalar One instruction fetch, decode, dispatch per operation Arbitrary register accesses, adds area and power Loop unrolling and software pipelining for high performance increases instruction cache footprint All data passes through cache; waste power if no temporal locality One TLB lookup per load or store Off-chip access in whole cache lines 41

Superscalar Energy Efficiency Even Worse
Control logic grows quadratically with issue width Control logic consumes energy regardless of available parallelism Speculation to increase visible parallelism wastes energy Vector Control logic grows linearly with issue width Vector unit switches off when not in use Vector instructions expose parallelism without speculation Software control of speculation when desired: Whether to use vector mask or compress/expand for conditionals 42

New Architecture Directions
“…media processing will become the dominant force in computer arch. & microprocessor design.” “... new media-rich applications... involve significant real-time processing of continuous media streams, and make heavy use of vectors of packed 8-, 16-, and 32-bit integer and Fl. Pt.” Needs include high memory BW, high network BW, continuous media data types, real-time response, fine grain parallelism “How Multimedia Workloads Will Change Processor Design”, Diefendorff & Dubey, IEEE Computer (9/97)

VLIW/Out-of-Order vs. Modest Scalar+Vector
(Where are crossover points on these curves?) VLIW/OOO Modest Scalar (Where are important applications on this axis?) Very Sequential Very Parallel

Cost-performance of simple vs. OOO
MIPS MPUs R5000 R k/5k Clock Rate 200 MHz 195 MHz 1.0x On-Chip Caches 32K/32K 32K/32K 1.0x Instructions/Cycle 1(+ FP) x Pipe stages x Model In-order Out-of-order --- Die Size (mm2) x without cache, TLB x Development (man yr.) x SPECint_base x

Summary Vector is alternative model for exploiting ILP
If code is vectorizable, then simpler hardware, more energy efficient, and better real-time model than Out-of-order machines Design issues include number of lanes, number of functional units, number of vector registers, length of vector registers, exception handling, conditional operations Will multimedia popularity revive vector architectures?

Processor Classes General Purpose - high performance
– Pentiums, Alpha's, SPARC – Used for general purpose software – Heavy weight OS - UNIX, NT – Workstations, PC's Embedded processors and processor cores – ARM, 486SX, Hitachi SH7000, NEC V800 – Single program – Lightweight, often realtime OS – DSP support – Cellular phones, consumer electronics (e. g. CD players) Microcontrollers – Extremely cost sensitive – Small word size - 8 bit common – Highest volume processors by far – Automobiles, toasters, thermostats, ... Increasing Cost Increasing Volume

DSP Outline Intro Sampled Data Processing and Filters Evolution of DSP
DSP vs. GP Processor

DSP Introduction Digital Signal Processing: application of mathematical operations to digitally represented signals Signals represented digitally as sequences of samples Digital signals obtained from physical signals via tranducers (e.g., microphones) and analog-to-digital converters (ADC) Digital signals converted back to physical signals via digital-to-analog converters (DAC) Digital Signal Processor (DSP): electronic system that processes digital signals

Common DSP algorithms and applications
Applications – Instrumentation and measurement – Communications – Audio and video processing – Graphics, image enhancement, 3- D rendering – Navigation, radar, GPS – Control - robotics, machine vision, guidance Algorithms – Frequency domain filtering - FIR and IIR – Frequency- time transformations - FFT – Correlation

What Do DSPs Need to Do Well?
Most DSP tasks require: Repetitive numeric computations Attention to numeric fidelity High memory bandwidth, mostly via array accesses Real-time processing DSPs must perform these tasks efficiently while minimizing: Cost Power Memory use Development time

Who Cares? DSP is a key enabling technology for many types of electronic products DSP-intensive tasks are the performance bottleneck in many computer applications today Computational demands of DSP-intensive tasks are increasing very rapidly In many embedded applications, general-purpose microprocessors are not competitive with DSP-oriented processors today 1997 market for DSP processors: $3 billion

A Tale of Two Cultures General Purpose Microprocessor traces roots back to Eckert, Mauchly, Von Neumann (ENIAC) DSP evolved from Analog Signal Processors, using analog hardware to transform phyical signals (classical electrical engineering) ASP to DSP because DSP insensitive to environment (e.g., same response in snow or desert if it works at all) DSP performance identical even with variations in components; 2 analog systems behavior varies even if built with same components with 1% variation Different history and different applications led to different terms, different metrics, some new inventions Increasing markets leading to cultural warfare

DSP vs. General Purpose MPU
DSPs tend to be written for 1 program, not many programs. Hence OSes are much simpler, there is no virtual memory or protection, ... DSPs sometimes run hard real-time apps You must account for anything that could happen in a time slot All possible interrupts or exceptions must be accounted for and their collective time be subtracted from the time interval. Therefore, exceptions are BAD! DSPs have an infinite continuous data stream

Today’s DSP “Killer Apps”
In terms of dollar volume, the biggest markets for DSP processors today include: Digital cellular telephony Pagers and other wireless systems Modems Disk drive servo control Most demand good performance All demand low cost Many demand high energy efficiency Trends are towards better support for these (and similar) major applications.

Digital Signal Processing in General Purpose Microprocessors
Speech and audio compression Filtering Modulation and demodulation Error correction coding and decoding Servo control Audio processing (e.g., surround sound, noise reduction, equalization, sample rate conversion) Signaling (e.g., DTMF detection) Speech recognition Signal synthesis (e.g., music, speech synthesis)

Decoding DSP Lingo DSP culture has a graphical format to represent formulas. Like a flowchart for formulas, inner loops, not programs. Some seem natural:  is add, X is multiply Others are obtuse: z–1 means take variable from earlier iteration. These graphs are trivial to decode

Decoding DSP Lingo Uses “flowchart” notation instead of equations
Multiply is or X Add is or +  Delay/Storage is or or Delay z–1 D designed to keep computer architects without the secret decoder ring out of the DSP field?

FIR Filtering: A Motivating Problem
M most recent samples in the delay line (Xi) New sample moves data down delay line “Tap” is a multiply-add Each tap (M+1 taps total) nominally requires: Two data fetches Multiply Accumulate Memory write-back to update delay line Goal: 1 FIR Tap / DSP instruction cycle

DSP Assumptions of the World
Machines issue/execute/complete in order Machines issue 1 instruction per clock Each line of assembly code = 1 instruction Clocks per Instruction = 1.000 Floating Point is slow, expensive

FIR filter on (simple) General Purpose Processor
loop: lw x0, 0(r0) lw y0, 0(r1) mul a, x0,y0 add y0,a,b sw y0,(r2) inc r0 inc r1 inc r2 dec ctr tst ctr jnz loop Problems: Bus / memory bandwidth bottleneck, control code overhead

First Generation DSP (1982): Texas Instruments TMS32010
Instruction Memory 16-bit fixed-point “Harvard architecture” separate instruction, data memories Accumulator Specialized instruction set Load and Accumulate 390 ns Multiple-Accumulate (MAC) time; 228 ns today Processor Data Memory Datapath: Mem T-Register Multiplier P-Register ALU Accumulator

TMS32010 FIR Filter Code Here X4, H4, ... are direct (absolute) memory addresses: LT X4 ; Load T with x(n-4) MPY H4 ; P = H4*X4 LTD X3 ; Load T with x(n-3); x(n-4) = x(n-3); ; Acc = Acc + P MPY H3 ; P = H3*X3 LTD X2 MPY H2 ... Two instructions per tap, but requires unrolling

Features Common to Most DSP Processors
Data path configured for DSP Specialized instruction set Multiple memory banks and buses Specialized addressing modes Specialized execution control Specialized peripherals for DSP

DSP Data Path: Arithmetic
DSPs dealing with numbers representing real world => Want “reals”/ fractions DSPs dealing with numbers for addresses => Want integers Support “fixed point” as well as integers . S -1 Š x < 1 radix point . S –2N–1 Š x < 2N–1 radix point

DSP Data Path: Precision
Word size affects precision of fixed point numbers DSPs have 16-bit, 20-bit, or 24-bit data words Floating Point DSPs cost 2X - 4X vs. fixed point, slower than fixed point DSP programmers will scale values inside code SW Libraries Separate explicit exponent “Blocked Floating Point” single exponent for a group of fractions Floating point support simplifies development

DSP Data Path: Overflow?
DSP are descended from analog : what should happen to output when “peg” an input? (e.g., turn up volume control knob on stereo) Modulo Arithmetic??? Set to most positive (2N–1–1) or most negative value(–2N–1) : “saturation” Many algorithms were developed in this model

DSP Data Path: Multiplier
Specialized hardware performs all key arithmetic operations in 1 cycle 50% of instructions can involve multiplier => single cycle latency multiplier Need to perform multiply-accumulate (MAC) n-bit multiplier => 2n-bit product

DSP Data Path: Accumulator
Don’t want overflow or have to scale accumulator Option 1: accumulator wider than product: “guard bits” Motorola DSP: 24b x 24b => 48b product, 56b Accumulator Option 2: shift right and round product before adder Multiplier Multiplier Shift ALU ALU Accumulator G Accumulator

DSP Data Path: Rounding
Even with guard bits, will need to round when store accumulator into memory 3 DSP standard options Truncation: chop results => biases results up Round to nearest: < 1/2 round down, >= 1/2 round up (more positive) => smaller bias Convergent: < 1/2 round down, > 1/2 round up (more positive), = 1/2 round to make lsb a zero (+1 if 1, +0 if 0) => no bias IEEE 754 calls this round to nearest even

DSP Memory FIR Tap implies multiple memory accesses
DSPs want multiple data ports Some DSPs have ad hoc techniques to reduce memory bandwdith demand Instruction repeat buffer: do 1 instruction 256 times Often disables interrupts, thereby increasing interrupt response time Some recent DSPs have instruction caches Even then may allow programmer to “lock in” instructions into cache Option to turn cache into fast program memory No DSPs have data caches May have multiple data memories

DSP Addressing Have standard addressing modes: immediate, displacement, register indirect Want to keep MAC datapath busy Assumption: any extra instructions imply clock cycles of overhead in inner loop => complex addressing is good => don’t use datapath to calculate fancy address Autoincrement/Autodecrement register indirect lw r1,0(r2)+ => r1 <- M[r2]; r2<-r2+1 Option to do it before addressing, positive or negative

DSP Addressing: Buffers
DSPs dealing with continuous I/O Often interact with an I/O buffer (delay lines) To save memory, buffer often organized as circular buffer What can do to avoid overhead of address checking instructions for circular buffer? Keep start register and end register per address register for use with autoincrement addressing, reset to start when reach end of buffer Every DSP has “modulo” or “circular” addressing

DSP Addressing: FFT FFTs start or end with data in weird bufferfly order 0 (000) => 0 (000) 1 (001) => 4 (100) 2 (010) => 2 (010) 3 (011) => 6 (110) 4 (100) => 1 (001) 5 (101) => 5 (101) 6 (110) => 3 (011) 7 (111) => 7 (111) What can do to avoid overhead of address checking instructions for FFT? Have an optional “bit reverse” address addressing mode for use with autoincrement addressing Many DSPs have “bit reverse” addressing for radix-2 FFT

DSP Instructions May specify multiple operations in a single instruction Must support Multiply-Accumulate (MAC) Need parallel move support Usually have special loop support to reduce branch overhead Loop an instruction or sequence 0 value in register usually means loop maximum number of times Must be sure if calculate loop count that 0 does not mean 0 May have saturating shift left arithmetic May have conditional execution to reduce branches

DSPs are like embedded MPUs, very concerned about energy and cost. So concerned about cost is that they might even use a 4.0 micron (not 0.40) to try to shrink the the wafer costs by using fab line with no overhead costs. DSPs that fail are often claimed to be good for something other than the highest volume application, but that's just designers fooling themselves. Very recently, conventional wisdom has changed so that you try to do everything you can digitally at low voltage so as to save energy. 3 years ago people thought doing everything in analog reduced power, but advances in lower power digital design flipped that bit.

The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate (MAC). DSP are judged by whether they can keep the multipliers busy 100% of the time. The "SPEC" of DSPs is 4 algorithms: Infinite Impulse Response (IIR) filters Finite Impulse Response (FIR) filters FFT, and convolvers In DSPs, algorithms are king! Binary compatibility not an issue Software is not (yet) king in DSPs. People still write in assembly language for a product to minimize the die area for ROM in the DSP chip.

Summary: How are DSPs different?
Essentially infinite streams of data which need to be processed in real time Relatively small programs and data storage requirements Intensive arithmetic processing with low amount of control and branching (in the critical loops) High amount of I/ O with analog interface

Summary: How are DSPs different?
Single cycle multiply accumulate (multiple busses and array multipliers) Complex instructions for standard DSP functions (IIR and FIR filters, convolvers) Specialized memory addressing Modular arithmetic for circular buffers (delay lines) Bit reversal (FFT) Zero overhead loops and repeat instructions I/ O support – Serial and parallel ports

Summary: Unique Features in DSP architectures
Continuous I/O stream, real time requirements Multiple memory accesses Autoinc/autodec addressing Datapath Multiply width Wide accumulator Guard bits/shifting rounding Saturation Weird things Circular addressing Reverse addressing Special instructions shift left and saturate (arithmetic left-shift)

Conclusions DSP processor performance has increased by a factor of about 150x over the past 15 years (~40%/year) Processor architectures for DSP will be increasingly specialized for applications, especially communication applications General-purpose processors will become viable for many DSP applications Users of processors for DSP will have an expanding array of choices Selecting processors requires a careful, application-specific analysis

For More Information Collection of BDTI’s papers on DSP processors, tools, and benchmarking. Links to other good DSP sites. Microprocessor Report For info on newer DSP processors. DSP Processor Fundamentals, Textbook on DSP Processors, BDTI IEEE Spectrum, July, Article on DSP Benchmarks Embedded Systems Prog., October, Article on Choosing a DSP Processor

Lecture 6: Storage Devices, Metrics, RAID, I/O Benchmarks, and Busses

Motivation: Who Cares About I/O?
CPU Performance: 60% per year I/O system performance limited by mechanical delays (disk I/O) < 10% per year (IO per sec or MB per sec) Amdahl's Law: system speed-up limited by the slowest part! 10% IO & 10x CPU => 5x Performance (lose 50%) 10% IO & 100x CPU => 10x Performance (lose 90%) I/O bottleneck: Diminishing fraction of time in CPU Diminishing value of faster CPUs Ancestor of Java had no I/O CPU vs. Peripheral Primary vs. Secondary What maks portable, PDA exciting?

Storage System Issues Historical Context of Storage I/O
Secondary and Tertiary Storage Devices Storage I/O Performance Measures Processor Interface Issues Redundant Arrays of Inexpensive Disks (RAID) ABCs of UNIX File Systems I/O Benchmarks Comparing UNIX File System Performance I/O Busses

I/O Systems Processor Cache Memory - I/O Bus Main Memory I/O
interrupts Processor Cache Memory - I/O Bus Main Memory I/O Controller I/O Controller I/O Controller Graphics Disk Disk Network

Technology Trends Disk Capacity now doubles every 18 months; before
• Today: Processing Power Doubles Every 18 months • Today: Memory Size Doubles Every 18 months(4X/3yr) • Today: Disk Capacity Doubles Every 18 months • Disk Positioning Rate (Seek + Rotate) Doubles Every Ten Years! The I/O GAP

Storage Technology Drivers
Driven by the prevailing computing paradigm 1950s: migration from batch to on-line processing 1990s: migration to ubiquitous computing computers in phones, books, cars, video cameras, … nationwide fiber optical network with wireless tails Effects on storage industry: Embedded storage smaller, cheaper, more reliable, lower power Data utilities high capacity, hierarchically managed storage

Historical Perspective
1956 IBM Ramac — early 1970s Winchester Developed for mainframe computers, proprietary interfaces Steady shrink in form factor: 27 in. to 14 in. 1970s developments 5.25 inch floppy disk form factor early emergence of industry standard disk interfaces ST506, SASI, SMD, ESDI Early 1980s PCs and first generation workstations Mid 1980s Client/server computing Centralized storage on file server accelerates disk downsizing: 8 inch to 5.25 inch Mass market disk drives become a reality industry standards: SCSI, IPI, IDE 5.25 inch drives for standalone PCs, End of proprietary interfaces

Disk History 1973: 1. 7 Mbit/sq. in 140 MBytes 1979: 7. 7 Mbit/sq. in
Data density Mbit/sq. in. Capacity of Unit Shown Megabytes 1973: 1. 7 Mbit/sq. in 140 MBytes 1979: 7. 7 Mbit/sq. in 2,300 MBytes source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even more data into even smaller spaces”

Historical Perspective
Late 1980s/Early 1990s: Laptops, notebooks, (palmtops) 3.5 inch, 2.5 inch, (1.8 inch form factors) Form factor plus capacity drives market, not so much performance Recently Bandwidth improving at 40%/ year Challenged by DRAM, flash RAM in PCMCIA cards still expensive, Intel promises but doesn’t deliver unattractive MBytes per cubic inch Optical disk fails on performance (e.g., NEXT) but finds niche (CD ROM)

Disk History 1989: 63 Mbit/sq. in 60,000 MBytes 1997: 1450 Mbit/sq. in
source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even mroe data into even smaller spaces”

MBits per square inch: DRAM as % of Disk over time
9 v. 22 Mb/si 470 v Mb/si 0.2 v. 1.7 Mb/si source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even mroe data into even smaller spaces”

Alternative Data Storage Technologies: Early 1990s
Cap BPI TPI BPI*TPI Data Xfer Access Technology (MB) (Million) (KByte/s) Time Conventional Tape: Cartridge (.25") minutes IBM 3490 (.5") seconds Helical Scan Tape: Video (8mm) secs DAT (4mm) secs Magnetic & Optical Disk: Hard Disk (5.25") ms IBM (10.5") ms Sony MO (5.25") ms

Devices: Magnetic Disks
Purpose: Long-term, nonvolatile storage Large, inexpensive, slow level in the storage hierarchy Characteristics: Seek Time (~8 ms avg) ~4 ms positional latency ~4 ms rotational latency Transfer rate About a sector per ms (5-15 MB/s) Blocks Capacity Gigabytes Quadruples every 3 years Track Sector Cylinder Platter Head 7200 RPM = 120 RPS => 8 ms per rev avg rot. latency = 4 ms 128 sectors per track => 0.25 ms per sector 1 KB per sector => 16 MB / s Response time = Queue + Controller + Seek + Rot + Xfer Service time

Disk Device Terminology
Disk Latency = Queuing Time + Controller time + Seek Time + Rotation Time + Xfer Time Order of magnitude times for 4K byte transfers: Seek: 8 ms or less Rotate: rpm Xfer: rpm

Advantages of Small Form-factor Disk Drives
Low cost/MB High MB/volume High MB/watt Low cost/Actuator Cost and Environmental Efficiencies

Tape vs. Disk • Longitudinal tape uses same technology as
hard disk; tracks its density improvements Disk head flies above surface, tape head lies on surface Disk fixed, tape removable • Inherent cost-performance based on geometries: fixed rotating platters with gaps (random access, limited area, 1 media / reader) vs. removable long strips wound on spool (sequential access, "unlimited" length, multiple / reader) • New technology trend: Helical Scan (VCR, Camcorder, DAT) Spins head at angle to tape to improve density

Current Drawbacks to Tape
Tape wear out: Helical 100s of passes to 1000s for longitudinal Head wear out: 2000 hours for helical Both must be accounted for in economic / reliability model Long rewind, eject, load, spin-up times; not inherent, just no need in marketplace (so far) Designed for archival

Automated Cartridge System
STC 4400 8 feet 10 feet 6000 x GB 3490 tapes = 5 TBytes in $500,000 O.E.M. Price 6000 x 10 GB D3 tapes = 60 TBytes in 1998 Library of Congress: all information in the world; in 1992, ASCII of all books = 30 TB

Relative Cost of Storage Technology—Late 1995/Early 1996
Magnetic Disks 5.25” 9.1 GB $2129 $0.23/MB $1985 $0.22/MB 3.5” 4.3 GB $1199 $0.27/MB $999 $0.23/MB 2.5” 514 MB $299 $0.58/MB GB $345 $0.33/MB Optical Disks 5.25” 4.6 GB $ $0.41/MB $ $0.39/MB PCMCIA Cards Static RAM 4.0 MB $700 $175/MB Flash RAM 40.0 MB $1300 $32/MB 175 MB $3600 $20.50/MB

Outline Historical Context of Storage I/O

Disk I/O Performance Metrics: Response Time Throughput Response
Time (ms) Metrics: Response Time Throughput 300 200 100 100% 0% Throughput (% total BW) Proc Queue IOC Device Response time = Queue + Device Service time

Response Time vs. Productivity
Interactive environments: Each interaction or transaction has 3 parts: Entry Time: time for user to enter command System Response Time: time between user entry & system replies Think Time: Time from response until user begins next command 1st transaction 2nd transaction What happens to transaction time as shrink system response time from 1.0 sec to 0.3 sec? With Keyboard: 4.0 sec entry, 9.4 sec think time With Graphics: sec entry, 1.6 sec think time

Response Time & Productivity
0.7sec off response saves 4.9 sec (34%) and 2.0 sec (70%) total time per transaction => greater productivity Another study: everyone gets more done with faster response, but novice with fast response = expert with slow

Disk Time Example Disk Parameters: Controller overhead is 2 ms
Transfer size is 8K bytes Advertised average seek is 12 ms Disk spins at 7200 RPM Transfer rate is 4 MB/sec Controller overhead is 2 ms Assume that disk is idle so no queuing delay What is Average Disk Access Time for a Sector? Avg seek + avg rot delay + transfer time + controller overhead 12 ms + 0.5/(7200 RPM/60) + 8 KB/4 MB/s + 2 ms = 20 ms Advertised seek time assumes no locality: typically 1/4 to 1/3 advertised seek time: 20 ms => 12 ms

Processor Interface Issues
Interrupts Memory mapped I/O I/O Control Structures Polling DMA I/O Controllers I/O Processors Capacity, Access Time, Bandwidth Interconnections Busses

I/O Interface CPU Memory Independent I/O Bus memory bus Interface
Separate I/O instructions (in,out) Peripheral Peripheral CPU Lines distinguish between I/O and memory transfers common memory & I/O bus 40 Mbytes/sec optimistically 10 MIP processor completely saturates the bus! VME bus Multibus-II Nubus Memory Interface Interface Peripheral Peripheral

Memory Mapped I/O CPU Single Memory & I/O Bus
No Separate I/O Instructions ROM Memory Interface Interface RAM Peripheral Peripheral CPU $ I/O L2 $ Memory Bus I/O bus Memory Bus Adaptor

Programmed I/O (Polling)
CPU Is the data ready? busy wait loop not an efficient way to use the CPU unless the device is very fast! no Memory IOC yes read data device but checks for I/O completion can be dispersed among computationally intensive code store data done? no yes

Interrupt Driven Data Transfer
CPU add sub and or nop user program (1) I/O interrupt Memory IOC (2) save PC (3) interrupt service addr device read store ... rti interrupt service routine User program progress only halted during actual transfer 1000 transfers at 1 ms each: µsec per interrupt 1000 interrupt 98 µsec each = 0.1 CPU seconds (4) memory -6 Device xfer rate = 10 MBytes/sec => 0 .1 x 10 sec/byte => 0.1 µsec/byte => 1000 bytes = 100 µsec 1000 transfers x 100 µsecs = 100 ms = 0.1 CPU seconds Still far from device transfer rate! 1/2 in interrupt overhead

Direct Memory Access Time to do 1000 xfers at 1 msec each:
1 DMA set-up 50 µsec 1 2 µsec 1 interrupt service 48 µsec .0001 second of CPU time CPU sends a starting address, direction, and length count to DMAC. Then issues "start". CPU ROM Memory Mapped I/O Memory DMAC IOC RAM device Peripherals DMAC provides handshake signals for Peripheral Controller, and Memory Addresses and handshake signals for Memory. DMAC n

Input/Output Processors
IOP D1 CPU D2 main memory bus Mem Dn I/O bus target device where cmnds are CPU IOP issues instruction to IOP interrupts when done (4) OP Device Address (1) looks in memory for commands (2) (3) memory OP Addr Cnt Other Device to/from memory transfers are controlled by the IOP directly. IOP steals memory cycles. what to do special requests where to put data how much

Relationship to Processor Architecture
I/O instructions have largely disappeared Interrupts: Stack replaced by shadow registers Handler saves registers and re-enables higher priority int's Interrupt types reduced in number; handler must query interrupt controller

Relationship to Processor Architecture
Caches required for processor performance cause problems for I/O Flushing is expensive, I/O pollutes cache Solution is borrowed from shared memory multiprocessors "snooping" Virtual memory frustrates DMA Stateful processors hard to context switch

Summary Disk industry growing rapidly, improves:
bandwidth 40%/yr , area density 60%/year, $/MB faster? queue + controller + seek + rotate + transfer Advertised average seek time benchmark much greater than average seek time in practice Response time vs. Bandwidth tradeoffs Value of faster response time: 0.7sec off response saves 4.9 sec and 2.0 sec (70%) total time per transaction => greater productivity everyone gets more done with faster response, but novice with fast response = expert with slow Processor Interface: today peripheral processors, DMA, I/O bus, interrupts

Summary: Relationship to Processor Architecture
I/O instructions have disappeared Interrupt stack replaced by shadow registers Interrupt types reduced in number Caches required for processor performance cause problems for I/O Virtual memory frustrates DMA Stateful processors hard to context switch

Network Attached Storage
Decreasing Disk Diameters 14" » 10" » 8" » 5.25" » 3.5" » 2.5" » 1.8" » 1.3" » . . . high bandwidth disk systems based on arrays of disks High Performance Storage Service on a High Speed Network Network provides well defined physical and logical interfaces: separate CPU and storage system! Network File Services OS structures supporting remote file access 3 Mb/s » 10Mb/s » 50 Mb/s » 100 Mb/s » 1 Gb/s » 10 Gb/s networks capable of sustaining high bandwidth transfers Increasing Network Bandwidth

Manufacturing Advantages of Disk Arrays
Disk Product Families Conventional: disk designs 14” 3.5” 5.25” 10” Low End High End Disk Array: 1 disk design 3.5”

Replace Small # of Large Disks with Large # of Small Disks
Replace Small # of Large Disks with Large # of Small Disks! (1988 Disks) IBM 3390 (K) 20 GBytes 97 cu. ft. 3 KW 15 MB/s 600 I/Os/s 250 KHrs $250K IBM 3.5" 0061 320 MBytes 0.1 cu. ft. 11 W 1.5 MB/s 55 I/Os/s 50 KHrs $2K x70 23 GBytes 11 cu. ft. 1 KW 120 MB/s 3900 I/Os/s ??? Hrs $150K Data Capacity Volume Power Data Rate I/O Rate MTTF Cost large data and I/O rates high MB per cu. ft., high MB per KW reliability? Disk Arrays have potential for

Array Reliability Reliability of N disks = Reliability of 1 Disk ÷ N
50,000 Hours ÷ 70 disks = 700 hours Disk system MTTF: Drops from 6 years to 1 month! • Arrays (without redundancy) too unreliable to be useful! Hot spares support reconstruction in parallel with access: very high media availability can be achieved

Redundant Arrays of Disks
• Files are "striped" across multiple spindles • Redundancy yields high data availability Disks will fail Contents reconstructed from data redundantly stored in the array Capacity penalty to store it Bandwidth penalty to update Mirroring/Shadowing (high capacity cost) Parity Techniques:

Redundant Arrays of Disks RAID 1: Disk Mirroring/Shadowing
recovery group • Each disk is fully duplicated onto its "shadow" Very high availability can be achieved • Bandwidth sacrifice on write: Logical write = two physical writes • Reads may be optimized • Most expensive solution: 100% capacity overhead Targeted for high I/O rate , high availability environments

Redundant Arrays of Disks RAID 3: Parity Disk
. . . P logical record 1 1 1 1 Striped physical records • Parity computed across recovery group to protect against hard disk failures 33% capacity cost for parity in this configuration wider arrays reduce capacity costs, decrease expected availability, increase reconstruction time • Arms logically synchronized, spindles rotationally synchronized logically a single high capacity, high transfer rate disk Targeted for high bandwidth applications: Scientific, Image Processing

Redundant Arrays of Disks RAID 5+: High I/O Rate Parity
Increasing Logical Disk Addresses P A logical write becomes four physical I/Os Independent writes possible because of interleaved parity D4 D5 D6 P D7 D8 D9 P D10 D11 D12 P D13 D14 D15 Stripe P D16 D17 D18 D19 Targeted for mixed applications Stripe Unit D20 D21 D22 D23 P . . . . . Disk Columns

Problems of Disk Arrays: Small Writes
RAID-5: Small Write Algorithm 1 Logical Write = 2 Physical Reads + 2 Physical Writes D0' D0 D1 D2 D3 P new data old data old parity (1. Read) (2. Read) + XOR + XOR (3. Write) (4. Write) D0' D1 D2 D3 P'

Subsystem Organization
host host adapter array controller single board disk controller manages interface to host, DMA single board disk controller control, buffering, parity logic single board disk controller physical device control single board disk controller striping software off-loaded from host to array controller no applications modifications no reduction of host performance often piggy-backed in small format devices

System Availability: Orthogonal RAIDs
Array Controller String Controller String Controller String Controller String Controller String Controller String Controller Data Recovery Group: unit of data redundancy Redundant Support Components: fans, power supplies, controller, cables End to End Data Integrity: internal parity protected data paths

System-Level Availability
host host I/O Controller Fully dual redundant I/O Controller Array Controller Array Controller . . . . . . . . . Goal: No Single Points of Failure . . . . . . . with duplicated paths, higher performance can be obtained when there are no failures Recovery Group

Summary: Redundant Arrays of Disks (RAID) Techniques
• Disk Mirroring, Shadowing (RAID 1) 1 1 Each disk is fully duplicated onto its "shadow" Logical write = two physical writes 100% capacity overhead • Parity Data Bandwidth Array (RAID 3) 1 1 1 1 Parity computed horizontally Logically a single high data bw disk • High I/O Rate Parity Array (RAID 5) Interleaved parity blocks Independent reads and writes Logical write = 2 reads + 2 writes Parity + Reed-Solomon codes

Secondary and Tertiary Storage Devices Storage I/O Performance Measures Processor Interface Issues Redundant Arrarys of Inexpensive Disks (RAID) ABCs of UNIX File Systems I/O Benchmarks Comparing UNIX File System Performance I/O Buses

ABCs of UNIX File Systems
Key Issues File vs. Raw I/O File Cache Size Policy Write Policy Local Disk vs. Server Disk File vs. Raw: File system access is the norm: standard policies apply Raw: alternate I/O system to avoid file system, used by data bases % of main memory dedicated to file cache is fixed at system generation (e.g., 10%) % of main memory for file cache varies depending on amount of file I/O (e.g., up to 80%)

Write Policy File Storage should be permanent; either write immediately or flush file cache after fixed period (e.g., 30 seconds) Write Through with Write Buffer Write Back Write Buffer often confused with Write Back Write Through with Write Buffer, all writes go to disk Write Through with Write Buffer, writes are asynchronous, so processor doesn’t have to wait for disk write Write Back will combine multiple writes to same page; hence can be called Write Cancelling

Local vs. Server Unix File systems have historically had different policies (and even file systems) for local client vs. remote server NFS local disk allows 30 second delay to flush writes NFS server disk writes through to disk on file close Cache coherency problem if allow clients to have file caches in addition to server file cache NFS just writes through on file close Other file systems use cache coherency with write back to check state and selectively invalidate or update

Typical File Server Architecture
Limits to performance: data copying read data staged from device to primary memory copy again into network packet templates copy yet again to network interface No specialization for fast processing between network and disk

AUSPEX NS5000 File Server Special hardware/software architecture for high performance NFS I/O Functional multiprocessing I/O buffers UNIX frontend specialized for protocol processing dedicated FS software manages 10 SCSI channels

Berkeley RAID-II Disk Array File Server
to UltraNet Low latency transfers mixed with high bandwidth transfers to 120 disk drives

I/O Benchmarks For better or worse, benchmarks shape a field
Processor benchmarks classically aimed at response time for fixed sized problem I/O benchmarks typically measure throughput, possibly with upper limit on response times (or 90% of response times) What if fix problem size, given 60%/year increase in DRAM capacity? Benchmark Size of Data % Time I/O Year I/OStones 1 MB 26% 1990 Andrew 4.5 MB 4% 1988 Not much time in I/O Not measuring disk (or even main memory)

I/O Benchmarks Alternative: self-scaling benchmark; automatically and dynamically increase aspects of workload to match characteristics of system measured Measures wide range of current & future Describe three self-scaling benchmarks Transaction Processing: TPC-A, TPC-B, TPC-C NFS: SPEC SFS (LADDIS) Unix I/O: Willy

I/O Benchmarks: Transaction Processing
Transaction Processing (TP) (or On-line TP=OLTP) Changes to a large body of shared information from many terminals, with the TP system guaranteeing proper behavior on a failure If a bank’s computer fails when a customer withdraws money, the TP system would guarantee that the account is debited if the customer received the money and that the account is unchanged if the money was not received Airline reservation systems & banks use TP Atomic transactions makes this work Each transaction => 2 to 10 disk I/Os & 5,000 to 20,000 CPU instructions per disk I/O Efficiency of TP SW & avoiding disks accesses by keeping information in main memory Classic metric is Transactions Per Second (TPS) Under what workload? how machine configured?

I/O Benchmarks: Transaction Processing
Early 1980s great interest in OLTP Expecting demand for high TPS (e.g., ATM machines, credit cards) Tandem’s success implied medium range OLTP expands Each vendor picked own conditions for TPS claims, report only CPU times with widely different I/O Conflicting claims led to disbelief of all benchmarks=> chaos 1984 Jim Gray of Tandem distributed paper to Tandem employees and 19 in other industries to propose standard benchmark Published “A measure of transaction processing power,” Datamation, 1985 by Anonymous et. al To indicate that this was effort of large group To avoid delays of legal department of each author’s firm Still get mail at Tandem to author

I/O Benchmarks: TP by Anon et. al
Proposed 3 standard tests to characterize commercial OLTP TP1: OLTP test, DebitCredit, simulates ATMs (TP1) Batch sort Batch scan Debit/Credit: One type of transaction: 100 bytes each Recorded 3 places: account file, branch file, teller file + events recorded in history file (90 days) 15% requests for different branches Under what conditions, how report results?

I/O Benchmarks: TP1 by Anon et. al
DebitCredit Scalability: size of account, branch, teller, history function of throughput TPS Number of ATMs Account-file size 10 1, GB 100 10, GB 1, , GB 10,000 1,000, GB Each input TPS =>100,000 account records, 10 branches, 100 ATMs Accounts must grow since a person is not likely to use the bank more frequently just because the bank has a faster computer! Response time: 95% transactions take < 1 second Configuration control: just report price (initial purchase price + 5 year maintenance = cost of ownership) By publishing, in public domain

I/O Benchmarks: TP1 by Anon et. al
Problems Often ignored the user network to terminals Used transaction generator with no think time; made sense for database vendors, but not what customer would see Solution: Hire auditor to certify results Auditors soon saw many variations of ways to trick system Proposed minimum compliance list (13 pages); still, DEC tried IBM test on different machine with poorer results than claimed by auditor Created Transaction Processing Performance Council in 1988: founders were CDC, DEC, ICL, Pyramid, Stratus, Sybase, Tandem, and Wang; 46 companies today Led to TPC standard benchmarks in 1990,

I/O Benchmarks: Old TPC Benchmarks
TPC-A: Revised version of TP1/DebitCredit Arrivals: Random (TPC) vs. uniform (TP1) Terminals: Smart vs. dumb (affects instruction path length) ATM scaling: 10 terminals per TPS vs. 100 Branch scaling: 1 branch record per TPS vs. 10 Response time constraint: 90% < 2 seconds vs. 95% < 1 Full disclosure, approved by TPC Complete TPS vs. response time plots vs. single point TPC-B: Same as TPC-A but without terminals—batch processing of requests Response time makes no sense: plots tps vs. residence time (time of transaction resides in system) These have been withdrawn as benchmarks

I/O Benchmarks: TPC-C Complex OLTP
Models a wholesale supplier managing orders Order-entry conceptual model for benchmark Workload = 5 transaction types Users and database scale linearly with throughput Defines full-screen end-user interface Metrics: new-order rate (tpmC) and price/performance ($/tpmC) Approved July 1992

I/O Benchmarks: TPC-D Complex Decision Support Workload
OLTP: business operation Decision support: business analysis (historical) Workload = 17 adhoc transactions e,g., Impact on revenue of eliminating company-wide discount? Synthetic generator of data Size determined by Scale Factor: 100 GB, 300 GB, 1 TB, 3 TB, 10 TB Metrics: “Queries per Gigabyte Hour” Power = 3600 x SF / Geo. Mean of queries Throughput = 17 x SF / (time/3600) Price/Performance = $/ geo. Report time to load database (indices, stats) too Approved April 1995

I/O Benchmarks: TPC-W Transactional Web Benchmark
Represent any business (retail store, software distribution, airline reservation, electronic stock trades, etc.) that markets and sells over the Internet/ Intranet Measure systems supporting users browsing, ordering, and conducting transaction oriented business activities. Security (including user authentication and data encryption) and dynamic page generation are important Before: processing of customer order by terminal operator working on LAN connected to database system Today: customer accesses company site over Internet connection, browses both static and dynamically generated Web pages, and searches the database for product or customer information. Customer also initiate, finalize and check on product orders and deliveries. Started 1/97; hope to release Fall, 1998

TPC-C Performance tpm(c)
Rank Config tpmC $/tpmC Database 1 IBM RS/6000 SP (12 node x 8-way) 57, $ Oracle 2 HP HP 9000 V2250 (16-way) 52, $81.17 Sybase ASE 3 Sun Ultra E6000 c/s (2 node x 22-way) 51, $ Oracle 4 HP HP 9000 V2200 (16-way) 39, $94.18 Sybase ASE 5 Fujitsu GRANPOWER 7000 Model , $57, Oracle8 6 Sun Ultra E6000 c/s (24-way) 31, $ Oracle 7Digital AlphaS8400 (4 node x 8-way) 30, $ Oracle7 V7.3 8 SGI Origin2000 Server c/s (28-way) 25, $ INFORMIX 9 IBM AS/400e Server (12-way) 25, $ DB2 10 Digital AlphaS8400 5/625 (10-way) 24, $ Sybase SQL

TPC-C Price/Performance $/tpm(c)
Rank Config $/tpmC tpmC Database 1 Acer AcerAltos 19000Pro4 $ , M/S SQL 6.5 2 Dell PowerEdge 6100 c/s $ , M/S SQL 6.5 3 Compaq ProLiant 5500 c/s $ , M/S SQL 6.5 4 ALR Revolution 6x6 c/s $ , M/S SQL 6.5 5 HP NetServer LX Pro $ , M/S SQL 6.5 6 Fujitsu teamserver M796i $ , M/S SQL 6.5 7 Fujitsu GRANPOWER 5000 Model 670 $ , M/S SQL 6.5 8 Unisys Aquanta HS/6 c/s $ , M/S SQL 6.5 9 Compaq ProLiant 7000 c/s $ , M/S SQL 6.5 10 Unisys Aquanta HS/6 c/s $ , M/S SQL 6.5

TPC-D Performance/Price 300 GB
Rank Config. Qppd QthD $/QphD Database 1 NCR WorldMark , , , Teradata 2 HP 9000 EPS22 (16 node) 5, , , Informix-XPS 3 DG AViiON AV , , , Oracle8 v8.0.4 4 Sun - Ultra Enterprise , , , Informix-XPS 5 Sequent NUMA-Q 2000 (32 way) 3, , , Oracle8 v8.0.4 Rank Config. Qppd QthD $/QphD Database 1 DG AViiON AV , , , Oracle8 v8.0.4 2 Sun Ultra Enterprise , , , Informix-XPS 3 HP 9000 EPS22 (16 node) 5, , , Informix-XPS 4 NCR WorldMark , , , Teradata 5 Sequent NUMA-Q 2000 (32 way) 3, , , Oracle8 v8.0.4

TPC-D Performance 1TB Rank Config. Qppd QthD $/QphD Database 1 Sun Ultra E6000 (4 x 24-way) 12, , , Infomix Dyn 2 NCR WorldMark (32 x 4-way) 12, , Teradata 3 IBM RS/6000 SP (32 x 8-way) 7, , DB2 UDB, V5 NOTE: Inappropriate to compare results from different database sizes.

SPEC SFS/LADDIS Predecessor: NFSstones
NFSStones: synthetic benchmark that generates series of NFS requests from single client to test server: reads, writes, & commands & file sizes from other studies Problem: 1 client could not always stress server Files and block sizes not realistic Clients had to run SunOS

SPEC SFS/LADDIS 1993 Attempt by NFS companies to agree on standard benchmark: Legato, Auspex, Data General, DEC, Interphase, Sun. Like NFSstones but Run on multiple clients & networks (to prevent bottlenecks) Same caching policy in all clients Reads: 85% full block & 15% partial blocks Writes: 50% full block & 50% partial blocks Average response time: 50 ms Scaling: for every 100 NFS ops/sec, increase capacity 1GB Results: plot of server load (throughput) vs. response time Assumes: 1 user => 10 NFS ops/sec

Example SPEC SFS Result: DEC Alpha
200 MHz 21064: 8KI + 8KD + 2MB L2; 512 MB; 1 Gigaswitch DEC OSF1 v2.0 4 FDDI networks; 32 NFS Daemons, 24 GB file size 88 Disks, 16 controllers, 84 file systems 4817

Willy UNIX File System Benchmark that gives insight into I/O system behavior (Chen and Patterson, 1993) Self-scaling to automatically explore system size Examines five parameters Unique bytes touched: data size; locality via LRU Gives file cache size Percentage of reads: %writes = 1 – % reads; typically 50% 100% reads gives peak throughput Average I/O Request Size: Bernoulli distrib., Coeff of variance=1 Percentage sequential requests: typically 50% Number of processes: concurrency of workload (number processes issuing I/O requests) Fix four parameters while vary one parameter Searches space to find high throughput

Example Willy: DS 5000 Avg. Access Size 32 KB 13 KB
Sprite Ultrix Avg. Access Size 32 KB 13 KB Data touched (file cache) 2MB, 15 MB 2 MB Data touched (disk) 36 MB 6 MB % reads = 50%, % sequential = 50% DS MB memory Ultrix: Fixed File Cache Size, Write through Sprite: Dynamic File Cache Size, Write back (Write cancelling)

Sprite's Log Structured File System
Large file caches effective in reducing disk reads Disk traffic likely to be dominated by writes Write-Optimized File System Only representation on disk is log Stream out files, directories, maps without seeks Advantages: Speed Stripes easily across several disks Fast recovery Temporal locality Versioning Problems: Random access retrieval Log wrap Disk space utilization

Willy: DS 5000 Number Bytes Touched
W+R Cached R Cached None Cached Log Structured File System: effective write cache of LFS much smaller (5-8 MB) than read cache (20 MB) Reads cached while writes are not => 3 plateaus

Summary: I/O Benchmarks
Scaling to track technological change TPC: price performance as normalizing configuration feature Auditing to ensure no foul play Throughput with restricted response time is normal measure

Secondary and Tertiary Storage Devices Storage I/O Performance Measures Processor Interface Issues A Little Queuing Theory Redundant Arrarys of Inexpensive Disks (RAID) ABCs of UNIX File Systems I/O Benchmarks Comparing UNIX File System Performance I/O Busses

Interconnect Trends Interconnect = glue that interfaces computer system components High speed hardware interfaces + logical protocols Networks, channels, backplanes message-based narrow pathways distributed arb memory-mapped wide pathways centralized arb

Backplane Architectures
Distinctions begin to blur: SCSI channel is like a bus FutureBus is like a channel (disconnect/reconnect) HIPPI forms links in high speed switching fabrics

Bus-Based Interconnect
Bus: a shared communication link between subsystems Low cost: a single set of wires is shared multiple ways Versatility: Easy to add new devices & peripherals may even be ported between computers using common bus Disadvantage A communication bottleneck, possibly limiting the maximum I/O throughput Bus speed is limited by physical factors the bus length the number of devices (and, hence, bus loading). these physical limits prevent arbitrary bus speedup.

Bus-Based Interconnect
Two generic types of busses: I/O busses: lengthy, many types of devices connected, wide range in the data bandwidth), and follow a bus standard (sometimes called a channel) CPU–memory buses: high speed, matched to the memory system to maximize memory–CPU bandwidth, single device (sometimes called a backplane) To lower costs, low cost (older) systems combine together Bus transaction Sending address & receiving or sending data

Bus Protocols Multibus: 20 address, 16 data, 5 control
Master Slave ° ° ° Control Lines Address Lines Data Lines Multibus: 20 address, 16 data, 5 control Bus Master: has ability to control the bus, initiates transaction Bus Slave: module activated by the transaction Bus Communication Protocol: specification of sequence of events and timing requirements in transferring information. Asynchronous Bus Transfers: control lines (req., ack.) serve to orchestrate sequencing Synchronous Bus Transfers: sequence relative to common clock

Synchronous Bus Protocols
Clock Address Data Read Wait Read complete begin read Pipelined/Split transaction Bus Protocol Address Data Wait addr 1 addr 2 addr 3 data 0 data 1 data 2 wait 1 OK 1

Asynchronous Handshake
Write Transaction Address Data Read Req. Ack. Master Asserts Address Next Address Master Asserts Data 4 Cycle Handshake t t t t3 t4 t5 t0 : Master has obtained control and asserts address, direction, data Waits a specified amount of time for slaves to decode target\ t1: Master asserts request line t2: Slave asserts ack, indicating data received t3: Master releases req t4: Slave releases ack

Read Transaction Time Multiplexed Bus: address and data share lines
Req Ack Master Asserts Address Next Address 4 Cycle Handshake t t t t3 t4 t5 t0 : Master has obtained control and asserts address, direction, data Waits a specified amount of time for slaves to decode target\ t1: Master asserts request line t2: Slave asserts ack, indicating ready to transmit data t3: Master releases req, data received t4: Slave releases ack Time Multiplexed Bus: address and data share lines

Bus Arbitration BR=Bus Request BG=Bus Grant
Parallel (Centralized) Arbitration Serial Arbitration (daisy chaining) Polling (decentralized) BR=Bus Request BG=Bus Grant BR BG BR BG BR BG M M M BG BR BGi BGo BGi BGo BGi BGo M M M A.U. BR BR BR Busy On BGi BGo BGi BGo BGi BGo M M M BR BR BR

Bus Options Option High performance Low cost
Bus width Separate address Multiplex address & data lines & data lines Data width Wider is faster Narrower is cheaper (e.g., 32 bits) (e.g., 8 bits) Transfer size Multiple words has Single-word transfer less bus overhead is simpler Bus masters Multiple Single master (requires arbitration) (no arbitration) Split Yes—separate No—continuous transaction? Request and Reply connection is cheaper packets gets higher and has lower latency bandwidth (needs multiple masters) Clocking Synchronous Asynchronous

SCSI: Small Computer System Interface
Clock rate: 5 MHz / 10 MHz (fast) / 20 MHz (ultra) Width: n = 8 bits / 16 bits (wide); up to n – 1 devices to communicate on a bus or “string” Devices can be slave (“target”) or master(“initiator”) SCSI protocol: a series of “phases”, during which specific actions are taken by the controller and the SCSI disks Bus Free: No device is currently accessing the bus Arbitration: When the SCSI bus goes free, multiple devices may request (arbitrate for) the bus; fixed priority by address Selection: informs the target that it will participate (Reselection if disconnected) Command: the initiator reads the SCSI command bytes from host memory and sends them to the target Data Transfer: data in or out, initiator: target Message Phase: message in or out, initiator: target (identify, save/restore data pointer, disconnect, command complete) Status Phase: target, just before command complete

1993 I/O Bus Survey (P&H, 2nd Ed)
Bus SBus TurboChannel MicroChannel PCI Originator Sun DEC IBM Intel Clock Rate (MHz) async 33 Addressing Virtual Physical Physical Physical Data Sizes (bits) 8,16,32 8,16,24,32 8,16,24,32,64 8,16,24,32,64 Master Multi Single Multi Multi Arbitration Central Central Central Central 32 bit read (MB/s) Peak (MB/s) (222) Max Power (W)

1993 MP Server Memory Bus Survey
Bus Summit Challenge XDBus Originator HP SGI Sun Clock Rate (MHz) Split transaction? Yes Yes Yes? Address lines ?? Data lines (parity) Data Sizes (bits) Clocks/transfer 4 5 4? Peak (MB/s) Master Multi Multi Multi Arbitration Central Central Central Addressing Physical Physical Physical Slots Busses/system 1 1 2 Length 13 inches 12? inches 17 inches

Lecture 7: Interconnection Networks

Review: Storage System Issues
Historical Context of Storage I/O Secondary and Tertiary Storage Devices Storage I/O Performance Measures Processor Interface Issues Redundant Arrays of Inexpensive Disks (RAID) ABCs of UNIX File Systems I/O Benchmarks

Review: I/O Benchmarks
Scaling to track technological change TPC: price performance as nomalizing configuration feature Auditing to ensure no foul play Throughput with restricted response time is normal measure

I/O to External Devices and Other Computers
interrupts Processor Cache Used to be I/O, now a reneasissance Memory - I/O Bus Main I/O I/O I/O Memory Controller Controller Controller Graphics Disk Disk Network ideal: high bandwidth, low latency

Networks Goal: Communication between computers
Eventual Goal: treat collection of computers as if one big computer, distributed resource sharing Theme: Different computers must agree on many things Overriding importance of standards and protocols Fault tolerance critical as well Warning: Terminology-rich environment Obvious goal 1 big coherent file system as future goal? Standards critical

Example Major Networks
30 acronyms on slides T1 = speed of lines FDDI = successfor Fiberchannel distributed data interface: 10X Standard took so long, faster versions of Ethernet available 100 Mbit Ethernet, soon 1Gbit Ethernet Software hard to change, Hardware should

Networks Facets people talk a lot about: What really matters:
direct (point-to-point) vs. indirect (multi-hop) topology (e.g., bus, ring, DAG) routing algorithms switching (aka multiplexing) wiring (e.g., choice of media, copper, coax, fiber) What really matters: latency bandwidth cost reliability

Interconnections (Networks)
Examples: MPP networks (SP2): 100s nodes; Š 25 meters per link Local Area Networks (Ethernet): 100s nodes; Š 1000 meters Wide Area Network (ATM): 1000s nodes; Š 5,000,000 meters a.k.a. end systems, hosts a.k.a. network, communication subnet Interconnection Network

More Network Background
Connection of 2 or more networks: Internetworking 3 cultures for 3 classes of networks MPP: performance, latency and bandwidth LAN: workstations, cost WAN: telecommunications, phone call revenue

ABCs of Networks Starting Point: Send bits between 2 computers
Queue (FIFO) on each end Information sent called a “message” Can send both ways (“Full Duplex”) Rules for communication? “protocol” Inside a computer: Loads/Stores: Request (Address) & Response (Data) Need Request & Response signaling

A Simple Example What is the format of mesage?
Fixed? Number bytes? Request/ Response Address/Data 1 bit 32 bits 0: Please send data from Address 1: Packet contains data corresponding to request Header/Trailer: information to deliver a message Payload: data in message (1 word above)

Questions About Simple Example
What if more than 2 computers want to communicate? Need computer “address field” (destination) in packet What if packet is garbled in transit? Add “error detection field” in packet (e.g., CRC) What if packet is lost? More “elaborate protocols” to detect loss (e.g., NAK, ARQ, time outs) What if multiple processes/machine? Queue per process to provide protection Simple questions such as these lead to more complex protocols and packet formats => complexity

A Simple Example Revisted
What is the format of packet? Fixed? Number bytes? Request/ Response Address/Data CRC 1 bit 32 bits 4 bits 00: Request—Please send data from Address 01: Reply—Packet contains data corresponding to request 10: Acknowledge request 11: Acknowledge reply

Software to Send and Receive
SW Send steps 1: Application copies data to OS buffer 2: OS calculates checksum, starts timer 3: OS sends data to network interface HW and says start SW Receive steps 3: OS copies data from network interface HW to OS buffer 2: OS calculates checksum, if matches send ACK; if not, deletes message (sender resends when timer expires) 1: If OK, OS copies data to user address space and signals application to continue Sequence of steps for SW: protocol Example similar to UDP/IP protocol in UNIX

Network Performance Measures
Link bandwidth: 10 Mbit Ethernet What is interconnet BW? bus => link speed switch => 100 * link speed What gets quoted? link BW? What is latency? Interconnect? Overhead: latency of interface vs. Latency: network

Universal Performance Metrics
Sender Overhead Transmission time (size ÷ bandwidth) Sender (processor busy) Easy to get these things confused! Colors should help Min (link BW, bisection BW) Assumes no congestion Receiver usually longer Better to send then receive Store Like send Read like Receive Time of Flight Transmission time (size ÷ bandwidth) Receiver Overhead Receiver (processor busy) Transport Latency Total Latency Total Latency = Sender Overhead + Time of Flight + Message Size ÷ BW + Receiver Overhead Includes header/trailer in BW calculation?

Example Performance Measures
Interconnect MPP LAN WAN Example CM-5 Ethernet ATM Bisection BW N x 5 MB/s MB/s N x 10 MB/s Int./Link BW 20 MB/s MB/s 10 MB/s Transport Latency 5 µsec 15 µsec 50 to 10,000 µs HW Overhead to/from 0.5/0.5 µs 6/6 µs 6/6 µs SW Overhead to/from 1.6/12.4 µs 200/241 µs 207/360 µs (TCP/IP on LAN/WAN) Software overhead dominates in LAN, WAN

Total Latency Example 10 Mbit/sec., sending overhead of 230 µsec & receiving overhead of 270 µsec. a 1000 byte message (including the header), allows 1000 bytes in a single message. 2 situations: distance 0.1 km vs km Speed of light = 299,792.5 km/sec (1/2 in media) Latency0.1km = Latency1000km = Long time of flight => complex WAN protocol

Total Latency Example 10 Mbit/sec., sending overhead of 230 µsec & receiving overhead of 270 µsec. a 1000 byte message (including the header), allows 1000 bytes in a single message. 2 situations: distance 100 m vs km Speed of light = 299,792.5 km/sec Latency0.1km = km / (50% x 299,792.5) x 8 / Latency0.1km = = 1301 µsec Latency1000km = km / (50% x 299,792.5) x 8 / Latency1000km = = 7971 µsec Long time of flight => complex WAN protocol

Simplified Latency Model
Total Latency Overhead + Message Size / BW Overhead = Sender Overhead + Time of Flight + Receiver Overhead Example: show what happens as vary Overhead: 1, 25, 500 µsec BW: 10,100, 1000 Mbit/sec (factors of 10) Message Size: 16 Bytes to 4 MB (factors of 4) If overhead 500 µsec, how big a message > 10 Mb/s?

Impact of Overhead on Delivered BW
BW model: Time = overhead + msg size/peak BW > 50% data transfered in packets = 8KB

Building a Better Butterfly: The Multiplexed Metabutterfly
Chong, Brewer, Leighton, and Knight

Transformations

Outline Expander Networks Hierarchical Construction Multiplexing
Good, but hard to build Hierarchical Construction Much simpler Almost as good in theory Just as good in simulation Multiplexing Much better grouping Randomize for efficiency The cost of a butterfly the performance of a multibutterfly

Butterfly

Splitter Network

Dilated Butterfly

Wiring Splitters

Expansion Definition: A splitter has expansion if every set of S < aM inputs reaches at least bS outputs in each of r directions, where b < 1 and a < 1 / (br) Random wiring produces the best expansion

Faults and Congestion

Multibutterflies Randomly wired Extensively studied:
[Bassalygo & Pinsker 74] [Upfal 89] [Leighton & Maggs 89][Arora, Leighton & Maggs 90] Tremendous fault and congestion tolerance

What’s Wrong?

Wiring Complexity

Relative Complexity

Metabutterfly

Hierarchy Constant degree at each level

Random K-Extensions

K-Extension Properties
Preserve Expansion (with high probability): [Brewer, Chong, & Leighton, STOC94]

Empirical Results Methodology of [Chong & Knight, SPAA92]
Uniformly distributed router faults 1024-processor networks with 5 stages shared-memory traffic pattern Metabutterflies (with metanode sizes 4, 16, 32) perform as well as the Multibutterfly

Multiplexing Multiplex cables Random Destinations
Not possible with multibutterfly Random Destinations Can remove half the wires! 2X performance of comparable butterfly

Multiplexing (Bit-Inverse)
Over 5X better on bit-inverse Multiple logical paths without excess physical bandwidth

Load Balancing Why is bit-inverse worse than random?

Unbalanced Loading Solutions: balance bit-inverse
more wires in first stage more bandwidth in first stages

Randomized Multiplexing
Within cables, packet destination unimportant Could be random Assign each packet to any output Better bandwidth No fixed time slots No extra headers Dynamic randomness

Summary Metabutterfly Multiplexed Metabutterfly
best fault and congestion tolerance Multiplexed Metabutterfly comparable cost to butterfly much better fault and congestion tolerance K-extensions and Multiplexing applicable to other networks (eg fat trees)

Conclusions Other Expander-Based Networks Non-Random K-extensions
Fat-Trees Deterministic Constructions Non-Random K-extensions How many permutations? Other Networks with Multiplicity Expanders are great, but were hard to build K-extensions are the solution Allow Fixed Cabling Degree Retain Theoretical Properties Equal Multibutterflies in Simulation

Interconnect Outline Performance Measures (Metabutterfly Example)
Interface Issues

HW Interface Issues Where to connect network to computer?
Cache consistent to avoid flushes? (=> memory bus) Latency and bandwidth? (=> memory bus) Standard interface card? (=> I/O bus) MPP => memory bus; LAN, WAN => I/O bus CPU Network Network $ ideal: high bandwidth, low latency, standard interface I/O Controller I/O Controller L2 $ Memory Bus I/O bus Memory Bus Adaptor

SW Interface Issues How to connect network to software?
Programmed I/O?(low latency) DMA? (best for large messages) Receiver interrupted or received polls? Things to avoid Invoking operating system in common case Operating at uncached memory speed (e.g., check status of network interface)

CM-5 Software Interface
Overhead CM-5 example (MPP) Time per poll 1.6 µsecs; time per interrupt 19 µsecs Minimum time to handle message: 0.5 µsecs Enable/disable 4.9/3.8 µsecs As rate of messages arrving changes, use polling or interrupt? Solution: Always enable interrupts, have interrupt routine poll until until no messages pending Low rate => interrupt High rate => polling Time between messages

Interconnect Issues Performance Measures Interface Issues
Network Media

Network Media Copper, 1mm think, twisted to avoid
Twisted Pair: Copper, 1mm think, twisted to avoid attenna effect (telephone) Coaxial Cable: Used by cable companies: high BW, good noise immunity Plastic Covering Braided outer conductor Insulator Copper core Light: 3 parts are cable, light source, light detector. Multimode light disperse (LED), Single mode sinle wave (laser) Fiber Optics Total internal Transmitter Air reflection Receiver – L.E.D – Photodiode – Laser Diode light source Silica

Costs of Network Media (1995)
Cost/meter $0.23 $1.64 $1.03 Cost/interface $2 $5 $1000 Media twisted pair copper wire coaxial cable multimode optical fiber single mode optical fiber Bandwidth 1 Mb/s (20 Mb/s) 10 Mb/s 600 Mb/s 2000 Mb/s Distance 2 km (0.1 km) 1 km 2 km 100 km Note: more elaborate signal processing allows higher BW from copper (ADSL) Single mode Fiber measures: BW * distance as 3X/year

Interconnect Issues Performance Measures Interface Issues
Network Media Connecting Multiple Computers

Connecting Multiple Computers
Shared Media vs. Switched: pairs communicate at same time: “point-to-point” connections Aggregate BW in switched network is many times shared point-to-point faster since no arbitration, simpler interface Arbitration in Shared network? Central arbiter for LAN? Listen to check if being used (“Carrier Sensing”) Listen to check if collision (“Collision Detection”) Random resend to avoid repeated collisions; not fair arbitration; OK if low utilization (A. K. A. data switching interchanges, multistage interconnection networks, interface message processors)

Example Interconnects
Interconnect MPP LAN WAN Example CM-5 Ethernet ATM Maximum length 25 m 500 m; copper: 100 m between nodes Š5 repeaters optical: 2 km—25 km Number data lines 4 1 1 Clock Rate 40 MHz 10 MHz MHz Shared vs. Switch Switch Shared Switch Maximum number > 10,000 of nodes Media Material Copper Twisted pair Twisted pair copper wire copper wire or or Coaxial optical fiber cable

Switch Topology Structure of the interconnect Determines
Degree: number of links from a node Diameter: max number of links crossed between nodes Average distance: number of hops to random destination Bisection: minimum number of links that separate the network into two halves (worst case) Warning: these three-dimensional drawings must be mapped onto chips and boards which are essentially two-dimensional media Elegant when sketched on the blackboard may look awkward when constructed from chips, cables, boards, and boxes (largely 2D) Networks should not be interesting!

Important Topologies Hypercude 23 N = 1024
Type Degree Diameter Ave Dist Bisection Diam Ave D 1D mesh Š 2 N-1 N/3 1 2D mesh Š 4 2(N1/2 - 1) 2N1/2 / 3 N1/ 3D mesh Š 6 3(N1/3 - 1) 3N1/3 / 3 N2/3 ~30 ~10 nD mesh Š 2n n(N1/n - 1) nN1/n / 3 N(n-1) / n (N = kn) Ring 2 N / 2 N/4 2 2D torus 4 N1/2 N1/2 / 2 2N1/ k-ary n-cube 2n n(N1/n) nN1/n/ (3D) (N = kn) nk/2 nk/4 2kn-1 Hypercube n n = LogN n/2 N/2 10 5 Cube-Connected Cycles Hypercude 23

Topologies (cont) Fat Tree N = 1024
Type Degree Diameter Ave Dist Bisection Diam Ave D 2D Tree 3 2Log2 N ~2Log2 N 1 20 ~20 4D Tree 5 2Log4 N 2Log4 N - 2/ kD k+1 Logk N 2D fat tree 4 Log2 N N 2D butterfly 4 Log2 N N/ Fat Tree CM-5 Thinned Fat Tree

Butterfly Multistage: nodes at ends, switches in middle
All paths equal length Unique path from any input to any output Conflicts that try to avoid Don’t want algorithm to have to know paths N/2 Butterfly

Example MPP Networks No standard MPP topology!
Name Number Topology Bits Clock Link Bisect. Year nCube/ten cube 1 10 MHz iPSC/ cube 1 16 MHz MP D grid 1 25 MHz 3 1, Delta 540 2D grid MHz CM fat tree 4 40 MHz 20 10, CS fat tree 8 70 MHz 50 50, Paragon D grid MHz 200 6, T3D D Torus MHz , MBytes/second No standard MPP topology!

Summary: Interconnections
Communication between computers Packets for standards, protocols to cover normal and abnormal events Performance issues: HW & SW overhead, interconnect latency, bisection BW Media sets cost, distance Shared vs. Switched Media determines BW HW and SW Interface to computer affects overhead, latency, bandwidth Topologies: many to chose from, but (SW) overheads make them look alike; cost issues in topologies, not algorithms

Connection-Based vs. Connectionless
Telephone: operator sets up connection between the caller and the receiver Once the connection is established, conversation can continue for hours Share transmission lines over long distances by using switches to multiplex several conversations on the same lines “Time division multiplexing” divide B/W transmission line into a fixed number of slots, with each slot assigned to a conversation Problem: lines busy based on number of conversations, not amount of information sent Advantage: reserved bandwidth Before computers How share lines for multiple conversation Works until today; Continuous

Connection-Based vs. Connectionless
Connectionless: every package of information must have an address => packets Each package is routed to its destination by looking at its address Analogy, the postal system (sending a letter) also called “Statistical multiplexing” Note: “Split phase buses” are sending packets Analogy: post

Routing Messages Shared Media
Broadcast to everyone Switched Media needs real routing. Options: Source-based routing: message specifies path to the destination (changes of direction) Virtual Circuit: circuit established from source to destination, message picks the circuit to follow Destination-based routing: message specifies destination, switch must pick the path deterministic: always follow same path adaptive: pick different paths to avoid congestion, failures Randomized routing: pick between several good paths to balance network load Effect of a connection without it (not reserved) Adaptive leads to deadlocks; 2 nodes waiting for other

Deterministic Routing Examples
mesh: dimension-order routing (x1, y1) -> (x2, y2) first x = x2 - x1, then y = y2 - y1, hypercube: edge-cube routing X = xox1x xn -> Y = yoy1y yn R = X xor Y Traverse dimensions of differing address in order tree: common ancestor Deadlock free? 001 000 101 100 010 110 111 011

Store and Forward vs. Cut-Through
Store-and-forward policy: each switch waits for the full packet to arrive in switch before sending to the next switch (good for WAN) Cut-through routing or worm hole routing: switch examines the header, decides where to send the message, and then starts forwarding it immediately In worm hole routing, when head of message is blocked, message stays strung out over the network, potentially blocking other messages (needs only buffer the piece of the packet that is sent between switches). CM-5 uses it, with each switch buffer being 4 bits per port. Cut through routing lets the tail continue when head is blocked, accordioning the whole message into a single switch. (Requires a buffer large enough to hold the largest packet).

Store and Forward vs. Cut-Through
Advantage Latency reduces from function of: number of intermediate switches X by the size of the packet to time for 1st part of the packet to negotiate the switches + the packet size ÷ interconnect BW

Congestion Control Packet switched networks do not reserve bandwidth; this leads to contention (connection based limits input) Solution: prevent packets from entering until contention is reduced (e.g., freeway on-ramp metering lights) Options: Packet discarding: If packet arrives at switch and no room in buffer, packet is discarded (e.g., UDP) Flow control: between pairs of receivers and senders; use feedback to tell sender when allowed to send next packet Back-pressure: separate wires to tell to stop Window: give original sender right to send N packets before getting permission to send more; overlapslatency of interconnection with overhead to send & receive packet (e.g., TCP), adjustable window Choke packets: aka “rate-based”; Each packet received by busy switch in warning state sent back to the source via choke packet. Source reduces traffic to that destination by a fixed % (e.g., ATM) Mother’s day: “all circuits are busy” ATM packet: trying to avoid FDDI problem Hope that ATM used for local area packets Rate based for WAN Back pressure for LAN Telelecommunications is standards based: makes decision to ensure that no one has an advantage; all have a disadvantage Europe used 32 B and US 64B => 48B payload

Practical Issues for Inteconnection Networks
Standardization advantages: low cost (components used repeatedly) stability (many suppliers to chose from) Standardization disadvantages: Time for committees to agree When to standardize? Before anything built? => Committee does design? Too early suppresses innovation Perfect interconnect vs. Fault Tolerant? Will SW crash on single node prevent communication? (MPP typically assume perfect) Reliability (vs. availability) of interconnect

Practical Issues Interconnection MPP LAN WAN Example CM-5 Ethernet ATM
Standard No Yes Yes Fault Tolerance? No Yes Yes Hot Insert? No Yes Yes Standards: required for WAN, LAN! Fault Tolerance: Can nodes fail and still deliver messages to other nodes? required for WAN, LAN! Hot Insert: If the interconnection can survive a failure, can it also continue operation while a new node is added to the interconnection? required for WAN, LAN!

Cross-Cutting Issues for Networking
Efficient Interface to Memory Hierarchy vs. to Network SPEC ratings => fast to memory hierarchy Writes go via write buffer, reads via L1 and L2 caches Example: 40 MHz SPARCStation(SS)-2 vs 50 MHz SS-20, no L2$ vs 50 MHz SS-20 with L2$ I/O bus latency; different generations SS-2: combined memory, I/O bus => 200 ns SS-20, no L2$: 2 busses +300ns => ns SS-20, w L2$: cache miss+500ns => 1000ns

Protocols: HW/SW Interface
Internetworking: allows computers on independent and incompatible networks to communicate reliably and efficiently; Enabling technologies: SW standards that allow reliable communications without reliable networks Hierarchy of SW layers, giving each layer responsibility for portion of overall communications task, called protocol families or protocol suites Transmission Control Protocol/Internet Protocol (TCP/IP) This protocol family is the basis of the Internet IP makes best effort to deliver; TCP guarantees delivery TCP/IP used even when communicating locally: NFS uses IP even though communicating across homogeneous LAN WS companies used TCP/IP even over LAN Because early Ethernet controllers were cheap, but not reliable

FTP From Stanford to Berkeley
Hennessy FDDI FDDI Ethernet T3 FDDI Patterson Ethernet Ethernet BARRNet is WAN for Bay Area T1 is 1.5 mbps leased line; T3 is 45 mbps; FDDI is 100 mbps LAN IP sets up connection, TCP sends file

Protocol Key to protocol families is that communication occurs logically at the same level of the protocol, called peer-to-peer, but is implemented via services at the lower level Danger is each level increases latency if implemented as hierarchy (e.g., multiple check sums)

TCP/IP packet Application sends message
TCP breaks into 64KB segements, adds 20B header IP adds 20B header, sends to network If Ethernet, broken into 1500B packets with headers, trailers Header, trailers have length field, destination, window number, version, ... Ethernet IP Header TCP Header IP Data TCP data (Š 64KB)

Example Networks Ethernet: shared media 10 Mbit/s proposed in 1978, carrier sensing with expotential backoff on collision detection 15 years with no improvement; higher BW? Multiple Ethernets with devices to allow Ehternets to operate in parallel! 10 Mbit Ethernet successors? FDDI: shared media (too late) ATM (too late?) Switched Ethernet 100 Mbit Ethernet (Fast Ethernet) Gigabit Ethernet

Connecting Networks Bridges: connect LANs together, passing traffic from one side to another depending on the addresses in the packet. operate at the Ethernet protocol level usually simpler and cheaper than routers Routers or Gateways: these devices connect LANs to WANs or WANs to WANs and resolve incompatible addressing. Generally slower than bridges, they operate at the internetworking protocol (IP) level Routers divide the interconnect into separate smaller subnets, which simplifies manageability and improves security Cisco is major supplier; basically special purpose computers

Example Networks MPP LAN WAN IBM SP-2 10 8 40 MHz Yes Š512 copper
320xNodes 320 284 100 Mb Ethernet 200 1 100 MHz No Š254 copper 100 -- ATM 100/1000 1 155/622… Yes 10000 copper/fiber 155xNodes 155 80 Length (meters) Number data lines Clock Rate Switch? Nodes (N) Material Bisection BW (Mbit/s) Peak Link BW (Mbits/s) Measured Link BW

Example Networks (cont’d)
MPP LAN WAN IBM SP-2 1 39 Fat tree Yes No Back-pressure 100 Mb Ethernet 1.5 440 Line Yes No Carrier Sense ATM 50 630 Star No Yes Choke packets Latency (µsecs) Send+Receive Ovhd (µsecs) Topology Connectionless? Store & Forward? Congestion Control Standard Fault Tolerance

Examples: Interface to Processor

Packet Formats Fields: Destination, Checksum(C), Length(L), Type(T)
Fixed size thought would keep hardware simpler, like fixed length Virtual circuit, in-oder delivery for ATM On top of ATM, there are own set of protocol layers, 5 AAL layers, closer to WAN (LAN like AAL, better for TCP/IP) TCP/IP vs. AAL wars; never understood what TCP/IP bigot Fields: Destination, Checksum(C), Length(L), Type(T) Data/Header Sizes in bytes: (4 to 20)/4, (0 to 1500)/26, 48/5

Example Switched LAN Performance
Network Interface Switch Link BW AMD Lance Ethernet Baynetworks 10 Mb/s EtherCell 28115 Fore SBA-200 ATM Fore ASX Mb/s Myricom Myrinet Myricom Myrinet 640 Mb/s On SPARCstation-20 running Solaris 2.4 OS Myrinet is example of “System Area Network”: networks for a single room or floor: 25m limit shorter => wider faster, less need for optical short distance => source-based routing => simpler switches Compaq-Tandem/Microsoft also sponsoring SAN, called “ServerNet”

Example Switched LAN Performance (1995)
Switch Switch Latency Baynetworks µsecs EtherCell 28115 Fore ASX-200 ATM µsecs Myricom Myrinet 0.5 µsecs Measurements taken from “LogP Quantyified: The Case for Low-Overhead Local Area Networks”, K. Keeton, T. Anderson, D. Patterson, Hot Interconnects III, Stanford California, August 1995.

UDP/IP performance Network UDP/IP roundtrip, N=8B Formula
Bay. EtherCell µsecs *N Fore ASX-200 ATM µsecs *N Myricom Myrinet µsecs *N Formula from simple linear regression for tests from N = 8B to N = 8192B Software overhead not tuned for Fore, Myrinet; EtherCell using standard driver for Ethernet

NFS performance Network Avg. NFS response LinkBW/Ether UDP/E.
Bay. EtherCell ms Fore ASX-200 ATM ms Myricom Myrinet ms Last 2 columns show ratios of link bandwidth and UDP roundtrip times for 8B message to Ethernet

Estimated Database performance (1995)
Network Avg. TPS LinkBW/E. TCP/E. Bay. EtherCell 77 tps Fore ASX-200 ATM 67 tps Myricom Myrinet 66 tps Number of Transactions per Second (TPS) for DebitCredit Benchmark; front end to server with entire database in main memory (256 MB) Each transaction => 4 messages via TCP/IP DebitCredit Message sizes < 200 bytes Last 2 columns show ratios of link bandwidth and TCP/IP roundtrip times for 8B message to Ethernet

Summary: Networking Protocols allow heterogeneous networking
Protocols allow operation in the presence of failures Internetworking protocols used as LAN protocols => large overhead for LAN Integrated circuit revolutionizing networks as well as processors Switch is a specialized computer Faster networks and slow overheads violate of Amdahl’s Law

Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Almasi and Gottlieb, Highly Parallel Computing ,1989 Questions about parallel computers: How large a collection? How powerful are processing elements? How do they cooperate and communicate? How are data transmitted? What type of interconnection? What are HW and SW primitives for programmer? Does it translate into performance?

Parallel Processors “Religion”
The dream of computer architects since 1960: replicate processors to add performance vs. design a faster processor Led to innovative organization tied to particular programming models since “uniprocessors can’t keep going” e.g., uniprocessors must stop getting faster due to limit of speed of light: 1972, … , 1989 Borders religious fervor: you must believe! Fervor damped some when 1990s companies went out of business: Thinking Machines, Kendall Square, ... Argument instead is the “pull” of opportunity of scalable performance, not the “push” of uniprocessor performance plateau

Opportunities: Scientific Computing
Nearly Unlimited Demand (Grand Challenge): App Perf (GFLOPS) Memory (GB) 48 hour weather 72 hour weather 3 1 Pharmaceutical design Global Change, Genome (Figure 1-2, page 25, of Culler, Singh, Gupta [CSG97]) Successes in some real industries: Petrolium: reservoir modeling Automotive: crash simulation, drag analysis, engine Aeronautics: airflow analysis, engine, structural mechanics Pharmaceuticals: molecular modeling Entertainment: full length movies (“Toy Story”)

Example: Scientific Computing
Molecular Dynamics on Intel Paragon with 128 processors (1994) (see Chapter 1, Figure 1-3, page 27 of Culler, Singh, Gupta [CSG97]) Classic MPP slide: processors v. speedup Improve over time: load balancing, other 128 processor Intel Paragon = 406 MFLOPS C90 vector = 145 MFLOPS (or 45 Intel processors)

Opportunities: Commercial Computing
Transaction processing & TPC-C bencmark (see Chapter 1, Figure 1-4, page 28 of [CSG97]) small scale parallel processors to large scale Throughput (Transactions per minute) vs. Time (1996) Speedup: IBM RS Tandem Himilaya IBM performance hit 1=>4, good 4=>8 Tandem scales: 112/16 = 7.0 Others: File servers, electronic CAD simulation (multiple processes), WWW search engines

What level Parallelism?
Bit level parallelism: 1970 to 1985 4 bits, 8 bit, 16 bit, 32 bit microprocessors Instruction level parallelism (ILP): through today Pipelining Superscalar VLIW Out-of-Order execution Limits to benefits of ILP? Process Level or Thread level parallelism; mainstream for general purpose computing? Servers are parallel (see Fig. 1-8, p. 37 of [CSG97]) Highend Desktop dual processor PC soon?? (or just the sell the socket?)

Whither Supercomputing?
Linpack (dense linear algebra) for Vector Supercomputers vs. Microprocessors “Attack of the Killer Micros” (see Chapter 1, Figure 1-10, page 39 of [CSG97]) 100 x 100 vs x 1000 MPPs vs. Supercomputers when rewrite linpack to get peak performance (see Chapter 1, Figure 1-11, page 40 of [CSG97]) 500 fastest machines in the world: parallel vector processors (PVP), bus-based shared memory (SMP), and MPPs (see Chapter 1, Figure 1-12, page 41 of [CSG97])

Parallel Architecture
Parallel Architecture extends traditional computer architecture with a communication architecture abstractions (HW/SW interface) organizational structure to realize abstraction efficiently

Parallel Framework Layers:
(see Chapter 1, Figure 1-13, page 42 of [CSG97]) Programming Model: Multiprogramming : lots of jobs, no communication Shared address space: communicate via memory Message passing: send and recieve messages Data Parallel: several agents operate on several data sets simultaneously and then exchange information globally and simultaneously (shared or message passing) Communication Abstraction: Shared address space: e.g., load, store, atomic swap Message passing: e.g., send, recieve library calls Debate over this topic (ease of programming, scaling) => many hardware designs 1:1 programming model

Shared Address Model Summary
Each processor can name every physical location in the machine Each process can name all data it shares with other processes Data transfer via load and store Data size: byte, word, ... or cache blocks Uses virtual memory to map virtual to local or remote physical Memory hierarchy model applies: now communication moves data to local processor cache (as load moves data from memory to cache) Latency, BW, scalability when communicate?

Next Time Interconnect Networks Introduction to Multiprocessing

Lecture 8: Parallel Processing
Prof. Fred Chong ECS 250A Computer Architecture Winter 1999 (Adapted from Culler CS258 and Dally EE282)

ECS250B Goal: Cover Advanced Architecture Topics
Focus on ILP (Instruction Level Parallelism) Data Parallelism in 250C Exploit existing knowledge base Review Tomasulo’s, not learn it Look at most recent research advances Maybe make our own!

Administrative Issues
Text: New text by Sohi, Hill and Jouppi Important papers plus their commentary Additional Papers of importance Workload Daily “quiz” Project at end Plan to publish!

Some Topics we will cover
Speculation (key to ILP) Value prediction Branch prediction Controlling Speculation (Confidence) Power considerations Feeding the beast (Instruction Fetch issues)

More Topics we will cover
Memory System Issues Smarter Cache Management Type-specific caches Cache utilization issues Memory Prefetch Bandwidth and Compression (hello, Justin!)

Still More Topics we will cover
What else is there? If time, we will cover it

Research Plan Learn Tools to support investigations
Simplescalar SimOS Many others Create Tools to support investigations You tell me, I’m all ears! (Well, mostly) Investigate!

Carrot Micro32 is in Haifa in November Submission Deadline is June 1st
see Submission Deadline is June 1st If your paper is accepted, somebody has to present it ...

Parallel Programming Motivating Problems (application case studies)
Process of creating a parallel program What a simple parallel program looks like three major programming models What primitives must a system support?

Simulating Ocean Currents
(a) Cross sections (b) Spatial discretization of a cross section Model as two-dimensional grids Discretize in space and time finer spatial and temporal resolution => greater accuracy Many different computations per time step set up and solve equations Concurrency across and within grid computations Static and regular

Simulating Galaxy Evolution
Simulate the interactions of many stars evolving over time Computing forces is expensive O(n2) brute force approach Hierarchical Methods take advantage of force law: G m1m2 r2 Many time-steps, plenty of concurrency across stars within one

Rendering Scenes by Ray Tracing
Shoot rays into scene through pixels in image plane Follow their paths they bounce around as they strike objects they generate new rays: ray tree per input ray Result is color and opacity for that pixel Parallelism across rays How much concurrency in these examples?

Creating a Parallel Program
Pieces of the job: Identify work that can be done in parallel work includes computation, data access and I/O Partition work and perhaps data among processes Manage data access, communication and synchronization

Definitions Task: Process (thread): Processor:
Arbitrary piece of work in parallel computation Executed sequentially; concurrency is only across tasks E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace Fine-grained versus coarse-grained tasks Process (thread): Abstract entity that performs the tasks assigned to processes Processes communicate and synchronize to perform their tasks Processor: Physical engine on which process executes Processes virtualize machine to programmer write program in terms of processes, then map to processors

4 Steps in Creating a Parallel Program
Decomposition of computation in tasks Assignment of tasks to processes Orchestration of data access, comm, synch. Mapping processes to processors

Decomposition Identify concurrency and decide level at which to exploit it Break up computation into tasks to be divided among processes Tasks may become available dynamically No. of available tasks may vary with time Goal: Enough tasks to keep processes busy, but not too many Number of tasks available at a time is upper bound on achievable speedup

Limited Concurrency: Amdahl’s Law
Most fundamental limitation on parallel speedup If fraction s of seq execution is inherently serial, speedup <= 1/s Example: 2-phase calculation sweep over n-by-n grid and do some independent computation sweep again and add each value to global sum Time for first phase = n2/p Second phase serialized at global variable, so time = n2 Speedup <= or at most 2 Trick: divide second phase into two accumulate into private sum during sweep add per-process private sum into global sum Parallel time is n2/p + n2/p + p, and speedup at best 2n2 n2 p + n2 2n2 2n2 + p2

Understanding Amdahl’s Law
1 p n2/p n2 work done concurrently Time (c) (b) (a)

Concurrency Profiles Area under curve is total work done, or time with 1 processor Horizontal extent is lower bound on time (infinite processors) Amdahl’s law applies to any overhead, not just limited concurrency

Orchestration Naming data Structuring communication Synchronization Organizing data structures and scheduling tasks temporally Goals Reduce cost of communication and synch. Preserve locality of data reference Schedule tasks to satisfy dependences early Reduce overhead of parallelism management Choices depend on Prog. Model., comm. abstraction, efficiency of primitives Architects should provide appropriate primitives efficiently

Mapping Two aspects: space-sharing System allocation Real world
Which process runs on which particular processor? mapping to a network topology Will multiple processes run on same processor? space-sharing Machine divided into subsets, only one app at a time in a subset Processes can be pinned to processors, or left to OS System allocation Real world User specifies desires in some aspects, system handles some Usually adopt the view: process <-> processor

Parallelizing Computation vs. Data
Computation is decomposed and assigned (partitioned) Partitioning Data is often a natural view too Computation follows data: owner computes Grid example; data mining; Distinction between comp. and data stronger in many applications Barnes-Hut Raytrace

Architect’s Perspective
What can be addressed by better hardware design? What is fundamentally a programming issue?

High-level Goals High performance (speedup over sequential program)
But low resource usage and development effort Implications for algorithm designers and architects?

What Parallel Programs Look Like

Example: iterative equation solver
Simplified version of a piece of Ocean simulation Illustrate program in low-level parallel language C-like pseudocode with simple extensions for parallelism Expose basic comm. and synch. primitives State of most real parallel programming today

Grid Solver Gauss-Seidel (near-neighbor) sweeps to convergence
interior n-by-n points of (n+2)-by-(n+2) updated in each sweep updates done in-place in grid difference from previous value computed accumulate partial diffs into global diff at end of every sweep check if has converged to within a tolerance parameter

Sequential Version

Decomposition Simple way to identify concurrency is to look at loop iterations dependence analysis; if not enough concurrency, then look further Not much concurrency here at this level (all loops sequential) Examine fundamental dependences Concurrency O(n) along anti-diagonals, serialization O(n) along diag. Retain loop structure, use pt-to-pt synch; Problem: too many synch ops. Restructure loops, use global synch; imbalance and too much synch

Exploit Application Knowledge
Reorder grid traversal: red-black ordering Different ordering of updates: may converge quicker or slower Red sweep and black sweep are each fully parallel: Global synch between them (conservative but convenient) Ocean uses red-black We use simpler, asynchronous one to illustrate no red-black, simply ignore dependences within sweep parallel program nondeterministic

Decomposition Decomposition into elements: degree of concurrency n2
Decompose into rows?

Assignment Static assignment: decomposition into rows
block assignment of rows: Row i is assigned to process cyclic assignment of rows: process i is assigned rows i, i+p, ... Dynamic assignment get a row index, work on the row, get a new row, ... What is the mechanism? Concurrency? Volume of Communication? i p

Data Parallel Solver

Shared Address Space Solver
Single Program Multiple Data (SPMD) Assignment controlled by values of variables used as loop bounds

Generating Threads

Assignment Mechanism

SAS Program SPMD: not lockstep. Not necessarily same instructions
Assignment controlled by values of variables used as loop bounds unique pid per process, used to control assignment done condition evaluated redundantly by all Code that does the update identical to sequential program each process has private mydiff variable Most interesting special operations are for synchronization accumulations into shared diff have to be mutually exclusive why the need for all the barriers? Good global reduction? Utility of this parallel accumulate???

Mutual Exclusion Why is it needed?
Provided by LOCK-UNLOCK around critical section Set of operations we want to execute atomically Implementation of LOCK/UNLOCK must guarantee mutual excl. Serialization? Contention? Non-local accesses in critical section? use private mydiff for partial accumulation!

Global Event Synchronization
BARRIER(nprocs): wait here till nprocs processes get here Built using lower level primitives Global sum example: wait for all to accumulate before using sum Often used to separate phases of computation Process P_1 Process P_2 Process P_nprocs set up eqn system set up eqn system set up eqn system Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs) solve eqn system solve eqn system solve eqn system apply results apply results apply results Conservative form of preserving dependences, but easy to use WAIT_FOR_END (nprocs-1)

Pt-to-pt Event Synch (Not Used Here)
One process notifies another of an event so it can proceed Common example: producer-consumer (bounded buffer) Concurrent programming on uniprocessor: semaphores Shared address space parallel programs: semaphores, or use ordinary variables as flags P P 1 2 A = 1; a: while (flag is 0) do nothing; b: flag = 1; print A; Busy-waiting or spinning

Group Event Synchronization
Subset of processes involved Can use flags or barriers (involving only the subset) Concept of producers and consumers Major types: Single-producer, multiple-consumer Multiple-producer, single-consumer

Message Passing Grid Solver
Cannot declare A to be global shared array compose it logically from per-process private arrays usually allocated in accordance with the assignment of work process assigned a set of rows allocates them locally Transfers of entire rows between traversals Structurally similar to SPMD SAS Orchestration different data structures and data access/naming communication synchronization Ghost rows

Data Layout and Orchestration
P 2 4 1 Data partition allocated per processor P 1 2 4 Send edges to neighbors Add ghost rows to hold boundary data Receive into ghost rows Compute as in sequential program

Notes on Message Passing Program
Use of ghost rows Receive does not transfer data, send does unlike SAS which is usually receiver-initiated (load fetches data) Communication done at beginning of iteration, so no asynchrony Communication in whole rows, not element at a time Core similar, but indices/bounds in local rather than global space Synchronization through sends and receives Update of global diff and event synch for done condition Could implement locks and barriers with messages Can use REDUCE and BROADCAST library calls to simplify code

Send and Receive Alternatives
Can extend functionality: stride, scatter-gather, groups Semantic flavors: based on when control is returned Affect when data structures or buffers can be reused at either end Send/Receive Synchronous Asynchronous Blocking asynch. Nonblocking asynch. Affect event synch (mutual excl. by fiat: only one process touches data) Affect ease of programming and performance Synchronous messages provide built-in synch. through match Separate event synchronization needed with asynch. messages With synch. messages, our code is deadlocked. Fix?

Orchestration: Summary
Shared address space Shared and private data explicitly separate Communication implicit in access patterns No correctness need for data distribution Synchronization via atomic operations on shared data Synchronization explicit and distinct from data communication Message passing Data distribution among local address spaces needed No explicit shared structures (implicit in comm. patterns) Communication is explicit Synchronization implicit in communication (at least in synch. case) mutual exclusion by fiat

Correctness in Grid Solver Program
Decomposition and Assignment similar in SAS and message-passing Orchestration is different Data structures, data access/naming, communication, synchronization Performance?

History of Parallel Architectures
Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth. Mid 80s rennaisance Application Software System Software Systolic Arrays SIMD Architecture Message Passing Dataflow Shared Memory

Convergence Look at major programming models
where did they come from? The 80s architectural rennaisance! What do they provide? How have they converged? Extract general structure and fundamental issues Reexamine traditional camps from new perspective Systolic Arrays SIMD Generic Architecture Message Passing Dataflow Shared Memory

Programming Model Conceptualization of the machine that programmer uses in coding applications How parts cooperate and coordinate their activities Specifies communication and synchronization operations Multiprogramming no communication or synch. at program level Shared address space like bulletin board Message passing like letters or phone calls, explicit point to point Data parallel: more regimented, global actions on data Implemented with shared address space or message passing

Structured Shared Address Space
o r e P 1 2 n L a d p i v Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space Shared portion of address space Private portion Common physical addresses Add hoc parallelism used in system code Most parallel applications have structured SAS Same program on each processor shared variable X means the same thing to each thread

Engineering: Intel Pentium Pro Quad
All coherence and multiprocessing glue in processor module Highly integrated, targeted at high volume Low latency and bandwidth

Engineering: SUN Enterprise
Proc + mem card - I/O card 16 cards of either type All memory accessed over bus, so symmetric Higher bandwidth, higher latency bus

Scaling Up “Dance hall” Distributed memory
° ° ° Network Network $ ° ° ° $ M M ° ° ° $ $ $ M $ P P P P P P “Dance hall” Distributed memory Problem is interconnect: cost (crossbar) or bandwidth (bus) Dance-hall: bandwidth still scalable, but lower cost than crossbar latencies to memory uniform, but uniformly large Distributed memory or non-uniform memory access (NUMA) Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response) Caching shared (particularly nonlocal) data?

Engineering: Cray T3E Scale up to 1024 processors, 480MB/s links
Memory controller generates request message for non-local references No hardware mechanism for coherence SGI Origin etc. provide this

Message Passing Architectures
Complete computer as building block, including I/O Communication via explicit I/O operations Programming model direct access only to private address space (local memory), communication via explicit messages (send/receive) High-level block diagram Communication integration? Mem, I/O, LAN, Cluster Easier to build and scale than SAS Programming model more removed from basic hardware operations Library or OS intervention M ° ° ° Network P $

Message-Passing Abstraction
Pr ocess P Q Addr ess Y X Send X, Q, t Receive , t Match Local pr addr ess space Send specifies buffer to be transmitted and receiving process Recv specifies sending process and application storage to receive into Memory to memory copy, but need to name processes Optional tag on send and matching rule on receive User process names local data and entities in process/tag space too In simplest form, the send/recv match achieves pairwise synch event Other variants too Many overheads: copying, buffer management, protection

Evolution of Message-Passing Machines
Early machines: FIFO on each link HW close to prog. Model; synchronous ops topology central (hypercube algorithms) CalTech Cosmic Cube (Seitz, CACM Jan 95)

Diminishing Role of Topology
Shift to general links DMA, enabling non-blocking ops Buffered by system at destination until recv Store&forward routing Diminishing role of topology Any-to-any pipelined routing node-network interface dominates communication time Simplifies programming Allows richer design space grids vs hypercubes Intel iPSC/1 -> iPSC/2 -> iPSC/860 H x (T0 + n/B) vs T0 + HD + n/B

Example Intel Paragon

Building on the mainstream: IBM SP-2
Made out of essentially complete RS6000 workstations Network interface integrated in I/O bus (bw limited by I/O bus)

Berkeley NOW 100 Sun Ultra2 workstations Inteligent network interface
proc + mem Myrinet Network 160 MB/s per link 300 ns per hop

Toward Architectural Convergence
Evolution and role of software have blurred boundary Send/recv supported on SAS machines via buffers Can construct global address space on MP (GA -> P | LA) Page-based (or finer-grained) shared virtual memory Hardware organization converging too Tighter NI integration even for MP (low-latency, high-bandwidth) Hardware SAS passes messages Even clusters of workstations/SMPs are parallel systems Emergence of fast system area networks (SAN) Programming models distinct, but organizations converging Nodes connected by general network and communication assists Implementations also converging, at least in high-end machines

Programming Models Realized by Protocols
CAD Multipr ogramming Shar ed addr ess Message passing Data parallel Database Scientific modeling Parallel applications Pr ogramming models Communication abstraction User/system boundary Compilation or library Operating systems support Communication har dwar e Physical communication medium Har e/softwar e boundary Network Transactions

Shared Address Space Abstraction
Fundamentally a two-way request/response protocol writes have an acknowledgement Issues fixed or variable length (bulk) transfers remote virtual or physical address, where is action performed? deadlock avoidance and input buffer full coherent? consistent?

The Fetch Deadlock Problem
Even if a node cannot issue a request, it must sink network transactions. Incoming transaction may be a request, which will generate a response. Closed system (finite buffering)

Consistency write-atomicity violated without caching

Key Properties of Shared Address Abstraction
Source and destination data addresses are specified by the source of the request a degree of logical coupling and trust no storage logically “outside the address space” may employ temporary buffers for transport Operations are fundamentally request response Remote operation can be performed on remote memory logically does not require intervention of the remote processor

Message passing Bulk transfers Complex synchronization semantics
more complex protocols More complex action Synchronous Send completes after matching recv and source data sent Receive completes after data transfer complete from matching send Asynchronous Send completes after send buffer may be reused

Synchronous Message Passing
Processor Action? Constrained programming model. Deterministic! Destination contention very limited.

Asynch. Message Passing: Optimistic
More powerful programming model Wildcard receive => non-deterministic Storage required within msg layer?

Asynch. Msg Passing: Conservative
Where is the buffering? Contention control? Receiver initiated protocol? Short message optimizations

Key Features of Msg Passing Abstraction
Source knows send data address, dest. knows receive data address after handshake they both know both Arbitrary storage “outside the local address spaces” may post many sends before any receives non-blocking asynchronous sends reduces the requirement to an arbitrary number of descriptors fine print says these are limited too Fundamentally a 3-phase transaction includes a request / response can use optimisitic 1-phase in limited “Safe” cases credit scheme

Active Messages User-level analog of network transaction Request/Reply
handler Reply handler User-level analog of network transaction transfer data packet and invoke handler to extract it from the network and integrate with on-going computation Request/Reply Event notification: interrupts, polling, events? May also perform memory-to-memory transfer

Common Challenges Input buffer overflow
N-1 queue over-commitment => must slow sources reserve space per source (credit) when available for reuse? Ack or Higher level Refuse input when full backpressure in reliable network tree saturation deadlock free what happens to traffic not bound for congested dest? Reserve ack back channel drop packets Utilize higher-level semantics of programming model

Challenges (cont) Fetch Deadlock
For network to remain deadlock free, nodes must continue accepting messages, even when cannot source msgs what if incoming transaction is a request? Each may generate a response, which cannot be sent! What happens when internal buffering is full? logically independent request/reply networks physical networks virtual channels with separate input/output queues bound requests and reserve input buffer space K(P-1) requests + K responses per node service discipline to avoid fetch deadlock? NACK on input buffer full NACK delivery?

Challenges in Realizing Prog. Models in the Large
One-way transfer of information No global knowledge, nor global control barriers, scans, reduce, global-OR give fuzzy global state Very large number of concurrent transactions Management of input buffer resources many sources can issue a request and over-commit destination before any see the effect Latency is large enough that you are tempted to “take risks” optimistic protocols large transfers dynamic allocation Many many more degrees of freedom in design and engineering of these system

Network Transaction Processing
Scalable Network Message P M CA P M CA Input Processing – checks – translation – buffering – action Output Processing – checks – translation – formating – scheduling ° ° ° Communication Assist Node Architecture Key Design Issue: How much interpretation of the message? How much dedicated processing in the Comm. Assist?

Spectrum of Designs None: Physical bit stream User/System
blind, physical DMA nCUBE, iPSC, . . . User/System User-level port CM-5, *T User-level handler J-Machine, Monsoon, . . . Remote virtual address Processing, translation Paragon, Meiko CS-2 Global physical address Proc + Memory controller RP3, BBN, T3D Cache-to-cache Cache controller Dash, KSR, Flash Increasing HW Support, Specialization, Intrusiveness, Performance (???)

Net Transactions: Physical DMA
DMA controlled by regs, generates interrupts Physical => OS initiates transfers Send-side construct system “envelope” around user data in kernel area Receive must receive into system buffer, since no interpretation in CA sender auth dest addr

nCUBE Network Interface
independent DMA channel per link direction leave input buffers always open segmented messages routing interprets envelope dimension-order routing on hypercube bit-serial with 36 bit cut-through Os 16 ins 260 cy 13 us Or cy 15 us - includes interrupt

Conventional LAN NI Host Memory NIC trncv Data NIC Controller TX addr
DMA RX len Addr Len Status Next Addr Len Status Next IO Bus mem bus Addr Len Status Next Addr Len Status Next Proc Addr Len Status Next Addr Len Status Next

User Level Ports initiate transaction at user level
deliver to user without OS intervention network port in user space User/system flag in envelope protection check, translation, routing, media access in src CA user/sys check in dest CA, interrupt on system

User Level Network ports
Appears to user as logical message queues plus status What happens if no user pop?

Example: CM-5 Input and output FIFO for each network 2 data networks
tag per message index NI mapping table context switching? *T integrated NI on chip iWARP also Os 50 cy 1.5 us Or 53 cy 1.6 us interrupt 10us

User Level Handlers U s e r / y t m P M D a A d Hardware support to vector to address specified in message message ports in registers

J-Machine: Msg-Driven Processor
Each node a small msg driven processor HW support to queue msgs and dispatch to msg handler task

Communication Comparison
Message passing (active messages) interrupts (int-mp) polling (poll-mp) bulk transfer (bulk) Shared memory (sequential consistency) without prefetching (sm) with prefetching (pre-sm)

Motivation Comparison over a range of parameters
latency and bandwidth emulation hand-optimized code for each mechanism 5 versions of 4 applications

The Alewife Multiprocessor

Alewife Mechanisms Int-mp -- 100-200 cycles Send/Rec ovrhd
Poll-mp -- saves cycles Rec ovrhd Bulk -- gather/scatter Sm cycles cycles/hop Pre-sm -- 2 cycles, 16 entry buffer

Applications Irregular Computations Little data re-use Data driven

Application Descriptions
EM3D ICCG Unstruc Moldyn 3D electromagnetic wave irreg sparse matrix solver 3D fluid flow molecular dynamics

Performance Breakdown

Performance Summary

Traffic Breakdown

Traffic Summary

Effects of Bandwidth

Bandwidth Emulation Lower bisection by introducing cross-traffic

Sensitivity to Bisection

Effects of Latency

Latency Emulation Clock variation Context switch on miss
processor has tunable clock network is asynchronous results in variations in relative latency Context switch on miss add delay

Sensitivity to Latency

Sensitivity to Higher Latencies

Communication Comparison Summary
Low overhead in shared memory performs well even with: irregular, data-driven applications little re-use Bisection and latency can cause crossovers

Future Technology Technology changes the cost and performance of computer elements in a non-uniform manner logic and arithmetic is becoming plentiful and cheap wires are becoming slow and scarce This changes the tradeoffs between alternative architectures superscalar doesn’t scale well global control and data So what will the architectures of the future be? 1 clk 1998 2001 2004 2007 64 x the area 4x the speed slower wires 20 clks

Single-Chip Multiprocessors
Build a multiprocessor on a single chip linear increase in peak performance advantage of fast interaction between processors But memory bandwidth problem multiplied P P P P $ $ $ $ $ M

Exploiting fine-grain threads
Where will the parallelism come from to keep all of these processors busy? ILP - limited to about 5 Outer-loop parallelism e.g., domain decomposition requires big problems to get lots of parallelism Fine threads make communication and synchronization very fast (1 cycle) break the problem into smaller pieces more parallelism

Processor with DRAM (PIM)
Put the processor and the main memory on a single chip much lower memory latency much higher memory bandwidth But need to build systems with more than one chip 64Mb SDRAM Chip Internal K subarrays 4 bits per subarray each 10ns 51.2 Gb/s External - 8 bits at 10ns, 800Mb/s 1 Integer processor ~ 100KBytes DRAM 1 FP processor ~ 500KBytes DRAM

Reconfigurable processors
Adapt the processor to the application special function units special wiring between function units Builds on FPGA technology FPGAs are inefficient a multiplier built from an FPGA is about 100x larger and 10x slower than a custom multiplier. Need to raise the granularity configure ALUs, or whole processors Memory and communication are usually the bottleneck not addressed by configuring a lot of ALUs

EPIC - explicit (instruction-level) parallelism aka VLIW
Instruction Cache Compiler schedules instructions Encodes dependencies explicitly saves having the hardware repeatedly rediscover them Support speculation speculative load branch prediction Really need to make communication explicit too still has global registers and global instruction issue Instruction Issue Register File

Summary Parallelism is inevitable Commodity forces
ILP Medium Massive Commodity forces SMPs NOWs, CLUMPs Technological trends MP chips Intelligent memory

Cost/Performance, DLX, Pipelining

Similar presentations

Presentation on theme: "Cost/Performance, DLX, Pipelining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cost/Performance, DLX, Pipelining

Similar presentations

Presentation on theme: "Cost/Performance, DLX, Pipelining"— Presentation transcript:

Similar presentations

About project

Feedback