Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Architectures and Systems, Edwin Sha 1 Review of Computer Architecture -- Instruction sets, pipelines and caches.

Similar presentations


Presentation on theme: "Parallel Architectures and Systems, Edwin Sha 1 Review of Computer Architecture -- Instruction sets, pipelines and caches."— Presentation transcript:

1 Parallel Architectures and Systems, Edwin Sha 1 Review of Computer Architecture -- Instruction sets, pipelines and caches

2 Parallel Architectures and Systems, Edwin Sha 2 Review, #1 Designing to Last through Trends CapacitySpeed Logic2x in 3 years2x in 3 years DRAM4x in 3 years2x in 10 years Disk4x in 3 years2x in 10 years Processor ( n.a.)2x in 1.5 years Time to run the task –Execution time, response time, latency Tasks per day, hour, week, sec, ns, … –Throughput, bandwidth “X is n times faster than Y” means ExTime(Y) Performance(X) --------- =-------------- ExTime(X)Performance(Y)

3 Parallel Architectures and Systems, Edwin Sha 3 Review, #2 Amdahl’s Law: CPI Law: Execution time is the REAL measure of computer performance! Good products created when have: –Good benchmarks –Good ways to summarize performance Die Cost goes roughly with die area 4 Speedup overall = ExTime old ExTime new = 1 (1 - Fraction enhanced ) + Fraction enhanced Speedup enhanced CPU time= Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle CPU time= Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle

4 Parallel Architectures and Systems, Edwin Sha 4 Computer Architecture Is … the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation. Amdahl, Blaaw, and Brooks, 1964 SOFTWARE

5 Parallel Architectures and Systems, Edwin Sha 5 Evolution of Instruction Sets Major advances in computer architecture are typically associated with landmark instruction set designs –Ex: Stack vs GPR (System 360) Design decisions must take into account: –technology –machine organization –programming langauges –compiler technology –operating systems And they in turn influence these

6 Parallel Architectures and Systems, Edwin Sha 6 A "Typical" RISC 32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero, DP take pair) 3-address, reg-reg arithmetic instruction Single address mode for load/store: base + displacement –no indirection Simple branch conditions Delayed branch see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3

7 Parallel Architectures and Systems, Edwin Sha 7 Example: MIPS (­ DLX) Op 312601516202125 Rs1Rd immediate Op 3126025 Op 312601516202125 Rs1Rs2 target RdOpx Register-Register 56 1011 Register-Immediate Op 312601516202125 Rs1Rs2/Opx immediate Branch Jump / Call

8 Parallel Architectures and Systems, Edwin Sha 8 Pipelining Lessons Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup ABCD 6 PM 789 TaskOrderTaskOrder Time 3040 20

9 Parallel Architectures and Systems, Edwin Sha 9 Computer Pipelines Execute billions of instructions, so throughout is what matters DLX desirable features: all instructions same length, registers located in same place in instruction format, memory operands only in loads or stores

10 Parallel Architectures and Systems, Edwin Sha 10 5 Steps of DLX Datapath Figure 3.1, Page 130 Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc IR LMDLMD

11 Parallel Architectures and Systems, Edwin Sha 11 Its Not That Easy for Computers Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle –Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) –Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) –Control hazards: Pipelining of branches & other instructions that change the PC –Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline

12 Parallel Architectures and Systems, Edwin Sha 12 Speed Up Equation for Pipelining CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instr Speedup = Ideal CPI x Pipeline depth Clock Cycle unpipelined Ideal CPI + Pipeline stall CPI Clock Cycle pipelined Speedup = Pipeline depth Clock Cycle unpipelined 1 + Pipeline stall CPI Clock Cycle pipelined x x

13 Parallel Architectures and Systems, Edwin Sha 13 Data Hazard on R1 Figure 3.9, page 147 I n s t r. O r d e r Time (clock cycles) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 IFID/RFEXMEMWB IF ID/RF EX MEM WB

14 Parallel Architectures and Systems, Edwin Sha 14 Three Generic Data Hazards Instr I followed by Instr J Read After Write (RAW) Instr J tries to read operand before Instr I writes it

15 Parallel Architectures and Systems, Edwin Sha 15 Three Generic Data Hazards Instr I followed by Instr J Write After Read (WAR) Instr J tries to write operand before Instr I reads i –Gets wrong operand Can’t happen in DLX 5 stage pipeline because: – All instructions take 5 stages, and – Reads are always in stage 2, and – Writes are always in stage 5

16 Parallel Architectures and Systems, Edwin Sha 16 Three Generic Data Hazards Instr I followed by Instr J Write After Write (WAW) Instr J tries to write operand before Instr I writes it – Leaves wrong result ( Instr I not Instr J ) Can’t happen in DLX 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes

17 Parallel Architectures and Systems, Edwin Sha 17 Forwarding to Avoid Data Hazard Figure 3.10, Page 149 I n s t r. O r d e r Time (clock cycles) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

18 Parallel Architectures and Systems, Edwin Sha 18 I n s t r. O r d e r Time (clock cycles) lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 Data Hazard Even with Forwarding Figure 3.12, Page 153

19 Parallel Architectures and Systems, Edwin Sha 19 Data Hazard Even with Forwarding Figure 3.13, Page 154 I n s t r. O r d e r Time (clock cycles) lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

20 Parallel Architectures and Systems, Edwin Sha 20 Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SWd,Rd Software Scheduling to Avoid Load Hazards Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SWd,Rd

21 Parallel Architectures and Systems, Edwin Sha 21 Control Hazard on Branches Three Stage Stall

22 Parallel Architectures and Systems, Edwin Sha 22 Branch Stall Impact If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Two part solution: –Determine branch taken or not sooner, AND –Compute taken branch address earlier DLX branch tests if register = 0 or ° 0 DLX Solution: –Move Zero test to ID/RF stage –Adder to calculate new PC in ID/RF stage –1 clock cycle penalty for branch versus 3

23 Parallel Architectures and Systems, Edwin Sha 23 Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken –Execute successor instructions in sequence –“Squash” instructions in pipeline if branch actually taken –Advantage of late pipeline state update –47% DLX branches not taken on average –PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken –53% DLX branches taken on average –But haven’t calculated branch target address in DLX »DLX still incurs 1 cycle branch penalty »Other machines: branch target known before outcome

24 Parallel Architectures and Systems, Edwin Sha 24 Four Branch Hazard Alternatives #4: Delayed Branch –Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor 2........ sequential successor n branch target if taken –1 slot delay allows proper decision and branch target address in 5 stage pipeline –DLX uses this Branch delay of length n

25 Parallel Architectures and Systems, Edwin Sha 25 Delayed Branch Where to get instructions to fill branch delay slot? –Before branch instruction –From the target address: only valuable when branch taken –From fall through: only valuable when branch not taken –Cancelling branches allow more slots to be filled Compiler effectiveness for single branch delay slot: –Fills about 60% of branch delay slots –About 80% of instructions executed in branch delay slots useful in computation –About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

26 Parallel Architectures and Systems, Edwin Sha 26 Pipelining Introduction Summary Just overlap tasks, and easy if tasks are independent Speed Up Š Pipeline Depth; if ideal CPI is 1, then: Hazards limit performance on computers: –Structural: need more HW resources –Data (RAW,WAR,WAW): need forwarding, compiler scheduling –Control: delayed branch, prediction Speedup = Pipeline Depth 1 + Pipeline stall CPI X Clock Cycle Unpipelined Clock Cycle Pipelined

27 Parallel Architectures and Systems, Edwin Sha 27 Recap: Who Cares About the Memory Hierarchy? µProc 60%/yr. (2X/1.5yr) DRAM 9%/yr. (2X/10 yrs) 1 10 100 1000 19801981198319841985198619871988198919901991199219931994199519961997199819992000 DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance Time “Moore’s Law” Processor-DRAM Memory Gap (latency)

28 Parallel Architectures and Systems, Edwin Sha 28 Levels of the Memory Hierarchy CPU Registers 100s Bytes <10s ns Cache K Bytes 10-100 ns 1-0.1 cents/bit Main Memory M Bytes 200ns- 500ns $.0001-.00001 cents /bit Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10 cents/bit -5 -6 Capacity Access Time Cost Tape infinite sec-min 10 -8 Registers Cache Memory Disk Tape Instr. Operands Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache cntl 8-128 bytes OS 512-4K bytes user/operator Mbytes Upper Level Lower Level faster Larger

29 Parallel Architectures and Systems, Edwin Sha 29 The Principle of Locality The Principle of Locality: –Program access a relatively small portion of the address space at any instant of time. Two Different Types of Locality: –Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) –Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) Last 15 years, HW relied on localilty for speed

30 Parallel Architectures and Systems, Edwin Sha 30 Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X) –Hit Rate: the fraction of memory access found in the upper level –Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y) –Miss Rate = 1 - (Hit Rate) –Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty (500 instructions on 21264!) Lower Level Memory Upper Level Memory To Processor From Processor Blk X Blk Y

31 Parallel Architectures and Systems, Edwin Sha 31 Cache Measures Hit rate: fraction found in that level –So high that usually talk about Miss rate –Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks) Miss penalty: time to replace a block from lower level, including time to replace in CPU –access time : time to lower level = f(latency to lower level) –transfer time : time to transfer block =f(BW between upper & lower levels)

32 Parallel Architectures and Systems, Edwin Sha 32 Simplest Cache: Direct Mapped Memory 4 Byte Direct Mapped Cache Memory Address 0 1 2 3 4 5 6 7 8 9 A B C D E F Cache Index 0 1 2 3 Location 0 can be occupied by data from: –Memory location 0, 4, 8,... etc. –In general: any memory location whose 2 LSBs of the address are 0s –Address => cache index Which one should we place in the cache? How can we tell which one is in the cache?

33 Parallel Architectures and Systems, Edwin Sha 33 1 KB Direct Mapped Cache, 32B blocks For a 2 ** N byte cache: –The uppermost (32 - N) bits are always the Cache Tag –The lowest M bits are the Byte Select (Block Size = 2 ** M) Cache Index 0 1 2 3 : Cache Data Byte 0 0431 : Cache TagExample: 0x50 Ex: 0x01 0x50 Stored as part of the cache “state” Valid Bit : 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select Ex: 0x00 9

34 Parallel Architectures and Systems, Edwin Sha 34 Two-way Set Associative Cache N-way set associative: N entries for each Cache Index –N direct mapped caches operates in parallel (N typically 2 to 4) Example: Two-way set associative cache –Cache Index selects a “set” from the cache –The two tags in the set are compared in parallel –Data is selected based on the tag result Cache Data Cache Block 0 Cache TagValid ::: Cache Data Cache Block 0 Cache TagValid ::: Cache Index Mux 01 Sel1Sel0 Cache Block Compare Adr Tag Compare OR Hit

35 Parallel Architectures and Systems, Edwin Sha 35 Disadvantage of Set Associative Cache N-way Set Associative Cache v. Direct Mapped Cache: –N comparators vs. 1 –Extra MUX delay for the data –Data comes AFTER Hit/Miss In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: –Possible to assume a hit and continue. Recover later if miss. Cache Data Cache Block 0 Cache TagValid ::: Cache Data Cache Block 0 Cache TagValid ::: Cache Index Mux 01 Sel1Sel0 Cache Block Compare Adr Tag Compare OR Hit

36 Parallel Architectures and Systems, Edwin Sha 36 4 Questions for Memory Hierarchy Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)

37 Parallel Architectures and Systems, Edwin Sha 37 Q1: Where can a block be placed in the upper level? Block 12 placed in 8 block cache: –Fully associative, direct mapped, 2-way set associative –S.A. Mapping = Block Number Modulo Number Sets Memory

38 Parallel Architectures and Systems, Edwin Sha 38 Q2: How is a block found if it is in the upper level? Tag on each block –No need to check index or block offset Increasing associativity shrinks index, expands tag

39 Parallel Architectures and Systems, Edwin Sha 39 Q3: Which block should be replaced on a miss? Easy for Direct Mapped Set Associative or Fully Associative: –Random –LRU (Least Recently Used) Associativity:2-way4-way8-way SizeLRURandomLRURandomLRURandom 16 KB5.2%5.7%4.7%5.3%4.4%5.0% 64 KB1.9%2.0%1.5%1.7%1.4%1.5% 256 KB1.15%1.17%1.13%1.13%1.12%1.12%

40 Parallel Architectures and Systems, Edwin Sha 40 Q4: What happens on a write? Write through—The information is written to both the block in the cache and to the block in the lower-level memory. Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. –is block clean or dirty? Pros and Cons of each? –WT: read misses cannot result in writes –WB: no repeated writes to same location WT always combined with write buffers so that don’t wait for lower level memory

41 Parallel Architectures and Systems, Edwin Sha 41 Write Buffer for Write Through A Write Buffer is needed between the Cache and Memory –Processor: writes data into the cache and the write buffer –Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: –Typical number of entries: 4 –Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare: –Store frequency (w.r.t. time) -> 1 / DRAM write cycle –Write buffer saturation Processor Cache Write Buffer DRAM

42 Parallel Architectures and Systems, Edwin Sha 42 A Modern Memory Hierarchy By taking advantage of the principle of locality: –Present the user with as much memory as is available in the cheapest technology. –Provide access at the speed offered by the fastest technology. Control Datapath Secondary Storage (Disk) Processor Registers Main Memory (DRAM) Second Level Cache (SRAM) On-Chip Cache 1s10,000,000s (10s ms) Speed (ns):10s100s Gs Size (bytes): KsMs Tertiary Storage (Disk/Tape) 10,000,000,000s (10s sec) Ts

43 Parallel Architectures and Systems, Edwin Sha 43 Basic Issues in VM System Design size of information blocks that are transferred from secondary to main storage (M) block of information brought into M, and M is full, then some region of M must be released to make room for the new block --> replacement policy which region of M is to hold the new block --> placement policy missing item fetched from secondary memory only on the occurrence of a fault --> demand load policy Paging Organization virtual and physical address space partitioned into blocks of equal size page frames pages reg cache mem disk frame

44 Parallel Architectures and Systems, Edwin Sha 44 Address Map V = {0, 1,..., n - 1} virtual address space M = {0, 1,..., m - 1} physical address space MAP: V --> M U {0} address mapping function n > m MAP(a) = a' if data at virtual address a is present in physical address a' and a' in M = 0 if data at virtual address a is not present in M Processor Name Space V Addr Trans Mechanism fault handler Main Memory Secondary Memory a a a' 0 missing item fault physical address OS performs this transfer

45 Parallel Architectures and Systems, Edwin Sha 45 Paging Organization frame 0 1 7 0 1024 7168 P.A. Physical Memory 1K Addr Trans MAP page 0 1 31 1K 0 1024 31744 unit of mapping also unit of transfer from virtual to physical memory Virtual Memory Address Mapping VApage no.disp 10 Page Table index into page table Page Table Base Reg V Access Rights PA + table located in physical memory physical memory address actually, concatenation is more likely V.A.

46 Parallel Architectures and Systems, Edwin Sha 46 Virtual Address and a Cache CPU Trans- lation Cache Main Memory VAPA miss hit data It takes an extra memory access to translate VA to PA This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible ASIDE: Why access cache with PA at all? VA caches have a problem! synonym / alias problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address! for update: must update all cache entries with same physical address or memory becomes inconsistent determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits; or software enforced alias boundary: same lsb of VA &PA > cache size

47 Parallel Architectures and Systems, Edwin Sha 47 TLBs A way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB Virtual Address Physical Address Dirty Ref Valid Access Really just a cache on the page table mappings TLB access time comparable to cache access time (much less than main memory access time)

48 Parallel Architectures and Systems, Edwin Sha 48 Translation Look-Aside Buffers Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped TLBs are usually small, typically not more than 128 - 256 entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations. CPU TLB Lookup Cache Main Memory VAPA miss hit data Trans- lation hit miss 20 t t 1/2 t Translation with a TLB

49 Parallel Architectures and Systems, Edwin Sha 49 Reducing Translation Time Machines with TLBs go one step further to reduce # cycles/cache access They overlap the cache access with the TLB access: high order bits of the VA are used to look in the TLB while low order bits are used as index into cache

50 Parallel Architectures and Systems, Edwin Sha 50 Summary #1/4: The Principle of Locality: –Program access a relatively small portion of the address space at any instant of time. »Temporal Locality: Locality in Time »Spatial Locality: Locality in Space Three Major Categories of Cache Misses: –Compulsory Misses: sad facts of life. Example: cold start misses. –Capacity Misses: increase cache size –Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Write Policy: –Write Through: needs a write buffer. Nightmare: WB saturation –Write Back: control can be complex

51 Parallel Architectures and Systems, Edwin Sha 51 Summary #2 / 4: The Cache Design Space Several interacting dimensions –cache size –block size –associativity –replacement policy –write-through vs write-back –write allocation The optimal choice is a compromise –depends on access characteristics »workload »use (I-cache, D-cache, TLB) –depends on technology / cost Simplicity often wins Associativity Cache Size Block Size Bad Good LessMore Factor AFactor B

52 Parallel Architectures and Systems, Edwin Sha 52 Summary #3/4: TLB, Virtual Memory Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What block is repalced on miss? 4) How are writes handled? Page tables map virtual address to physical address TLBs are important for fast translation TLB misses are significant in processor performance –funny times, as most systems can’t access all of 2nd level cache without TLB misses!

53 Parallel Architectures and Systems, Edwin Sha 53 Summary #4/4: Memory Hierachy VIrtual memory was controversial at the time: can SW automatically manage 64KB across many programs? –1000X DRAM growth removed the controversy Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy Today CPU time is a function of (ops, cache misses) vs. just f(ops): What does this mean to Compilers, Data structures, Algorithms?

54 Parallel Architectures and Systems, Edwin Sha 54 Review: Summary of Pipelining Basics Hazards limit performance –Structural: need more HW resources –Data: need forwarding, compiler scheduling –Control: early evaluation & PC, delayed branch, prediction Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency Interrupts, Instruction Set, FP makes pipelining harder Compilers reduce cost of data and control hazards –Load delay slots –Branch delay slots –Branch prediction Today: Longer pipelines (R4000) => Better branch prediction, more instruction parallelism?

55 Parallel Architectures and Systems, Edwin Sha 55 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions + 1 branch –Beyond single block to get more instruction level parallelism Loop level parallelism one opportunity, SW and HW Do examples and then explain nomenclature DLX Floating Point as example –Measurements suggests R4000 performance FP execution has room for improvement

56 Parallel Architectures and Systems, Edwin Sha 56 FP Loop: Where are the Hazards? Loop:LDF0,0(R1);F0=vector element ADDDF4,F0,F2;add scalar from F2 SD0(R1),F4;store result SUBIR1,R1,8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1!=zero NOP;delayed branch slot InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 Load doubleStore double0 Integer opInteger op0 Where are the stalls?

57 Parallel Architectures and Systems, Edwin Sha 57 FP Loop Hazards InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 Load doubleStore double0 Integer opInteger op0 Loop:LDF0,0(R1);F0=vector element ADDDF4,F0,F2;add scalar in F2 SD0(R1),F4;store result SUBIR1,R1,8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1!=zero NOP;delayed branch slot

58 Parallel Architectures and Systems, Edwin Sha 58 FP Loop Showing Stalls 9 clocks: Rewrite code to minimize stalls? InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 1 Loop:LDF0,0(R1);F0=vector element 2stall 3ADDDF4,F0,F2;add scalar in F2 4stall 5stall 6 SD0(R1),F4;store result 7 SUBIR1,R1,8;decrement pointer 8B (DW) 8 BNEZR1,Loop;branch R1!=zero 9stall;delayed branch slot

59 Parallel Architectures and Systems, Edwin Sha 59 Revised FP Loop Minimizing Stalls 6 clocks: Unroll loop 4 times code to make faster? InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 1 Loop:LDF0,0(R1) 2stall 3ADDDF4,F0,F2 4SUBIR1,R1,8 5BNEZR1,Loop;delayed branch 6 SD8(R1),F4;altered when move past SUBI Swap BNEZ and SD by changing address of SD

60 Parallel Architectures and Systems, Edwin Sha 60 Unroll Loop Four Times (straightforward way) Rewrite loop to minimize stalls? 1 Loop:LDF0,0(R1) 2ADDDF4,F0,F2 3SD0(R1),F4 ;drop SUBI & BNEZ 4LDF6,-8(R1) 5ADDDF8,F6,F2 6SD-8(R1),F8 ;drop SUBI & BNEZ 7LDF10,-16(R1) 8ADDDF12,F10,F2 9SD-16(R1),F12 ;drop SUBI & BNEZ 10LDF14,-24(R1) 11ADDDF16,F14,F2 12SD-24(R1),F16 13SUBIR1,R1,#32;alter to 4*8 14BNEZR1,LOOP 15NOP 15 + 4 x (1+2) = 27 clock cycles, or 6.8 per iteration Assumes R1 is multiple of 4

61 Parallel Architectures and Systems, Edwin Sha 61 Unrolled Loop That Minimizes Stalls What assumptions made when moved code? –OK to move store past SUBI even though changes register –OK to move loads before stores: get right data? –When is it safe for compiler to do such changes? 1 Loop:LDF0,0(R1) 2LDF6,-8(R1) 3LDF10,-16(R1) 4LDF14,-24(R1) 5ADDDF4,F0,F2 6ADDDF8,F6,F2 7ADDDF12,F10,F2 8ADDDF16,F14,F2 9SD0(R1),F4 10SD-8(R1),F8 11SD-16(R1),F12 12SUBIR1,R1,#32 13BNEZR1,LOOP 14SD8(R1),F16; 8-32 = -24 14 clock cycles, or 3.5 per iteration When safe to move instructions?

62 Parallel Architectures and Systems, Edwin Sha 62 Compiler Perspectives on Code Movement Definitions: compiler concerned about dependencies in program, whether or not a HW hazard depends on a given pipeline Try to schedule to avoid hazards (True) Data dependencies (RAW if a hazard for HW) –Instruction i produces a result used by instruction j, or –Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. If depedent, can’t execute in parallel Easy to determine for registers (fixed names) Hard for memory: –Does 100(R4) = 20(R6)? –From different loop iterations, does 20(R6) = 20(R6)?

63 Parallel Architectures and Systems, Edwin Sha 63 Compiler Perspectives on Code Movement Another kind of dependence called name dependence: two instructions use same name (register or memory location) but don’t exchange data Antidependence (WAR if a hazard for HW) –Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first Output dependence (WAW if a hazard for HW) –Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved.

64 Parallel Architectures and Systems, Edwin Sha 64 Where are the name dependencies? 1 Loop:LDF0,0(R1) 2ADDDF4,F0,F2 3SD0(R1),F4 ;drop SUBI & BNEZ 4LDF0,-8(R1) 2ADDDF4,F0,F2 3SD-8(R1),F4 ;drop SUBI & BNEZ 7LDF0,-16(R1) 8ADDDF4,F0,F2 9SD-16(R1),F4 ;drop SUBI & BNEZ 10LDF0,-24(R1) 11ADDDF4,F0,F2 12SD-24(R1),F4 13SUBIR1,R1,#32;alter to 4*8 14BNEZR1,LOOP 15NOP How can remove them?

65 Parallel Architectures and Systems, Edwin Sha 65 Where are the name dependencies? 1 Loop:LDF0,0(R1) 2ADDDF4,F0,F2 3SD0(R1),F4 ;drop SUBI & BNEZ 4LDF6,-8(R1) 5ADDDF8,F6,F2 6SD-8(R1),F8 ;drop SUBI & BNEZ 7LDF10,-16(R1) 8ADDDF12,F10,F2 9SD-16(R1),F12 ;drop SUBI & BNEZ 10LDF14,-24(R1) 11ADDDF16,F14,F2 12SD-24(R1),F16 13SUBIR1,R1,#32;alter to 4*8 14BNEZR1,LOOP 15NOP Called “register renaming”

66 Parallel Architectures and Systems, Edwin Sha 66 Compiler Perspectives on Code Movement Again Name Dependenceis are Hard for Memory Accesses –Does 100(R4) = 20(R6)? –From different loop iterations, does 20(R6) = 20(R6)? Our example required compiler to know that if R1 doesn’t change then: 0(R1) ≠ -8(R1) ≠ -16(R1) ≠ -24(R1) There were no dependencies between some loads and stores so they could be moved by each other

67 Parallel Architectures and Systems, Edwin Sha 67 Compiler Perspectives on Code Movement Final kind of dependence called control dependence Example if p1 {S1;}; if p2 {S2;}; S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.

68 Parallel Architectures and Systems, Edwin Sha 68 Compiler Perspectives on Code Movement Two (obvious) constraints on control dependences: –An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. –An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch. Control dependencies relaxed to get parallelism; get same effect if preserve order of exceptions (address in register checked by branch before use) and data flow (value in register depends on branch)

69 Parallel Architectures and Systems, Edwin Sha 69 Where are the control dependencies? 1 Loop:LDF0,0(R1) 2ADDDF4,F0,F2 3SD0(R1),F4 4SUBIR1,R1,8 5BEQZR1,exit 6LDF0,0(R1) 7ADDDF4,F0,F2 8SD0(R1),F4 9SUBIR1,R1,8 10BEQZR1,exit 11LDF0,0(R1) 12ADDDF4,F0,F2 13SD0(R1),F4 14SUBIR1,R1,8 15BEQZR1,exit....

70 Parallel Architectures and Systems, Edwin Sha 70 When Safe to Unroll Loop? Example: Where are data dependencies? (A,B,C distinct & nonoverlapping) for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1];} /* S2 */ 1. S2 uses the value, A[i+1], computed by S1 in the same iteration. 2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. This is a “loop-carried dependence”: between iterations Implies that iterations are dependent, and can’t be executed in parallel Not the case for our prior example; each iteration was distinct

71 Parallel Architectures and Systems, Edwin Sha 71 HW Schemes: Instruction Parallelism Why in HW at run time? –Works when can’t know real dependence at compile time –Compiler simpler –Code for one machine runs well on another Key idea: Allow instructions behind stall to proceed DIVDF0,F2,F4 ADDDF10,F0,F8 SUBDF12,F8,F14 –Enables out-of-order execution => out-of-order completion –ID stage checked both for structuralScoreboard dates to CDC 6600 in 1963

72 Parallel Architectures and Systems, Edwin Sha 72 HW Schemes: Instruction Parallelism Out-of-order execution divides ID stage: 1.Issue—decode instructions, check for structural hazards 2.Read operands—wait until no data hazards, then read operands Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions CDC 6600: In order issue, out of order execution, out of order commit ( also called completion) Reference: “Computer Architecture – A Quantitative Approach”, chp.1 - 4 -- By John L Hennessy & David A. Patterson


Download ppt "Parallel Architectures and Systems, Edwin Sha 1 Review of Computer Architecture -- Instruction sets, pipelines and caches."

Similar presentations


Ads by Google