Presentation is loading. Please wait.

Presentation is loading. Please wait.

ILP, Memory and Synchronization Joseph B. Manzano.

Similar presentations


Presentation on theme: "ILP, Memory and Synchronization Joseph B. Manzano."— Presentation transcript:

1 ILP, Memory and Synchronization Joseph B. Manzano

2 Instruction Level Parallelism Parallelism that is found between instructions Dynamic and Static Exploitation – Dynamic: Hardware related. – Static: Software related (compiler and system software) VLIW and Superscalar Micro-Dataflow and Tomasulo’s Algorithm

3 Hazards Structural Hazards – Non Pipelining Function Units – One Port Register Bank and one port memory bank Data Hazards – For some Forwarding – For others Pipeline Interlock LDR1A +R4R1R7 Need Bubble / Stall

4 Data Dependency: A Review B + C  A A + D  E Flow Dependency RAW Conflicts A + C  B E + D  A Anti Dependency WAR Conflicts B + C  A E + D  A Output Dependency WAW Conflicts RAR are not really a problem

5 Instruction Level Parallelism Static Scheduling – Simple Scheduling – Loop Unrolling – Loop Unrolling + Scheduling – Software Pipelining Dynamic Scheduling – Out of order execution – Data Flow computers Speculation

6 Advanced Pipelining Instruction Reordering and scheduling within loop body Loop Unrolling – Code size suffers Superscalar – Compact code – Multiple issued of different instruction types VLIW

7 An Example X[i] + a Loop:LDF0, 0 (R1) ; load the vector element ADDDF4, F0, F2; add the scalar in F2 SD0 (R1), F4; store the vector element SUBR1, R1, #8; decrement the pointer by ; 8 bytes (per DW) BNEZR1, Loop; branch when it’s not zero Instruction ProducerInstruction ConsumerLatency FP ALU op 3 Store Double2 Load DoubleFP ALU op1 Load DoubleStore Double0 Load can by-pass the store Assume that latency for Integer ops is zero and latency for Integer load is 1

8 An Example X[i] + a Loop:LDF0, 0 (R1) 1 STALL 2 ADDDF4, F0, F23 STALL4 STALL5 SD0 (R1), F46 SUBR1, R1, #87 BNEZR1, Loop8 STALL9 Load Latency FP ALU Latency Load Latency This requires 9 Cycles per iteration

9 An Example X[i] + a Loop:LDF0, 0 (R1) 1 STALL 2 ADDDF4, F0, F23 SUBR1, R1, #84 BNEZR1, Loop5 SD8 (R1), F4 6 This requires 6 Cycles per iteration Scheduling

10 An Example X[i] + a Loop :LDF0, 0 (R1)1 NOP2 ADDDF4, F0, F23 NOP4 NOP5 SD0 (R1), F4 6 LDF6, -8 (R1)7 NOP8 ADDDF8, F6, F29 NOP10 NOP11 SD-8 (R1), F812 LDF10, -16 (R1)13 NOP14 ADDDF12, F10, F215 NOP16 NOP17 SD-16 (R1), F12 18 LDF14, -24 (R1)19 NOP20 ADDDF16, F14, F221 NOP22 NOP23 SD-24 (R1), F1624 SUBR1, R1, #3225 BNEZR1, LOOP26 NOP27 This requires 6.8 Cycles per iteration Unrolling

11 An Example X[i] + a Loop :LDF0, 0 (R1)1 LDF6, - 8 (R1)2 LDF10, -16 (R1)3 LDF14, -24 (R1)4 ADDDF4, F0, F25 ADDDF8, F6, F2 6 ADDDF12, F10, F2 7 ADDDF16, F14, F2 8 SD0 (R1), F49 SD-8 (R1), F810 SD-16 (R1), F1211 SUBR1, R1, #3212 BNEZR1, LOOP13 SD8 (R1), F1614 This requires 3.5 Cycles per iteration Unrolling + Scheduling

12 ILP ILP of a program – Average Number of Instructions that a superscalar processor might be able to execute at the same time Data dependencies Latencies and other processor difficulties ILP of a machine – The ability of a processor to take advantage of the ILP Number of instructions that can be fetched and executed at the same time by such processor

13 Multi Issue Architectures Super Scalar – Machines that issue multiple independent instructions per clock cycle when they are properly scheduled by the compiler and runtime scheduler Very Long Instruction Word – A machine where the compiler has complete responsibility for creating a package of instructions that can be simultaneously issued, and the hardware does not dynamically make any decisions about multiple issue Patterson & Hennessy P317 and P318

14 Multiple Instruction Issue Multiple Issue + Static Scheduling  VLIW Dynamic Scheduling – Tomasulo – Scoreboarding Multiple Issue + Dynamic Scheduling  Superscalar Decoupled Architectures – Static Scheduling of R-R Instructions – Dynamic Scheduling of Memory Ops Buffers

15 Software Pipeline Reorganizing loops such that each iteration is composed of instruction sequences chosen from different iterations Use less code size – Compared to Unrolling Some Architecture has specific software support – Rotating register banks – Predicated Instructions

16 Software Pipelining Overlap instructions without unrolling the loop Give the vector M in memory, and ignoring the start-up and finishing code, we have: Loop: SD0 (R1), F4 ;stores into M[i] ADDDF4, F0, F2 ;adds to M[i +1] LDF0, -8 (R1) ;loads M[i + 2] BNEZR1, LOOP SUBR1, R1, #8 ;subtract indelay slot This loop can be run at a rate of 5 cycles per result, ignoring the start-up and clean-up portions.

17 Software Pipeline Overhead for Software Pipeline: Two times cost  One for Prolog and one for epilog Overhead for Unrolled Loop: M / N times cost  M Loop Executions and N unrolling Software Pipeline Code Prologue Epilog Unrolled Number of Overlapped instructions Time

18 Loop Unrolling V.S. Software Pipelining When not running at maximum rate – Unrolling: Pay m/n times overhead when m iteration and n unrolling – Software Pipelining: Pay two times Once at prologue and once at epilog Moreover – Code compactness – Optimal runtime – Storage constrains

19 Comparison of Static Methods w/o scheduling schedulingunrollingUnrolling + Scheduling 2 issue4 issueSP 1- issue SP 5- Issue Cycles per iterations

20 Limitations of VLIW Limited parallelism (statically schedule) code – Basic Blocks may be too small – Global Code Motion is difficult Limited Hardware Resources Code Size Memory Port limitations A Stall is serious Cache is difficult to be used (effectively) – i-cache misses have the potential to multiply the miss rate by a factor of n where n is the issue width – Cache miss penalty is increased since the length of instruction word

21 An VLIW Example TMS32C62x/C67 Block Diagram Source: TMS320C600 Technical Brief. February 1999

22 An VLIW Example TMS32C62x/C67 Data Paths Source: TMS320C600 Technical Brief. February 1999 Assembly Example

23 Introduction to SuperScalar

24 Instruction Issue Policy It determinates the processor look ahead policy – Ability to examine instructions beyond the current PC Look Ahead must ensure correctness at all costs Issue policy – Protocol used to issue instructions Note: Issue, execution and completion

25 Achieve High Performance in Multiple Issued Instruction Machines Detection and resolution of storage conflicts – Extra “Shadow” registers – Special bit for reservation Organization and control of the buses between the various units in the PU – Special controllers to detect write backs and read

26 Data Dependencies & SuperScalar Hardware Mechanism (dynamic scheduling) -Scoreboarding -limited out-of-order issue/completion -centralized control -Renaming with reorder buffer is a another attractive approach (based on Tomasulo Alg.) -Micro dataflow Advantage: exact runtime information -Load/cache miss -resolve storage location related dependence

27 Scoreboarding Named after CDC 6600 Effective when there are enough resources and no data dependencies Out-of-order execution Issue: checking scoreboard and WAW will cause a stall Read operand -checking availability of operand and resolve RAW dynamically at this step -WAR will not cause stall EX Write result -WAR will be checked and will cause stall

28 ..... Registers Integer unit FP add FP divide FP mult Scoreboard Data buses Control/ status Control/ status The basic structure of a DLX processor with a scoreboard

29 Scoreboarding [CDC6600, Thorton70], [WeissSmith84] A bit (called “scoreboard bit”) is associated with each register bit = 1: the register is reserved by a write An instruction has a source operand with bit = 1will be issued, but put into an instruction window, with the register identifier to denote the “to-be-written” operand Copies of valid operands also be read with pending inst (solve anti-dependence) When the missing operand is finally written, the register id in the pending inst will be compared and value written, so it can be issued An inst has result R reserved - will stall so the output- dependence (WAW) will be correctly handled by stall!

30 Micro Data Flow Fundamental Concepts – “Data Flow” Instructions can only be fired when operands are available – Single assignment and register renaming Implementation – Tomasulo’s Algorithm – Reorder Buffer

31 Renaming/Single Assignment R0 = R2 / R4;(1) R6 = R0 + R8(2) R1[0] = R6(3) R8 = R10 – R14(4) R6 = R10 * R8(5) R0 = R2 / R4;(1) S = R0 + R8(2) R1[0] = S(3) T = R10 – R14(4) R6 = R10 * T(5)

32 Baseline Superscalar Model Inst Fetch Inst Decode Wake Up Select Register File ExecData Cache Bypass Renaming Issue Window Execution Bypass Data Cache Access Register Write & Instruction Commit

33 Micro Data Flow Conceptual Model A  R1 R1 * B  R2 R2 / C  R1 R4 + R1  R4 A Load * / + B C R1 OR4 OR3 OR5 OR1 OR6 R2 R4 R1 R4 R1 R2 R3 R4

34 ROB Stages Issue – Dispatch an instruction from the instruction queue – Reserved ROB entry and a reservation station Execute – Stall for operands – RAW resolved Write Result – Write back to any reservation stations waiting for it and to the ROB Commit – Normal Commit: Update Registers – Store Commit: Update Memory – False Branch: Flush the ROB and re-begin execution

35 Tomasulo’s Algorithm Tomasulo, R.M. “An Efficient Algorithm for Exploiting Multiple Arithmetic Units”, IBM J. of R&D 11:1 (Jan, 1967, p.p ) IBM 360/91 (three year after CDC 6600 and just before caches) Features: CDB: Common Data Bus Reservation Units: Hardware features which allow the fetch, use and reuse of data as soon as it becomes available. It allows register renaming and it is decentralized in nature (as opposed as Scoreboarding)

36 Tomasulo’s Algorithm Control and Buffers distributed with Functional Units. HW renaming of registers CDB broadcasting Load / Store buffers  Functional Units Reservation Stations: – Hazard detection and Instruction control – 4-bit tag field to specify which station or buffer will produce the result Register Renaming – Tag Assigned on IS – Tag discarded after write back

37 Comparison Scoreboarding – Centralized Data structure and control – Register bit Simple, low cost – Structural hazards solved by FU – Solve RAW by register bit – Solve WAR in write – Solve WAW stalls on issue Tomasulo’s Algoritjm – Distributed control – Tagged Registers + register renaming – Structural Hazard stalls on Reservation Station – Solve RAW by CDB – Solve WAR by copying operand to Reservation Station – Solve WAW by renaming – Limited: CDB Broadcast 1 per cycle

38 The Architecture Form memory Load buffers From instruction unit Floating- point operations FP registers FP addersFP multipliers Store buffers to memory Common data bus (CDB) Operation bus 2121 Reservation Stations Operand bus - 3 Adders - 2 Multipliers - Load buffers (6) - Store buffers (3) - FP Queue - FP registers - CDB: Common Data Bus

39 Tomasulo’s Algorithm’s Steps Issue -Issue if empty reservation station is found, fetch operands if they are in registers, otherwise assign a tag -If no empty reservation is found, stall and wait for one to get free -Renaming is performed here and WAW and WAR are resolved Execute – If operands are not ready, monitor the CDB for them – RAWs are resolved – When they are ready, execute the op in the FU Write Back – Send the results to CDB and update registers and the Store buffers – Store Buffers will write to memory during this step Exception Behavior – During Execute: No instructions are allowed to be issued until all branches before it have been completed

40 Tomasulo’s Algorithm Note that: Upon Entering a reservation station, source operands are either filled with values or renamed The new names are 1-to-1 correspondence to FU names Question: How the output dependencies are resolved? Two pending writes to a register How to determinate that a read will get the most recent value if they complete out of order

41 Features of T. Alg. The value of an operand (for any inst already issued in a reservation station) will be read from CDB. it will not be read from the reg. field. Instructions can be issued without even the operands produced (but know they are coming from CDB)

42 Memory Models

43 Programming Execution Models A set of rules to create programs Message Passing Model – De Facto Multicomputer Programming Model – Multiple Address Space – Explicit Communication / Implicit Synchronization Shared Memory Models – De Facto Multiprocessor Programming Model – Single Address Space – Implicit Communication / Explicit Synchronization

44 Shared Memory Execution Model A group of rules that deals with data replication, coherency, and memory ordering Private Data Shared Data Data that is not visible to other threads Data that can be access by other threads Thread Model Memory Model Synchronization Model A set of rules for thread creation, scheduling and destruction Rules that deal with access to shared data Thread Virtual Machine

45 Grand Challenge Problems Shared Memory Multiprocessor  Effective at a number of thousand units Optimize and Compile parallel applications Main Areas: Assumptions about – Memory Coherency – Memory Consistency

46 Memory Consistency & Coherence

47 Memory [Cache] Coherency The Problem P1P2P3 U:5 1 4 U:? U: What value P1 and P2 will read? 13

48 MCM Category of Access As Presented in Mosberger 93 Memory Access Private Shared CompetingNon-Competing Synchronization Non synchronization Acquire Release ExclusiveNon-exclusive Uniform V.S. Hybrid

49 10/03/2007ELEG652-07F49 Conventional MCM Sequential Consistency – “… the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” [Lamport 79]

50 Memory Consistency Problem B = 0 … A = 1 L1:print B A = 0 … B = 1 L2:print A Assume that L1 and L2 are issue only after the other 4 instructions have been completed. What are the possible values that are printed on the screen? Is 0, 0 a possible combination? The MCM: A software and hardware contract

51 MCM Attributes Memory Operations Location of Access – Near memory (cache, near memory modules, etc) V.S. far memory Direction of Access – Write or Read Value Transmitted in Access – Size Causality of Access – Check if two access are “causually” related and if they are in which order are they completed Category of Access – Static Property of Accesses

52 Synchronization and Its Cost

53 Synchronization The orchestration of two or more threads (or processes) to complete a task in a correct manner and to avoid any data races Data Race or Race Condition – “There is an anomaly of concurrent accesses by two or more threads to a shared memory and at least one of the accesses is a write” Atomicity and / or serialibility

54 Atomicity Atomic  From the Greek “Atomos” which means indivisible An “All or None” scheme An instruction (or a group of them) will appear as if it was (they were) executed in a single try – All side effects of the instruction (s) in the block are seen in its totality or not all Side effects  Writes and (Causal) Reads to the variables inside the atomic block

55 Synchronization Applied to Shared Variables Synchronization might enforce ordering or not High level Synchronization types – Semaphores – Mutex – Barriers – Critical Sections – Monitors – Conditional Variables

56 Types of (Software) Locks The Spin Lock Family The Simple Test and Set Lock – Polls a shared Boolean variable: A binary semaphore – Uses Fetch and Φ operations to operate on the binary semaphore – Expensive!!!! Waste bandwidth Generate Extra Busses transactions – The test test and set approach Just poll when the lock is in use

57 Types of (Software) Locks The Spin Lock Family Delay based Locks – Spin Locks in which a delay has been introduced in testing the lock – Constant delay – Exponentional Back-off Best Results – The test test and set scheme is not needed

58 Types of (Software) Locks The Spin Lock Family Pseudo code: enum LOCK_ACTIONS = {LOCKED, UNLOCKED}; void acquire_lock(lock_t L) { int delay = 1; while(! test_and_set(L, LOCKED) ) { sleep(delay); delay *= 2; } void release_lock(lock_t L) { L = UNLOCKED; }

59 Types of (Software) Locks The Ticket Lock Reduce the # of Fetch and Φ operations – Only one per lock acquisition Strongly fair lock – No starvation A FIFO service Implementation: Two counters – A Request and Release Counters

60 Types of (Software) Locks The Ticket LockPseudocode: unsigned int next_ticket = 0; unsigned int now_serving = 0; void acquire_lock() { unsigned int my_ticket = fetch_and_increment(next_ticket); while{ sleep(my_ticket - now_serving); if(now_serving == my_ticket) return; } void release_lock() { now_serving = now_serving + 1; }

61 Types of (Software) Locks The Array Based Queue Lock Contention on the release counter Cache Coherence and memory traffic – Invalidation of the counter variable and the request to a single memory bank Two elements – An Array and a tail pointer that index such array – The array is as big as the number of processors – Fetch and store  Address of the array element – Fetch and increment  Tail pointer FIFO ordering

62 Types of (Software) Locks The Queue Locks It uses too much memory – Linear space (relative to the number of processors) per lock. Array – Easy to implement Linked List: QNODE – Cache management

63 Types of (Software) Locks The MCS Lock Characteristics – FIFO ordering – Spins on locally accessible flag variables – Small amount of space per lock – Works equally well on machines with and without coherent caches Similar to the QNODE implementation of queue locks – QNODES are assigned to local memory – Threads spins on local memory

64 MCS: How it works? Each processor enqueues its own private lock variable into a queue and spins on it – key: spin locally CC model: spin in local cache DSM model: spin in local private memory – No contention On lock release, the releaser unlocks the next lock in the queue – Only have bus/network contention on actual unlock – No starvation (order of lock acquisitions defined by the list)

65 MCS Lock Requires atomic instruction: – compare-and-swap – fetch-and-store If there is no compare-and-swap – an alternative release algorithm extra complexity loss of strict FIFO ordering theoretical possibility of starvation Detail: Mellor-Crummey and Scott ’ s 1991 paper

66 Implementation Modern Alternatives Fetch and Φ operations – They are restrictive – Not all architecture support all of them Problem: A general one atomic op is hard!!! Solution: Provide two primitives to generate atomic operations Load Linked and Store Conditional – Remember PowerPC lwarx and stwcx instructions

67 Performance Penalty Example Suppose there are 10 processors on a bus that each try to lock a variable simultaneously. Assume that each bus transaction (read miss or write miss) is 100 clock cycles long. You can ignore the time of the actual read or write of a lock held in the cache, as well as the time the lock is held (they won’t matter much!) Determine the performance penalty.

68 Answer It takes over 12,000 cycles total for all processor to pass through the lock! Note: the contention of the lock and the serialization of the bus transactions. See example on pp 596, Henn/Patt, 3 rd Ed.


Download ppt "ILP, Memory and Synchronization Joseph B. Manzano."

Similar presentations


Ads by Google