Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Multi Threaded Architectures Sima, Fountain and Kacsuk Chapter 16 CSE462.

Similar presentations


Presentation on theme: "1 Multi Threaded Architectures Sima, Fountain and Kacsuk Chapter 16 CSE462."— Presentation transcript:

1 1 Multi Threaded Architectures Sima, Fountain and Kacsuk Chapter 16 CSE462

2  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 2 Memory and Synchronization Latency l Scalability of system is limited by ability to handle memory latency & algorithmic sychronization delays l Overall solution is well known –Do something else whilst waiting l Remote memory accesses –Much slower than local –Varying delay depending on Network traffic Memory traffic

3  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 3 Processor Utilization l Utilization –P/T P time spent processing T total time –P/(P + I + S) I time spent waiting on other tasks S time spent switching tasks

4  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 4 Basic ideas - Multithreading l Fine Grain – task switch every cycle l Coarse Grain – Task swith every n cycles Blocked Task Switch Overhead

5  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 5 Design Space Multi threaded architectures Computational Model Von Neumann (Sequential Control Flow) Hybrid Von Neumann/ Dataflow Parallel Control flow Based on parallel Control operators Parallel control flow Based on control tokens Granularity Fine Grain Coarse Grain Memory Organization Physical Shared Memory Distributed Shared Memory Cache-coherent Distributed shared Memory Number of threads per processor Small (4 – 10) Middle (10 – 100) Large (over 100)

6  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 6 Classification of multi-threaded architectures Multi-threaded architectures Von Neumann based architectures HEP Tera MIT Alewife & Sparcle Hybrid von Neumann/ Dataflow architectures RISC Like P-RISC *T Decoupled USC McGill MGDA & SAM Macro dataflow architectures MIT Hybrid Machine EM-4

7 7 Computational Models

8  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 8 Sequential control flow (von Neumann) l Flow of control and data separated l Executed sequentially (or at least sequential semantics – see chapter 7) l Control flow changed with JUMP/GOTO/CALL instructions l Data stored in rewritable memory –Flow of data does not affect execution order

9  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 9 Sequential Control Flow Model - A B m1 + B 1 m2 * m1 m2 R Control Flow L1: L2: L3: R = (A - B) * (B + 1)

10  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 10 Dataflow l Control tied to data l Instruction “fires” when data is available –Otherwise it is suspended l Order of instructions in program has no effect on execution order –Cf Von Neumann l No shared rewritable memory –Write once semantics l Code is stored as a dataflow graph l Data transported as tokens l Parallelism occurs if multiple instructions can fire at same time –Needs a parallel processor l Nodes are self scheduling

11  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 11 Dataflow – arbitrary execution order - + * AB 1 R = (A - B) * (B + 1) R

12  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 12 Dataflow – arbitrary execution order - + * AB 1 R = (A - B) * (B + 1) R

13  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 13 Dataflow – Parallel Execution - + * AB 1 R = (A - B) * (B + 1) R

14  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 14 Implementation l Dataflow model required very different execution engine l Data must be stored in special matching store l Instructions must be triggered when both operands are available l Parallel operations must be scheduled to processors dynamically –Don’t know apriori when they are available. l Instruction operands are pointers –To instruction –Operand number

15  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 15 Dataflow model of execution - L4/1 + 1 L4/2 * L6/1 L2:L3: L4: A Compte B L2/2 L3/1 L1: B B

16  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 16 Parallel Control flow l Sometimes called macro dataflow –Data flows between blocks of sequential code –Has advantaged of dataflow & Von Neumann Context switch overhead reduced Compiler can schedule instructions statically Don’t need fast matching store l Requires additional control instructions –Fork/Join

17  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 17 Macro Dataflow (Hybrid Control/Dataflow) - A B m1 L2: R = (A - B) * (B + 1) Control Flow + B 1 m2 L4: FORK L4 L1: Control Flow GOTO L5 L3: JOIN 2 L5: * m1 m2 R L6:

18  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 18 Issues for Hybrid dataflow l Blocks of sequential instructions need to be large enough to absorb overheads of context switching l Data memory same as MIMD –Can be partitioned or shared –Synchronization instructions required Semaphores, test-and-set l Control tokens required to synchronize threads.

19 19 Some examples

20  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 20 Denelcor HEP l Designed to tolerate latency in memory l Fine grain interleaving of threads l Processor pipeline contains 8 stages l Each time step a new thread enters the pipeline l Threads are taken from the Process Status Word (PSW) l After thread taken from the PSW queue, instruction and operands are fetched l When an instruction is executed, another one is placed on the PSW queue l Threads are interleaved at the instruction level.

21  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 21 Denelcor HEP l Memory latency toleration solved with Scheduler Function Unit (SFU) l Memory words are tagged as full or empty l Attempting to read an empty suspends the current thread –Then current PSW entry is moved to the SFU l When data is written, taken from the SFU and placed back on the PSW queue.

22  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 22 Synchronization on the HEP l All registers have Full/Empty/Reserved bit l Reading an empty register causes thread toe be placed back on the PSW queue without updating its program counter l Thread synchronization is busy-wait –But other threads can run

23  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 23 HEP Architecture Matching Unit Operand fetch Funct unit 1 Funct unit 2 Funct unit N PSW queue Increment control Program memory Registers Operand hand 1 Operand hand 2 SFU To/from Data memory

24  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 24 HEP configuration l Up to 16 processors l Up to 128 data memories l Connected by high speed switch l Limitations –Threads can have only 1 outstanding memory request –Thread synchronization puts bubbles in the pipeline –Maximum of 64 threads causing problems for software Need to throttle loops –If parallelism is lower than 8 full utilisation not possible.

25  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 25 MIT Alewife Processor l 512 Processors in 2-dim mesh l Sparcle Processor l Physcially distributed memory l Logical shared memory l Hardware supported cache coherence l Hardware supported user level message passing l Multi-threading

26  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 26 Threading in Alewife l Coarse-grained multithreading l Pipeline works on single thread as long as remote memory access or synchronization not required l Can exploit register optimization in the pipeline l Integration of multi-threading with hardware supported cache coherence

27  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 27 The Sparcle Processor l Extension of SUN Sparc architecture l Tolerant of memory latency l Fine grained synchronisation l Efficient user level message passing

28  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 28 Fast context switching l In Sparc 8 overlapping register windows l Used in Sparcle in paris to represent 4 independent, non-overlapping contexts –Three for user threads –1 for traps and message handlers l Each context contains 32 general purpose registers and –PSR (Processor State Register) –PC (Program Counter) –nPC (next Program Counter) l Thread states –Active –Loaded State stored in registers – can become active –Ready Not suspended and not loaded –Suspended l Thread switching –In fast if one is active and the other is loaded –Need to flush the pipeline (cf HEP)

29  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 29 Sparcle Architecture PSR PC nPC PSR PC nPC PSR PC nPC PSR PC nPC 3:R0 3:R31 2:R0 2:R31 Active thread 1:R0 1:R31 0:R0 0:R31 CP

30  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 30 MIT Alewife and Sparcle Cache 64 kbytes Main Memory 4 Bytes FPU Sparcle CMMU NR NR = Network router CMMU = Communication & memory management unit FPU = Floating point unit

31 31 From here figures are drawn by Tim

32  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 32 Figures 16.10 Thread states in Sparcle PSR PC nPC PSR PC nPC PSR PC nPC PSR PC nPC CP active thread PC and PSR frames Global register frames Process state G0 G7 0:R0 0:R31 1:R0 1:R31 2:R0 2:R31 3:R0 3:R31... Memory Ready queueSuspended queue Loaded thread Unloaded thread

33  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 33 Figures 16.11 structure of a typical static dataflow PE Instruction queue Func. Unit 1 Func. Unit 2 Func. Unit N Activity store Fetch unit Update unit To/From other (PEs)

34  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 34 Figures 16.12 structure of a typical tagged-token dataflow PE Token queue Func. Unit 1 Func. Unit 2 Func. Unit N Instruction/ data memory Fetch unit Update unit To other (PEs) Matching unitMatching store

35  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 35 Figures 16.13 organization of the I- structure storage W A A P W datum tag X tag Z tag Y nil Data storage k: k+1: k+2: k+3: k+4: Presence bits (A=Absent, P=Present, W=Waiting

36  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 36 Figures 16.14 coding in explicit token-store architectures (a) and (b) *+ - > *+ - fire

37  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 37 Figures 16.14 coding in explicit token-store architectures (c) SUB ADD MUL 2 3 4 +1, +2 +2 +7 1 0 0 350 1 1 23 fire FP FP+2 Instruction memoryFrame memory IPFP FP+3 FP+4 Presence bit

38  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 38 Figures 16.15 structure of a typical explicit token-store dataflow PE Fetch unit Func. Unit 1 Func. Unit 2 Func. Unit N Effective address Presence bits Frame store operation Form token unit Fetch unit Frame memory Form token unit To/from other PEs From other PEs

39  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 39 Figures 16.16 scale of von Neumann/dataflow architectures Dataflow Macro dataflow Decoupled hybrid dataflow RISC-like hybrid von Neumann

40  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 40 Figures 16.17 structure of a typical macro dataflow PE Token queueFunc. Unit Instruction Frame memory Fetch unit Form token unit To/from other (PEs) Matching unit Internal control pipeline (program counter-based sequential execution)

41  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 41 Figures 16.18 organization of a PE in the MIT hybrid Machine Frame memory Instruction fetch PCFBR Decode unit Operand fetch Execution unit Instruction memory Registers Enabled continuation queue (Token queue) +1 To/from global memory

42  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 42 Figures 16.19 comparison of (a) SQ and (b) SCB macro nodes l1 l2 ab l4 l5 c l3l6 SQ1SQ2 12 l1 l2 ab l4 l5 c l3l6 SCB1SCB2 12 3

43  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 43 Figures 16.20 structure of the USC Decoupled Architecture Cluster graph memory GC DFGE CE CC AQRQ GC DFGE CE CC AQRQ Cluster 0 To/from network (Graph virtual space) Cluster graph memory To/from network (Computation virtual space)

44  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 44 Figures 16.21 structure of a node in the SAM Main memory APU LEU ASUSEU fire done To/from network

45  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 45 Figures 16.22 structure of the P- RISC processing element Token queue Frame memory Instruction Instruction fetch Operand fetch Func. unit Operand store Start Local memory Internal control pipeline (conventional RISC- processor) Messages to/from other PE’s memory Load/Store

46  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 46 Figures 16.23 transformation of dataflow graphs into control flow graphs (a) dataflow graph (b) control flow graph *- + join + * - fork L1 L1:

47  David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 47 Figures 16.24 structure of *T node Remote memory request coprocessor Synchronization coprocessor sIP sFP sV1 sV2 Data processor dIP dFP dV1 dV2 Continuation queue Local memory Network interface Message formatter From network To network Message queues


Download ppt "1 Multi Threaded Architectures Sima, Fountain and Kacsuk Chapter 16 CSE462."

Similar presentations


Ads by Google