MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.

MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon Wook Kim, II Park, Rudolf Eigenmann, Babak Falsafi and T.N. Vijayakumar

Outline  Background  Thread Level Parallelism(TLP)  Explicit & Implicit TLP  An Example  Multiplex  Threading Model  MUCS protocol  Key Performance Factors  Performance Analysis  Conclusion

Thread Level Parallelism  ILP Wall  Increasing CPI with increasing clock rates  Limited ILP in applications  Insufficient memory locality  Using TLP  Increased granularity of parallelism  Exploitation of Multi-cores  Threads:  A Logical sub-process that carries its own state.  State – Instructions, data, PC, register file, stack, etc.,

Explicit & Implicit TLP  Explicit TLP  Program is explicitly partitioned into threads by programmer and an API is used to dispatch and execute on multiple cores.  Static – defined in the program  Main Overhead – Thread Dispatch  Implicit or Speculative TLP  Threads are peeled off from a sequential execution stream of the program by hardware prediction.  Dynamic – runtime prediction  Main Overhead – Speculative State Overflow

Example – Exec Explicit Threads  Data Dependence is resolved using a barrier here  Dispatch of threads is done using a fork (System API) call

Example – Exec Implicit Threads  Both data dependence as well as dispatch are handled by a hardware predictor

Multiplex  Unifies explicit and implicit threading on a CMP  Obviates the need for serializing unanalyzable program segments by using speculative TLP  Avoids implicit threading’s speculation overhead and performance loss in compiler-analyzable program segments by using explicit threading.  Implements a single snoopy bus protocol to unify cache coherence with memory renaming and disambiguation.

Anatomy of a Multiplex CMP

Threading Model  Thread selection  Partitioning code into distinct instruction sequences.  Thread dispatch  Assigning threads to execute on different CPUs  Data communication and speculation  Propagating data between independent threads.

Thread Selection in Multiplex  Methodology  Compiler chooses between threading models  Prioritizes explicit threading over implicit threading  Implicit threads selected by runtime speculation by hardware  However, software specifies implicit thread boundaries  Pros – Minimizes explicit and implicit overheads  Scenarios  Executing loops with small bodies implicitly  Executing tail ends of unevenly partitioned segments implicitly

Thread Dispatch – An Overview  Dispatching conventional threads involve  Assigning PCs of CPUs the address of the first instruction of the thread  Assigning a private SP to CPUs  Copying stacks and register values prior to dispatch  Thread Descriptor – holds thread information  Stores the addresses of possible subsequent dispatch target threads  Holds register dependency information

Thread Dispatch in Multiplex  Methodology  Predict subsequent threads based on current threads  Dispatch, execute and commit sequentially  Re-dispatch on squashing  Suspend dispatch upon mode switch to allow thread commits to complete  Instruction Set Changes - fork, stop and setsp  A Thread Predictor unit added to handle speculative prediction  A mode bit added to the Thread Descriptor  A TD Cache caches recently referenced descriptors

MUCS Protocol  Mux Unified Coherence and Speculation - MUCS  Offers data coherence as well as versioning support  Key Design Objectives – minimize speculation overheads in two respects  Dependence resolution in the common case should be handled within the cache thereby minimizing bus transactions  Thread commit/squashes should only be done en masse and not as individual cache blocks.

MUCS Protocol

StateActionState bits AffectedMode Speculative 1. Load/Read Miss 2. Fill cache with latest version of cache block as per program order 3. Set use bit if load is executed before a store 4. Clear commit bit 5. Clear squash bituse, commit, squashimplicit Speculative 1. Store/Write Miss 2. Fill cache with latest version from L2, write and store 3. Do not invalidate other caches 4. Set dirty bit 5. Set preceding cache stale bit 6.Clear commit bitdirty, stale, commitimplicit Committed 1. Commit Thread 2. Set commit bit en masse 3. Clear use bitcommit, useimplicit Squashed 1.Squash Thread 2. Set squash bit en masse 3. Clear use bit en massesquashimplicit

MUCS Protocol  6 bits used for monitoring states of each cache block  Use – Set per speculative load executed before store  Dirty – Set per speculative store in both modes  Commit – Set en masse on commit of speculative blocks  Stale – Set on a cache block when a newer version of data is available in another CPU  Squash – Set en masse on a cache touched by a squashed thread  Valid – Set per cache fill upon misses in both modes to determine validity of tag (not data)

Key Performance Factors  Thread Size  Load Imbalance  Data Dependence  Thread dispatch/completion overhead  Speculative State Overflow

Performance Analysis – System Info

Performance Analysis – Best Case  Class 1 applications favor Implicit-only CMPs  Class 2 applications favor explicit-only CMPs  Avg Speedup of 4 dual issue CMP over one dual issue CMP  Implicit-only=1.14, Explicit-only=2.17, Multiplex = 2.3

Performance Analysis - Overheads  I – implicit only, m - multiplex  Fpppp: provably parallel code = 0%, low squash buffer hits  wave5, tomcatv and swim have control flow irregularities in the inner loop i.e I/O stalls

Performance Analysis – Cache Size  Effects of increasing cache size – performance increases  Multiplex incurs lesser overflow than implicit-only CMP  Effects of increasing data rates – performance decreases

Conclusion  Coexistence of implicit and explicit multi-threading brings about a better speedup, showing a speedup of 2.63 during simulation  MUCS protocol allows such an implementation by mapping a coherence protocol needed for explicit threading to a subset of the states required for implicit threading and hence eliminates the need of extra hardware.  The dominant overheads for implicit and explicit threading are speculative state overflow and thread dispatching respectively.

Questions?

Thank you

MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.

Similar presentations

Presentation on theme: "MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.

Similar presentations

Presentation on theme: "MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon."— Presentation transcript:

Similar presentations

About project

Feedback