MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing

Advertisements

Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Computer System Overview

Chapter 17 Parallel Processing.

Computer System Overview

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Multiscalar processors

1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18,

Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.

Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.

Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

Computer Systems Overview. Page 2 W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware resources of one.

Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13

Multi-core architectures. Single-core computer Single-core CPU chip.

1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.

Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 1 Computer System Overview.

Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.

ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

Lecture 13: Multiprocessors Kai Bu

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.

The Standford Hydra CMP  Lance Hammond  Benedict A. Hubbert  Michael Siu  Manohar K. Prabhu  Michael Chen  Kunle Olukotun Presented by Jason Davis.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

CS 258 Spring The Expandable Split Window Paradigm for Exploiting Fine- Grain Parallelism Manoj Franklin and Gurindar S. Sohi Presented by Allen.

1 Computer Architecture. 2 Basic Elements Processor Main Memory –volatile –referred to as real memory or primary memory I/O modules –secondary memory.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.

Computer Systems Overview. Lecture 1/Page 2AE4B33OSS W. Stallings: Operating Systems: Internals and Design, ©2001 Operating System Exploits the hardware.

Lecture 13: Multiprocessors Kai Bu

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

COMP 740: Computer Architecture and Implementation

Cache Memory and Performance

Multiscalar Processors

Simultaneous Multithreading

/ Computer Architecture and Design

Morgan Kaufmann Publishers The Processor

Levels of Parallelism within a Single Processor

15-740/ Computer Architecture Lecture 5: Precise Exceptions

* From AMD 1996 Publication #18522 Revision E

/ Computer Architecture and Design

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

How to improve (decrease) CPI

Programming with Shared Memory Specifying parallelism

The University of Adelaide, School of Computer Science

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Presentation transcript:

MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon Wook Kim, II Park, Rudolf Eigenmann, Babak Falsafi and T.N. Vijayakumar

Outline  Background  Thread Level Parallelism(TLP)  Explicit & Implicit TLP  An Example  Multiplex  Threading Model  MUCS protocol  Key Performance Factors  Performance Analysis  Conclusion

Thread Level Parallelism  ILP Wall  Increasing CPI with increasing clock rates  Limited ILP in applications  Insufficient memory locality  Using TLP  Increased granularity of parallelism  Exploitation of Multi-cores  Threads:  A Logical sub-process that carries its own state.  State – Instructions, data, PC, register file, stack, etc.,

Explicit & Implicit TLP  Explicit TLP  Program is explicitly partitioned into threads by programmer and an API is used to dispatch and execute on multiple cores.  Static – defined in the program  Main Overhead – Thread Dispatch  Implicit or Speculative TLP  Threads are peeled off from a sequential execution stream of the program by hardware prediction.  Dynamic – runtime prediction  Main Overhead – Speculative State Overflow

Example – Exec Explicit Threads  Data Dependence is resolved using a barrier here  Dispatch of threads is done using a fork (System API) call

Example – Exec Implicit Threads  Both data dependence as well as dispatch are handled by a hardware predictor

Multiplex  Unifies explicit and implicit threading on a CMP  Obviates the need for serializing unanalyzable program segments by using speculative TLP  Avoids implicit threading’s speculation overhead and performance loss in compiler-analyzable program segments by using explicit threading.  Implements a single snoopy bus protocol to unify cache coherence with memory renaming and disambiguation.

Anatomy of a Multiplex CMP

Threading Model  Thread selection  Partitioning code into distinct instruction sequences.  Thread dispatch  Assigning threads to execute on different CPUs  Data communication and speculation  Propagating data between independent threads.

Thread Selection in Multiplex  Methodology  Compiler chooses between threading models  Prioritizes explicit threading over implicit threading  Implicit threads selected by runtime speculation by hardware  However, software specifies implicit thread boundaries  Pros – Minimizes explicit and implicit overheads  Scenarios  Executing loops with small bodies implicitly  Executing tail ends of unevenly partitioned segments implicitly

Thread Dispatch – An Overview  Dispatching conventional threads involve  Assigning PCs of CPUs the address of the first instruction of the thread  Assigning a private SP to CPUs  Copying stacks and register values prior to dispatch  Thread Descriptor – holds thread information  Stores the addresses of possible subsequent dispatch target threads  Holds register dependency information

Thread Dispatch in Multiplex  Methodology  Predict subsequent threads based on current threads  Dispatch, execute and commit sequentially  Re-dispatch on squashing  Suspend dispatch upon mode switch to allow thread commits to complete  Instruction Set Changes - fork, stop and setsp  A Thread Predictor unit added to handle speculative prediction  A mode bit added to the Thread Descriptor  A TD Cache caches recently referenced descriptors

MUCS Protocol  Mux Unified Coherence and Speculation - MUCS  Offers data coherence as well as versioning support  Key Design Objectives – minimize speculation overheads in two respects  Dependence resolution in the common case should be handled within the cache thereby minimizing bus transactions  Thread commit/squashes should only be done en masse and not as individual cache blocks.

MUCS Protocol

StateActionState bits AffectedMode Speculative 1. Load/Read Miss 2. Fill cache with latest version of cache block as per program order 3. Set use bit if load is executed before a store 4. Clear commit bit 5. Clear squash bituse, commit, squashimplicit Speculative 1. Store/Write Miss 2. Fill cache with latest version from L2, write and store 3. Do not invalidate other caches 4. Set dirty bit 5. Set preceding cache stale bit 6.Clear commit bitdirty, stale, commitimplicit Committed 1. Commit Thread 2. Set commit bit en masse 3. Clear use bitcommit, useimplicit Squashed 1.Squash Thread 2. Set squash bit en masse 3. Clear use bit en massesquashimplicit

MUCS Protocol  6 bits used for monitoring states of each cache block  Use – Set per speculative load executed before store  Dirty – Set per speculative store in both modes  Commit – Set en masse on commit of speculative blocks  Stale – Set on a cache block when a newer version of data is available in another CPU  Squash – Set en masse on a cache touched by a squashed thread  Valid – Set per cache fill upon misses in both modes to determine validity of tag (not data)

Key Performance Factors  Thread Size  Load Imbalance  Data Dependence  Thread dispatch/completion overhead  Speculative State Overflow

Performance Analysis – System Info

Performance Analysis – Best Case  Class 1 applications favor Implicit-only CMPs  Class 2 applications favor explicit-only CMPs  Avg Speedup of 4 dual issue CMP over one dual issue CMP  Implicit-only=1.14, Explicit-only=2.17, Multiplex = 2.3

Performance Analysis - Overheads  I – implicit only, m - multiplex  Fpppp: provably parallel code = 0%, low squash buffer hits  wave5, tomcatv and swim have control flow irregularities in the inner loop i.e I/O stalls

Performance Analysis – Cache Size  Effects of increasing cache size – performance increases  Multiplex incurs lesser overflow than implicit-only CMP  Effects of increasing data rates – performance decreases

Conclusion  Coexistence of implicit and explicit multi-threading brings about a better speedup, showing a speedup of 2.63 during simulation  MUCS protocol allows such an implementation by mapping a coherence protocol needed for explicit threading to a subset of the states required for implicit threading and hence eliminates the need of extra hardware.  The dominant overheads for implicit and explicit threading are speculative state overflow and thread dispatching respectively.

Questions?

Thank you