Thread-Level Speculation Karan Singh CS 612 2.23.2006.

Slides:



Advertisements
Similar presentations
System Integration and Performance
Advertisements

CS 201 Compiler Construction
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Alias Speculation using Atomic Regions (To appear at ASPLOS 2013) Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign.
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
Multiple Processor Systems
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al.
The University of Adelaide, School of Computer Science
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.
Efficient and Flexible Architectural Support for Dynamic Monitoring YUANYUAN ZHOU, PIN ZHOU, FENG QIN, WEI LIU, & JOSEP TORRELLAS UIUC.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization Lawrence Rauchwerger and David A. Padua PLDI.
Techniques for Reducing the Overhead of Run-time Parallelization Lawrence Rauchwerger Department of Computer Science Texas A&M University
Multiscalar processors
Software-Based Cache Coherence with Hardware-Assisted Selective Self Invalidations Using Bloom Filters Authors : Thomas J. Ashby, Pedro D´ıaz, Marcelo.
Parasol LaboratoryTexas A&M University IPDPS The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops Francis Dang, Hao Yu, and Lawrence.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.
An Integrated Hardware-Software Approach to Transactional Memory Sean Lie Theory of Parallel Systems Monday December 8 th, 2003.
Synchronization (Barriers) Parallel Processing (CS453)
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z.
Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.
Chapter 4 Storage Management (Memory Management).
Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Reliability and Recovery CS Introduction to Operating Systems.
Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.
Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained.
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.
Detecting Atomicity Violations via Access Interleaving Invariants
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Dynamic Parallelization of JavaScript Applications Using an Ultra-lightweight Speculation Mechanism ECE 751, Fall 2015 Peng Liu 1.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Parallel Computing Presented by Justin Reschke
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
The University of Adelaide, School of Computer Science
Gwangsun Kim, Jiyun Jeong, John Kim
Henk Corporaal TUEindhoven 2009
CMSC 611: Advanced Computer Architecture
Operation System Program 4
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
Lecture 2: Snooping-Based Coherence
Henk Corporaal TUEindhoven 2011
Hybrid Transactional Memory
Lecture 25: Multiprocessors
Programming with Shared Memory Specifying parallelism
Lecture 18: Coherence and Synchronization
Presentation transcript:

Thread-Level Speculation Karan Singh CS

CS 6122 Introduction  extraction of parallelism at compile time is limited  TLS allows automatic parallelization by supporting thread execution without advance knowledge of any dependence violations  Thread-Level Speculation (TLS) is a form of optimistic parallelization

CS 6123 Introduction  Zhang et al. extensions to cache coherence protocol hardware to detect dependence violations  Pickett et al. design for a Java-specific software TLS system that operates at the bytecode level

Hardware for Speculative Run- Time Parallelization in Distributed Shared-Memory Multiprocessors Ye Zhang Lawrence Rauchwerger Josep Torrellas

CS 6125 Outline  Loop parallelization basics  Speculative Run-Time Parallelization in Software  Speculative Run-Time Parallelization in Hardware  Evaluation and Comparison

CS 6126 Loop parallelization basics  a loop can be executed in parallel without synchronization only if the outcome is independent of the order of iterations  need to analyze data dependences across iterations: flow, anti, output  if no dependences – doall loop  if only anti or output dependences – privatization, scalar expansion …

CS 6127 Loop parallelization basics to parallelize the loop above, we need a way of saving and restoring state a method to detect cross-iteration dependences

CS 6128 Speculative Run-Time Parallelization in Software  mechanism for saving/restoring state before executing speculatively, we need to save the state of the arrays that will be modified dense access – save whole array sparse access – save individual elements save only modified shared arrays in all cases, if loop is not found parallel after execution, arrays are restored from their backups

CS 6129 Speculative Run-Time Parallelization in Software  LRPD test to detect dependences flags existence of cross-iteration dependences apply to those arrays whose dependences cannot be analyzed at compile-time two phases: Marking & Analysis

CS LRPD test  setup backup A(1:s) initialize shadow arrays to zero A r (1:s), A w (1:s) initialize scalar Atw to zero

CS LRPD test  marking: performed for each iteration during speculative parallel execution of the loop write to A(i): set A w (i) read from A(i): if A(i) not written in this iteration, set A r (i) at end of iteration, count how many different elements of A have been written and add count to Atw

CS LRPD test  analysis: performed after the speculative execution compute Atm = number of non-zero A w (i) for all elements i of the shadow array if any(A w (:)^ A r (:)), loop is not a doall; abort execution else if Atw == Atm, then loop is a doall

CS Example w(x)  parallel threads write to element x of array Aw = 1Aw ^ Ar = 0 Ar = 0any(Aw ^ Ar) = 0 Atw = 2 Atm = 1 Since Atw ≠ Atm, parallelization fails

CS Example w(x)r(x)  parallel threads write to element x of array Aw = 1Aw ^ Ar = 1 Ar = 1any(Aw ^ Ar) = 1 Atw = 1 Atm = 1 Since any(Aw ^ Ar) == 1, parallelization fails

CS Example w(x) r(x)  parallel threads write to element x of array Aw = 1Aw ^ Ar = 1 Ar = 1any(Aw ^ Ar) = 1 Atw = 1 Atm = 1 Since any(Aw ^ Ar) == 1, parallelization fails

CS Example w(x) r(x)  parallel threads write to element x of array Aw = 1Aw ^ Ar = 0 Ar = 0*any(Aw ^ Ar) = 0 Atw = 1 Atm = 1 Since Atw == Atm, loop is a doall * if A(i) not written in this iteration, set Ar(i)

CS Example

CS Speculative Run-Time Parallelization in Software  implementation in a DSM system, each processor allocates a private copy of the shadow arrays marking phase performed locally for analysis phase, private shadow arrays are merged in parallel  compiler integration part of a front-end parallelizing compiler parallelize loops chosen based on user feedback or heuristics about previous success rate

CS Speculative Run-Time Parallelization in Software  improvements privatization iteration-wise vs. processor-wise  shortcomings overhead of analysis phase and extra instructions for marking we get to know parallelization failed only after the loop completes execution

CS  privatization example for i = 1 to N tmp = f(i) /* f is some operation */ A(i) = A(i) + tmp enddo in privatization, for each processor, we create private copies of the variables causing anti or output dependences privatization

CS Speculative Run-Time Parallelization in Hardware  extend cache coherence protocol hardware of a DSM multiprocessor with extra transactions to flag any cross-iteration data dependences  on detection, parallel execution is immediately aborted  extra state in tags of all caches  fast memory in the directories

CS Speculative Run-Time Parallelization in Hardware  two sets of transactions non-privatization algorithm privatization algorithm

CS non-privatization algorithm  identify as parallel those loops where each element of the array under test is either read-only or is accessed by only one processor  a pattern where an element is read by several processors and later written by one is flagged as not parallel

CS non-privatization algorithm  fast memory has three entries: ROnly, NoShr, First  these entries are also sent to cache and stored in tags of the corresponding cache line  per-element bits in tags of different caches and directories are kept coherent

CS non-privatization algorithm

CS Speculative Run-Time Parallelization in Hardware  implementation need three supports: storage for access bits, logic to test and change the bits, table in the directory to find the access bits for a given physical address modify three parts: primary cache, secondary cache, directory

CS implementation  primary cache access bits stored in an SRAM table called Access Bit Array algorithm operations determined by Control input Test Logic performs operations

CS implementation  secondary cache need Access Bit Array L1 miss, L2 hit  L2 provides data and access bits to L1 access bits sent directly to the test logic in L1 bits generated are stored in access bit array of L1

CS implementation  directory small dedicated memory for access bits with lookup table access bits generated by logic are sent to processor transaction overlapped with memory and directory access

CS Evaluation  execution drive simulations of CC-NUMA shared memory multiprocessor using Tango-lite  loops from applications in the Perfect Club set and one application from NCSA: Ocean, P3m, Adm, Track  compare four environments: Serial, Ideal, SW, HW  loops run with 16 processes (except Ocean which runs with 8 processes)

CS Evaluation  loop execution speedup

CS Evaluation  slowdown due to failure

CS Evaluation  scalability

CS Software vs. Hardware  in hardware, failure to parallelize is detected on the fly  several operations are performed in hardware, which reduces overheads  hardware scheme has better scalability with number of processors  hardware scheme has less space overhead

CS Software vs. Hardware  in hardware, non-privatization test is processor-wise without requiring static scheduling  hardware scheme can be applied to pointer-based C code more efficiently  however, software implementation does not require any hardware!