Increasing the Energy Efficiency of TLS Systems Using Intermediate Checkpointing Salman Khan 1, Nikolas Ioannou 2, Polychronis Xekalakis 3 and Marcelo.

Slides:

Advertisements

Similar presentations

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Advertisements

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

NC STATE UNIVERSITY Transparent Control Independence (TCI) Ahmed S. Al-Zawawi Vimal K. Reddy Eric Rotenberg Haitham H. Akkary* *Dept. of Electrical & Computer.

Warm-Up Methodology for HW/SW Co-Designed Processors A. Brankovic, K. Stavrou, E. Gibert, A. Gonzalez.

Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC) Ran Manevich, Isask har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny Technion – Israel.

Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

OS-aware Tuning Improving Instruction Cache Energy Efficiency on System Workloads Authors : Tao Li, John, L.K. Published in : Performance, Computing, and.

Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University.

Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept.

Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.

Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) David Brooks (Harvard, USA)Antonio González (Intel-UPC,

CRUISE: Cache Replacement and Utility-Aware Scheduling

Bypass and Insertion Algorithms for Exclusive Last-level Caches

1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.

Read-Write Lock Allocation in Software Transactional Memory Amir Ghanbari Bavarsad and Ehsan Atoofian Lakehead University.

Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh.

Handling Branches in TLS Systems with Multi-Path Execution Polychronis Xekalakis and Marcelo Cintra University of Edinburgh

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas.

Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization for Multiprocessors Marcelo Cintra and Josep Torrellas University.

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

Roy Ernest Database Administrator Pinnacle Sports Worldwide SQL Server High Availability.

UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

High Performing Cache Hierarchies for Server Workloads

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Matching Memory Access Patterns and Data Placement for NUMA Systems Zoltán Majó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

Revisiting Load Value Speculation:

NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

Power and Frequency Analysis for Data and Control Independence in Embedded Processors Farzad Samie Amirali Baniasadi Sharif University of Technology University.

ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Sampling Dead Block Prediction for Last-Level Caches

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Analysis of NUCA Policies for CMPs Using Parsec Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel.

Power Awareness through Selective Dynamically Optimized Traces Roni Rosner, Yoav Almog, Micha Moffie, Naftali Schwartz and Avi Mendelson – Intel Labs,

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

CML Branch Penalty Reduction by Software Branch Hinting Jing Lu Yooseong Kim, Aviral Shrivastava, and Chuan Huang Compiler Microarchitecture Lab Arizona.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Mihai Burcea, J. Gregory Steffan, Cristiana Amza

Exploiting Postdominance for Speculative Parallelization

Multiscalar Processors

Address-Value Delta (AVD) Prediction

Dynamic Code Mapping Techniques for Limited Local Memory Systems

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

Increasing the Energy Efficiency of TLS Systems Using Intermediate Checkpointing Salman Khan 1, Nikolas Ioannou 2, Polychronis Xekalakis 3 and Marcelo Cintra 2 1 University of Manchester 2 University of Edinburgh 3 Intel Labs Barcelona - UPC

HiPC Introduction  Power efficiency, complexity and time-to-market reasons lead to CMPs  Problem: –No benefits for sequential applications –Even for mostly parallel applications Amdahl’s Law limits performance gains with many cores  Solution: Thread Level Speculation(TLS) –But performance through TLS costs in energy Can we reduce the wastefulness of re-execution due to misspeculation without losing performance?

3 Key Contributions  Propose checkpointing to improve efficiency of speculative execution  Evaluate dependence prediction techniques to guide checkpoint placement  Our approach results in an energy saving of up to 14%, with 7% on average over normal TLS execution, with no significant effect on speedup. HiPC 2011

4 Outline  Introduction  Checkpointing  Dependence Predictors  Checkpointing Policy  Experimental Setup and Results  Conclusions HiPC 2011

Thread Level Speculation 5HiPC 2011

Thread Level Speculation with Checkpointing 6HiPC 2011

7 Outline  Introduction  Checkpointing  Dependence Predictors  Checkpointing Policy  Experimental Setup and Results  Conclusions HiPC 2011

Placing Checkpoints  Stride  Dependence Prediction –Address based –Program Counter Based –Hybrid HiPC 20118

Dependence Prediction HiPC 20119

Hybrid Dependence Predictor HiPC

11 Outline  Introduction  Checkpointing  Dependence Predictors  Checkpointing Policy  Experimental Setup and Results  Conclusions HiPC 2011

Placing Checkpoints  Limited number of checkpoints  Placing a checkpoint has a cost  Checkpointing on every positive prediction results in too many checkpoints HiPC

13 Outline  Introduction  Checkpointing  Dependence Predictors  Checkpointing Policy  Experimental Setup and Results  Conclusions HPCA 2010

Setup  Simulator, Compiler and Benchmarks: –SESC ( –POSH (Liu et al. PPoPP ‘06) –Spec 2000 Int.  Architecture: –Four way CMP, 4-Issue cores –16KB L1 Data (multi-versioned) and Instruction Caches –1MB unified L2 Caches –Cycles from Violation to Kill/Restart: 12 –Cycles to Spawn: 12 HiPC

Measuring Dependence Prediction HiPC

ICS

HiPC  Wasted Instructions: Unnecessarily squashed instructions.

HiPC

HiPC

20 Outline  Introduction  Checkpointing  Dependence Predictors  Checkpointing Policy  Experimental Setup and Results  Conclusions HPCA 2010

Conclusions  Effective checkpointing improves the efficiency of TLS  Placing checkpoints by stride is not sufficient to reduce waste significantly  Checkpointing using dependence predication obtains energy saving of up to 14%, with 7% on average over normal TLS execution, with no significant effect on speedup. HiPC

Read the paper for…  Complete results  Microarchitectural issues that arise from checkpointing running tasks  Modified squash/restart mechanism that is needed to avoid performance degradation from checkpointing HiPC