Efficient software checkpointing framework for speculative techniques

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Alias Speculation using Atomic Regions (To appear at ASPLOS 2013) Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign.
EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Dr. Gheith Abandah, Chair Computer Engineering Department The University of Jordan 20/4/20091.
Cisc Complex Instruction Set Computing By Christopher Wong 1.
Computer Performance Computer Engineering Department.
1 Improving Productivity With Fine-grain Compiler-based Checkpointing Chuck (Chengyan) Zhao Prof. Greg Steffan Prof. Cristiana Amza Allan Kielstra* Dept.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**
Multi-core.  What is parallel programming ?  Classification of parallel architectures  Dimension of instruction  Dimension of data  Memory models.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
Lengthening Traces to Improve Opportunities for Dynamic Optimization Chuck Zhao, Cristiana Amza, Greg Steffan, University of Toronto Youfeng Wu Intel Research.
Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
1 Compiler Support for Efficient Software-only Checkpointing Chuck (Chengyan) Zhao Dept. of Computer Science University of Toronto Ph.D. Thesis Exam Sept.
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
Pipeline Optimization Real-Time Rendering 김 송 국.
Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
CPU Central Processing Unit
Mihai Burcea, J. Gregory Steffan, Cristiana Amza
Code Optimization.
Brad Baker, Wayne Haney, Dr. Charles Choi
Microarchitecture.
Central Processing Unit- CPU
High Performance Computing on an IBM Cell Processor --- Bioinformatics
Computer Structure Multi-Threading
Atomic Operations in Hardware
idempotent (ī-dəm-pō-tənt) adj
Antonia Zhai, Christopher B. Colohan,
Processing Framework Sytse van Geldermalsen
Multi-Core Computing Osama Awwad Department of Computer Science
CPU Central Processing Unit
Implementation of IDEA on a Reconfigurable Computer
MapReduce Simplied Data Processing on Large Clusters
Hui Chen, Shinan Wang and Weisong Shi Wayne State University
CPU Central Processing Unit
COMPI: Concolic Testing for MPI Applications
Compiler Back End Panel
Compiler Back End Panel
Compiler Code Optimizations
Adaptive Single-Chip Multiprocessing
(A Research Proposal for Optimizing DBMS on CMP)
15-740/ Computer Architecture Lecture 14: Prefetching
Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J
Mihai Budiu Monday seminar, Apr 12, 2004
8 – Simultaneous Multithreading
Mapping DSP algorithms to a general purpose out-of-order processor
Maximizing Speedup through Self-Tuning of Processor Allocation
Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.
Presentation transcript:

Efficient software checkpointing framework for speculative techniques ECE Connections 2006 Co-Supervisors: Prof. Greg. Steffan Prof. Cristiana Amza Chuck (Chengyan) Zhao Department of Computer Science University of Toronto Jun. 09, 2006 Might need to introduce my supervisors to the audience

Chip Multi-Processor (CMP) is now everywhere IBM: Power 4 Power 5 Intel: Montecito Smithfield AMD: dual-core Opteron, Athlon X2 Four-core Opteron Sun: UltraSparc T1: 32 cores UltraSparc T2: 64 cores Sony, Toshiba, IBM: Cell:9 cores … … Power 4 Dual-core Intel chip We are interested in improving the performance of a single application, using the abundant CMP resources (which most of would stay idle most of the time) Dual-core Opteron Cell use CMP for single-threaded applications through parallelization

Parallelization Techniques Automatic Parallelization conservative + precise: prove of non dependence limited domain Speculative Parallelization non-conservative has to recover from failures focus: speculative parallelization, use TLS

Thread-Level Speculation (TLS) Parallelism Code example for ( …){ … *p = …; … = … *q; } difficult to parallelize automatically uncertain dependence between *p and *q might be runtime or user-input dependent Points at slide while talking. 2. turn each loop iteration into a thread 3. checkpointing scheme + dependence testing

How Thread-Level Speculation works TLS …*q *p…   violation  Recover  …*q Exec. Time We take a sequential program and carve it into threads. Watch for violations We then execute the threads speculatively in parallel. The speculative part is that we don’t know whether these threads are actually independent. Instead, we depend on runtime support to tell us whether the threads actually were independent whenever we have violated a data dependence we simply re-execute that thread so that it is redone with the proper value, otherwise we can commit the speculative work. But even when speculation has failed, we can still reduce overall execution time by exploiting the available parallelism. If you are usually right, then it is faster to apologize when you are wrong than to always ask for permission exploit available thread-level parallelism

Memory Checkpointing Compiler Transformations mark region of interest backup each memory write (store) generate buffer refresh calls generate recovery code remove region marking delimiters start_instrument(); setjmp(buf1); for(…){ refresh_ckpt(); backup_mem(a); a = …; backup_mem(b); b = …; … } if(error_spec()){ ckp_restore(); longjmp(buf1); } stop_instrument(); mention those function calls are currently organized into a runtime library

Preliminary Results: MCF in SPEC2KINT index fname 1 refresh_potential() 2 bea_compute_red_cost() 3 primal_bea_mpp() 4 1 + 2 5 1 + 3 6 2 + 3 7 1 + 2 + 3 Picked SPEC2000 CPU INT Benchmark suite (make 10 / 12 applications available) Remember to show the key point: performance degradation is can be up to 50%, but have large room of improvements

Challenges and Future Work Challenges: software overhead Proposed Solutions: optimizations inlining optimal buffer sizing and refreshing placement memory optimizations Applications value prediction debugging support reliability enhancement TLS (long term) ... Mention that the challenge of software-only checkpointing is to significantly reducing the software overhead by aggressively optimizations

Questions and Answers