IThreads A Threading Library for Parallel Incremental Computation Pramod Bhatotia Pedro Fonseca, Björn Brandenburg (MPI-SWS) Umut Acar (CMU) Rodrigo Rodrigues.

Slides:



Advertisements
Similar presentations
Dataflow Analysis for Datarace-Free Programs (ESOP 11) Arnab De Joint work with Deepak DSouza and Rupesh Nasre Indian Institute of Science, Bangalore.
Advertisements

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Two for the Price of One: A Model for Parallel and Incremental Computation Sebastian Burckhardt, Daan Leijen, Tom Ball (Microsoft Research, Redmond) Caitlin.
Abort Free SemanticTM by Dependency Aware Scheduling of Transactional Instructions Shlomi Dolev Ben-Gurion University of the Negev Israel WTM 2013 Panagiota.
Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1.
INSTITUTE OF COMPUTING TECHNOLOGY An Adaptive Task Creation Strategy for Work-Stealing Scheduling Lei Wang, Huimin Cui, Yuelu Duan, Fang Lu, Xiaobing Feng,
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
Programming Language Semantics Java Threads and Locks Informal Introduction The Java Specification Language Chapter 17.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
CS294-6 Reconfigurable Computing Day 23 November 10, 1998 Stream Processing.
Parallelizing Data Race Detection Benjamin Wester Facebook David Devecsery, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan.
Using JetBench to Evaluate the Efficiency of Multiprocessor Support for Parallel Processing HaiTao Mei and Andy Wellings Department of Computer Science.
“Evaluating MapReduce for Multi-core and Multiprocessor Systems” Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis Computer.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
DATA STRUCTURES OPTIMISATION FOR MANY-CORE SYSTEMS Matthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
Threads, Thread management & Resource Management.
Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.
Introduction and Features of Java. What is java? Developed by Sun Microsystems (James Gosling) A general-purpose object-oriented language Based on C/C++
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)
Workflow Early Start Pattern and Future's Update Strategies in ProActive Environment E. Zimeo, N. Ranaldo, G. Tretola University of Sannio - Italy.
Dynamic Analysis of Multithreaded Java Programs Dr. Abhik Roychoudhury National University of Singapore.
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Memory Consistency Models 1. Uniform Consistency Models Only have read and write operations Sequential Consistency Pipelined-RAM Causal Consistency Coherence.
Program Design. The design process How do you go about writing a program? –It’s like many other things in life Understand the problem to be solved Develop.
Michael Bond Katherine Coons Kathryn McKinley University of Texas at Austin.
Motivation  Parallel programming is difficult  Culprit: Non-determinism Interleaving of parallel threads But required to harness parallelism  Sequential.
Threads, Thread management & Resource Management.
Streaming Big Data with Self-Adjusting Computation Umut A. Acar, Yan Chen DDFP January 2014 SNU IDB Lab. Namyoon Kim.
CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Threads. Readings r Silberschatz et al : Chapter 4.
Scheduling Multithreaded Computations By Work-Stealing Robert D. Blumofe The University of Texas, Austin Charles E. Leiserson, MIT Laboratory for Computer.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
Conditional Memory Ordering Christoph von Praun, Harold W.Cain, Jong-Deok Choi, Kyung Dong Ryu Presented by: Renwei Yu Published in Proceedings of the.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
Concurrent Revisions: A deterministic concurrency model. Daan Leijen & Sebastian Burckhardt Microsoft Research (OOPSLA 2010, ESOP 2011)
FastTrack: Efficient and Precise Dynamic Race Detection [FlFr09] Cormac Flanagan and Stephen N. Freund GNU OS Lab. 23-Jun-16 Ok-kyoon Ha.
IncApprox The marriage of incremental and approximate computing Pramod Bhatotia Dhanya Krishnan, Do Le Quoc, Christof Fetzer, Rodrigo Rodrigues* (TU Dresden.
Incremental Parallel and Distributed Systems Pramod Bhatotia MPI-SWS & Saarland University April 2015.
Slider Incremental Sliding Window Analytics Pramod Bhatotia MPI-SWS Umut Acar CMU Flavio Junqueira MSR Cambridge Rodrigo Rodrigues NOVA University of Lisbon.
NOVA University of Lisbon
PHyTM: Persistent Hybrid Transactional Memory
Conception of parallel algorithms
SOFTWARE DESIGN AND ARCHITECTURE
Memory Consistency Models
Threads Cannot Be Implemented As a Library
Memory Consistency Models
Martin Rinard Laboratory for Computer Science
Store Recycling Function Experimental Results
Threads and Memory Models Hal Perkins Autumn 2011
Chapter 4: Threads.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Threads and Memory Models Hal Perkins Autumn 2009
Shared Memory Programming
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Pramod Bhatotia, Ruichuan Chen, Myungjin Lee
Presentation transcript:

iThreads A Threading Library for Parallel Incremental Computation Pramod Bhatotia Pedro Fonseca, Björn Brandenburg (MPI-SWS) Umut Acar (CMU) Rodrigo Rodrigues (NOVA University of Lisbon)

Common workflow: Rerun the same application with evolving input Scientific computing Reactive Systems Data analytics Small input change  small work to update the output 2 Goal: Efficient execution of apps in successive runs

Existing approaches (-) Inefficient (+) Easy to design Re-compute from scratch Design application- specific algorithms Incremental computation (+) Efficient (-) Hard to design 3 Automatic + Efficient

Well-studied topic in the PL community Compiler and language-based approaches Target sequential programs!  Auto “Incrementalization” 4

Proposals for parallel incremental computation Hammer et al. [DAMP’07] Burckhardt et al. [OOPSLA’11] Limitations Require a new language with special data-types Restrict programming model to strict fork-join Existing multi-threaded programs are not supported! Parallelism 5

1.Transparency Target unmodified pthreads based programs 2.Practicality Support the full range of synchronization primitives 3.Efficiency Design parallel infrastructure for incremental computation Design goals 6

iThreads $ LD_PRELOAD=“iThreads.so” $./myProgram (initial run) $ echo “ ” >> changes.txt $./myProgram (incremental run) 7 Speedups w.r.t. pthreads up to 8X

Motivation Design Evaluation Outline 8

Behind the scenes Computation Sub-computations step#1 Divide step#2 Build step#3 Perform Dependence graph Change propagation 9 Initial run Incremental run

A simple example lock(); z = ++y; unlock(); thread-1 thread-2 lock(); x++; unlock(); local_var++; lock(); y = 2*x + z; unlock(); shared variables: x, y, and z 10

Step # 1 Computation Sub-computations Divide BuildPerform Dependence graph Change propagation 11 How do we divide a computation into sub- computations? How do we divide a computation into sub- computations?

Sub-computations 12 lock(); z = ++y; unlock(); thread-1 thread-2 lock(); x++; unlock(); local_var++; lock(); y = 2*x + z; unlock(); Entire thread? Entire thread? “Coarse-grained” Small change implies recomputing the entire thread “Coarse-grained” Small change implies recomputing the entire thread Single Instruction? Single Instruction? “Fine-grained” Requires tracking individual load/store instructions “Fine-grained” Requires tracking individual load/store instructions

Sub-computation granularity 13 Single instruction Single instruction Entire thread Entire thread Coarse-grainedExpensive A sub-computation is a sequence of instructions between pthreads synchronization points Release Consistency (RC) memory model to define the granularity of a sub-computation

Sub-computations 14 lock(); z = ++y; unlock(); thread-1 thread-2 lock(); x++; unlock(); local_var++; lock(); y = 2*x + z; unlock(); Sub-computation for thread-1 Sub-computation for thread-1 Sub- computations for thread-2

Step # 2 Computation Sub-computations Divide BuildPerform Dependence graph Change propagation 15 How do we build the dependence graph? How do we build the dependence graph?

Example: Changed schedule 16 lock(); z = ++y; unlock(); thread-1 thread-2 lock(); x++; unlock(); local_var++; lock(); y = 2*x + z; unlock(); Read={x} Write={x} Read={ local_var } Write={ local_var } Read={x,z} Write={y} shared variables: x, y, and z different schedule Read={y} Write={y,z} (1): Record happens-before dependencies

Example: Same schedule 17 lock(); z = ++y; unlock(); thread-1 thread-2 lock(); x++; unlock(); local_var++; lock(); y = 2*x + z; unlock(); Read={y} Write={y,z} Read={x} Write={x} Read={ local_var } Write={ local_var } Read={x,z} Write={y} shared variables: x, y, and z Same schedule

Example: Changed input 18 lock(); z = ++y; unlock(); thread-1 thread-2 lock(); x++; unlock(); local_var++; lock(); y = 2*x + z; unlock(); Read={y} Write={y,z} Read={x} Write={x} Read={ local_var } Write={ local_var } Read={x,z} Write={y} shared variables: x, y, and z different input y’ Read={y} Write={y,z} Read={x, z } Write={y} (2): Record data dependencies

Vertices: sub-computations Edges: Happens-before order b/w sub-computations Intra-thread: Totally ordered based on execution Inter-thread: Partially ordered based on sync primitives Data dependency b/w sub-computations If they can be ordered based on happens-before, & Antecedent writes the data that is read by precedent Dependence graph 19

Step # 3 Computation Sub-computations Divide BuildPerform Dependence graph Change propagation 20 How do we perform change propagation? How do we perform change propagation?

Change propagation Dirty set  {Changed input} For each sub-computation in a thread Check validity in the recorded happens-before order If (Read set Dirty set) – Re-compute the sub-computation – Add write set to the dirty set Else – Skip execution of the sub-computation – Apply the memoized effects of the sub-computation 21

Motivation Design Evaluation Outline 22

Evaluating iThreads 1.Speedups for the incremental run 2.Overheads for the initial run Implementation (for Linux) 32-bit dynamically linkable shared library Platform Evaluated on Intel Xeon 12 cores running Linux Evaluation 23

1. Speedup for incremental run 24 Better Worst Speedups vs. pthreads up to 8X

Memoization overhead 25 Space overheads w.r.t. input size Write a lot of intermediate state

2. Overheads for the initial run 26 Runtime overheads vs. pthreads Worst Better

A case for parallel incremental computation iThreads: Incremental multithreading Transparent: Targets unmodified programs Practical: Supports the full range of sync primitives Efficient: Employs parallelism for change propagation Usage: A dynamically linkable shared library Summary 27

iThreads Transparent + Practical + Efficient Thank you! 28