PFunc: Modern Task Parallelism For Modern High Performance Computing Prabhanjan Kambadur, Open Systems Lab, Indiana University.

Slides:



Advertisements
Similar presentations
Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.
Advertisements

© 2009 Charles E. Leiserson and Pablo Halpern1 Introduction to Cilk++ Programming PADTAD July 20, 2009 Cilk, Cilk++, Cilkview, and Cilkscreen, are trademarks.
INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Requirements on the Execution of Kahn Process Networks Marc Geilen and Twan Basten 11 April 2003 /e.
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
Modified from Silberschatz, Galvin and Gagne ©2009 Lecture 7 Chapter 4: Threads (cont)
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts Essentials – 2 nd Edition Chapter 4: Threads.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
4.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles Chapter 4 Multithreaded Programming Objectives Objectives To introduce a notion of.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Threads. Processes and Threads  Two characteristics of “processes” as considered so far: Unit of resource allocation Unit of dispatch  Characteristics.
Cilk CISC 879 Parallel Computation Erhan Atilla Avinal.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Chapter 4: Threads READ 4.1 & 4.2 NOT RESPONSIBLE FOR 4.3 &
14.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 4: Threads.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Portable and Predictable Performance on Heterogeneous Embedded Manycores (ARTEMIS ) ARTEMIS Project Review 28 nd October 2014 Multimedia Demonstrator.
Prabhanjan Kambadur, Amol Ghoting, Anshul Gupta and Andrew Lumsdaine. International Conference on Parallel Computing (ParCO),2009 Extending Task Parallelism.
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
E81 CSE 532S: Advanced Multi-Paradigm Software Development Chris Gill Department of Computer Science Washington University, St. Louis
國立台灣大學 資訊工程學系 Chapter 4: Threads. 資工系網媒所 NEWS 實驗室 Objectives To introduce the notion of a thread — a fundamental unit of CPU utilization that forms the.
Silberschatz, Galvin and Gagne ©2011Operating System Concepts Essentials – 8 th Edition Chapter 4: Threads.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
 The PFunc Implementation of NAS Parallel Benchmarks. Presenter: Shashi Kumar Nanjaiah Advisor: Dr. Chung E Wang Department of Computer Science California.
Recursion. Basic problem solving technique is to divide a problem into smaller subproblems These subproblems may also be divided into smaller subproblems.
Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
4.1 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 4: Threads Overview Multithreading Models Thread Libraries  Pthreads  Windows.
Scheduling Basic scheduling policies, for OS schedulers (threads, tasks, processes) or thread library schedulers Review of Context Switching overheads.
Overview Work-stealing scheduler O(pS 1 ) worst case space small overhead Narlikar scheduler 1 O(S 1 +pKT  ) worst case space large overhead Hybrid scheduler.
Frequent Itemset Mining on Graphics Processors, Fang et al., DaMoN Turbo-charging Vertical Mining of Large Databases, Shenoy et al., MOD NVIDIA.
Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.
Parallelism without Concurrency Charles E. Leiserson MIT.
Martin Kruliš by Martin Kruliš (v1.1)1.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Scheduling Multithreaded Computations By Work-Stealing Robert D. Blumofe The University of Texas, Austin Charles E. Leiserson, MIT Laboratory for Computer.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Prabhanjan Kambadur, Amol Ghoting, Anshul Gupta and Andrew Lumsdaine. International Conference on Parallel Computing (ParCO),2009 Extending Task Parallelism.
Saurav Karmakar. Chapter 4: Threads  Overview  Multithreading Models  Thread Libraries  Threading Issues  Operating System Examples  Windows XP.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
1 Cilk Chao Huang CS498LVK. 2 Introduction A multithreaded parallel programming language Effective for exploiting dynamic, asynchronous parallelism (Chess.
Operating System Concepts
Chapter 4: Threads 羅習五. Chapter 4: Threads Motivation and Overview Multithreading Models Threading Issues Examples – Pthreads – Windows XP Threads – Linux.
Contents 1.Overview 2.Multithreading Model 3.Thread Libraries 4.Threading Issues 5.Operating-system Example 2 OS Lab Sun Suk Kim.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
1 Chapter 5: Threads Overview Multithreading Models & Issues Read Chapter 5 pages
Introduction to threads
CILK: An Efficient Multithreaded Runtime System
Chapter 4: Threads.
CMPS 5433 Programming Models
Prabhanjan Kambadur, Open Systems Lab, Indiana University
Chapter 4: Multithreaded Programming
Lighting Up Windows Server 2008 R2 Using the ConcRT on UMS
Chapter 4: Threads 羅習五.
Chapter 4: Threads.
Chapter 4: Threads.
Skyline query with R*-Tree: Branch and Bound Skyline (BBS) Algorithm
CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM
Chapter 4: Threads.
Introduction to CILK Some slides are from:
Presentation transcript:

PFunc: Modern Task Parallelism For Modern High Performance Computing Prabhanjan Kambadur, Open Systems Lab, Indiana University

Overview Motivate the problem Need for another task parallel solution PFunc, a library-based solution for task parallelism Introduce the Cilk model Discuss PFunc’s features using fibonacci Case studies Demand-driven DAG execution Frequent pattern mining Sparse CG Conclusion and future work

Motivation Parallelize a wide-variety of applications Traditional HPC, Informatics, mainstream Parallelize for modern architectures Multi-core, many-core and GPGPUs Enable user-driven optimizations Fine tune application performance No runtime penalties Mix SPMD-style programming with tasks

Task parallelism and Cilk Program broken down into smaller tasks Independent tasks are executed in parallel Generic model of parallelism Subsumes data parallelism and SPMD parallelism Cilk is the most successful implementation Leiserson et al Base language C and C++ Work-stealing scheduler Guaranteed bounds and space and time

Cilk-style parallelization Order of discovery Order of completion Depth-first discovery, post-order finish n n n-1 n-2 n-3 n-4 n-3 n-4 n-5 n-6 1 Thread

Cilk-style parallelization Thd 1Thd 2 n Thd 1Thd 2 n-2 n-1 n Thd 1Thd 2 n-2n-1 n Thd 1Thd 2 n-5n-3 n-6n-4 n-3 n-2 nn-1 Thd 1Thd 2 n-3n-4 nn-2 n-1 Thd 1Thd 2 nn-4 n-3 n-2 n-1 1. Breadth-first theft. 2. Steal one task at a time. 3. Stealing is expensive. Steal (n-1)Steal (n-3) Thread-local Deques n n n-1 n-2 n-3 n-4 n-3 n-4 n-5 n-6

Drawbacks of Cilk Scheduling policy is hard-coded Tasks cannot have priorities Difficult to switch task scheduling policy Divide and conquer is a must Refactoring algorithms a must! Otherwise data locality between tasks is not exploited Fully-strict computation model Task graph is always a tree-DAG Cannot directly execute general DAG structures Cannot mix SPMD and task parallelism

PFunc: An overview Library-based solution for task parallelism C/C++ APIs Extends existing task parallel feature-set Cilk, Threading Building Blocks (TBB), Fortran M, etc Fully customizable Generic and generative programming principles No runtime penalty for customizations Portable Linux, OS X and AIX Windows release soon!

PFunc: Feature set FeatureExplanation Scheduling PolicyDetermines task scheduling (eg., cilkS) CompareOrdering function for the tasks (eg., std::less ) FunctorType of the function to be parallelized struct fibonacci; typedef pfunc::generator <cilkS, // Scheduling policy pfunc::use_default, // Compare fibonacci> // Functor my_pfunc;

PFunc: Nested types TypeExplanation AttributeAttached to each task. Used for affinity, priority, etc GroupAttached to each task. Used for SPMD-style programming TaskHandle to a spawned task. Used for status checks TaskmgrRepresents PFunc’s runtime. Encapsulates threads and queues typedef my_pfunc::attribute my_attr; typedef my_pfunc::group my_group; typedef my_pfunc::task my_task; typedef my_pfunc::taskmgr my_taskmgr;

Fibonacci numbers my_taskmgr gbl_taskmgr; struct fibonacci { fibonacci (const int& n) : n(n), fib_n(0) {} int get_number () const { return fib_n; } void operator () (void) { if (0 == n || 1 == n) fib_n = n; else { task tsk; fibonacci fib_n_1 (n−1), fib_n_2 (n−2); pfunc::spawn ( ∗ gbl_taskmgr, tsk, fib_n_1); fib_n_2(); pfunc::wait ( ∗ gbl_taskmgr, tsk); fib_n = fib_n_1.get_number () + fib_n_2.get_number (); } private: int fib_n; const int n; };

PFunc: Fibonacci performance 2x faster than TBB 2x slower than Cilk Provides more flexibility than TBB or Cilk * 4 socket quad-core AMD 8356 with Linux ThreadsCilk (secs)PFunc/CilkPFunc/TBB

New features in PFunc Customizable task scheduling and task priorities cilkS, prioS, fifoS and lifoS provided Multiple task completion notifications on demand Deviates from the strict computation model Task groups SPMD-style parallelization Task affinities Heterogeneous computers Attach task to queues and queues to processor Exception handling and profiling

Case Studies

Demand-driven DAG execution Data-driven DAG execution has many shortcomings Increased memory consumption in many applications Over-parallelization (eg., Sparse Cholesky Factorization) Strict computation model precludes Demand-driven execution of general DAGs Only supports execution of tree-DAGs PFunc supports demand-driven DAG execution Multiple task completion notifications Task priorities to control execution

DAG execution: Runtime

DAG execution: Peak memory usage

Frequent pattern mining (FPM) FPM algorithms are not always recursive The best known algorithm (Apriori) is breadth-first Optimal execution depends on memory reuse b/w tasks Current solutions do not support task affinities Affinities exploited only in divide and conquer executions Emphasis on recursive parallelism PFunc allows custom scheduling and task priorities Nearest neighbor scheduling algorithm Hash-table based common prefix scheduling algorithm Task priorities double as keys for tasks

Frequent pattern mining

Iterative sparse solvers Krylov-subspace methods such as CG, GMRES Efficient parallelization requires SPMD for unpreconditioned iterative sparse solvers Task parallelism for preconditioners Eg., incomplete factorization methods Current solutions do not support SPMD model PFunc supports SPMD through task groups Barrier operation, group cancellation Point-to-point operations coming soon!

Conjugate gradient

Conclusions PFunc increases tasking support for: Modern HPC applications DAG execution, frequent pattern mining, sparse CG SPMD-style programming Modern computer architectures Future work Parallelize more applications Incorporate support for GPGPUs