DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011.

Slides:

Advertisements

Similar presentations

Multiprocessor Architecture for Image processing Mayank Kumar – 2006EE10331 Pushpendre Rastogi – 2006EE50412 Under the guidance of Dr.Anshul Kumar.

Advertisements

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Energy-Efficient Distributed Algorithms for Ad hoc Wireless Networks Gopal Pandurangan Department of Computer Science Purdue University.

Multiple Processor Systems

Chapter 1 Introduction Copyright © Operating Systems, by Dhananjay Dhamdhere Copyright © Introduction Abstract Views of an Operating System.

Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory.

Energy-efficient Task Scheduling in Heterogeneous Environment 2013/10/25.

1 General Iteration Algorithms by Luyang Fu, Ph. D., State Auto Insurance Company Cheng-sheng Peter Wu, FCAS, ASA, MAAA, Deloitte Consulting LLP 2007 CAS.

© DEEDS – OS Course WS11/12 Lecture 10 - Multiprocessing Support 1 Administrative Issues  Exam date candidates  CW 7 * Feb 14th (Tue): * Feb 16th.

PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.

CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo.

James Edwards and Uzi Vishkin University of Maryland 1.

Slide 1 Parallel Computation Models Lecture 3 Lecture 4.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

1 Distributed Computing Algorithms CSCI Distributed Computing: everything not centralized many processors.

CS 284a, 4 November 1997 Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 4 November, 1997.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

1 Compiling with multicore Jeehyung Lee Spring 2009.

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

Project Mentor – Prof. Alan Kaminsky

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Design and analysis of algorithms for multicore architectures Alejandro Salinger April 2 nd, 2009 Joint work with Alex López-Ortiz and Reza Dorrigiv.

Bulk Synchronous Parallel Processing Model Jamie Perkins.

RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.

LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.

Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.

MotivationFundamental ProblemsProblems on Graphs Parallel processors are becoming common place. Each core of a multi-core processor consists of a CPU and.

Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 3, 2005 Session 7.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

Multi-core Computing Lecture 1 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

Classic Model of Parallel Processing

Data Structures and Algorithms in Parallel Computing Lecture 2.

Multi-Semester Effort and Experience to Integrate NSF/IEEE-TCPP PDC into Multiple Department- wide Core Courses of Computer Science and Technology Department.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Concept Diagram Hung-Hsun Su UPC Group, HCS lab 1/27/2004.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.

KERRY BARNES WILLIAM LUNDGREN JAMES STEED

Background Computer System Architectures Computer System Software.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

GPU ProgrammingOther Relevant ObservationsExperiments GPU kernels run on C blocks (CTAs) of W warps. Each warp is a group of w=32 threads, which are executed.

CILK: An Efficient Multithreaded Runtime System

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Neural Network Implementations on Parallel Architectures

Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

Models and Languages for Parallel Computation

Guoliang Chen Parallel Computing Guoliang Chen

Theory: Asleep at the Switch to Many-Core

Hybrid Programming with OpenMP and MPI

EE 4xx: Computer Architecture and Performance Programming

Low Depth Cache-Oblivious Algorithms

6- General Purpose GPU Programming

Presentation transcript:

DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

Multicore Challenges The purpose of modeling is to capture the salient characteristics of phenomena with clarity and the right degree of accuracy to facilitate analysis and prediction [Maggs et al. 95] A model should provide clear, productive design incentives while providing strong messages to platform designers about the quality of characteristics required for efficient solution The development of a unifying paradigm also requires a somewhat unified and stable technological environment Theoretical Modeling of Multicore Computation - Alejandro Salinger2

We would like a model that: Reflects the characteristics of the architecture Relatively flexible Easy theoretical analysis Cost model linked to programming model Easy to learn Easy to program Others? (parameter-oblivious?) Multicore Challenges Theoretical Modeling of Multicore Computation - Alejandro Salinger3

Simple Accurate Theoretical Modeling of Multicore Computation - Alejandro Salinger4 Multicore models

Low Degree PRAM Theoretical Modeling of Multicore Computation - Alejandro Salinger5 [Dorrigiv, Lopez-Ortiz, S. ‘08]

Communication is key Parallel computing is as much about communicating data between processors, as it is about partitioning computing load between processors [Pal] It’s all about the cache Not only time complexity, also cache complexity: number of cache misses, parallel transfers Reducing misses can lead to overall faster running time even if processors are not fully utilized Theoretical Modeling of Multicore Computation - Alejandro Salinger6

Cache models Core 1 Core 2 Core 3 Core 4 Cache RAM Core 1 Core 2 Core 3 Core 4 RAM Cache Core 1 Core 2 Core 3 Core 4 RAM Cache Core 1 Core 2 Core 3 Core 4 RAM Cache 7

Parallel External Model (PEM) P synchronized processors Private memory of M words Blocks of size B words Measures: Computational complexity: maximum memory accesses to cache I/O complexity: parallel block transfers from memory Core 1 Core 2 Core 3 Core 4 RAM M M M M M M M M Theoretical Modeling of Multicore Computation - Alejandro Salinger8 [Arge, Goodrich, Nelson, Sitchinava ‘08]

ProblemPEM - I/O complexity Sorting Weighted list ranking Euler tour Tree contraction Expression tree evaluation Lowest Common Ancestor Minimum Spanning Tree Connected and biconnected components Ear decomposition Line Segment Intersection Reporting Theoretical Modeling of Multicore Computation - Alejandro Salinger9 [Arge, Goodrich, Sitchinava ‘10, Ajwani, Sitchinava, Zeh ‘11]

DAG model Theoretical Modeling of Multicore Computation - Alejandro Salinger 10

Schedulers It’s all about the scheduler Multithreaded computations with arbitrary dependencies can be impossible to schedule efficiently Restrict computation Fully strict computation: all data dependencies go to thread’s parent Work-stealing Core 1 Core 2 Core 3 Core 4 Theoretical Modeling of Multicore Computation - Alejandro Salinger11

Schedulers: Work-Stealing Core 1 Core 2 Core 3 Core 4 RAM C C C C C C C C Theoretical Modeling of Multicore Computation - Alejandro Salinger12 [Acar, Blelloch, Blumofe ’02][Blumofe, Leiserson ‘94] [Blumofe, Frigo, Joerg,Leiserson, Randall ‘96]

Schedulers: Parallel Depth First Core 1 Core 2 Core 3 Core 4 CpCp CpCp RAM Theoretical Modeling of Multicore Computation - Alejandro Salinger13 [Blelloch, Gibbons ‘04]

Schedulers Core 1 Core 2 Core 3 Core 4 RAM L1 L2 Theoretical Modeling of Multicore Computation - Alejandro Salinger14 [Blelloch, Chowdhury, Gibbons, Ramachandran, Chen, Kozuch ‘08]

Schedulers: Controlled-PDF Theoretical Modeling of Multicore Computation - Alejandro Salinger15

Cache obliviousness Theoretical Modeling of Multicore Computation - Alejandro Salinger16 [Blelloch, Gibbons, Simhadri ‘10]

Low-depth cache oblivious ProblemDepthCache (size M, block B) Sorting List ranking Euler tour on trees Tree contraction Lowest Common Ancestor (k queries) Minimum Spanning Forest Connected components Theoretical Modeling of Multicore Computation - Alejandro Salinger17

Resource Oblivious Algorithms - HM Hierarchical model HM Extension to multicore model Efficient oblivious algorithms for: Matrix transposition FFT Sorting Gaussian Elimination Paradigm List ranking Connected components Scheduler hints Theoretical Modeling of Multicore Computation - Alejandro Salinger 18 [Chowdurry, Silvestri, Blakeley, Rramachandran ‘10] Core 1 Core 2 Core 3 Core 4 RAM Cache

Multi-BSP d levels (p j,L j,m j,g j ) p j : number of components L j : synchronization cost m j : size of memory g j : data rate Level 0: cores Portable algorithms “Immortal algorithms” Optimal algorithms for matrix multiplication, FFT, and sorting L closer to latency that synchronization Prescriptive: e.g. support for synchronization operation level j level j-1 gjgj Core 1 Core 2 Core 3 Core 4 RAM Cache pjpj pjpj mjmj mjmj Theoretical Modeling of Multicore Computation - Alejandro Salinger19 [Valiant ‘08]

Models Summary Modeling parallel computation is hard Multicore architecture constantly changing Cache should be part of the equation Maybe later inter-processor communication, synchronization, energy Theoretical Modeling of Multicore Computation - Alejandro Salinger20

Models Summary Good: No need to reinvent everything Large class of algorithms with good cache complexity for shared or private caches Some relatively simple design in terms of work, depth, and sequential cache complexity Parameters of the machine only known by scheduler Cilk Plus: model, scheduler, tools widely available Needs improvement: More algorithms or scheduler with good shared and private cache complexities How to choose the scheduler? Theory needs to be accessible to the masses Theoretical Modeling of Multicore Computation - Alejandro Salinger 21

Parallel training Current CS degree prepares for programming on obsolete model Change of mentality: Parallel thinking (algorithms, programming), but also I/O complexity, locality of reference Programming languages Right balance between practical skills and underlying theory? How to add new concepts without too much sacrifice? More specialized majors? Theoretical Modeling of Multicore Computation - Alejandro Salinger22

Final thoughts Constant factor speedup, opportunity for simplicity Use of more efficient, low-level algorithms were appropriate (library tools) Should we marry multicores? what’s the next thing? Theoretical Modeling of Multicore Computation - Alejandro Salinger 23

Bibliography U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3), 2002.The data locality of work stealing D. Ajwani, N. Sitchinava, N. Zeh. I/O-optimal algorithms for orthogonal problems for private-cache chip multiprocessors. In IPDPS’11, 2011I/O-optimal algorithms for orthogonal problems for private-cache chip multiprocessors L. Arge, M. T. Goodrich, M. Nelson, and N. Sitchinava. Fundamental parallel algorithms for private-cache chip multiprocessors. In ACM SPAA ’08, Fundamental parallel algorithms for private-cache chip multiprocessors L. Arge, M. T. Goodrich, and N. Sitchinava. Parallel external memory graph algorithms. In IPDPS’10, 2010Parallel external memory graph algorithms G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In ACM-SIAM SODA ’08, 2008.Provably good multicore cache performance for divide-and-conquer algorithms Theoretical Modeling of Multicore Computation - Alejandro Salinger 24

Bibliography(2) G. E. Blelloch and P. B. Gibbons. Effectively sharing a cache among threads. In ACM SPAA ’04, 2004.Effectively sharing a cache among threads G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri. Low- depth cache oblivious algorithms. In ACM SPAA ’10, 2010.Low- depth cache oblivious algorithms R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5), 1999.Scheduling multithreaded computations by work stealing R.D. Blumofe, M. Frigo, C.F. Joerg,C.E. Leiserson, K.H. Randall. An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms. In SPAA’96, 1996.An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms Theoretical Modeling of Multicore Computation - Alejandro Salinger 25

Bibliography(3) R.A. Chowdhury, F. Silvestri, B. Blakeley, V. Ramachandran. Oblivious algorithms for multicores and network of processors. In IEEE IPDPS’10, Oblivious algorithms for multicores and network of processors R. Cole, V. Ramachandran. Resource Oblivious Sorting on Multicores. In ICALP ’10, 2010.Resource Oblivious Sorting on Multicores R. Dorrigiv, A. López-Ortiz, A. Salinger. Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM). In ACM SPAA ’08, 2008.Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) B.M. Maggs, L.R. Matheson, R.E. Tarjan. Models of Parallel Computation: A Survey and Synthesis. In HICSS’95, 1995.Models of Parallel Computation: A Survey and Synthesis L. G. Valiant. A bridging model for multicore computing. In Journal of Computer and System Sciences, 2010.A bridging model for multicore computing Theoretical Modeling of Multicore Computation - Alejandro Salinger 26