DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011.

DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011

Multicore Challenges The purpose of modeling is to capture the salient characteristics of phenomena with clarity and the right degree of accuracy to facilitate analysis and prediction [Maggs et al. 95] A model should provide clear, productive design incentives while providing strong messages to platform designers about the quality of characteristics required for efficient solution The development of a unifying paradigm also requires a somewhat unified and stable technological environment Theoretical Modeling of Multicore Computation - Alejandro Salinger2

We would like a model that: Reflects the characteristics of the architecture Relatively flexible Easy theoretical analysis Cost model linked to programming model Easy to learn Easy to program Others? (parameter-oblivious?) Multicore Challenges Theoretical Modeling of Multicore Computation - Alejandro Salinger3

Simple Accurate Theoretical Modeling of Multicore Computation - Alejandro Salinger4 Multicore models

Low Degree PRAM Theoretical Modeling of Multicore Computation - Alejandro Salinger5 [Dorrigiv, Lopez-Ortiz, S. ‘08]

Communication is key Parallel computing is as much about communicating data between processors, as it is about partitioning computing load between processors [Pal] It’s all about the cache Not only time complexity, also cache complexity: number of cache misses, parallel transfers Reducing misses can lead to overall faster running time even if processors are not fully utilized Theoretical Modeling of Multicore Computation - Alejandro Salinger6

Cache models Core 1 Core 2 Core 3 Core 4 Cache RAM Core 1 Core 2 Core 3 Core 4 RAM Cache Core 1 Core 2 Core 3 Core 4 RAM Cache Core 1 Core 2 Core 3 Core 4 RAM Cache 7

Parallel External Model (PEM) P synchronized processors Private memory of M words Blocks of size B words Measures: Computational complexity: maximum memory accesses to cache I/O complexity: parallel block transfers from memory Core 1 Core 2 Core 3 Core 4 RAM M M M M M M M M Theoretical Modeling of Multicore Computation - Alejandro Salinger8 [Arge, Goodrich, Nelson, Sitchinava ‘08]

ProblemPEM - I/O complexity Sorting Weighted list ranking Euler tour Tree contraction Expression tree evaluation Lowest Common Ancestor Minimum Spanning Tree Connected and biconnected components Ear decomposition Line Segment Intersection Reporting Theoretical Modeling of Multicore Computation - Alejandro Salinger9 [Arge, Goodrich, Sitchinava ‘10, Ajwani, Sitchinava, Zeh ‘11]

DAG model Theoretical Modeling of Multicore Computation - Alejandro Salinger 10

Schedulers It’s all about the scheduler Multithreaded computations with arbitrary dependencies can be impossible to schedule efficiently Restrict computation Fully strict computation: all data dependencies go to thread’s parent Work-stealing Core 1 Core 2 Core 3 Core 4 Theoretical Modeling of Multicore Computation - Alejandro Salinger11

Schedulers: Work-Stealing Core 1 Core 2 Core 3 Core 4 RAM C C C C C C C C Theoretical Modeling of Multicore Computation - Alejandro Salinger12 [Acar, Blelloch, Blumofe ’02][Blumofe, Leiserson ‘94] [Blumofe, Frigo, Joerg,Leiserson, Randall ‘96]

Schedulers: Parallel Depth First Core 1 Core 2 Core 3 Core 4 CpCp CpCp RAM Theoretical Modeling of Multicore Computation - Alejandro Salinger13 [Blelloch, Gibbons ‘04]

Schedulers Core 1 Core 2 Core 3 Core 4 RAM L1 L2 Theoretical Modeling of Multicore Computation - Alejandro Salinger14 [Blelloch, Chowdhury, Gibbons, Ramachandran, Chen, Kozuch ‘08]

Schedulers: Controlled-PDF Theoretical Modeling of Multicore Computation - Alejandro Salinger15

Cache obliviousness Theoretical Modeling of Multicore Computation - Alejandro Salinger16 [Blelloch, Gibbons, Simhadri ‘10]

Low-depth cache oblivious ProblemDepthCache (size M, block B) Sorting List ranking Euler tour on trees Tree contraction Lowest Common Ancestor (k queries) Minimum Spanning Forest Connected components Theoretical Modeling of Multicore Computation - Alejandro Salinger17

Resource Oblivious Algorithms - HM Hierarchical model HM Extension to multicore model Efficient oblivious algorithms for: Matrix transposition FFT Sorting Gaussian Elimination Paradigm List ranking Connected components Scheduler hints Theoretical Modeling of Multicore Computation - Alejandro Salinger 18 [Chowdurry, Silvestri, Blakeley, Rramachandran ‘10] Core 1 Core 2 Core 3 Core 4 RAM Cache

Multi-BSP d levels (p j,L j,m j,g j ) p j : number of components L j : synchronization cost m j : size of memory g j : data rate Level 0: cores Portable algorithms “Immortal algorithms” Optimal algorithms for matrix multiplication, FFT, and sorting L closer to latency that synchronization Prescriptive: e.g. support for synchronization operation level j level j-1 gjgj Core 1 Core 2 Core 3 Core 4 RAM Cache 1 1 2 2 pjpj pjpj mjmj mjmj Theoretical Modeling of Multicore Computation - Alejandro Salinger19 [Valiant ‘08]

Models Summary Modeling parallel computation is hard Multicore architecture constantly changing Cache should be part of the equation Maybe later inter-processor communication, synchronization, energy Theoretical Modeling of Multicore Computation - Alejandro Salinger20

Models Summary Good: No need to reinvent everything Large class of algorithms with good cache complexity for shared or private caches Some relatively simple design in terms of work, depth, and sequential cache complexity Parameters of the machine only known by scheduler Cilk Plus: model, scheduler, tools widely available Needs improvement: More algorithms or scheduler with good shared and private cache complexities How to choose the scheduler? Theory needs to be accessible to the masses Theoretical Modeling of Multicore Computation - Alejandro Salinger 21

Parallel training Current CS degree prepares for programming on obsolete model Change of mentality: Parallel thinking (algorithms, programming), but also I/O complexity, locality of reference Programming languages Right balance between practical skills and underlying theory? How to add new concepts without too much sacrifice? More specialized majors? Theoretical Modeling of Multicore Computation - Alejandro Salinger22

Final thoughts Constant factor speedup, opportunity for simplicity Use of more efficient, low-level algorithms were appropriate (library tools) Should we marry multicores? what’s the next thing? Theoretical Modeling of Multicore Computation - Alejandro Salinger 23

Bibliography U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3), 2002.The data locality of work stealing D. Ajwani, N. Sitchinava, N. Zeh. I/O-optimal algorithms for orthogonal problems for private-cache chip multiprocessors. In IPDPS’11, 2011I/O-optimal algorithms for orthogonal problems for private-cache chip multiprocessors L. Arge, M. T. Goodrich, M. Nelson, and N. Sitchinava. Fundamental parallel algorithms for private-cache chip multiprocessors. In ACM SPAA ’08, 2008. Fundamental parallel algorithms for private-cache chip multiprocessors L. Arge, M. T. Goodrich, and N. Sitchinava. Parallel external memory graph algorithms. In IPDPS’10, 2010Parallel external memory graph algorithms G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In ACM-SIAM SODA ’08, 2008.Provably good multicore cache performance for divide-and-conquer algorithms Theoretical Modeling of Multicore Computation - Alejandro Salinger 24

Bibliography(2) G. E. Blelloch and P. B. Gibbons. Effectively sharing a cache among threads. In ACM SPAA ’04, 2004.Effectively sharing a cache among threads G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri. Low- depth cache oblivious algorithms. In ACM SPAA ’10, 2010.Low- depth cache oblivious algorithms R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5), 1999.Scheduling multithreaded computations by work stealing R.D. Blumofe, M. Frigo, C.F. Joerg,C.E. Leiserson, K.H. Randall. An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms. In SPAA’96, 1996.An Analysis of Dag-Consistent Distributed Shared-Memory Algorithms Theoretical Modeling of Multicore Computation - Alejandro Salinger 25

Bibliography(3) R.A. Chowdhury, F. Silvestri, B. Blakeley, V. Ramachandran. Oblivious algorithms for multicores and network of processors. In IEEE IPDPS’10, 2010. Oblivious algorithms for multicores and network of processors R. Cole, V. Ramachandran. Resource Oblivious Sorting on Multicores. In ICALP ’10, 2010.Resource Oblivious Sorting on Multicores R. Dorrigiv, A. López-Ortiz, A. Salinger. Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM). In ACM SPAA ’08, 2008.Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) B.M. Maggs, L.R. Matheson, R.E. Tarjan. Models of Parallel Computation: A Survey and Synthesis. In HICSS’95, 1995.Models of Parallel Computation: A Survey and Synthesis L. G. Valiant. A bridging model for multicore computing. In Journal of Computer and System Sciences, 2010.A bridging model for multicore computing Theoretical Modeling of Multicore Computation - Alejandro Salinger 26

DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011.

Similar presentations

Presentation on theme: "DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011.

Similar presentations

Presentation on theme: "DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011."— Presentation transcript:

Similar presentations

About project

Feedback