Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo.

Similar presentations


Presentation on theme: "Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo."— Presentation transcript:

1 Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo Joint work with Alejandro López-Ortiz and Reza Dorrigiv

2 Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 2 Multicore Challenge RAM model will no longer accurately reflect the architecture on which algorithms are executed. PRAM facilitates design and analysis, however: –Unrealistic. –Difficult to derive work-optimal algorithms for Θ(n) processors. 2, 4, or 8 cores per chip: low degree parallelism. Thread-based parallelism.

3 Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 3 Multicore Challenge Design a model such that: –Reflects available degree of parallelism. –Multi-threaded. –Easy theoretical analysis. –Easy to program. “Programmability has now replaced power as the number one impediment to the continuation of Moore’s law” [Gartner]

4 Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 4 The LoPRAM Model Number of cores is not a constant: modeled as O(log n). Similar to bit-level parallelism, w = O(log n)-bit word. LoPRAM: PRAM with p = O(log n) processors running in MIMD mode. Concurrent Read Exclusive Write (CREW). Simplest form: high-level thread-based parallelism. Semaphores and automatic serialization available and transparent to programmer. p = O(log n) but not p = Θ(log n).

5 Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 5 PAL-threads void mergeSort(int numbers[], int temp[], int array_size) { m_sort(numbers, temp, 0, array_size - 1); } void m_sort(int numbers[], int temp[], int left, int right) { int mid = (right + left) / 2; if (right > left) { m_sort(numbers, temp, left, mid); m_sort(numbers, temp, mid+1, right); merge(numbers, temp, left, mid+1, right); }

6 Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 6 PAL-threads void mergeSort(int numbers[], int temp[], int array_size) { m_sort(numbers, temp, 0, array_size - 1); } void m_sort(int numbers[], int temp[], int left, int right) { int mid = (right + left) / 2; if (right > left) { palthreads { // do in parallel if possible m_sort(numbers, temp, left, mid); m_sort(numbers, temp, mid+1, right); } // implicit join merge(numbers, temp, left, mid+1, right); } pending active waiting

7 Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 7 Work-Optimal Algorithms: Divide & Conquer Recursive divide-and-conquer algorithms with time given by: By the master theorem:

8 Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 8 Divide & Conquer

9 Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 9 Divide & Conquer Parallel Master theorem in the LoPRAM: If we assume parallel merging the third case becomes T p (n) = (f (n)/p). Optimal speedup [i.e. Tp(n) = T(n)/p ] so long as p = O(log n).

10 Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 10 Matrix Multiplication T(n)=7T(n/2)+O(n 2 ) T(n)=O(n 2.8 ) T p (n)=O(n 2.8 /p)

11 Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 11 Dynamic programming Generic parallel algorithm that exploits the parallelism of execution in the DAG.

12 Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) - A. Salinger 12 Conclusions Computers have a small number of processors. The assumption that p=O(log n) or even O(log 2 n) will last for a while. Designing work-optimal algorithms for a small number of processors is easy.


Download ppt "Optimal Speedup on a Low-Degree Multi-Core Parallel Architecture (LoPRAM) Alejandro Salinger Cheriton School of Computer Science University of Waterloo."

Similar presentations


Ads by Google