1 7 Questions for Parallelism Applications: 1. What are the apps? 2. What are kernels of apps? Hardware: 3. What are the HW building blocks? 4. How to.

1 7 Questions for Parallelism Applications: 1. What are the apps? 2. What are kernels of apps? Hardware: 3. What are the HW building blocks? 4. How to connect them? Programming Model & Systems Software: 5. How to describe apps and kernels? 6. How to program the HW? Evaluation: 7. How to measure success? (Inspired by a view of the Golden Gate Bridge from Berkeley)

2 How do we describe apps and kernels? Observation 1: use Dwarfs. Dwarfs are of 2 types Libraries Dense matrices Sparse matrices Spectral Combinational Finite state machines Patterns/Frameworks MapReduce Graph traversal, graphical models Dynamic programming Backtracking/B&B N-Body (Un) Structured Grid Algorithms in the dwarfs can either be implemented as: Compact parallel computations within a traditional library Compute/communicate pattern implemented as framework Computations may be viewed a multiple levels: e.g., an FFT library may be built by instantiating a Map-Reduce framework, mapping 1D FFTs and then transposing (generalize reduce)

3 Composing dwarfs to build apps Any parallel application of arbitrary complexity may be built by composing parallel and serial components   Parallel patterns with serial plug-ins e.g., MapReduce   Serial code invoking parallel libraries, e.g., FFT, matrix ops.,… Composition is hierarchical

4 Programming the HW 2 types of programmers/2 layers  “The right tool for the right time” Productivity Layer (90% of programmers)  Domain experts / Naïve programmers/productively build parallel apps using frameworks & libraries  Frameworks & libraries composed using C&C Language to provide app frameworks Efficiency Layer (10% of programmers)  Expert programmers build: Frameworks: software that supports general structural patterns of computation and communication: e. g. MapReduce Libraries: software that supports compact computational expressions: e.g. Sketch for Combinational or Grid computation  “Bare metal” efficiency possible at Efficiency Layer Effective composition techniques allows the efficiency programmers to be highly leveraged.

5 Coordination & Composition in CBIR Application Parallelism in CBIR is hierarchical Mostly independent tasks/data with combining DCT extractor Face Recog ? DWT ? … stream parallel over images task parallel over extraction algorithms data parallel map DCT over tiles combine concatenate feature vectors output stream of feature vectors DCT output stream of images feature extraction combine reduction on histograms from each tile output one histogram (feature vector)

6 Coordination & Composition Language Coordination & Composition language for productivity 2 key challenges 1. Correctness: ensuring independence using decomposition operators, copying and requirements specifications on frameworks 2. Efficiency: resource management during composition; domain- specific OS/runtime support Language control features hide core resources, e.g.,  Map DCT over tiles in language becomes set of DCTs/tiles per core  Hierarchical parallelism managed using OS mechanisms Data structure hide memory structures  Partitioners on arrays, graphs, trees produce independent data  Framework interfaces give independence requirements: e.g., mapreduce function must be independent, either by copying or application to partitioned data object (set of tiles from partitioner)

7 For parallelism to succeed, must provide productivity, efficiency, and correctness simultaneously  Can’t make SW productivity even worse!  Why do in parallel if efficiency doesn’t matter?  Correctness usually considered orthogonal problem  Productivity slows if code incorrect or inefficient  Correctness and efficiency slow if programming unproductive Most programmers not ready for parallel programming  IBM SP customer escalations: concurrency bugs worst, can take months to fix  How make ≈90% today’s programmers productive on parallel computers?  How make code written by ≈90% of programmers efficient? How do we program the HW? What are the problems?

8 Ensuring Correctness Productivity Layer: Enforce independence of tasks using decomposition (partitioning) and copying operators Goal: Remove concurrency errors (nondeterminism from execution order, not just low level data races) E.g., the race-free program “atomic delete” + “atomic insert” does not compose to an “atomic replace”; need higher level properties, rather than just locks or transactions Efficiency Layer: Check for subtle concurrency bugs (races, deadlocks, etc.) Mixture of verification and automated directed testing Error detection on framework and libraries; some techniques applicable to third-party software

9 Compilers and Operating Systems are large, complex, resistant to innovation Takes a decade for compiler innovations to show up in production compilers? Time for idea in SOSP to appear in production OS? Traditional OSes brittle, insecure, memory hogs  Traditional monolithic OS image uses lots of precious memory * 100s - 1000s times (e.g., AIX uses GBs of DRAM / CPU) Support Software: What are the problems?

10 21 st Century Code Generation Search space for matmul block sizes: Axes are block dim Temp is speed Problem: generating optimal code is like searching for a needle in a haystack New approach: “Auto-tuners” 1st run variations of program on computer to heuristically search for best combinations of optimizations (blocking, padding, …) and data structures, then produce C code to be compiled for that computer  E.g., PHiPAC (BLAS), Atlas (BLAS), Spiral (DSP), FFT-W  Can achieve 10X over conventional compiler Example: Sparse Matrix (SPMv) for 3 multicores  Fastest SPMv: 2X OSKI/PETSc Clovertown, 4X Opteron  Optimization space: register blocking, cache blocking, TLB blocking, prefetching/DMA options, NUMA, BCOO v. BCSR data structures, 16b v. 32b indices, …

11 Example: Sparse Matrix * Vector NameClovertownOpteronCell Chips*Cores2*4 = 8 2*2 = 4 1*8 = 8 Architecture4-/3-issue, 2-/1-SSE3, OOO, caches, prefetch 2-VLIW, SIMD, local store, DMA Clock Rate2.3 GHz2.2 GHz3.2 GHz Peak MemBW21.3 GB/s21.325.6 GB/s Peak GFLOPS74.6 GF17.6 GF14.6 (DP Fl. Pt.) Naïve SPMv (median of many matrices) 1.0 GF 0.6 GF-- Efficiency %1%3%-- Autotune SPMv 1.5 GF 1.9 GF3.4 GF Auto Speedup 1.5X 3.2X--

12 Example: Sparse Matrix * Vector NameClovertownOpteronCell Chips*Cores2*4 = 8 2*2 = 4 1*8 = 8 Architecture4-/3-issue, 2-/1-SSE3, OOO, caches, prefetch 2-VLIW, SIMD, local store, DMA Clock Rate2.3 GHz2.2 GHz3.2 GHz Peak MemBW21.3 GB/s21.325.6 GB/s Peak GFLOPS74.6 GF17.6 GF14.6 (DP Fl. Pt.) Naïve SPMv (median of many matrices) 1.0 GF 0.6 GF-- Efficiency %1%3%-- Autotune SPMv 1.5 GF 1.9 GF3.4 GF Auto Speedup 1.5X 3.2X--

13 Example: Sparse Matrix * Vector NameClovertownOpteronCell Chips*Cores2*4 = 8 2*2 = 4 1*8 = 8 Architecture4-/3-issue, 2-/1-SSE3, OOO, caches, prefetch 2-VLIW, SIMD, local store, DMA Clock Rate2.3 GHz2.2 GHz3.2 GHz Peak MemBW21.3 GB/s21.325.6 GB/s Peak GFLOPS74.6 GF17.6 GF14.6 (DP Fl. Pt.) Naïve SPMv (median of many matrices) 1.0 GF 0.6 GF-- Efficiency %1%3%-- Autotune SPMv 1.5 GF 1.9 GF3.4 GF Auto Speedup 1.5X 3.2X ∞

14 Greater productivity and efficiency for SPMv? Parallelizing compiler + multicore + caches + prefetching Autotuner + multicore + local store + DMA Originally, caches to improve programmer productivity Not always the case for manycore+autotuner Easier to autotune single local store + DMA than multilevel caches + HW and SW prefetching

15 Deconstructing Operating Systems Resurgence of interest in virtual machines  VM monitor thin SW layer btw guest OS and HW Future OS: libraries where only functions needed are linked into app, on top of thin hypervisor providing protection and sharing of resources Partitioning support for very thin hypervisors, and to allow software full access to hardware within partition

1 7 Questions for Parallelism Applications: 1. What are the apps? 2. What are kernels of apps? Hardware: 3. What are the HW building blocks? 4. How to.

Similar presentations

Presentation on theme: "1 7 Questions for Parallelism Applications: 1. What are the apps? 2. What are kernels of apps? Hardware: 3. What are the HW building blocks? 4. How to."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 7 Questions for Parallelism Applications: 1. What are the apps? 2. What are kernels of apps? Hardware: 3. What are the HW building blocks? 4. How to.

Similar presentations

Presentation on theme: "1 7 Questions for Parallelism Applications: 1. What are the apps? 2. What are kernels of apps? Hardware: 3. What are the HW building blocks? 4. How to."— Presentation transcript:

Similar presentations

About project

Feedback