1 7 Questions for Parallelism Applications: 1. What are the apps? 2. What are kernels of apps? Hardware: 3. What are the HW building blocks? 4. How to.

Slides:

Advertisements

Similar presentations

Operating-System Structures

Advertisements

Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Computer Abstractions and Technology

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

October 2007Susan Eggers: SOSP Women's Workshop 1 Your Speaker in a Nutshell BA in Economics 1965 PhD University of CA, Berkeley 1989 Technical skills.

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Memory Design Example. Selecting Memory Chip Selecting SRAM Memory Chip.

Phones OFF Please Operating System Introduction Parminder Singh Kang Home:

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.

Memory Management 2010.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Chapter 91 Memory Management Chapter 9   Review of process from source to executable (linking, loading, addressing)   General discussion of memory.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Computer System Architectures Computer System Software

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (paper to appear at SC07) Sam Williams

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

Microkernels, virtualization, exokernels Tutorial 1 – CSC469.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Introduction 1-1 Introduction to Virtual Machines From “Virtual Machines” Smith and Nair Chapter 1.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.

CIS250 OPERATING SYSTEMS Memory Management Since we share memory, we need to manage it Memory manager only sees the address A program counter value indicates.

System Software for Parallel Computing. Two System Software Components Hard to do the innovation Replacement for Tradition Optimizing Compilers Replacement.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

GPU Architecture and Programming

CE Operating Systems Lecture 3 Overview of OS functions and structure.

Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.

Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.

CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.

CS 346 – Chapter 2 OS services –OS user interface –System calls –System programs How to make an OS –Implementation –Structure –Virtual machines Commitment.

Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Introduction Why are virtual machines interesting?

1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Chapter 7 Memory Management Eighth Edition William Stallings Operating Systems: Internals and Design Principles.

KERRY BARNES WILLIAM LUNDGREN JAMES STEED

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.

Current Generation Hypervisor Type 1 Type 2.

CMSC 611: Advanced Computer Architecture

Chapter 9 – Real Memory Organization and Management

Real-Time Ray Tracing Stefan Popov.

Vector Processing => Multimedia

OS Virtualization.

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

Virtualization Techniques

NVIDIA Fermi Architecture

Multicore and GPU Programming

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Presentation transcript:

1 7 Questions for Parallelism Applications: 1. What are the apps? 2. What are kernels of apps? Hardware: 3. What are the HW building blocks? 4. How to connect them? Programming Model & Systems Software: 5. How to describe apps and kernels? 6. How to program the HW? Evaluation: 7. How to measure success? (Inspired by a view of the Golden Gate Bridge from Berkeley)

2 How do we describe apps and kernels? Observation 1: use Dwarfs. Dwarfs are of 2 types Libraries Dense matrices Sparse matrices Spectral Combinational Finite state machines Patterns/Frameworks MapReduce Graph traversal, graphical models Dynamic programming Backtracking/B&B N-Body (Un) Structured Grid Algorithms in the dwarfs can either be implemented as: Compact parallel computations within a traditional library Compute/communicate pattern implemented as framework Computations may be viewed a multiple levels: e.g., an FFT library may be built by instantiating a Map-Reduce framework, mapping 1D FFTs and then transposing (generalize reduce)

3 Composing dwarfs to build apps Any parallel application of arbitrary complexity may be built by composing parallel and serial components   Parallel patterns with serial plug-ins e.g., MapReduce   Serial code invoking parallel libraries, e.g., FFT, matrix ops.,… Composition is hierarchical

4 Programming the HW 2 types of programmers/2 layers  “The right tool for the right time” Productivity Layer (90% of programmers)  Domain experts / Naïve programmers/productively build parallel apps using frameworks & libraries  Frameworks & libraries composed using C&C Language to provide app frameworks Efficiency Layer (10% of programmers)  Expert programmers build: Frameworks: software that supports general structural patterns of computation and communication: e. g. MapReduce Libraries: software that supports compact computational expressions: e.g. Sketch for Combinational or Grid computation  “Bare metal” efficiency possible at Efficiency Layer Effective composition techniques allows the efficiency programmers to be highly leveraged.

5 Coordination & Composition in CBIR Application Parallelism in CBIR is hierarchical Mostly independent tasks/data with combining DCT extractor Face Recog ? DWT ? … stream parallel over images task parallel over extraction algorithms data parallel map DCT over tiles combine concatenate feature vectors output stream of feature vectors DCT output stream of images feature extraction combine reduction on histograms from each tile output one histogram (feature vector)

6 Coordination & Composition Language Coordination & Composition language for productivity 2 key challenges 1. Correctness: ensuring independence using decomposition operators, copying and requirements specifications on frameworks 2. Efficiency: resource management during composition; domain- specific OS/runtime support Language control features hide core resources, e.g.,  Map DCT over tiles in language becomes set of DCTs/tiles per core  Hierarchical parallelism managed using OS mechanisms Data structure hide memory structures  Partitioners on arrays, graphs, trees produce independent data  Framework interfaces give independence requirements: e.g., mapreduce function must be independent, either by copying or application to partitioned data object (set of tiles from partitioner)

7 For parallelism to succeed, must provide productivity, efficiency, and correctness simultaneously  Can’t make SW productivity even worse!  Why do in parallel if efficiency doesn’t matter?  Correctness usually considered orthogonal problem  Productivity slows if code incorrect or inefficient  Correctness and efficiency slow if programming unproductive Most programmers not ready for parallel programming  IBM SP customer escalations: concurrency bugs worst, can take months to fix  How make ≈90% today’s programmers productive on parallel computers?  How make code written by ≈90% of programmers efficient? How do we program the HW? What are the problems?

8 Ensuring Correctness Productivity Layer: Enforce independence of tasks using decomposition (partitioning) and copying operators Goal: Remove concurrency errors (nondeterminism from execution order, not just low level data races) E.g., the race-free program “atomic delete” + “atomic insert” does not compose to an “atomic replace”; need higher level properties, rather than just locks or transactions Efficiency Layer: Check for subtle concurrency bugs (races, deadlocks, etc.) Mixture of verification and automated directed testing Error detection on framework and libraries; some techniques applicable to third-party software

9 Compilers and Operating Systems are large, complex, resistant to innovation Takes a decade for compiler innovations to show up in production compilers? Time for idea in SOSP to appear in production OS? Traditional OSes brittle, insecure, memory hogs  Traditional monolithic OS image uses lots of precious memory * 100s s times (e.g., AIX uses GBs of DRAM / CPU) Support Software: What are the problems?

10 21 st Century Code Generation Search space for matmul block sizes: Axes are block dim Temp is speed Problem: generating optimal code is like searching for a needle in a haystack New approach: “Auto-tuners” 1st run variations of program on computer to heuristically search for best combinations of optimizations (blocking, padding, …) and data structures, then produce C code to be compiled for that computer  E.g., PHiPAC (BLAS), Atlas (BLAS), Spiral (DSP), FFT-W  Can achieve 10X over conventional compiler Example: Sparse Matrix (SPMv) for 3 multicores  Fastest SPMv: 2X OSKI/PETSc Clovertown, 4X Opteron  Optimization space: register blocking, cache blocking, TLB blocking, prefetching/DMA options, NUMA, BCOO v. BCSR data structures, 16b v. 32b indices, …

11 Example: Sparse Matrix * Vector NameClovertownOpteronCell Chips*Cores2*4 = 8 2*2 = 4 1*8 = 8 Architecture4-/3-issue, 2-/1-SSE3, OOO, caches, prefetch 2-VLIW, SIMD, local store, DMA Clock Rate2.3 GHz2.2 GHz3.2 GHz Peak MemBW21.3 GB/s GB/s Peak GFLOPS74.6 GF17.6 GF14.6 (DP Fl. Pt.) Naïve SPMv (median of many matrices) 1.0 GF 0.6 GF-- Efficiency %1%3%-- Autotune SPMv 1.5 GF 1.9 GF3.4 GF Auto Speedup 1.5X 3.2X--

12 Example: Sparse Matrix * Vector NameClovertownOpteronCell Chips*Cores2*4 = 8 2*2 = 4 1*8 = 8 Architecture4-/3-issue, 2-/1-SSE3, OOO, caches, prefetch 2-VLIW, SIMD, local store, DMA Clock Rate2.3 GHz2.2 GHz3.2 GHz Peak MemBW21.3 GB/s GB/s Peak GFLOPS74.6 GF17.6 GF14.6 (DP Fl. Pt.) Naïve SPMv (median of many matrices) 1.0 GF 0.6 GF-- Efficiency %1%3%-- Autotune SPMv 1.5 GF 1.9 GF3.4 GF Auto Speedup 1.5X 3.2X--

13 Example: Sparse Matrix * Vector NameClovertownOpteronCell Chips*Cores2*4 = 8 2*2 = 4 1*8 = 8 Architecture4-/3-issue, 2-/1-SSE3, OOO, caches, prefetch 2-VLIW, SIMD, local store, DMA Clock Rate2.3 GHz2.2 GHz3.2 GHz Peak MemBW21.3 GB/s GB/s Peak GFLOPS74.6 GF17.6 GF14.6 (DP Fl. Pt.) Naïve SPMv (median of many matrices) 1.0 GF 0.6 GF-- Efficiency %1%3%-- Autotune SPMv 1.5 GF 1.9 GF3.4 GF Auto Speedup 1.5X 3.2X ∞

14 Greater productivity and efficiency for SPMv? Parallelizing compiler + multicore + caches + prefetching Autotuner + multicore + local store + DMA Originally, caches to improve programmer productivity Not always the case for manycore+autotuner Easier to autotune single local store + DMA than multilevel caches + HW and SW prefetching

15 Deconstructing Operating Systems Resurgence of interest in virtual machines  VM monitor thin SW layer btw guest OS and HW Future OS: libraries where only functions needed are linked into app, on top of thin hypervisor providing protection and sharing of resources Partitioning support for very thin hypervisors, and to allow software full access to hardware within partition