Teaching Parallelism Panel, SPAA11 Uzi Vishkin, University of Maryland.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Distributed Systems CS
Standards Alignment A study of alignment between state standards and the ACM K-12 Curriculum.
IPDPS Looking Back Panel Uzi Vishkin, University of Maryland.
Algorithms-based extension of serial computing education to parallelism Uzi Vishkin - Using Simple Abstraction to Reinvent Computing for Parallelism, CACM,
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
James Edwards and Uzi Vishkin University of Maryland 1.
Introduction CS 524 – High-Performance Computing.
June 13, Introduction to CS II Data Structures Hongwei Xi Comp. Sci. Dept. Boston University.
Performance Potential of an Easy-to- Program PRAM-On-Chip Prototype Versus State-of-the-Art Processor George C. Caragea – University of Maryland A. Beliz.
Joint UIUC/UMD Parallel Algorithms/Programming Course David Padua, University of Illinois at Urbana-Champaign Uzi Vishkin, University of Maryland, speaker.
CS 300 – Lecture 20 Intro to Computer Architecture / Assembly Language Caches.
Ritu Varma Roshanak Roshandel Manu Prasanna
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Better Speedups for Parallel Max-Flow George C. Caragea Uzi Vishkin Dept. of Computer Science University of Maryland, College Park, USA June 4 th, 2011.
General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks NameDescriptionCUDA SourceLines of Code DatasetParallel sectn. Threads/sectn.
Parallel Computing Approaches & Applications Arthur Asuncion April 15, 2008.
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
XMT-GPU A PRAM Architecture for Graphics Computation Tom DuBois, Bryant Lee, Yi Wang, Marc Olano and Uzi Vishkin.
Programmability and Portability Problems? Time for Hardware Upgrades Uzi Vishkin ~2003 Wall Street traded companies gave up the safety of the only paradigm.
Principles/theory matter and can matter more: Big lead of PRAM algorithms on prototype-HW Uzi Vishkin There is nothing more practical than a good theory--
What is Concurrent Programming? Maram Bani Younes.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
WEEK 1 CS 361: ADVANCED DATA STRUCTURES AND ALGORITHMS Dong Si Dept. of Computer Science 1.
Conference title1 A New Methodology for Studying Realistic Processors in Computer Science Degrees Crispín Gómez, María E. Gómez y Julio Sahuquillo DISCA.
High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov.
Gary MarsdenSlide 1University of Cape Town Computer Architecture – Introduction Andrew Hutchinson & Gary Marsden (me) ( ) 2005.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Software Software is omnipresent in the lives of billions of human beings. Software is an important component of the emerging knowledge based service.
CSCI-383 Object-Oriented Programming & Design Lecture 1.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
CSCI-256 Data Structures & Algorithm Analysis Lecture Note: Some slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. 4.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Does humans-in-the-service-of-technology have a future Preview of Viewpoint article: Is Multi-Core Hardware for General-Purpose Parallel Processing Broken?
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
RAM, PRAM, and LogP models
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Joint UIUC/UMD Parallel Algorithms/Programming Course David Padua, University of Illinois at Urbana-Champaign Uzi Vishkin, University of Maryland, speaker.
Early Adopter: Integration of Parallel Topics into the Undergraduate CS Curriculum at Calvin College Joel C. Adams Chair, Department of Computer Science.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.
Multi-Semester Effort and Experience to Integrate NSF/IEEE-TCPP PDC into Multiple Department- wide Core Courses of Computer Science and Technology Department.
Parallel Portability and Heterogeneous programming Stefan Möhl, Co-founder, CSO, Mitrionics.
Course Overview  What is AI?  What are the Major Challenges?  What are the Main Techniques?  Where are we failing, and why?  Step back and look at.
© Bennett, McRobb and Farmer 2005
Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.
Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
09/02/2010CS4961 CS4961 Parallel Programming Lecture 4: CTA, cont. Data and Task Parallelism Mary Hall September 2,
Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.
University of Washington Today Quick review? Parallelism Wrap-up 
From Use Cases to Implementation 1. Structural and Behavioral Aspects of Collaborations  Two aspects of Collaborations Structural – specifies the static.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
From Use Cases to Implementation 1. Mapping Requirements Directly to Design and Code  For many, if not most, of our requirements it is relatively easy.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Feeding Parallel Machines – Any Silver Bullets? Novica Nosović ETF Sarajevo 8th Workshop “Software Engineering Education and Reverse Engineering” Durres,
1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 26 – Alternative Architectures.
Measuring Performance II and Logic Design
ECE 486/586 Computer Architecture Introductions Instructor and You
Done Done Course Overview What is AI? What are the Major Challenges?
Overview Introduction General Register Organization Stack Organization
Objective of This Course
Programming with Shared Memory Specifying parallelism
Overview of Networking
Presentation transcript:

Teaching Parallelism Panel, SPAA11 Uzi Vishkin, University of Maryland

Dream opportunity Limited interest in parallel computing evolved into quest for general- purpose parallel computing in mainstream computers. Alas: - Only heroic programmers can exploit the vast parallelism in today’s mainstream computers. - Rejection of their par prog by most programmers: all but certain. - Widespread working assumption Programming models for larger-scale & mainstream systems - similar. Not so in serial days. - Parallel computing plagued with prog difficulties. [build-first figure-out- how-to-program-later’  fitting parallel languages to these arbitrary arch  standardization of language fits  doomed later parallel arch - Working assumption  import parallel computing ills to mainstream Shock and awe example 1 st par prog trauma ASAP : Start par prog course with tile-based parallel algorithm for matrix mult. How many tiles to fit 1000X1000 matrices in cache of modern PC? Teach later: OK Missing Many-Core Understanding Comparison of many-core platforms for: Ease-of-programming & achieving hard speedups. The Economist I (F) :

Summary of my thoughts 1. In class Parallel PRAM algorithmic theory -2 nd in magnitude only to serial algorithmic theory -Won the “battle of ideas” in the 1980s. Repeatedly: -Challenged without success  no real alternative! Is this another: the older we get the better we were? 2. Parallel programming experience for concreteness: In Homework Extensive programming assignments [ The XMT HW/SW/Alg solution Programming for locality: 2 nd order consideration] Must be Trauma-free providing Hard speedups/Best serial 3. Tread carefully Consider non-parallel computing colleague instructors. Limited line of credit. Future change is certain. Pushing may backfire when need cooperation in future 3

Parallel Random-Access Machine/Model PRAM: n synchronous processors all having unit time access to a shared memory. Reactions [Important to convey plurality, plus coherent approach(es)] You got to be kidding, this is way: - Too easy - Too difficult: Why even mention processors? What to do with n processors? How to allocate processors to instructions?

Immediate Concurrent Execution 5 ‘Work-Depth framework’ SV82, Adopted in Par Alg texts [J92,KKT01]. ICE basis for architecture specs: V, Using simple abstraction to reinvent computing for parallelism, CACM 1/2011 [ Similar to role of stored-program & program-counter in arch specs for serial comp]

Algorithms-aware many-core is feasible Algorithms Programming Programmer’s workflow Rudimentary yet stable compiler PRAM-On-Chip HW Prototypes 64-core, 75MHz FPGA of XMT [ SPAA98..CF08] Toolchain Compiler + simulator HIPS’ core interconnection network IBM 90nm: 9mmX5mm, MHz [HotI07] FPGA design  ASIC IBM 90nm: 10mmX10mm 150 MHz Architecture scales to cores on-chip XMT homepage: or search: ‘XMT’. considerable material/suggestions for teaching: class notes, tool chain, lecture videos, programming assignment.

Elements in My education platform Identify ‘thinking in parallel’ with the basic abstraction behind the SV82b work- depth framework. Note: presentation framework in PRAM texts: J92, KKT01. Teach as much PRAM algorithms as timing and developmental stage of the students permit; extensive ‘dry’ theory homework: required from graduate students. Little from high-school students. Students self-study programming in XMTC (standard C plus 2 commands, spawn and prefix-sum) and do demanding programming assignments Provide a programmer’s workflow: links the simple PRAM abstraction with XMTC (even tuned) programming. The synchronous PRAM provides ease of algorithm design and reasoning about correctness and complexity. Multi-threaded programming relaxes this synchrony for implementation. Since reasoning directly about soundness and performance of multi-threaded code is known to be error prone, the workflow only tasks the programmer with: establish that the code behavior matches the PRAM-like algorithm Unlike PRAM, XMTC incorporates locality as 2 nd order. Unlike many approaches, XMTC preempts harm of locality on programmer’s productivity. If XMT architecture is presented: only at the end of the course; parallel programming more difficult than serial, which… does not require architecture.

Anecdotal Validation (?) Breadth-first-search (BFS) example 42 students: joint UIUC/UMD course -<1X speedups using OpenMP on 8-processor SMP -7x-25x speedups on 64-processor XMT FPGA prototype [Built at UMD] What’s the big deal of 64 processors beating 8? Silicon area of 64 XMT processors ~= 1-2 SMP processors Questionnaire Rank approaches for achieving (hard) speedups: All students, but one : XMTC ahead of OpenMP Order-of-magnitude: teachability/learnabilty (MS, HS & up, SIGCSE’10) SPAA’11: >100X speedup on max-flow relative to 2.5X on GPU (IPDPS’10) Fleck/Kuhn: research too esoteric to be reliable  exoteric validation! Reward alert: Try to publish a paper boasting easy to obtain results  EoP: 1. Badly needed. Yet, 2. A lose-lose proposition. 8

Where to find a machine that supports effectively such parallel algorithms? Parallel algorithms researchers realized decades ago that the main reason that parallel machines are difficult to program has been that bandwidth between processors/memories is limited. Lower bounds [VW85,MNV94]. [BMM94]: 1. HW vendors see the cost benefit of lowering performance of interconnects, but grossly underestimate the programming difficulties and the high software development costs implied. 2. Their exclusive focus on runtime benchmarks misses critical costs, including: (i) the time to write the code, and (ii) the time to port the code to different distribution of data or to different machines that require different distribution of data. G. Blelloch, B. Maggs & G. Miller. The hidden cost of low bandwidth communication. In Developing a CS Agenda for HPC (Ed. U. Vishkin). ACM Press, 1994 Patterson, CACM04: Latency Lags Bandwidth. HP12: as latency improved by 30-80X, bandwidth improved by 10-25KX  Isn’t this great news: cost benefit of low bandwidth drastically decreasing Not so fast. X86Gen Senior Eng, 1/2011: Okay, you do have a ‘convenient’ way to do parallel programming; so what’s the big deal?! Commodity HW  Decomposition-first programming doctrine  heroic programmers  sigh … Has the ‘bw  ease-of-programming opportunity’ got lost?

Sociologists of science Debates between adherents of different thought styles consist almost entirely of misunderstandings. Members of both parties are talking of different things (though they are usually under an illusion that they are talking about the same thing). They are applying different methods and criteria of correctness (although they are usually under an illusion that their arguments are universally valid and if their opponents do not want to accept them, then they are either stupid or malicious) 10

Comment on need for breadth of knowledge Where are your specs? One example: what is your par alg abstraction? ‘First-specs then-build’ is “not uncommon”.. for engineering 2 options for architects WRT example: A.1. Learn parallel algorithms 2. Develop abstraction that meets EoP 3. Develop specs 4. Build B.Start from abstraction with proven EoP It is similarly important for algorithms people to learn architecture and applications