Presentation on theme: "Algorithms-based extension of serial computing education to parallelism Uzi Vishkin - Using Simple Abstraction to Reinvent Computing for Parallelism, CACM,"— Presentation transcript:
Algorithms-based extension of serial computing education to parallelism Uzi Vishkin - Using Simple Abstraction to Reinvent Computing for Parallelism, CACM, January 2011, pp
Commodity computer systems If you want your program to run significantly faster … you’re going to have to parallelize it Parallelism: only game in town But, what about the programmer? “The Trouble with Multicore: Chipmakers are busy designing microprocessors that most programmers can't handle”—D. Patterson, IEEE Spectrum 7/2010 Only heroic programmers can exploit the vast parallelism in current machines – Report by CSTB, U.S. National Academies 12/2010 A San Antonio spin Where would Mr. Maverick be on this issue? Conform with things that do not really work?!
Parallel Random-Access Machine/Model PRAM: n synchronous processors all having unit time access to a shared memory. Each processor has also a local memory. At each time unit, a processor can: 1.write into the shared memory (i.e., copy one of its local memory registers into a shared memory cell), 2. read into shared memory (i.e., copy a shared memory cell into one of its local memory registers ), or 3. do some computation with respect to its local memory.
So, an algorithm in the PRAM model is presented in terms of a sequence of parallel time units (or “rounds”, or “pulses”); we allow p instructions to be performed at each time unit, one per processor; this means that a time unit consists of a sequence of exactly p instructions to be performed concurrently SV-MaxFlow-82: way too difficult Contrast e.g, TCPP 12/2010: simplest parallel model 2 drawbacks to PRAM mode (i)Does not reveal how the algorithm will run on PRAMs with different number of processors; e.g., to what extent will more processors speed the computation, or fewer processors slow it? (ii) Fully specifying the allocation of instructions to processors requires a level of detail which might be unnecessary (e.g., a compiler may be able to extract from lesser detail) 1st round of discounts..
Work-Depth presentation of algorithms Work-Depth algorithms are also presented as a sequence of parallel time units (or “rounds”, or “pulses”); however, each time unit consists of a sequence of instructions to be performed concurrently; the sequence of instructions may include any number. Why is this enough? See J-92, KKT01, or my classnotes SV-MaxFlow-82: still way too difficult Drawback to WD mode Fully specifying the serial number of each instruction requires a level of detail that may be added later 2nd round of discounts..
Informal Work-Depth (IWD) description Similar to Work-Depth, the algorithm is presented in terms of a sequence of parallel time units (or “rounds”); however, at each time unit there is a set containing a number of instructions to be performed concurrently. ‘ICE’ Descriptions of the set of concurrent instructions can come in many flavors. Even implicit, where the number of instruction is not obvious. The main methodical issue addressed here is how to train CS&E professionals “to think in parallel”. Here is the informal answer: train yourself to provide IWD description of parallel algorithms. The rest is detail (although important) that can be acquired as a skill, by training (perhaps with tools). Why is this enough for PRAM? See J-92, KKT01, or my classnotes
Input: (i) All world airports. (ii) For each, all its non-stop flights. Find: smallest number of flights from DCA to every other airport. Basic (actually parallel) algorithm Step i: For all airports requiring i-1flights For all its outgoing flights Mark (concurrently!) all “yet unvisited” airports as requiring i flights (note nesting) Serial: forces queue; O(T) time; T – total # of flights Parallel: parallel data-structures. Inherent serialization: S. Gain relative to serial: (first cut) ~T/S! Decisive also relative to coarse-grained parallelism. Note: (i) “Concurrently” as in natural BFS: only change to serial algorithm (ii) No “decomposition”/”partition” Mental effort of PRAM-like programming 1. sometimes easier than serial 2. considerably easier than for any parallel computer currently sold. Understanding falls within the common denominator of other approaches. Example of Parallel ‘PRAM-like’ Algorithm
Elements in My education platform Identify ‘thinking in parallel’ with the basic abstraction behind the SV82b work-depth framework. Note: adopted as the presentation framework in PRAM algorithms texts: J92, KKT01. Teach as much PRAM algorithms as timing and developmental stage of the students permit; extensive ‘dry’ theory homework: is required from graduate students, but little from high-school students. Students self-study programming in XMTC (standard C plus 2 commands, spawn and prefix-sum) and do demanding programming assignments Provide a programmer’s workflow: links the simple PRAM abstraction with XMTC (even tuned) programming. The synchronous PRAM provides ease of algorithm design and reasoning about correctness and complexity. Multi- threaded programming relaxes this synchrony for implementation. Since reasoning directly about soundness and performance of multi-threaded code is known to be error prone, the workflow only tasks the programmer with: establish that the code behavior matches the PRAM-like algorithm Unlike PRAM, XMTC is far from ignoring locality. Unlike most approaches, XMTC preempts harm of locality on programmer’s productivity. If XMT architecture is presented: only at the end of the course; parallel programming more difficult than serial that does not require architecture.
Where to find a machine that supports effectively such parallel algorithms? Parallel algorithms researchers realized decades ago that the main reason that parallel machines are difficult to program has been that bandwidth between processors/memories is limited. Lower bounds [VW85,MNV94]. [BMM94]: 1. HW vendors see the cost benefit of lowering performance of interconnects, but grossly underestimate the programming difficulties and the high software development costs implied. 2. Their exclusive focus on runtime benchmarks misses critical costs, including: (i) the time to write the code, and (ii) the time to port the code to different distribution of data or to different machines that require different distribution of data. G. Blelloch, B. Maggs & G. Miller. The hidden cost of low bandwidth communication. In Developing a CS Agenda for HPC (Ed. U. Vishkin). ACM Press, 1994 Patterson, CACM04: Latency Lags Bandwidth. HP12: as latency improved by 30-80X, bandwidth improved by 10-25KX Isn’t this great news: cost benefit of low bandwidth drastically decreasing Not so fast. Senior HW Eng, 1/2011: Okay, you do have a ‘convenient’ way to do parallel programming; so what’s the big deal?! Commodity HW Decomposition-first programming doctrine heroic programmers sigh … Has the ‘bw ease-of-programming opportunity’ got lost? Do we sugarcoat a salty cake instead of ‘return to baker/store’?
Suggested answers in this talk (soft, more like BMM) 1.Fault line One side: commodity HW. Other side: this ‘convenient way’ 2.‘Life’ across fault line so, what’s the point of heroic programmers?! 3.‘Every CS major could program’: ‘no way’ vs promising evidence 4.Sooner or later, system vendors will see the connection to their bottom line and abandon directions perceived today as hedging one’s bets
The fault line Is PRAM Too Easy or Too difficult? BFS Example BFS in TCPP curriculum, 12/2010. But, 1. XMT/GPU Speed-ups: same-silicon area, highly parallel input: 5.4X! Small HW configuration, 20-way parallel input: 109X wrt same GPU Note: BFS on GPUs: research papers; but PRAM version: too easy for paper Makes one wonder: why work so hard on a GPU? 2. BFS using OpenMP. Good news: Easy coding (since no meaningful decomposition). Bad news: none of the 42 students in joint F2010 UIUC/UMD got any speedups (over serial) on an 8-processor SMP machine. So, not only PRAM was too easy: no speedups. Also BFS… Speedups on a 64-processor XMT, using <= 1/4 of the silicon area of SMP machine, ranged between 7x and 25x PRAM is ‘too difficult’ approach worked. Makes one wonder BFS is unavoidable. Can we (professionals/instructors) really defend teaching/using OpenMP for it? Any other commercial approach?
Chronology around fault line Too easy ‘Paracomputer’ Schwartz80 BSP Valiant90 LOGP UC-Berkeley93 Map-Reduce. Success; not manycore CLRS-09, 3rd edition TCPP curriculum 2010 Nearly all parallel machines to date “.. machines that most programmers cannot handle" “Only heroic programmers” Too difficult SV-82 and V-Thesis81 PRAM theory (in effect) CLR-90 1st edition J-92 NESL KKT-01 XMT97+ Supports the rich PRAM algorithms literature V-11 Just right: PRAM model FW77 Nested parallelism: issue for both; e.g., Cilk Current interest new "computing stacks“: programmer's model, programming languages, compilers, architectures, etc. Merit of fault-line image Two pillars holding a building (the stack) must be on the same side of a fault line chipmakers cannot expect: wealth of algorithms and high programmer’s productivity with architectures for which PRAM is too easy (e.g., force programming for decomposition).
Telling a fault line from the surface PRAM too difficult ICE WD PRAM Effective bandwidth PRAM too easy PRAM “simplest model”* BSP/Cilk * In(e/su)fficient bandwidth *per TCPP Old soft claim, e.g., [BMM94]: hidden cost of low bandwidth New soft claim: the surface (PRAM easy/difficult) reveals side WRT the bandwidth fault line. Surface Fault line
Ease of Teaching/Learning Benchmark Can any CS major program your manycore? Cannot really avoid it! Teachability demonstrated so far for XMT [SIGCSE’10] - To freshman class with 11 non-CS students. Some prog. assignments: merge-sort*, integer-sort* & sample-sort. Other teachers: - Magnet HS teacher. Downloaded simulator, assignments, class notes, from XMT page. Self-taught. Recommends: Teach XMT first. Easiest to set up (simulator), program, analyze: ability to anticipate performance (as in serial). Can do not just for embarrassingly parallel. Teaches also OpenMP, MPI, CUDA. See also, keynote at + interview with teacher. - High school & Middle School (some 10 year olds) students from underrepresented groups by HS Math teacher. *Also in Nvidia’s Satish, Harris & Garland IPDPS09
Middle School Summer Camp Class Picture, July’09 (20 of 22 students) 19
From UIUC/UMD questionnaire Split between UIUC and UMD students on: did PRAM algorithms help for XMT programming? UMD students: strong yes. Majority of Illinois students: No. Exposure of UIUC students to PRAM algorithms and XMT programming was more limited. This may demonstrate that students must be exposed to a minimal amount of parallel algorithms and their programming in order to internalize their merit. If this conclusion is valid, it creates tension with: 1. The pressure on instructors of parallel computing courses to cover several programming paradigms along with their required architecture background; 2. The tendency to teach “Parallel computing” as a hodgepodge of topics jumping from one to the other without teaching anything at any depth, contrary to many other CS courses
Not just talking Algorithms PRAM parallel algorithmic theory. “Natural selection”. Latent, though not widespread, knowledgebase “Work-depth”. SV82 conjectured: The rest (full PRAM algorithm) just a matter of skill. Lots of evidence that “work-depth” works. Used as framework in main PRAM algorithms texts: JaJa92, KKT01 Later: programming & workflow PRAM-On-Chip HW Prototypes 64-core, 75MHz FPGA of XMT (Explicit Multi-Threaded) architecture SPAA98..CF core intercon. network IBM 90nm: 9mmX5mm, 400 MHz [HotI07] Fund work on asynch NOCS’10 FPGA design ASIC IBM 90nm: 10mmX10mm 150 MHz Rudimentary yet stable compiler. Architecture scales to cores on- chip
But, what is the performance penalty for easy programming? Surprise benefit! vs. GPU [HotPar10] 1024-TCU XMT simulations vs. code by others for GTX280. < 1 is slowdown. Sought: similar silicon area & same clock. Postscript regarding BFS - 59X if average parallelism is X if XMT is … downscaled to 64 TCUs
Problem acronyms BFS: Breadth-first search on graphs Bprop: Back propagation machine learning alg. Conv: Image convolution kernel with separable filter Msort: Merge-sort algorith NW: Needleman-Wunsch sequence alignment Reduct: Parallel reduction (sum) Spmv: Sparse matrix-vector multiplication
New work Biconnectivity Not aware of GPU work 12-processor SMP: < 4X speedups. TarjanV log-time PRAM algorithm practical version significant modification. Their 1 st try: 12-processor below serial XMT: >9X to <42X speedups. TarjanV practical version. More robust for all inputs than BFS, DFS etc. Significance: 1.log-time PRAM graph algorithms ahead on speedups. 2.Paper makes a similar case for Shiloach-V log-time connectivity. Beats also GPUs on both speed-up and ease (GPU paper versus grad course programming assignment and even couple of 10 th graders implemented SV) Even newer result: PRAM max-flow (hybrid ShiloachV & GoldbergTarjan) provides unprecedented speedup