Fifty Years of Parallel Programming: Ieri, Oggi, Domani

Fifty Years of Parallel Programming: Ieri, Oggi, Domani
Keshav Pingali The University of Texas at Austin

Overview Parallel programming research started in mid-60’s Goal:
Productivity for Joe: abstractions to hide complexity of parallel hardware Performance from Stephanie: implement abstractions efficiently What should these abstractions be and how are they implemented? Yesterday: Five lessons from the past Today: Abstraction for parallelism and locality and its implementation Tomorrow: Some research challenges Joe Stephanie Karp and Miller 1966 “Scalable” parallel programming: few Stephanies, many Joes

(1) It’s better to be wrong once in a while than to be right all the time.

Impossibility of exploiting ILP: [c. 1972]
Flynn bottleneck “..Therefore, we must reject the possibility of bypassing conditional jumps as being of substantial help in speeding up execution of programs. In fact, our results seem to indicate that even very large amounts of hardware applied to programs at runtime do not generate hemibel (> 3x) improvements in execution speed.” Riseman and Foster, IEEE Trans. Computers, 1972

Exploiting ILP [Fisher c.1982]
Key idea: Branch speculation Backup/re-execute if prediction is wrong Dynamic branch prediction [Smith,Patt] Infallibility is for popes, not parallel computing Broader lesson: Runtime parallelization: essential in spite of overhead and wasted work Compile-time parallelization: only part of the solution to exploiting parallelism (see Itanium)

(2) Dependence graphs are not the right foundation
for parallel programming.

Dependence graphs Classical abstraction for parallelism
[Karp/Miller66,Dennis 68,Kuck72] Nodes: tasks, edges: ordering of tasks Independent operations: execute in parallel Dependence-graph-based parallelization Static analysis [Kuck72,Feautrier92]: stencils, FFT, dense linear algebra Inspector-executor [Duff/Reid77,Saltz90]: sparse linear linear algebra Thread-level speculation [Jefferson81,Rauchwerger/Padua95]: executor-inspector Works well for HPC programs Gold standard is a sequential program Dependences must be removed/respected by parallel execution

Delaunay mesh refinement
Irregular algorithms More complex behavior than HPC programs Tasks can generate and kill other tasks Unordered: tasks can be executed in any order serially even though R/W sets may overlap Output may be different for different orders, all acceptable Don’t-care non-determinism (Dijkstra) Arises from under-specification of execution order Dependence graphs are not the right abstractions for such algorithms No gold standard serial order for tasks What is the right abstraction? Delaunay mesh refinement Red Triangle: badly shaped triangle Blue triangles: cavity of bad triangle

(3) Express algorithms using data-centric abstractions.
Parallel program = Operator +Schedule +Parallel data structures

von Neumann model (control-centric)
state ………. update initial state final state State update: Assignment statement (local view) Location: Program counter (where?) Algorithm Schedule: (global view) Ordering: Control-flow constructs (when?) von Neumann bottleneck [Backus 79]

Operator formulation (data-centric)
state update ….. : active node : neighborhood State update: (local view) Operator Location: Active nodes (where?) Algorithm Schedule (global view) Unordered Ordering: (when?) Ordered No distinction between sequential/parallel, regular/irregular algorithms Unifies seemingly different algorithms for same problem

Programming model (Joe)
Set iterator: [Schwartz70] don’t-care non-determinism: implementation free to iterate over set in any order optional soft priorities on elements (cf. OpenMP) Captures the “freedom” in unordered algorithms for each e in W:set do B(e) //operator e W:set B€ B(e) Consistency vs isolation state

Parallel Implementation (Stephanie)
Active nodes with disjoint neighborhoods can be executed in parallel Ordered algorithms: algorithmic order must be respected How should transactional semantics for operators be implemented? One approach: speculative execution using HTM/STM [Herlihy/Moss]

Construct Implementation ? (4) One size does not fit all. Exploit context and structure for efficiency.

Exploiting context for efficiency
RISC vs. CISC [c. 80’s-90’s] CISC philosophy: Map high-level language (HLL) idioms directly to instructions and addressing modes Makes compiler’s job easier RISC philosophy: Minimalist ISA Sophisticated compiler generated code for HLL constructs tailored to program context structure for (int i=0; i<N; i++) { } …..a[i]….. Exploiting context for efficiency

Transactional semantics: exploiting context
Optimistic parallelization (ordered algorithms) Interference graph (unordered algorithms) Inspector-executor (sparse LA) Dependence graphs (stencils,dense LA) Compile-time After input is given During program execution After program is finished Binding time: when are active nodes and neighborhoods known? i1 i2 i3

Transactional semantics: exploiting structure
Operators have structure Cautious operators: read entire neighborhood before any write, so no need to track writes Detect conflicts at ADT level, not memory level Exploit structure by generating customized code RISC-like approach to implementing transactional semantics

(5) The difference between theory and practice
Dataflow machine Japan (5) The difference between theory and practice is smaller in theory than in practice. McKinsey & Co: “So what?”

Galois: Performance on SGI Ultraviolet

Graph Analytics on Stampede (256 hosts, 272 threads on each host)
Lower is better in all graphs D-Galois better than D-Ligra for clueweb12 mostly because D-Ligra takes more rounds to converge 25

Domani

Become evangelists for parallelism
Many application areas need parallelism Circuit design (EDA) tools EDA 3.0: designs are now complex enough that EDA community wants to parallelize their tools Circuits High-diameter hyper-graphs Lots of graph algorithms (but no vertex programs ) DARPA IDEA/POSH program Open-source parallel EDA tool-chain Let’s make EDA 3.0 happen St. Luke the Evangelist EDA tool-chain

Principled performance optimization
Parallel platforms are complex Heterogeneous devices Heterogeneous memory Memory hierarchies Multi-user systems Optimizing performance Very little science: hard to model computer systems analytically One approach [Kuck: codelets] Machine learning for building models of programs and platforms Control theory for optimizing performance using these models Predictive control

Parallel program synthesis
Dream since 1960’s Precise spec → Program Specs based on formal logic Usually more complex than programs Informal specifications that can be interpreted precisely? That’s how we interact with humans Need to build synthesis systems that have “common sense” to resolve ambiguity

Patron saint of parallel programming
“Pessimism of the intellect, optimism of the will” Antonio Gramsci ( )

Lessons It’s better to be wrong once in a while than to be right all the time. Runtime parallelization essential in spite of overheads. Dependence graphs are not the right abstraction for parallel programming. Embrace don’t-care non-determinism. Algorithms should be structured using data-centric abstractions. Parallel program = Operator + Schedule + Parallel data structure One size does not fit all. Exploit context and structure for efficiency. The difference between theory and practice is smaller in theory than in practice. Always ask yourself “So what?”

Fifty Years of Parallel Programming: Ieri, Oggi, Domani

Similar presentations

Presentation on theme: "Fifty Years of Parallel Programming: Ieri, Oggi, Domani"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fifty Years of Parallel Programming: Ieri, Oggi, Domani

Similar presentations

Presentation on theme: "Fifty Years of Parallel Programming: Ieri, Oggi, Domani"— Presentation transcript:

Similar presentations

About project

Feedback