Presentation is loading. Please wait.

Presentation is loading. Please wait.

Architecting Parallel Software with Patterns Kurt Keutzer, EECS, Berkeley Tim Mattson, Intel and the PALLAS team: Bryan Catanzaro, Jike Chong, Matt Moskewicz,

Similar presentations


Presentation on theme: "Architecting Parallel Software with Patterns Kurt Keutzer, EECS, Berkeley Tim Mattson, Intel and the PALLAS team: Bryan Catanzaro, Jike Chong, Matt Moskewicz,"— Presentation transcript:

1 Architecting Parallel Software with Patterns Kurt Keutzer, EECS, Berkeley Tim Mattson, Intel and the PALLAS team: Bryan Catanzaro, Jike Chong, Matt Moskewicz, Michael Murphy, NR Satish, Bor-Yiing Su, Naryanan Sundaram, Youngmin Yi

2 Executive Summary 1. Our challenge in parallelizing applications really reflects a deeper more pervasive problem about inability to develop software in general 1. Corollary: Any highly-impactful solution to parallel programming should have significant impact on programming as a whole 2. Software must be architected to achieve productivity, efficiency, and correctness 3. SW architecture >> programming environments 1. >> programming languages 2. >> compilers and debuggers 3. (>>hardware architecture) 4. Key to architecture (software or otherwise) is design patterns and a pattern language 5. The desired pattern language should span the full range of design from application conceptualization to detailed software implementation 6. Resulting software design then uses a hierarchy of software frameworks for implementation 1. Application frameworks for application (e.g. CAD) developers 2. Programming frameworks for those who build the application frameworks 2

3 3 Outline What doesn’t work Common approaches to approaching parallel programming will not work The scope and nature of the endeavor Challenges in parallel programming are symptoms, not root causes We need a full pattern language Patterns and frameworks Frameworks will be a (the?) primary medium of software development for application developers Detail on Structural Patterns Detail on Computational Patterns High-level examples of composing patterns

4 4 4 Assumption #1: How not to develop parallel code Initial Code Profiler Performance profile Re-code with more threads Not fast enough Fast enough Ship it Lots of failures N PE’s slower than 1

5 5 Steiner Tree Construction Time By Routing Each Net in Parallel BenchmarkSerial2 Threads3 Threads4 Threads5 Threads6 Threads adaptec newblue newblue adaptec adaptec adaptec adaptec newblue average

6 6 6 Assumption #2: This won’t help either Code in new cool language Profiler Performance profile Re-code with cool language Not fast enough Fast enough Ship it After 200 parallel languages where’s the light at the end of the tunnel?

7 7 Parallel Programming environments in the 90’s ABCPL ACE ACT++ Active messages Adl Adsmith ADDAP AFAPI ALWAN AM AMDC AppLeS Amoeba ARTS Athapascan-0b Aurora Automap bb_threads Blaze BSP BlockComm C*. "C* in C C** CarlOS Cashmere C4 CC++ Chu Charlotte Charm Charm++ Cid Cilk CM-Fortran Converse Code COOL CORRELATE CPS CRL CSP Cthreads CUMULVS DAGGER DAPPLE Data Parallel C DC++ DCE++ DDD DICE. DIPC DOLIB DOME DOSMOS. DRL DSM-Threads Ease. ECO Eiffel Eilean Emerald EPL Excalibur Express Falcon Filaments FM FLASH The FORCE Fork Fortran-M FX GA GAMMA Glenda GLU GUARD HAsL. Haskell HPC++ JAVAR. HORUS HPC IMPACT ISIS. JAVAR JADE Java RMI javaPG JavaSpace JIDL Joyce Khoros Karma KOAN/Fortran-S LAM Lilac Linda JADA WWWinda ISETL-Linda ParLin Eilean P4-Linda Glenda POSYBL Objective-Linda LiPS Locust Lparx Lucid Maisie Manifold Mentat Legion Meta Chaos Midway Millipede CparPar Mirage MpC MOSIX Modula-P Modula-2* Multipol MPI MPC++ Munin Nano-Threads NESL NetClasses++ Nexus Nimrod NOW Objective Linda Occam Omega OpenMP Orca OOF90 P++ P3L p4-Linda Pablo PADE PADRE Panda Papers AFAPI. Para++ Paradigm Parafrase2 Paralation Parallel-C++ Parallaxis ParC ParLib++ ParLin Parmacs Parti pC pC++ PCN PCP: PH PEACE PCU PET PETSc PENNY Phosphorus POET. Polaris POOMA POOL-T PRESTO P-RIO Prospero Proteus QPC++ PVM PSI PSDM Quake Quark Quick Threads Sage++ SCANDAL SAM pC++ SCHEDULE SciTL POET SDDA. SHMEM SIMPLE Sina SISAL. distributed smalltalk SMI. SONiC Split-C. SR Sthreads Strand. SUIF. Synergy Telegrphos SuperPascal TCGMSG. Threads.h++. TreadMarks TRAPPER uC++ UNITY UC V ViC* Visifold V-NUS VPE Win32 threads WinPar WWWinda XENOOPS XPC Zounds ZPL

8 8 8 Assumption #3: Nor this Initial Code Super-compiler Performance profile Tune compiler Not fast enough Fast enough Ship it 30 years of HPC research don’t offer much hope

9 9 Automatic parallelization? A Cost-Driven Compilation Framework for Speculative Parallelization of Sequential Programs, Zhao-Hui Du, Chu-Cheow Lim, Xiao-Feng Li, Chen Yang, Qingyu Zhao, Tin-Fook Ngai (Intel Corporation) in PLDI 2004 Aggressive techniques such as speculative multithreading help, but they are not enough. Ave SPECint speedup of 8% … will climb to ave. of 15% once their system is fully enabled. There are no indications auto par. will radically improve any time soon. Hence, I do not believe Auto-par will solve our problems. Results for a simulated dual core platform configured as a main core and a core for speculative execution.

10 10 Outline What doesn’t work Common approaches to parallel programming will not work The scope and nature of the endeavor Challenges in parallel programming are symptoms, not root causes We need a full pattern language Patterns and frameworks Frameworks will be a (the?) primary medium of software development for application developers Detail on Structural Patterns Detail on Computational Patterns High-level examples of composing patterns

11 11 Reinvention of design? In 1418 the Santa Maria del Fiore stood without a dome. In 1418 the Santa Maria del Fiore stood without a dome. Brunelleschi won the competition to finish the dome. Brunelleschi won the competition to finish the dome. Construction of the dome without the support of flying buttresses seemed unthinkable. Construction of the dome without the support of flying buttresses seemed unthinkable.

12 12 Innovation in architecture After studying earlier Roman and Greek architecture, Brunelleschi drew on diverse architectural styles to arrive at a dome design that could stand independently

13 13 Innovation in tools Scaffolding for cupola Mechanism for raising materials His construction of the dome design required the development of new tools for construction, as well as an early (the first?) use of architectural drawings (now lost).

14 14 Innovation in use of building materials Herringbone pattern bricks His construction of the dome design also required innovative use of building materials.

15 15 Resulting Dome Completed dome eng.htm

16 16 The point? of Santa Maria del Fiore Challenges to design and build the dome of Santa Maria del Fiore showed underlying weaknesses of architectural understanding, tools, and use of materials By analogy, parallelizing code should not have thrown us for such a loop. Our difficulties in facing the challenge of developing parallel software are a symptom of underlying weakness is in our abilities to:  Architect software  Develop robust tools and frameworks  Re-use implementation approaches Time for a serious rethink of all of software design Time for a serious rethink of all of software design

17 17 What I’ve learned (the hard way) Software must be architected to achieve productivity, efficiency, and correctness SW architecture >> programming environments  >> programming languages  >> compilers and debuggers  (>>hardware architecture) Discussions with superprogrammers taught me:  Give me the right program structure/architecture I can use any programming language  Give me the wrong architecture and I’ll never get there What I’ve learned when I had to teach this stuff at Berkeley: Key to architecture (software or otherwise) is design patterns and a pattern language Resulting software design then uses a hierarchy of software frameworks for implementation  Application frameworks for application (e.g. CAD) developers  Programming frameworks for those who build the application frameworks

18 18 Outline What doesn’t work Common approaches to parallel programming will not work The scope and nature of the endeavor Challenges in parallel programming are symptoms, not root causes We need a full pattern language Patterns and frameworks Frameworks will be a (the?) primary medium of software development for application developers Detail on Structural Patterns Detail on Computational Patterns High-level examples of composing patterns

19 19 Elements of a pattern language

20 20 Alexander’s Pattern Language Christopher Alexander’s approach to (civil) architecture:  "Each pattern describes a problem which occurs over and over again in our environment, and then describes the core of the solution to that problem, in such a way that you can use this solution a million times over, without ever doing it the same way twice.“ Page x, A Pattern Language, Christopher Alexander Alexander’s 253 (civil) architectural patterns range from the creation of cities (2. distribution of towns) to particular building problems (232. roof cap) A pattern language is an organized way of tackling an architectural problem using patterns Main limitation:  It’s about civil not software architecture!!!

21 21 Alexander’s Pattern Language (95-103) Layout the overall arrangement of a group of buildings: the height and number of these buildings, the entrances to the site, main parking areas, and lines of movement through the complex. 95. Building Complex 96. Number of Stories 97. Shielded Parking 98. Circulation Realms 99. Main Building 100. Pedestrian Street 101. Building Thoroughfare 102. Family of Entrances 103. Small Parking Lots

22 22 Family of Entrances (102) May be part of Circulation Realms (98). Conflict: When a person arrives in a complex of offices or services or workshops, or in a group of related houses, there is a good chance he will experience confusion unless the whole collection is laid out before him, so that he can see the entrance of the place where he is going. Resolution: Lay out the entrances to form a family. This means:  1) They form a group, are visible together, and each is visible from all the others.  2) They are all broadly similar, for instance all porches, or all gates in a wall, or all marked by a similar kind of doorway.  May contain Main Entrance (110), Entrance Transition (112), Entrance Room (130), Reception Welcomes You (149).

23 23 Family of Entrances

24 24 Computational Patterns The Dwarfs from “The Berkeley View” (Asanovic et al.) Dwarfs form our key computational patterns

25 25 Patterns for Parallel Programming PLPP is the first attempt to develop a complete pattern language for parallel software development. PLPP is a great model for a pattern language for parallel software PLPP mined scientific applications that utilize a monolithic application style PLPP doesn’t help us much with horizontal composition Much more useful to us than: Design Patterns: Elements of Reusable Object-Oriented Software, Gamma, Helm, Johnson & Vlissides, Addison-Wesley, 1995.

26 26 Structural programming patterns In order to create more complex software it is necessary to compose programming patterns For this purpose, it has been useful to induct a set of patterns known as “architectural styles” Examples:  pipe and filter  event based/event driven  layered  Agent and repository/blackboard  process control  Model-view-controller

27 27 Put it all together

28 28 Graph Algorithms Dynamic Programming Dense Linear Algebra Sparse Linear Algebra Unstructured Grids Structured Grids Model-view controller Iteration Map reduce Layered systems Arbitrary Static Task Graph Pipe-and-filter Agent and Repository Process Control Event based, implicit invocation Graphical models Finite state machines Backtrack Branch and Bound N-Body methods Circuits Spectral Methods Task Decomposition ↔ Data Decomposition Group Tasks Order groups data sharing data access Applications Pipeline Discrete Event Event Based Divide and Conquer Data Parallelism Geometric Decomposition Task Parallelism Graph algorithms Fork/Join CSP Master/worker Loop Parallelism BSP Distributed Array Shared Data Shared Queue Shared Hash Table SPMD Barriers Mutex Thread Creation/destruction Process Creation/destruction Message passing Collective communication Speculation Transactional memory Choose your high level structure – what is the structure of my application? Guided expansion Identify the key computational patterns – what are my key computations? Guided instantiation Implementation methods – what are the building blocks of parallel programming? Guided implementation Choose you high level architecture? Guided decomposition Refine the structure - what concurrent approach do I use? Guided re-organization Utilize Supporting Structures – how do I implement my concurrency? Guided mapping Productivity Layer Efficiency Layer Digital Circuits Semaphores Our Pattern Language 2.0

29 29 Graph Algorithms Dynamic Programming Dense Linear Algebra Sparse Linear Algebra Unstructured Grids Structured Grids Model-view controller Iteration Map reduce Layered systems Arbitrary Static Task Graph Pipe-and-filter Agent and Repository Process Control Event based, implicit invocation Graphical models Finite state machines Backtrack Branch and Bound N-Body methods Circuits Spectral Methods Task Decomposition ↔ Data Decomposition Group Tasks Order groups data sharing data access Applications Pipeline Discrete Event Event Based Divide and Conquer Data Parallelism Geometric Decomposition Task Parallelism Graph algorithms Fork/Join CSP Master/worker Loop Parallelism BSP Distributed Array Shared Data Shared Queue Shared Hash Table SPMD Barriers Mutex Thread Creation/destruction Process Creation/destruction Message passing Collective communication Speculation Transactional memory Choose your high level structure – what is the structure of my application? Guided expansion Identify the key computational patterns – what are my key computations? Guided instantiation Implementation methods – what are the building blocks of parallel programming? Guided implementation Choose you high level architecture? Guided decomposition Refine the structure - what concurrent approach do I use? Guided re-organization Utilize Supporting Structures – how do I implement my concurrency? Guided mapping Productivity Layer Efficiency Layer Digital Circuits Semaphores Our Pattern Language 2.0 Garlan and Shaw Architectural Styles Berkeley View 13 dwarfs

30 30 Outline What doesn’t work Common approaches to parallel programming will not work The scope and nature of the endeavor Challenges in parallel programming are symptoms, not root causes We need a full pattern language Patterns and frameworks Frameworks will be a (the?) primary medium of software development for application developers Detail on Structural Patterns Detail on Computational Patterns High-level examples of composing patterns

31 31 Assumption #4 The future of parallel programming is not training Joe Coder to learn to write parallel programs The future of parallel programming is getting domain experts to use parallel application frameworks (without them ever knowing it) Louis Pasteur Inventor of Pasteurization

32 32 People, Patterns, and Frameworks Design PatternsFrameworks Application DeveloperUses application design patterns (e.g. feature extraction) to architect the application Uses application frameworks (e.g. CBIR) to develop application Application-Framework Developer Uses programming design patterns (e.g. Map/Reduce) to architect the application framework Uses programming frameworks (e.g MapReduce) to develop the application framework

33 33 Frameworks Application Framework Patterns Programming Framework Programming Patterns Computation & Communication Framework Computation & Communication Patterns Target application End User Application Patterns HW target Hardware Architect Patterns and Frameworks Pattern Language SW Infrastructure Platform Application Developer Application Framework Developer Programming Framework Developer Today’s talk focuses only on the middle of the pattern language

34 34 Example: Content-Based Image Retrieval Application Large number of untagged photos Want quick retrieval based on image query

35 Frameworks CBIR Application Framework Patterns Application Framework Developer Map Reduce Programming Framework Map Reduce Programming Pattern Map Reduce Programming Framework Developer CUDA Computation & Communication Framework Barrier/Reduction Computation & Communication Patterns CUDA Framework Developer Face Search Developer Personal Search Cellphone User Feature Extraction & Classifier Application Patterns GeForce 8800 Hardware Architect Example: CBIR Layers Pattern Language SW Infrastructure Platform Application

36 36 Eventually Domain literate programming gurus (1% of the population). Application frameworks Parallel patterns & programming frameworks + Parallel programming gurus (1-10% of programmers) Parallel programming frameworks 3 2 Domain Experts End-user, application programs Application patterns & frameworks + 1 The hope is for Domain Experts to create parallel code with little or no understanding of parallel programming. Leave hardcore “bare metal” efficiency layer programming to the parallel programming experts

37 37 Today Domain literate programming gurus (1% of the population). Application frameworks Parallel patterns & programming frameworks + Parallel programming gurus (1-10% of programmers) Parallel programming frameworks 3 2 Domain Experts End-user, application programs Application patterns & frameworks + 1 For the foreseeable future, domain experts, application framework builders, and parallel programming gurus will all need to learn the entire stack. For the foreseeable future, domain experts, application framework builders, and parallel programming gurus will all need to learn the entire stack. That’s why you all need to be here today! That’s why you all need to be here today!

38 38 Definitions - 1 Design Patterns: “Each design pattern describes a problem which occurs over and over again in our environment, and then describes the core of the solution to that problem, in such a way that you can use this solution a million times over, without ever doing it the same way twice.“ Page x, A Pattern Language, Christopher Alexander Structural patterns: design patterns that provide solutions to problems associated with the development of program structure Computational patterns: design patterns that provide solutions to recurrent computational problems

39 39 Definitions - 2 Library: The software implementation of a computational pattern (e.g. BLAS) or a particular sub-problem (e.g. matrix multiply) Software architecture: A composition of structural and compositional patterns Framework: An extensible software environment (e.g. Ruby on Rails) organized around a software architecture (e.g. model- view-controller) that allows for programmer customization only in harmony with the software architecture Domain specific language: A programming language (e.g. Matlab) that provides language constructs that particularly support a particular application domain. The language may also supply library support for common computations in that domain (e.g. BLAS). If the language is restricted to maintain fidelity to a one or more structural patterns and provides library support for common computations then it encompasses a framework (e.g. NPClick).

40 40 Decompose Tasks Group tasks Order Tasks Architecting Parallel Software Identify the Software Structure Identify the Key Computations Decompose Data Identify data sharing Identify data access

41 41 Pipe-and-Filter Agent-and-Repository Event-based coordination Iterator MapReduce Process Control Layered Systems Identify the SW Structure Structural Patterns These define the structure of our software but they do not describe what is computed

42 42 Analogy: Layout of Factory Plant

43 43 Identify Key Computations Computational patterns describe the key computations but not how they are implemented Computational Patterns

44 44 Analogy: Machinery of the Factory

45 45 Pipe-and-Filter Agent-and-Repository Event-based Bulk Synchronous MapReduce Layered Systems Arbitrary Task Graphs Decompose Tasks/Data Order tasksIdentify Data Sharing and Access Graph Algorithms Dynamic programming Dense/Spare Linear Algebra (Un)Structured Grids Graphical Models Finite State Machines Backtrack Branch-and-Bound N-Body Methods Circuits Spectral Methods Architecting Parallel Software Identify the Software Structure Identify the Key Computations

46 46 Analogy: Architected Factory Raises appropriate issues like scheduling, latency, throughput, workflow, resource management, capacity etc.

47 47 Outline What doesn’t work Common approaches to parallel programming will not work The scope and nature of the endeavor Challenges in parallel programming are symptoms, not root causes We need a full pattern language Patterns and frameworks Frameworks will be a (the?) primary medium of software development for application developers Detail on Structural Patterns Detail on Computational Patterns High-level examples of composing patterns

48 Logic Optimization

49 49 Decompose Tasks Group tasks Order Tasks Decompose Data Data sharing Data access Architecting Parallel Software Identify the Software Structure Identify the Key Computations

50 50 Structure of Logic Optimization Build Data model netlist Scan Netlist Optimize circuit Highest level structure is the pipe-and-filter pattern (just because it’s obvious doesn’t mean it’s not worth stating!)(just because it’s obvious doesn’t mean it’s not worth stating!)

51 51 Structure of optimization Circuit representation timing optimization area optimization power optimization  Agent and repository : Blackboard structural pattern Key elements:  Blackboard: repository of the resulting creation that is shared by all agents (circuit database)  Agents: intelligent agents that will act on blackboard (optimizations)  Controller: orchestrates agents access to the blackboard and creation of the aggregate results (scheduler)

52 52 Timing Optimization While (! user_timing_constraint_met && power_in_budget){ restructure_circuit(netlist); remap_gates(netlist); resize_gates(netlist); retime_gates(netlist); …. more optimizations … …. }

53 53 Structure of Timing Optimization Timing transformati ons controller user timing constraints Speed? Launching transformations Local netlist Power? Process control structural pattern

54 54 Architecture of Logic Optimization Graph algorithm pattern Group, order tasks DecomposeData

55 55 Parallelism in Logic Synthesis Logic synthesis offers lots of easy coarse-grain parallelism: Run n scripts/recipes and choose the best For per-instance parallelism: General program structure offers modest amounts amount of parallelism: We can pipeline (pipe-and-filter) scanning, parsing, database/datamodel building We can decouple agents (e.g. power and timing) acting on the repository We can decouple sensor (e.g. timing analysis) and actuator (e.g. timing optimization) We can use programming patterns like graph traversal and branch-and-bound But how do we keep 128 processors busy?

56 56 Here’s a hint …

57 Key to Parallelizing Logic Optimization? We must exploit the data parallelism inherent in a graph/netlist with >2,000,000 cells Partition graphs/netlists into highly/completely independent modules Even modest amount of synchronization (e.g. stitching together overlapped regions) will devastate performance due to Amdahl ’ s law

58 58 Data Parallelism in Respository Repository manager must partition underlying circuit to allow many agents (timing, power, area optimizers) to operation on different partitions simultaneously Chips with >2M cells easily enable opportunities for manycore parallelism timing Optimization 1 timing Optimization 2 timing Optimization 3 timing optimization 128 …….. CircuitDatabase

59 Moral of the story Architecting an application doesn’t automatically make it parallel Architecting an application brings to light where the parallelism most likely resides Humans must still analyze the architecture to identify opportunities for parallelism However, significantly more parallelism is identified in this way than if we worked bottom-up to identify local parallelism 59

60 60 Today’s take away Many approaches to parallelizing software are not working  Profile and improve  Swap in a new parallel programming language  Rely on a super parallelizing compiler My own experience has shown that a sound software architecture is the greatest single indicator of a software project’s success. Software must be architected to achieve productivity, efficiency, and correctness SW architecture >> programming environments  >> programming languages  >> compilers and debuggers  (>>hardware architecture) If we had understood how to architect sequential software, then parallelizing software would not have been such a challenge Key to architecture (software or otherwise) is design patterns and a pattern language At the highest level our pattern language has:  Eight structural patterns  Thirteen computational patterns Yes, we really believe arbitrarily complex parallel software can built just from these!

61 More examples 61

62 62 Architecting Speech Recognition Signal Processing Inference Engine Recognition Network Voice Input Most Likely Word Sequence Iterator Pipe-and-filter MapReduce Beam Search Iterations Active State Computation Steps Dynamic Programming Graphical Model Pipe-and-filter Large Vocabulary Continuous Speech Recognition Poster: Chong, Yi Work also to appear at Emerging Applications for Manycore Architecture

63 63 CBIR Application Framework Results Exercise Classifier Train Classifier Feature Extraction User Feedback Choose Examples New Images ? ? Catanzaro, Sundaram, Keutzer, “Fast SVM Training and Classification on Graphics Processors”, ICML 2008

64 64 Feature Extraction Image histograms are common to many feature extraction procedures, and are an important feature in their own right 64 Agent and Repository: Each agent computes a local transform of the image, plus a local histogram. Results are combined in the repository, which contains the global histogram  The data dependent access patterns found when constructing histograms make them a natural fit for the agent and repository pattern

65 65 Train Classifier: SVM Training 65 Update Optimality Conditions Select Working Set, Solve QP Select Working Set, Solve QP Train Classifier iterate Iterator MapReduce Gap not small enough?

66 66 Exercise Classifier : SVM Classification Compute dot products Compute Kernel values, sum & scale Output Test Data SV Exercise Classifier MapReduce Dense Linear Algebra

67 67 Key Elements of Kurt’s SW Education AT&T Bell Laboratories: CAD researcher and programmer  Algorithms: D. Johnson, R. Tarjan  Programming Pearls: S C Johnson, K. Thompson, (Jon Bentley)  Developed useful software tools: Plaid: programmable logic aid: used for developing 100’s of FPGA-based HW systems CONES/DAGON: used for designing >30 application-specific integrated circuits Synopsys: researcher  CTO (25 products, ~15 million lines of code, $750M annual revenue, top 20 SW companies)  Super programming: J-C Madre, Richard Rudell, Steve Tjiang  Software architecture: Randy Allen, Albert Wang  High-level Invariants: Randy Allen, Albert Wang Berkeley: teaching software engineering and Par Lab  Took the time to reflect on what I had learned:  Architectural styles: Garlan and Shaw Design patterns: Gamma et al (aka Gang of Four), Mattson’s PLPP A Pattern Language: Alexander, Mattson Dwarfs: Par Lab Team


Download ppt "Architecting Parallel Software with Patterns Kurt Keutzer, EECS, Berkeley Tim Mattson, Intel and the PALLAS team: Bryan Catanzaro, Jike Chong, Matt Moskewicz,"

Similar presentations


Ads by Google