Confronting Manycore: Parallel Programming Beyond Multicore

Confronting Manycore: Parallel Programming Beyond Multicore
Dr. Michael Wrinn Intel Corporation July 2, 2008 11/20/2018 1

Agenda The multicore/manycore landscape
why this switch, to multiple cores, and who is making it? the software challenge: manycore is more than “more multicore” confronting the challenge at a conceptual level programming model, design patterns confronting the challenge at the implementation level extending current languages explicitly parallel languages call for action/help please work with us -- sign up for community 11/20/2018

The good news: Moore’s Law isn’t done yet
30nm 20nm 15nm 10nm 45nm process (20 nm prototype) 2007 32nm process (15 nm prototype) 2009 65nm process 2005 22nm process (10 nm prototype) 2011 … combined with advanced packaging, we get the familiar transistor-doubling with each generation Technology Node (nm) 90 65 45 32 22 Integration Capacity (BT) 2 4 8 16 Keep in mind that they when DNA is packaged into a chromosome, an intermediate step is to package it into a 5 nm wide ribbon. So life sciences people enjoy seeing that we’re approaching molecular scales in our devices by the end of the decade. These are projections only and may not be reflected in future products from Intel Corp. Source: Intel 11/20/2018

Historic SPECint 2000 Performance
The bad news: Single thread performance is falling off (ILP at point of diminishing returns?) Historic SPECint 2000 Performance Year The free lunch is over…Herb Sutter, MS Source: published SPECInt data 11/20/2018

Worse news: Power (normalized to i486) trends
Growth in power is unsustainable The results of these techniques are shown in the following graph. In this graph, we’ve plotted the scalar performance and power of four generations of Intel microprocessors. To emphasize the effects of product design, we’ve factored out effects due to process technology and normalized all data to the i486 microprocessor. Compared to the i486, the Pentium 4 is 6x faster (2x IPC at 3x frequency), 23x higher power, and spends 4 units of power for every 1 unit of performance. We would say that Pentium 4 (Willamette) consumes 4 times the energy per instruction (EPI) of the i486. The historical relationship between power and scalar performance roughly follows a square law; Fred Pollack observed “We are on the wrong side of a square law” in 1999. Source: Intel 11/20/2018

Architecture optimized for power: a big step in the right direction
Current CPUs with shallow pipelines use less power The results of these techniques are shown in the following graph. In this graph, we’ve plotted the scalar performance and power of four generations of Intel microprocessors. To emphasize the effects of product design, we’ve factored out effects due to process technology and normalized all data to the i486 microprocessor. Compared to the i486, the Pentium 4 is 6x faster (2x IPC at 3x frequency), 23x higher power, and spends 4 units of power for every 1 unit of performance. We would say that Pentium 4 (Willamette) consumes 4 times the energy per instruction (EPI) of the i486. The historical relationship between power and scalar performance roughly follows a square law; Fred Pollack observed “We are on the wrong side of a square law” in 1999. Source: Intel 11/20/2018

Better power and thermal management
Solution: Multi-Core Power Large Core Cache Power = 1/4 4 Performance Performance = 1/2 3 Small Core 2 2 1 1 1 1 C1 C2 C3 C4 Cache 4 4 Multi-Core: Power efficient Better power and thermal management 3 3 2 2 1 1 Data represents performance trends and is not a projection of expected performance in future products. 11/20/2018

Next step, Manycore: a fundamental design change - converging from many directions
general-purpose GPU Tesla (NVidia) Cell (IBM/Sony/Toshiba) Firestream (AMD) general-purpose manycore Tilera (MIT) RAMP (Berkeley), Terascale R&D (Intel) Larrabee (Intel, 2009) heterogeneous manycore Fusion (AMD, announced) multicore evolving to manycore Sun UltraSPARC T2: 8 cores x 8 threads = 64 IBM z10: 20 CPU cores (+ 2 service cores) Intel Nehalem: 8 cores X 2 threads = 16 (2008) Also: support industry appearing Rapidmind (ports to Tesla, Cell, x86) 11/20/2018

Future manycore platform
many Intel Architecture cores (conceptual schematic only) Combine Larrabee and multi-core CPU: heterogeneous manycore platform 11/20/2018

Software challenge for manycore: scaling, programming model
Speedup Typical speedup Curve Ideal Speedup Curve US DOE, internal report Ultimate destination is manycore BUT – As significant as migration from vector to MPP – Widespread panic regarding programming model Burton Smith (Microsoft) The many-core inflection point presents a new challenge for our industry, namely general-purpose parallel computing. Unless this challenge is met, the continued growth and importance of computing itself and of the businesses engaged in it are at risk. EE Times, Feb 15, 2008, (quoting William Dally, Stanford) Multicore puts screws to parallel-programming models It's a critical problem…The danger is we will not have a good model when we need it, and people will wind up creating a generation of difficult legacy code we will have to live with for a long time." 11/20/2018

The Peanut Butter & Jelly example
Software challenge for manycore: traditional CS thinking is deeply sequential The Peanut Butter & Jelly example 1. First, place a paper plate on the kitchen counter. 2. Then, get a loaf of white bread and a jar of peanut butter from the pantry, as well as a jar of grape jelly from the refrigerator, and place all of these things on the kitchen counter. 3. Next, take two slices of white bread out of the bread bag, and put them on the paper plate. 4. Take a butter knife out of the kitchen drawer, and place it on the paper plate as well. 5. Open the peanut butter jar, and use the butter knife to remove some peanut butter; proceed to butter one slice of bread. 6. Afterwards, open the jelly jar, take some jelly out with the knife, and smear some jelly onto the other slice. 7. Place one slice of bread onto the other, so that the two sides with condiments are facing each other. 8. Enjoy your peanut butter and jelly sandwich! 11/20/2018

Software challenge for manycore: traditional CS programs hard to change
Curriculum what material gets dropped? new course or change to existing one? Material hard to find textbook? (parallelism dropped from CLSR 2nd ed!) lecture material, demos, labs? What is the right model/approach/language to teach? 11/20/2018

Software challenge for manycore: academics are excited…and sceptical
Dave Patterson, Berkeley, on the move to manycore: Computer architecture is back….End of La-Z-Boy era of programming. Charles Leiserson, MIT The Age of Serial Computing is over. Karsten Schwan, Georgia Tech Since processors will be parallel, like the multi-core chips from Intel, we have to start educating and thinking in terms of parallel. Cannot assume that current models, for multicore SMP, will scale to manycore. Need to find new programming models for manycore processors Edward Lee, Berkeley Intel, for example, has embarked on an active campaign to get leading computer science academic programs to put more emphasis on multi-threaded programming. If they are successful, and the next generation of programmers makes more intensive use of multithreading, then the next generation of computers will become nearly unusable. 11/20/2018

Manycore programming models: active R&D (in contrast to SMP Multicore)
Conceptual Models “View from Berkeley”: 13 dwarfs design patterns; RMS (Intel) Alternatives to threads everything old is new again? Linda, MPI, functional lang, skeletons(templates, frameworks (cactus)), actors… Extensions to current languages X10 (IBM/DARPA) Ct (Intel CTG) OpenMP, TBB: continue to scale? Explicitly parallel programming languages: Erlang (Ericsson) TStreams (Intel/MIT) Cilk, Haskell, Charm++, Titanium, etc, etc IEEE Computer 5/2006: “For concurrent programming to become mainstream, we must discard threads as a programming model.” higher level of abstraction 11/20/2018

Conceptual models: Berkeley’s 13 dwarfs
Dense linear algebra Combinational logic Sparse linear algebra Graph traversal Spectral methods Dynamic programming N-body methods Backtrack / Branch-and-bound Structured grids Graph models Unstructured grids Finite state machines MapReduce A dwarf is an algorithmic method that captures a pattern of computation and communication. The Landscape of Parallel Computing Research: A View from Berkeley, 2006 (continuing) 11/20/2018

Conceptual models: Berkeley’s 13 dwarfs
IEEE Computer 5/2006: “For concurrent programming to become mainstream, we must discard threads as a programming model.” higher level of abstraction 11/20/2018

importance to application area (Red Hot  Blue Cool)
Conceptual models: Berkeley’s 13 dwarfs importance to application area (Red Hot  Blue Cool) Some people might subdivide, some might combine them together Trying to stretch a computer architecture then you can do subcategories as well Graph Traversal = Probabilistic Models N-Body = Particle Methods, … 11/20/2018

Conceptual models: Patterns
A design pattern language for parallel algorithm design with examples in MPI, OpenMP and Java. Represents the author's hypothesis for how programmers think about parallel programming. NOTE: this is just a hypothesis … a starting point. It needs more peer review and experiments to validate the theories. 11/20/2018

The Pattern Language’s Structure
A software design can be viewed as a series of refinements Consider the process in terms of 4 design spaces Add progressively lower level elements to the design Design Space The Evolving Design Messages, synchronization, spawn Implementation Mechanisms Source Code organization, Shared data Supporting Structures Tasks, shared data, partial orders Finding Concurrency Thread/process structures, schedules Algorithm Structure 11/20/2018

The Finding Concurrency Design Space
Finding the scope for parallelization: Begin with a sequential application that solves the original problem Decompose the application into tasks or data sets Analyze the dependency among tasks before decomposition Decomposition Analysis Dependency Analysis Group tasks Domain decomposition Order groups Task decomposition Data sharing 11/20/2018

The Algorithm Structure Design Space
Structure used for organizing computations to support parallelization Pipeline Organized by flow of data Linear? Recursive? Organized by tasks Organized by data Recursive Data Regular? Irregular? Task Parallelism Geometric Decomposition Divide and Conquer Event-based Coordination How is the computation structured? “Geometric Decomposition” is sometimes called “Data Decomposition” 11/20/2018

The Supporting Structures Design Space
High-level constructs used to organize the source code Categorized into program structures and data structures Loop parallelism Program Structure Data Structures Boss/Worker Shared queue Shared data SPMD Fork/join Distributed array 11/20/2018

The Implementation Mechanisms Design Space
Low level constructs implementing specific constructs used in parallel computing Examples in Java, OpenMP and MPI Not proper design patterns; included to make the pattern language self-contained UE* Management Process Control Thread Control Synchronization Mutual Exclusion Memory sync/fences Barriers Communications Collective Comm Message Passing Other Comm * UE = Unit of execution 11/20/2018

Programming Pattern Language – how it might look…
1. Guided Expansion: Choose your high level structure Agent and Repository Arbitrary Static Task Graph Bulk Synchronous Event based, implicit invocation Layered Systems Map Reduce Model-view Controller Pipe-and-filter Process Control 2. Guided Instantiation: Identify the key computational patterns Backtrack Branch and Bound Circuits Dense Linear Algebra Dynamic Programming Finite State Machines Graph Algorithms Graphical Models N-Body Methods Sparse Linear Algebra Spectral Methods Structured Grids Unstructured Grids Productivity Layer 3. Guided Reorganization: Refine the structure Data Parallelism Digital Circuits Discrete Event Divide and Conquer Event Based Geometric Decomposition Graph algorithms Pipeline Task Parallelism 4. Guided Mapping: Choose the concurrent approach; utilize supporting structures CSP Distributed Array Fork/Join Loop Parallelism Master/Worker Shared Data Shared Hash Table Shared Queue Efficiency Layer 5. Guided Implementation: Choose the method and building blocks Barriers Collective Communication Message Passing Mutex Process Creation/Destruction Semaphores Speculation Thread Creation/Destruction Transactional Memory 11/20/2018

Implementations: Extending current languages Parallel Programming API’s today
Thread Libraries Win32 API POSIX threads TBB managed runtime extensions: java.util.concurrent, X10, Parallel Extensions for .NET Compiler Directives OpenMP (portable shared-memory parallelism) Message Passing Libraries MPI avoiding choice overload -- A glut of options scares consumers (i.e. ISVs) away … Less is More 11/20/2018

Extending existing APIs, example:
OpenMP 2.5 cannot deal with a pointer following loop nodeptr list, p; for (p=list; p!=NULL; p=p->next) process(p->data); OpenMP 3.0 fixes this by adding a new task construct nodeptr list, p; #pragma omp parallel { #pragma omp single { for (p=list; p!=NULL; p=p->next) #pragma omp task firstprivate(p) process(p->data); } } The name OpenMP is the property of the OpenMP Architecture review board 11/20/2018

Implementations: Explicitly parallel languages Be careful what you wish for….
ABCPL ACE ACT++ Active messages Adl Adsmith ADDAP AFAPI ALWAN AM AMDC AppLeS Amoeba ARTS Athapascan-0b Aurora Automap bb_threads Blaze BSP BlockComm C*. "C* in C C** CarlOS Cashmere C4 CC++ Chu Charlotte Charm Charm++ Cid Cilk CM-Fortran Converse Code COOL CORRELATE CPS CRL CSP Cthreads CUMULVS DAGGER DAPPLE Data Parallel C DC++ DCE++ DDD DICE. DIPC DOLIB DOME DOSMOS. DRL DSM-Threads Ease . ECO Eiffel Eilean Emerald EPL Excalibur Express Falcon Filaments FM FLASH The FORCE Fork Fortran-M FX GA GAMMA Glenda GLU GUARD HAsL. Haskell HPC++ JAVAR. HORUS HPC IMPACT ISIS. JAVAR JADE Java RMI javaPG JavaSpace JIDL Joyce Khoros Karma KOAN/Fortran-S LAM Lilac Linda JADA WWWinda ISETL-Linda ParLin P4-Linda POSYBL Objective-Linda LiPS Locust Lparx Lucid Maisie Manifold Mentat Legion Meta Chaos Midway Millipede CparPar Mirage MpC MOSIX Modula-P Modula-2* Multipol MPI MPC++ Munin Nano-Threads NESL NetClasses++ Nexus Nimrod NOW Objective Linda Occam Omega OpenMP Orca OOF90 P++ P3L p4-Linda Pablo PADE PADRE Panda Papers AFAPI. Para++ Paradigm Parafrase2 Paralation Parallel-C++ Parallaxis ParC ParLib++ Parmacs Parti pC pC++ PCN PCP: PH PEACE PCU PET PETSc PENNY Phosphorus POET. Polaris POOMA POOL-T PRESTO P-RIO Prospero Proteus QPC++ PVM PSI PSDM Quake Quark Quick Threads Sage++ SCANDAL SAM SCHEDULE SciTL POET SDDA. SHMEM SIMPLE Sina SISAL. distributed smalltalk SMI. SONiC Split-C. SR Sthreads Strand. SUIF. Synergy Telegrphos SuperPascal TCGMSG. Threads.h++. TreadMarks TRAPPER uC++ UNITY UC V ViC* Visifold V-NUS VPE Win32 threads WinPar XENOOPS XPC Zounds ZPL Lets look at the history of supercomputing and see if we can get some clues about what we should do. Back in the late 80’s and early 90’s, we thought all the ISV’s needed to enter parallel computing was a portable Application programming interface. So the universities and a number of small companies (3 of them – 2 of which I used to work for) came up with all sorts of API’s. It is my firm belief that this backfired on us. The fact there were so many API’s made parallel computing computer scientists look stupid. If we couldn’t agree on an effective approach to parallel computing, how could we expect non-specialist ISV’s to figure this out. 11/20/2018 Third party names are the property of their owners.

Software challenge for manycore: What is Intel doing about this?
Research funding UPCRC at Berkeley & Illinois (jointly with MS) PPL at Stanford (with several others) Tools Intel Thread Checker, Thread Profiler Thread Building Blocks (open source) University Program Multi-core content and training Curriculum development grants 11/20/2018 Third party names are the property of their owners.

Intel Academic Community – Academic Community Collaborating Worldwide
In 69 countries around the world , at more than 700 Universities 1150 professors are teaching new software professionals Multi-core programming 11/20/2018

Intel University Program
Courseware to drop into existing courses Continual technology updates Free Intel® Software Development Tools licenses for the classroom Wiki and blog for collaboration Forums for specific questions academiccommunity.intel.com 11/20/2018 Third party names are the property of their owners.

Call to action Join the effort: academiccommunity.intel.com 11/20/2018

Confronting Manycore: Parallel Programming Beyond Multicore

Similar presentations

Presentation on theme: "Confronting Manycore: Parallel Programming Beyond Multicore"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Confronting Manycore: Parallel Programming Beyond Multicore

Similar presentations

Presentation on theme: "Confronting Manycore: Parallel Programming Beyond Multicore"— Presentation transcript:

Similar presentations

About project

Feedback