A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken, William Dally, Pat Hanrahan Stanford University

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 2 The Problem Lots of different architectures –Shared memory –Distributed memory –Exposed communication –Disk systems Each architecture has its own programming system Composed machines difficult to program and manage –Different mechanisms for each architecture Previous runtimes and languages limited –Designed for a single architecture –Struggle with memory hierarchies

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 3 Shared Memory Machines AMD Barcelona SGI Altix

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 4 Distributed Memory Machines Marenostrum 2,282 Nodes ASC Blue Gene/L 65,536 Nodes

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 5 Exposed Communication Architectures 90nm | ~220 mm 2 |~100 W STI CELL processor ~200 GFLOPS

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 6 Complex machines Cluster of SMPs? –MPI? Cluster of Cell processors? –MPI + ALF/Cell SDK Cluster of clusters of SMPs with Cell accelerators? What about disk systems? –Disk I/O often second class

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 7 Previous Work APIs –MPI/MPI-2 –Pthreads –OpenMP (Dagum et al. 1998) –GASNet (Bonachea et al. 2002) –Charm (Kale et al, 1993) –SVM (Lebonte et al. 2004) –Cell SDK (IBM 2006) –CellSs (Bellens et al. 2006) –HTA (Bikshandi et al. 2006) –… Languages –Co-Array Fortran (Numrich et al. 1994) –Titanium (Yelick et al. 1998) –UPC (Carlson et al. 1999) –ZPL (Deitz et al. 2004) –Chapel (Callahan et al. 2004) –X10 (Charles et al. 2005) –Brook (Buck et al. 2004) –Cilk (Blumofe et al. 1995) –…

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 8 Contributions Uniform scheme for explicitly describing memory hierarchies –Capture common traits important for performance –Allow composition of memory hierarchies Simple, portable API interface for many parallel machines –Mechanism independence for communication and management of parallel resources –Few entry points –Efficient execution Efficient compiler target –Compiler can concentrate on global optimizations –Runtime manages mechanisms

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 9 The Sequoia System Programming language for memory hierarchies –Fatahalian et al. (Supercomputing 2006) –Adapts Parallel Memory Hierarchies programming abstraction Compiler for exposed communication architectures –Knight et al. (PPoPP 2007) –Takes in Sequoia program, machine file, and mapping file –Generates task calls and data transfers –Performs large granularity optimizations Portable runtime system –Houston et al. (PPoPP 2008) –Target for Sequoia compiler –Serves as abstract machine http://sequoia.stanford.edu

Abstract Machine Model

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 11 Parallel Memory Hierarchy Model (PMH) Abstract machines a trees of memories (each memory is an address space) (Alpern et al. 1995) Memory CPU Memory CPU Memory CPU Memory CPU Memory

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 12 Parallel Memory Hierarchy Model CPU Memory

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 13 Example Mappings SPE LS Main Memory Cell Processor SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS CPU Node Memory Node Memory... Aggregate Node Memory (Virtual Level) Cluster CPU Disk Node Memory Shared Memory Multi-processor CPU... Main Memory L2 CPU L2

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 14 Multi-Level Configurations SPE... CPU... Disk + Playstation 3 Cluster of Playstation 3s Cluster of SMPs... CPU... CPU SPE Aggregate Cluster Memory L2 Node Memory Node Memory LS Main Memory Disk LS Main Memory LS Aggregate Cluster Memory

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 15 Abstraction Rules Tree of nodes 1 control thread per node 1 memory per node Threads can: –Transfer in bulk from/to parent memory asynchronously –Wait for transfers from/to parent to complete –Allocate data –Only access their memory directly –Transfer control to child node(s) –Non-leaf threads only operate to move data and control –Synchronize with siblings Memory CPU Memory CPU Memory CPU Simliar to Space Limited Procedures (Alpern et al. 1995)

Portable Runtime Interface

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 17 Requirements Resource allocation –Data allocation and naming –Setup parallel resources Explicit bulk asynchronous communication –Transfer lists –Transfer commands Parallel execution –Launch tasks on children –Asynchronous Synchronization –Make sure tasks/transfers complete before continuing Runtime isolation –No direct knowledge of other runtimes

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 18 Memory Level i+1 CPU Level i+1 Memory Level i Child N Memory Level i … Graphical Runtime Representation Memory Level i Child 1 Runtime CPU Level i Child 1 CPU Level i … CPU Level i Child N

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 19 Top Interface // create and free runtime Runtime(TaskTable table, int numChildren); virtual ~Runtime(); // allocate and deallocate arrays virtual Array_t* AllocArray (Size_t elmtSize, int dimensions, Size_t* dim_sizes, ArrayDesc_t descriptor, int alignment) = 0; virtual void FreeArray(Array_t* array) = 0; // array naming virtual void AddArray(Array_t array); virtual Array_t GetArray(ArrayDesc_t descriptor); virtual void RemoveArray(ArrayDesc_t descriptor); // launch and synchronize on tasks virtual TaskHandle_t CallChildTask(TaskID_t taskid, ChildID_t start, ChildID_t end) = 0; virtual void WaitTask(TaskHandle_t handle) = 0;

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 20 Bottom Interface // array naming virtual Array_t* GetArray (ArrayDesc_t descriptor); // create, free, invoke, and synchronize on transfer lists virtual XferList* CreateXferList (Array_t* dst, Array_t* src, Size_t* dst_idx, Size_t* src_idx, Size_t* lengths, int count) = 0; virtual void FreeXferList (XferList* list) = 0; virtual XferHandle_t Xfer (XferList* list) = 0; virtual void WaitXfer (XferHandle_t handle) = 0; // get number of children in bottom level, get local processor id, // and barrier int GetSiblingCount (); int GetID (); virtual void Barrier (ChildID_t start, ChildID_t stop) = 0;

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 21 Compiler/Runtime Interaction Compiler initializes runtime for each pair of memories in the hierarchy Initialize runtime for root memory –Machine description specifies runtime to initialize (SMP, Cluster, Disk, Cell, etc.) If more levels in hierarchy –Initialize runtimes for child levels Runtime cleanup is inverse –Call exit on children, wait, cleanup local resources, return to parent Control of hierarchy via task calls

Runtime Implementations

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 23 SMP Runtime CPU... Main Memory SMP Runtime CPU Disk Node Memory Disk Runtime SPE LS Main Memory SPE LS SPE LS SPE LS SPE LS SPE LS PowerPC Cell Runtime … CPU Node Memory Node Memory... Aggregate Cluster Memory Cluster Runtime

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 24 SMP Implementation Pthreads based –Launch a thread per child –CallChildTask enqueues work on task queue per child Data transfers –Memory copy from source to destination –Optimizations Pass reference to parent array Feedback to compiler to remove transfers –Machine file information No processor at parent level –Processor 0 represents the parent node and a child node

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 25 Disk Implementation No processor at top level –Host CPU represents the parent node and child node Implementation –Allocation Open file on disk –Data transfers Use Async I/O API to read/write data to disk

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 26 Cell Implementation Overlay handling –At runtime creation, load overlay loader into SPEs –On task call PowerPC notifies SPE of function to load SPE loads overlay and executes Data alignment –All data allocated to 128 byte boundaries –Multi-dimensional arrays padded to maintain alignment for dimensions Heavy use of mailboxes –SPE synchronization –PPE SPE communication Data transfers –DMA lists –DMA commands

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 27 Cluster Implementation No processor at top level –Node 0 represents the parent node and a child node Virtual level –Distributed Shared Memory (DSM) Many software and hardware solutions IVY (Li et al. 1988), Mether (Minnich et al. 1991), … Alewife (Agarwal et al. 1991), FLASH (Kuskin et al. 1994), … Complex coherence and consistency mechanisms –Overkill for our needs No sharing between sibling processors = no coherence requirements Sequoia enforces consistency by disallowing aliasing on writes to parent memory

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 28 Cluster Implementation, cont. Virtual level implementation –Interval trees Represent arrays as covering an range of 0->N elements Each node can have any sub-portion (interval) of range Tree structure allows fast lookup of all nodes that cover the interval of interest Allows complex data distributions See CLRS Section 14.3 for more detailed information –Array allocation Define distribution as interval tree and broadcast Allocate data for intervals and register data pointers with MPI-2 (MPI_win_create) Align data to page alignment for network fast transfers

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 29 Cluster Implementation, cont. Virtual level implementation –Data transfer Compare requested data range against interval tree Read from parent: issue MPI_Get to any nodes with matching intervals (MPI_LOCK_SHARED) Write to parent: issue MPI_Put to all nodes with matching intervals (MPI_LOCK_EXCLUSIVE) –Optimizations If requested range is local, return reference to parent memory

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 30 Runtime Composition Disk Main Memory SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS PowerPC Main Memory SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS PowerPC Main Memory SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS PowerPC Aggregate Cluster Memory Node Memory Node Memory... Aggregate Cluster Memory CPU... CPU... Disk + Playstation 3 Cluster of Playstation 3s Cluster of SMPs... Cluster Runtime Cell Runtime Disk Runtime SMP Runtime Cell Runtime

Evaluation

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 32 Evaluation Metrics Can we run unmodified Sequoia code on all these systems? How much overhead do we have in our abstraction? How well can we utilize machine resources? –Computation –Bandwidth Is our application performance competitive with best available implementations? Can we effectively compose runtimes for more complex systems?

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 33 Sequoia Benchmarks Linear Algebra Blas Level 1 SAXPY, Level 2 SGEMV, and Level 3 SGEMM benchmarks 2D single precision convolution with 9x9 support (non- periodic boundary constraints) Complex single precision FFT 100 time steps of N-body stellar dynamics simulation (N 2 ) single precision Fuzzy protein string matching using HMM evaluation (Horn et al. SC2005 paper) Conv2D FFT3D Gravity HMMER Best available implementations used as leaf task

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 34 Single Runtime System Configurations Scalar –2.4 GHz Intel Pentium4 Xeon, 1GB 8-way SMP –4 dual-core 2.66GHz Intel P4 Xeons, 8GB Disk –2.4 GHz Intel P4, 160GB disk, ~50MB/s from disk Cluster –16, Intel 2.4GHz P4 Xeons, 1GB/node, Infiniband interconnect (780MB/s) Cell –3.2 GHz IBM Cell blade (1 Cell – 8 SPE), 1GB PS3 –3.2 GHz Cell in Sony Playstation 3 (6 SPE), 256MB (160MB usable)

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 35 System Utilization SAXPYSGEMVSGEMMCONV2DFFT3DGRAVITYHMMER SMP | Disk | Cluster | Cell | PS3 Percentage of Runtime 100 0

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 36 Resource Utilization – IBM Cell Bandwidth utilization Compute utilization Resource Utilization (%) 100 0

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 37 Single Runtime Configurations – GFlop/s ScalarSMPDiskClusterCellPS3 SAXPY0.30.70.0074.93.53.1 SGEMV1.11.70.0412 10 SGEMM6.9455.59111994 CONV2D1.97.80.6248562 FFT3D0.73.90.055.55431 GRAVITY4.8403.7689771 HMMER0.9110.912 7.1

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 38 SGEMM Performance Cluster –Intel Cluster MKL: 101 GFlop/s –Sequoia: 91 GFlop/s SMP –Intel MKL: 44 GFlop/s –Sequoia: 45 GFlop/s

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 39 FFT3D Performance Cell –Mercury:58 GFlop/s –FFTW 3.2 alpha 2:35 GFlop/s –Sequoia:54 GFlop/s Cluster –FFTW 3.2 alpha 2: 5.3 GFlop/s –Sequoia: 5.5 GFlop/s SMP –FFTW 3.2 alpha 2: 4.2 GFlop/s –Sequoia: 3.9 GFlop/s

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 40 Best Known Implementations HMMer –ATI X1900XT:9.4 GFlop/s (Horn et al. 2005) –Sequoia Cell: 12 GFlop/s –Sequoia SMP: 11 GFlop/s Gravity –Grape-6A:2 billion interactions/s (Fukushige et al. 2005) –Sequoia Cell: 4 billion interactions/s –Sequoia PS3: 3 billion interactions/s

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 41 Multi-Runtime System Configurations Cluster of SMPs –Four 2-way, 3.16GHz Intel Pentium 4 Xeons connected via GigE (80MB/s peak) Disk + PS3 –Sony Playstation 3 bringing data from disk (~30MB/s) Cluster of PS3s –Two Sony Playstation 3s connected via GigE (60MB/s peak)

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 42 Multi-Runtime Utilization SAXPYSGEMV SGEMMCONV2D FFT3D GRAVITYHMMER Cluster of SMPs | Disk + PS3 | Cluster of PS3s Percentage of Runtime 100 0

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 43 Cluster of PS3 Issues SAXPYSGEMV SGEMMCONV2D FFT3D GRAVITYHMMER Cluster of SMPs | Disk + PS3 | Cluster of PS3s Percentage of Runtime 100 0

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 44 Cluster of PS3 Issues SAXPYSGEMV Cluster of PS3s | PS3 Percentage of Runtime 100 0

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 45 Multi-Runtime Configurations - GFlop/s Cluster-SMPDisk+PS3PS3 Cluster SAXPY1.90.0045.3 SGEMV4.40.01415 SGEMM483.730 CONV2D4.80.4819 FFT3D1.10.050.36 GRAVITY5066119 HMMER148.313

Conclusion

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 47 Summary Uniform runtime interface for multi-level memory hierarchies –Horizontal portability SMP, cluster, disk, Cell, PS3 –Complex machines supported through composition Cluster of SMPs, disk + PS3, cluster of PS3s –Provides mechanism independence for communication and thread management Efficient abstraction for multiple machines –Maximize machine resource utilization –Low overhead –Competitive performance Simple interface –<20 entry points Code portability –Unmodified code running on 9 system configurations Demonstrates viability of the Parallel Memory Hierarchies model

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 48 Future Work Higher level functions in the runtime? Load balancing? Running on more complex machines? Combine with transactional memory? What machines can be mapped as a tree?

Mike Houston - Stanford University Pervasive Parallelism Lab – PPoPP 2008 49 Questions? Acknowledgements: Intel Graduate Fellowship Program DOE ASC IBM LANL

A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

Similar presentations

Presentation on theme: "A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,

Similar presentations

Presentation on theme: "A Portable Runtime Interface For Multi-Level Memory Hierarchies Mike Houston, Ji-Young Park, Manman Ren, Timothy Knight, Kayvon Fatahalian, Alex Aiken,"— Presentation transcript:

Similar presentations

About project

Feedback