Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.

Slides:



Advertisements
Similar presentations
Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Thoughts on Shared Caches Jeff Odom University of Maryland.
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Today’s topics Single processors and the Memory Hierarchy
Running Large Graph Algorithms – Evaluation of Current State-of-the-Art Andy Yoo Lawrence Livermore National Laboratory – Google Tech Talk Feb Summarized.
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
1 Introduction to MIMD Architectures Sima, Fountain and Kacsuk Chapter 15 CSE462.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.
1: Operating Systems Overview
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Characterizing a New Class of Threads in Science Applications for High End Supercomputing Arun Rodrigues Richard Murphy Peter Kogge Keith Underwood Presentation.
The Sensitivity of Communication Mechanisms to Bandwidth and Latency Frederic T. Chong, Rajeev Barua, Fredrik Dahlgrenz, John D. Kubiatowicz, and Anant.
Resource Fabrics: The Next Level of Grids and Clouds Lei Shi.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
XMT BOF SC09 XMT Status And Roadmap Shoaib Mufti Director Knowledge Management.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Graph Algorithms for Irregular, Unstructured Data John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory July, 2010.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Programmability Hiroshi Nakashima Thomas Sterling.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.
6.894: Distributed Operating System Engineering Lecturers: Frans Kaashoek Robert Morris
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
These slides are based on the book:
Introduction to parallel programming
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Programming Models for SimMillennium
Structural Simulation Toolkit / Gem5 Integration
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
CMSC 611: Advanced Computer Architecture
CSCI1600: Embedded and Real Time Software
Introduction to Multiprocessors
Multithreaded Programming
Latency Tolerance: what to do when it just won’t go away
Chapter 4 Multiprocessors
Memory System Performance Chapter 3
Database System Architectures
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia National Labs Richard Murphy Notre Dame University

Extreme Computing’05 Confessions of a Message-Passing Snob Bruce Hendrickson Sandia National Labs

Discrete Algorithms & Math Department Extreme Computing’05 Why Graphs? l Exemplar of memory-intensive application l Widely applicable and can be very large scale »Scientific computing –sparse direct solvers –preconditioning –radiation transport –mesh generation –computational biology, etc. »Informatics –data-centric computing –encode entities & relationships –look for patterns or subgraphs

Discrete Algorithms & Math Department Extreme Computing’05 Characteristics l Data patterns »Moderately structured for scientific applications –Even unstructured grids make “nice” graphs –Good partitions, lots of locality on multiple scales »Highly unstructured for informatics –Similar to random, power-law networks –Can’t be effectively partitioned l Algorithm characteristics »Typically, follow links of edges –Maybe many at once – high level of concurrency –Highly memory intensive l Random accesses to global memory – small fetches l Next access depends on current one l Minimal computation

Discrete Algorithms & Math Department Extreme Computing’05 Shortest Path Illustration

Discrete Algorithms & Math Department Extreme Computing’05 Architectural Challenges l Runtime is dominated by latency l Essential no computation to hide memory costs l Access pattern is data dependent »Prefetching unlikely to help »Often only want small part of cache line l Potentially abysmal locality at all levels of memory hierarchy

Discrete Algorithms & Math Department Extreme Computing’05 Caching Futility

Discrete Algorithms & Math Department Extreme Computing’05 Larger Blocks are Expensive

Discrete Algorithms & Math Department Extreme Computing’05 Properties Needed for Good Graph Performance l Low latency / high bandwidth »For small messages! l Latency tolerant l Light-weight synchronization mechanisms l Global address space »No graph partitioning required »Avoid memory-consuming profusion of ghost-nodes l These describe Burton Smith’s MTA!

Discrete Algorithms & Math Department Extreme Computing’05 MTA Introduction l Latency tolerance via massive multi-threading »Each processor has hardware support for 128 threads »Context switch in a single tick »Global address space, hashed to reduce hot-spots »No cache. Context switch on memory request. »Multiple outstanding loads l Good match for applications which: »Exhibit complex memory access patterns »Aren’t computationally intensive (slow clock) »Have lots of fine-grained parallelism l Programming model »Serial code with parallelization directives »Code is cleaner than MPI, but quite subtle »Support for “future” based parallelism

Discrete Algorithms & Math Department Extreme Computing’05 Case Study – Shortest Path l Compare codes optimized for different architectures l Option 1: Distributed Memory CompNets »Run on Linux cluster: 3GHz Xenons, Myrinet network »LLNL/SNL collaboration – just for short path finding »Finalist for Gordon Bell Prize on BlueGene/L »About 1100 lines of C code l Option 2: MTA parallelization »Part of general-purpose graph infrastructure »About 400 lines of C++ code

Discrete Algorithms & Math Department Extreme Computing’05 Short Paths on Erdos-Renyi Random Graphs (V=32M, E=128M)

Discrete Algorithms & Math Department Extreme Computing’05 Connected Components on MTA-2 Power-Law Graph V=34M, E=235M procstime

Discrete Algorithms & Math Department Extreme Computing’05 Remarks l Single processor MTA competitive with current micros, despite 10x clock difference l Excellent parallel scalability for MTA on range of graph problems »Identical to single processor code l Eldorado is coming next year »Hybrid of MTA & Red Storm »Less well balanced, but affordable

Discrete Algorithms & Math Department Extreme Computing’05 Broader Lessons l Space of important apps is broader than PDE solvers »Data-centric applications may be quite different from traditional scientific simulations l Architectural diversity is important »No single architecture can do everything well l As memory wall gets steeper, latency tolerance will be essential for more and more applications l High level of concurrency requires »Latency tolerance »Fine-grained synchronization