June 11, 2002 SS-SQ-W: 1 Stanford Streaming Supercomputer (SSS) Spring Quarter Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford University.

Slides:

Advertisements

Similar presentations

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan GCafe December 10th, 2003.

Advertisements

Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Thoughts on Shared Caches Jeff Odom University of Maryland.

Streaming Supercomputer Strawman Bill Dally, Jung-Ho Ahn, Mattan Erez, Ujval Kapasi, Tim Knight, Ben Serebrin April 15, 2002.

Last Lecture The Future of Parallel Programming and Getting to Exascale 1.

Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.

Oct 2, 2001 SSS: 1 Stanford Streaming Supercomputer (SSS) Project Meeting Bill Dally, Pat Hanrahan, and Ron Fedkiw Computer Systems Laboratory Stanford.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

December 10, 2002 SS-FQ02-W: 1 Stanford Streaming Supercomputer (SSS) Fall Quarter 2002 Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford.

The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

SSS Software Update Ian Buck Mattan Erez August 2002.

Jan 30, 2003 GCAFE: 1 Compilation Targets Ian Buck, Francois Labonte February 04, 2003.

March 11, 2003 SS-SQ03-W: 1 Stanford Streaming Supercomputer (SSS) Winter Quarter Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford.

1 Java Grande Introduction  Grande Application: a GA is any application, scientific or industrial, that requires a large number of computing resources(CPUs,

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.

1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.

Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.

Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Supercomputers – David Bailey (1991) Eileen Kraemer August 25, 2002.

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Group 3: Architectural Design for Enhancing Programmability Dean Tullsen, Josep Torrellas, Luis Ceze, Mark Hill, Onur Mutlu, Sampath Kannan, Sarita Adve,

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

Oct 26, 2005 FEC: 1 Custom vs Commodity Processors Bill Dally October 26, 2005.

Data Structures and Algorithms in Parallel Computing Lecture 7.

NA-MIC National Alliance for Medical Image Computing Core 1b – Engineering Computational Platform Jim Miller GE Research.

Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.

UPC Status Report - 10/12/04 Adam Leko UPC Project, HCS Lab University of Florida Oct 12, 2004.

Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.

Parallel IO for Cluster Computing Tran, Van Hoai.

KERRY BARNES WILLIAM LUNDGREN JAMES STEED

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Nek5000 preliminary discussion for petaflops apps project.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Welcome: Intel Multicore Research Conference

Abstract Machine Layer Research in VGrADS

Lecture on High Performance Processor Architecture (CS05162)

BlueGene/L Supercomputer

VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture 5: Data Layout for Grid Applications ©Wen-mei W. Hwu and David Kirk/NVIDIA.

6- General Purpose GPU Programming

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

June 11, 2002 SS-SQ-W: 1 Stanford Streaming Supercomputer (SSS) Spring Quarter Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford University June 11, 2002

SS-SQ-W: 2 Review – What is the SSS Project About? Exploit streams to give 100x improvement in performance/cost for scientific applications vs. ‘cluster’ supercomputers –From 100 GFLOPS PCs to TFLOPS single-board computers to PFLOPS supercomputers Use layered programming system to simplify development and tuning of applications –Application specific frameworks and libraries –Stream languages –Streaming virtual machine Demonstrate feasibility of above in year 1 –Run real applications on simulated hardware –Identify bottlenecks Build a prototype and demonstrate CITS applications in years 2-6

June 11, 2002 SS-SQ-W: 3 Architecture of SSS

June 11, 2002 SS-SQ-W: 4 A layered software system simplifies stream programming

June 11, 2002 SS-SQ-W: 5 The big picture VLSI technology enables us to put TeraOPS on a chip –Conventional general-purpose architecture cannot exploit this –The problem is bandwidth Streams expose locality and concurrency –Perform operations in record (not operation as with vector) order –Enables compiler optimization at a larger scale than scalar processing A stream architecture achieves high arithmetic intensity –Intensity = arithmetic rate/bandwidth –Bandwidth hierarchy, compound stream operations A Streaming Supercomputer is feasible –100GFLOPS (64-b) on a chip, 1TFLOPS single-board computer, PFLOPS systems

June 11, 2002 SS-SQ-W: 6 Outreach TST Meeting May –SSS Project well received Sierra visit April 30 Outreach plans for summer –DOE Headquarters –Labs Industrial partners –Intel, IBM, Sun, HP, Cray, Nvidia –Start visits this fall Other application areas

June 11, 2002 SS-SQ-W: 7 EE482C – Streaming Architecture 11 Class Projects –Irregular Streams (2) – caches and SRF indexing –Aspect ratio –Compiling Brook –Streams on legacy architectures –Mapping to multiple nodes –Communication Scheduling –Stencils –Vectors –Cellular Automata –Viterbi Decoding

June 11, 2002 SS-SQ-W: 8 Three Major Thrusts Software –Brook language –Virtual machines –‘Compilers’ to map Brook to VM to streaming and other hardware –OS/Run-time system Hardware –Specification Stream caching and support for multi-dim streams ‘Aspect ratio’ thread vs data parallelism Global mechanisms & Memory system –Simulation SSS Simulator Prototyping on Imagine Applications –Fluids StreamFLO Model PDEs –Molecular dynamics –Microbenchmarks/Stress tests

June 11, 2002 SS-SQ-W: 9 Software goals for SQ02 Accomplishments –Metacompiler parses Brook –Multidimensional features in Brook –Apps coded in Brook –Central source repository –Mapping analyses enumerated and mapped interaction kernel of StreamMD Overall –End-to-end (brook->SVM->SSS) demonstration [all] –Put in place release process Brook –Feature lock, all features needed for two apps [Ian] –Hints [Mattan] Metacompilation –Compile Brook to SVM [Ben C.] SVM –SVM specification, prototype C implementation, develop and run test suite [Francois] –Instrumented version of SVM [Francois] Mapping –StreamMD running on Imagine [Mattan] –Enumerate known algorithms and research problems [Mattan] –Implement minimum mapping tool [Mattan]

June 11, 2002 SS-SQ-W: 10 Software Goals for Summer 02 Fill in at meeting SQ Accomplishments –Brook to StreamC (manual to KernelC) runs on Imagine (unoptimized, subset) –Version 2 SVM Specification –Brook features a lot closer –Metacompilation of Brook to BRT and StreamC –Compilation document first draft Summer Goals –Brook Bug fixes and changes to facilitate compilation [Ian, Mattan] –SVM Specification, Simulator, and run-time [Francois] –Compilation Identify framework permitting analysis [Mattan] Translate to SVM [Mattan] Compile kernels [Jayanth] See Mattan’s –Run-time Scalar processor multi-node support [Mattan] –Memory management etc… Issues –Critical path SVM implementation –Build long-term compiler framework –Leverage Imagine compilation techniques –Run-time system

June 11, 2002 SS-SQ-W: 11 Hardware goals for SQ02 Accomplishment –Completed strawman architecture –Initial bandwidth analysis of StreamMD Architecture –Reconcile multidimensional language features with architecture [Tim] Simulator –Define simulation results needed for October [all] –Single-node simulator [Ben S.] –Multi-node definition and simulator [Jung Ho] Apps on Simulator –Map StreamFlow and StreamMD to SSS and analyze bandwidth [Mattan] Point studies –Aspect ratio (TP vs ILP vs DP) [Ben], conditionals[Ujval], stream caching [Tim], global mechanisms [Mattan]…

June 11, 2002 SS-SQ-W: 12 Hardware Goals for Summer 02 SQ Accomplishments –Revised strawman –Ran key StreamMD kernels on Imagine –Cache study, indexable SRF studies Summer Goals –Architecture specification Fix bugs[All] Support for multi-node scalar arch [Mattan] –Simulator Modify imagine simulator to match strawman [Jung Ho] –Application studies Run StreamMD and StreamFlo on strawman simulator [Mattan] –Point studies Conditional study, aspect ratio study Issues –Coherency –Finalize cache/SRF architecture –Finalize remote ops –Support for reductions across nodes –Scalar architecture – multi-node

June 11, 2002 SS-SQ-W: 13 Application goals for SQ02 Accomplishments –FFT microbenchmark Solvers –Incompressible fluid flow running (all 3 PDE types) [Eran] Hints –Application hints into Brook [all] Microbenchmarks –PCA [Ian, Anand]

June 11, 2002 SS-SQ-W: 14 Application Goals for Summer 02 SQ Accomplishments –2 PDE types completed – smoke movie –Ungridded StreamMD –StreamFLO underway Summer Goals –Finite element “miniapp” [Tim] –Investigate Sierra [Tim] –Sparse and Dense stress codes M*V [Tim] –Complete and run StreamFLO [Fatica,Ian] –Complete and run gridded StreamMD [Eric,Ian] –Run StreamFLO and gridded StreamMD on simulators and collect numbers [Mattan] Issues –Follow up with Yates LLNL –Sweep3D – Sn Radiation Transport

June 11, 2002 SS-SQ-W: 15 Summer SS Meetings We should meet at least every other week over the summer Every other Tuesday at 11? Schedule on web page soon