Www.vacet.org Brad Whitlock October 14, 2009 Brad Whitlock October 14, 2009 Porting VisIt to BG/P.

Slides:

Advertisements

Similar presentations

Using MapuSoft Instead of OS Vendor’s Simulators.

Advertisements

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.

© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Job Submission.

HARDWARE ACCELERATED WEB BROWSER Berlian Juliartha M.P Indah Yudi Suryani Wais Al Qonri H

Parallel Apriori Algorithm Using MPI Congressional Voting Records Çankaya University Computer Engineering Department Ahmet Artu YILDIRIM January 2010.

INTRODUCTION OS/2 was initially designed to extend the capabilities of DOS by IBM and Microsoft Corporations. To create a single industry-standard operating.

Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.

Quick Tutorial on MPICH for NIC-Cluster CS 387 Class Notes.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 2: Operating-System Structures Modified from the text book.

Chapter 6 - Implementing Processes, Threads and Resources Kris Hansen Shelby Davis Jeffery Brass 3/7/05 & 3/9/05 Kris Hansen Shelby Davis Jeffery Brass.

Measuring zSeries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Sponsored in part by Deer &

Chapter 8 Input/Output. Busses l Group of electrical conductors suitable for carrying computer signals from one location to another l Each conductor in.

 When Bill Gates saw how successful the apple “Lisa” computer and “Mac” computer were doing he decided to create an operating system with a GUI himself.

Parallelization with the Matlab® Distributed Computing Server CBI cluster December 3, Matlab Parallelization with the Matlab Distributed.

Processor and Internal Stuff or the “guts” of the computer.

Blaise Barney, LLNL ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

1 ITSK 2611 Welcome. 2 Operating System 3 What is an OS Resource Manager –Disk –Memory –CPU Device Manager –Printers –Video Card –Sound Card Utility.

14th IEEE-NPSS Real Time Conference 2005, 8 June Stockholm.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Experiments with Pure Parallelism Hank Childs, Dave Pugmire, Sean Ahern, Brad Whitlock, Mark Howison, Prabhat, Gunther Weber, & Wes Bethel April 13, 2010.

Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

Composition and Evolution of Operating Systems Introduction to Operating Systems: Module 2.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Chapter 34 Java Technology for Active Web Documents methods used to provide continuous Web updates to browser – Server push – Active documents.

GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.

CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.

Large Scale Visualization on the Cray XT3 Using ParaView Cray User’s Group 2008 May 8, 2008 Sandia is a multiprogram laboratory operated by Sandia Corporation,

CE Operating Systems Lecture 3 Overview of OS functions and structure.

ProtoVis Peter Sikachev Institute of Computer Graphics and Algorithms Vienna University of Technology.

An OLSR implementation, experience, and future design issues.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

NetLogger Using NetLogger for Distributed Systems Performance Analysis of the BaBar Data Analysis System Data Intensive Distributed Computing Group Lawrence.

Full and Para Virtualization

Threaded Programming Lecture 2: Introduction to OpenMP.

Chapter Eleven Windows XP Professional Application Support.

Performance Optimization in Apache 2.0 Development: How we made Apache faster, and what we learned from the experience O’Reilly Open.

Single Node Optimization Computational Astrophysics.

02/02/20001/14 Managing Commands & Processes through CORBA CHEP 2000 PaduaPadua Sending Commands and Managing Processes across the BaBar OPR Unix Farm.

® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.

Operating Systems. Categories of Software System Software –Operating Systems (OS) –Language Translators –Utility Programs Application Software.

Page 1 Monitoring, Optimization, and Troubleshooting Lecture 10 Hassan Shuja 11/30/2004.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Terra-Fusion Loads Tiles in real-time while panning Loads Tiles in real-time while panning Improved overall performance via: Improved overall performance.

Computer Performance. Hard Drive - HDD Stores your files, programs, and information. If it gets full, you can’t save any more. Measured in bytes (KB,

Creating Grid Resources for Undergraduate Coursework John N. Huffman Brown University Richard Repasky Indiana University Joseph Rinkovsky Indiana University.

Information Technology (IT). Information Technology – technology used to create, store, exchange, and use information in its various forms (business data,

VisIt Project Overview

Virtual Machine Monitors

Operating System & Application Software

Node.Js Server Side Javascript

Platform as a Service (PaaS)

VisIt Libsim Update DOE Computer Graphics Forum 2012 Brad Whitlock

Transitioning VisIt to CMake

Mechanism: Limited Direct Execution

In-situ Visualization using VisIt

Platform as a Service.

Introduction to Operating System (OS)

TYPES OFF OPERATING SYSTEM

Node.Js Server Side Javascript

GSP 215 RANK Perfect Education/ gsp215rank.com.

Example of usage in Micron Italy (MIT)

Types of Software.

Windows Virtual PC / Hyper-V

Quick Tutorial on MPICH for NIC-Cluster

Lecture Topics: 11/1 Hand back midterms

Cluster Computers.

Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.

Presentation transcript:

Brad Whitlock October 14, 2009 Brad Whitlock October 14, 2009 Porting VisIt to BG/P

Overview Objectives Building 3rd party libraries Building VisIt Running VisIt on BG/P Improvements Impact Future work Objectives Building 3rd party libraries Building VisIt Running VisIt on BG/P Improvements Impact Future work

Objectives Port VisIt to IBM’s BlueGene/P platform so VisIt can run on LLNL’s Dawn and eventually Sequoia –Dawn is a 500 Teraflop, 36,864 node, 147,456 cpu, IBM BG/P system –4 850MHz PowerPC cores/node, 4Gb Memory/node –Compute nodes run CNK OS –Cross-compile code for CNK Identify weaknesses in VisIt that prevent it from scaling to tens/hundreds of thousands of processors Port VisIt to IBM’s BlueGene/P platform so VisIt can run on LLNL’s Dawn and eventually Sequoia –Dawn is a 500 Teraflop, 36,864 node, 147,456 cpu, IBM BG/P system –4 850MHz PowerPC cores/node, 4Gb Memory/node –Compute nodes run CNK OS –Cross-compile code for CNK Identify weaknesses in VisIt that prevent it from scaling to tens/hundreds of thousands of processors

Building 3rd party libraries Built all libraries on login nodes for regular Linux PowerPC version of VisIt –Ran into runtime problems using xlC compiler so reverted to g++ for the time being Cross-compiled all libraries for CNK No support for this platform in VisIt’s 3rd party libraries so special builds were required Mesa built unmangled and no X11 VTK tricky to build –No OpenGL so VTK built with Mesa as its OpenGL –No X11 so created custom render window –Used CMake toolchain file Built all libraries on login nodes for regular Linux PowerPC version of VisIt –Ran into runtime problems using xlC compiler so reverted to g++ for the time being Cross-compiled all libraries for CNK No support for this platform in VisIt’s 3rd party libraries so special builds were required Mesa built unmangled and no X11 VTK tricky to build –No OpenGL so VTK built with Mesa as its OpenGL –No X11 so created custom render window –Used CMake toolchain file

Building VisIt No X11 so graphical components can’t be built for CNK (don’t build gui) Added new --enable-engine-only build mode to VisIt’s build system that only builds the compute engine and its plugins VisIt always used to require mangled mesa –This support had to become conditional on VTK having mangled mesa support No X11 so graphical components can’t be built for CNK (don’t build gui) Added new --enable-engine-only build mode to VisIt’s build system that only builds the compute engine and its plugins VisIt always used to require mangled mesa –This support had to become conditional on VTK having mangled mesa support

Running VisIt on Dawn Dawn uses mpirun to start VisIt on compute nodes –Minor differences required environment variables to be exported via mpirun command, which could be handled via host profile in VisIt VisIt ran at 1k,2k,4k,8k,16k nodes VisIt ran with 1 and 4 trillion zone datasets (June09) Encountered scaling problems early –Launch time slow because each processor was reading plugin directory to obtain plugin information –VisIt commands were sent from rank 0 to other ranks 1Kb at a time until a message was sent –Non-spinning bcast substitute used for sending commands had point-to-point that performed poorly at scale –Certain metadata consumed too much memory (each processor has ~700Mb only) –Synchronization step for SR mode used slow point-to-point Dawn uses mpirun to start VisIt on compute nodes –Minor differences required environment variables to be exported via mpirun command, which could be handled via host profile in VisIt VisIt ran at 1k,2k,4k,8k,16k nodes VisIt ran with 1 and 4 trillion zone datasets (June09) Encountered scaling problems early –Launch time slow because each processor was reading plugin directory to obtain plugin information –VisIt commands were sent from rank 0 to other ranks 1Kb at a time until a message was sent –Non-spinning bcast substitute used for sending commands had point-to-point that performed poorly at scale –Certain metadata consumed too much memory (each processor has ~700Mb only) –Synchronization step for SR mode used slow point-to-point

Improvements Broadcast plugin information from rank 0 to other ranks to improve plugin loading time 9x Broadcast VisIt commands from rank 0 in a single chunk instead of 1Kb at a time Use standard bcast in engine main loop instead of poorly performing non-spin substitute geared towards shared nodes Switched to alternate metadata representation to free up most available memory for calculations Mark Miller was able to replace SR mode synchronization step with much faster version that reduced time to 2 seconds from 20 minutes Broadcast plugin information from rank 0 to other ranks to improve plugin loading time 9x Broadcast VisIt commands from rank 0 in a single chunk instead of 1Kb at a time Use standard bcast in engine main loop instead of poorly performing non-spin substitute geared towards shared nodes Switched to alternate metadata representation to free up most available memory for calculations Mark Miller was able to replace SR mode synchronization step with much faster version that reduced time to 2 seconds from 20 minutes

Impact So far this project’s impact has been small for customers –They do not yet run on Dawn –They might not notice small improvements at today’s everyday processor counts (<2k) At higher processor counts (>4k) optimizations added by this work prevent bottlenecks in compute engine, improving scalability So far this project’s impact has been small for customers –They do not yet run on Dawn –They might not notice small improvements at today’s everyday processor counts (<2k) At higher processor counts (>4k) optimizations added by this work prevent bottlenecks in compute engine, improving scalability

Future work Resolve load problems with xlC compiler so we can use the best optimizations, including using BG/P’s dual FPU’s Improve 3rd party library build process for BG/P by adding support in build_visit script Continue profiling plots and improving performance Reduce memory usage where possible Investigate I/O patterns and attempt optimizations Resolve load problems with xlC compiler so we can use the best optimizations, including using BG/P’s dual FPU’s Improve 3rd party library build process for BG/P by adding support in build_visit script Continue profiling plots and improving performance Reduce memory usage where possible Investigate I/O patterns and attempt optimizations