Hank Childs Lawrence Berkeley National Laboratory /

Slides:



Advertisements
Similar presentations
A Framework for Particle Advection for Very Large Data
Advertisements

1 Slides presented by Hank Childs at the VACET/SDM workshop at the SDM Center All-Hands Meeting. November 26, 2007 Snoqualmie, Wa Work performed under.
Distributed Systems CS
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, B element Rayleigh-Taylor.
A Framework for Particle Advection for Very Large Data Hank Childs, LBNL/UCDavis David Pugmire, ORNL Christoph Garth, Kaiserslautern David Camp, LBNL/UCDavis.
OpenFOAM on a GPU-based Heterogeneous Cluster
Large Vector-Field Visualization, Theory and Practice: Large Data and Parallel Visualization Hank Childs Lawrence Berkeley National Laboratory / University.
Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.
Lawrence Livermore National Laboratory Visualization and Analysis Activities May 19, 2009 Hank Childs VisIt Architect Performance Measures x.x, x.x, and.
Nuclear Energy Work Hank Childs & Christoph Garth April 15, 2010.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
E. WES BETHEL (LBNL), CHRIS JOHNSON (UTAH), KEN JOY (UC DAVIS), SEAN AHERN (ORNL), VALERIO PASCUCCI (LLNL), JONATHAN COHEN (LLNL), MARK DUCHAINEAU.
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
Challenges and Solutions for Visual Data Analysis on Current and Emerging HPC Platforms Wes Bethel & Hank Childs, Lawrence Berkeley Lab July 20, 2011.
CS 221 – May 13 Review chapter 1 Lab – Show me your C programs – Black spaghetti – connect remaining machines – Be able to ping, ssh, and transfer files.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Computer System Architectures Computer System Software
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Experiments with Pure Parallelism Hank Childs, Dave Pugmire, Sean Ahern, Brad Whitlock, Mark Howison, Prabhat, Gunther Weber, & Wes Bethel April 13, 2010.
VisIt: a visualization tool for large turbulence simulations  Outline Success stories with turbulent simulations Overview of VisIt project 1 Hank Childs.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Neural and Evolutionary Computing - Lecture 10 1 Parallel and Distributed Models in Evolutionary Computing  Motivation  Parallelization models  Distributed.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Introduction to Hadoop and HDFS
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Hank Childs, Lawrence Berkeley Lab & UC Davis Workshop on Exascale Data Management, Analysis, and Visualization Houston, TX 2/22/11 Visualization & The.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
VACET: Deploying Technology for Visualizing and Analyzing Astrophysics Simulations Author May 19, 2009.
Nov. 14, 2012 Hank Childs, Lawrence Berkeley Jeremy Meredith, Oak Ridge Pat McCormick, Los Alamos Chris Sewell, Los Alamos Ken Moreland, Sandia Panel at.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Efficient Visualization and Analysis of Very Large Climate Data Hank Childs, Lawrence Berkeley National Laboratory December 8, 2011 Lawrence Livermore.
Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
How do we handle algorithms that aren’t embarrassingly parallel? P0 P1 P3 P2 P8 P7P6 P5 P4 P9 Pieces of data (on disk) ReadProcessRender Processor 0 ReadProcessRender.
PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Hank Childs, University of Oregon Large Data Visualization.
Data Structures and Algorithms in Parallel Computing Lecture 7.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
1 27B element Rayleigh-Taylor Instability (MIRANDA, BG/L) VisIt: a visualization tool for large turbulence simulations Large data requires special techniques.
Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
VisIt Project Overview
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Auburn University
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
VisIt Libsim Update DOE Computer Graphics Forum 2012 Brad Whitlock
Parallel Algorithm Design
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
WHY THE RULES ARE CHANGING FOR LARGE DATA VISUALIZATION AND ANALYSIS
Multithreaded Programming
Database System Architectures
Chapter 01: Introduction
Parallel Programming in C with MPI and OpenMP
Portable Performance for Many-Core Particle Advection
In Situ Fusion Simulation Particle Data Reduction Through Binning
Dynamic Load Balancing of Unstructured Meshes
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Large Vector-Field Visualization, Theory and Practice: Large Data and Parallel Visualization Hank Childs Lawrence Berkeley National Laboratory / University of California at Davis October 25, 2010

Outline Motivation Parallelization strategies Master-slave parallelization Hybrid parallelism

Outline Motivation Parallelization strategies Master-slave parallelization Hybrid parallelism

Supercomputers are generating large data sets that often require parallelized postprocessing. 1 billion element unstructured mesh 217 pin reactor cooling simulation. Nek5000 simulation on ¼ of Argonne BG/P. Image credit: Paul Fischer using VisIt

Communication between “channels” are a key factor in effective cooling.

Particle advection can be used to study communication properties.

This sort of analysis requires many particles to be statistically significant. How can we parallelize this process? Place thousands of particles in one channel Observe which channels the particles pass through Repeat for other channels Observe where particles come out (compare with experimental data)

Outline Motivation Parallelization strategies Master-slave parallelization Hybrid parallelism

For embarrassingly parallel algorithms, the most common processing technique is pure parallelism. Parallelized visualization data flow network Parallel Simulation Code P0 P1 P3 P2 P8 P7 P6 P5 P4 P9 Read Process Render Particle advection is not embarrassingly parallel. So how to parallelize? A: it depends Processor 0 Read Process Render P0 P1 P2 P3 Processor 1 P4 P5 P6 P7 Read Process Render Pieces of data (on disk) P8 P9 Processor 2

Particle advection: Four dimensions of complexity Data set size vs Seed set distribution vs Seed set size vs Vector field complexity

Do we need parallel processing? When? How complex? Data set size? Not enough! Large #’s of particles?

Parallelization for small data and a large number of particles. This scheme is referred to as parallelizing-over-particles. Read Advect Render Processor 1 Processor 2 Processor 0 Parallelized visualization data flow network File Simulation code GPU-accelerated approaches follow a variant of this model. The key is that the data is small enough that it can fit in memory.

Do we need advanced parallelization techniques? When? Data set size? Not enough! Large #’s of particles? Need to parallelize, but embarrassingly parallel OK Large #’s of particles + large data set sizes

Parallelization for large data with good “distribution”. This scheme is referred to as parallelizing-over-data. Read Advect Render Processor 1 Processor 2 Processor 0 Parallelized visualization data flow network Parallel Simulation Code P0 P1 P3 P2 P8 P7 P6 P5 P4 P9 P0 P1 P2 P3 P4 P5 P6 P7 Pieces of data (on disk) P8 P9

Do we need advanced parallelization techniques? When? Data set size? Not enough! Large #’s of particles? Need to parallelize, but embarrassingly parallel OK Large #’s of particles + large data set sizes Need to parallelize, simple schemes may be OK Large #’s of particles + large data set sizes + (bad distribution OR complex vector field) Need smart algorithm for parallelization

Parallelization with big data & lots of seed points & bad distribution Two extremes: Partition data over processors and pass particles amongst processors Parallel inefficiency! Partition seed points over processors and process necessary data for advection Redundant I/O! Parallelize over particles over data P0 P1 P2 P3 P4 Hybrid algorithms Parallelizing Over I/O Efficiency Data Good Bad Particles Notional streamline example

Outline Motivation Parallelization strategies Master-slave parallelization Hybrid parallelism

The master-slave algorithm is an example of a hybrid technique. “Scalable Computation of Streamlines on Very Large Datasets”, Pugmire, Childs, Garth, Ahern, Weber. SC09 Many of the following slides compliments of Dave Pugmire. Algorithm adapts during runtime to avoid pitfalls of parallelize-over-data and parallelize-over-particles. Nice property for production visualization tools. Implemented inside VisIt visualization and analysis package.

Master-Slave Hybrid Algorithm Divide processors into groups of N Uniformly distribute seed points to each group P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 Slave Master Master: Monitor workload Make decisions to optimize resource utilization Slaves: Respond to commands from Master Report status when work complete

Master Process Pseudocode { while ( ! done ) if ( NewStatusFromAnySlave() ) commands = DetermineMostEfficientCommand() for cmd in commands SendCommandToSlaves( cmd ) } What are the possible commands?

Commands that can be issued by master Assign / Loaded Block Assign / Unloaded Block Handle OOB / Load Handle OOB / Send OOB = out of bounds Master Slave Setup nomenclature Slave is given a streamline that is contained in a block that is already loaded

Commands that can be issued by master Assign / Loaded Block Assign / Unloaded Block Handle OOB / Load Handle OOB / Send OOB = out of bounds Master Slave Slave is given a streamline and loads the block

Commands that can be issued by master Assign / Loaded Block Assign / Unloaded Block Handle OOB / Load Handle OOB / Send OOB = out of bounds Master Slave Load Slave is instructed to load a block. The streamline in that block can then be computed.

Commands that can be issued by master Assign / Loaded Block Assign / Unloaded Block Handle OOB / Load Handle OOB / Send OOB = out of bounds Master Slave Send to J Slave J Slave is instructed to send a streamline to another slave that has loaded the block

Master Process Pseudocode { while ( ! done ) if ( NewStatusFromAnySlave() ) commands = DetermineMostEfficientCommand() for cmd in commands SendCommandToSlaves( cmd ) } * See SC 09 paper for details

Master-slave in action Iteration Action P0 reads B0, P3 reads B1 1 P1 passes points to P0, P4 passes points to P3, P2 reads B0 P0 P1 P2 P3 P4 When to pass and when to read? How to coordinate communication? Status? Efficiently? 0: Read 1: Pass 1: Read Notional streamline example

Algorithm Test Cases Core collapse supernova simulation Magnetic confinement fusion simulation Hydraulic flow simulation

Workload distribution in supernova simulation Parallelization by: Particles Data Hybrid Interesting way to look at workload distribution Colored by processor doing integration

Workload distribution in parallelize-over-particles Too much I/O Each streamline is computed by a single processor Lots of redundant I/O

Workload distribution in parallelize-over-data Starvation 3 Procs doing all the work!!

Workload distribution in hybrid algorithm Just right Good mix of processors working on different parts of the problem

Comparison of workload distribution Hybrid Data Particles 3 different pictures 3 different computational resource requirements

Astrophysics Test Case: Total time to compute 20,000 Streamlines Uniform Seeding Non-uniform Seeding Seconds Seconds COMPARISON Hybrid algorithm outperforms both traditional methods 256M cells 512 blocks Number of procs Number of procs Part-icles Data Hybrid

Astrophysics Test Case: Number of blocks loaded Uniform Seeding Non-uniform Seeding Blocks loaded Blocks loaded Seed: Lots of redundant I/O Block: Very low Hybrid can adapt to do the right amount of I/O Number of procs Number of procs Part-icles Data Hybrid

Outline Motivation Parallelization strategies Master-slave parallelization Hybrid parallelism

Are today’s algorithms going to fit well on tomorrow’s machines? Traditional approach for parallel visualization – 1 core per MPI task – may not work well on future supercomputers, which will have 100-1000 cores per node. Exascale machines will likely have O(1M) nodes … and we anticipate in situ particle advection. Hybrid parallelism blends distributed- and shared-memory parallelism concepts.

The word “hybrid” is being used in two contexts… The master-slave algorithm is a hybrid algorithm, sharing concepts from both parallelization-over-data and parallelization-over-seeds. Hybrid parallelism involves using a mix of shared and distributed memory techniques, e.g. MPI + pthreads or MPI+CUDA. One could think about implement a hybrid particle advection algorithm in a hybrid parallel setting.

What can we learn from a hybrid parallel study? How do we implement parallel particle advection algorithms in a hybrid parallel setting? How do they perform? Which algorithms perform better? How much better? Why? Streamline Integration Using MPI-Hybrid Parallelism on a Large Multi-Core Architecture by David Camp, Christoph Garth, Hank Childs, Dave Pugmire and Ken Joy Accepted to TVCG

Streamline integration using MPI-hybrid parallelism on a large multi-core architecture Implement parallelize-over-data and parallelize-over-particles in a hybrid parallel setting Did not study the master-slave algorithm Run series of tests on NERSC Franklin machine (Cray) Compare 128 MPI tasks (non-hybrid) vs 32 MPI tasks / 4 cores per task (hybrid) 12 test cases: large vs small # of seeds uniform vs non-uniform seed locations 3 data sets

Hybrid parallelism for parallelize-over-data Expected benefits: Less communication and communicators Should be able to avoid starvation by sharing data within a group.

Measuring the benefits of hybrid parallelism for parallelize-over-data

Gantt chart for parallelize-over-data

Hybrid parallelism for parallelize-over-particles Expected benefits: Only need to read blocks once for node, instead of once for core. Larger cache allows for reduced reads “Long” paths automatically shared among cores on node

Measuring the benefits of hybrid parallelism for parallelize-over-particles

Gantt chart for parallelize-over-particles

Summary of Hybrid Parallelism Study Hybrid parallelism appears to be extremely beneficial to particle advection. We believe the parallelize-over-data results are highly relevant to the in situ use case. Although we didn’t implement the master-slave algorithm, we believe the benefits shown at the spectrum extremes provide good evidence that hybrid algorithms will also benefit. Implemented on VisIt branch, goal to integrate into VisIt proper in the coming months.

Summary for Large Data and Parallelization The type of parallelization required will vary based on data set size, number of seeds, seed locations, and vector field complexity Parallelization may occur via parallelization-over-data, parallelization-over-particles, or somewhere in between (master-slave). Hybrid algorithms have the opportunity to de-emphasize the pitfalls of the traditional techniques. Hybrid parallelism appears to be very beneficial.

Thank you for your attention! Acknowledgments: DoE SciDAC program, VACET Participants: Hank Childs (hchilds@lbl.gov), Dave Pugmire (ORNL), Dave Camp (LBNL/UCD), Christoph Garth (UCD), Sean Ahern (ORNL), Gunther Weber (LBNL), and Ken Joy (UCD) Questions?