Presentation is loading. Please wait.

Presentation is loading. Please wait.

Astrophysics, Biology, Climate, Combustion, Fusion, HEP, Nanoscience Sim Scientist DOE NL.

Similar presentations


Presentation on theme: "Astrophysics, Biology, Climate, Combustion, Fusion, HEP, Nanoscience Sim Scientist DOE NL."— Presentation transcript:

1 Astrophysics, Biology, Climate, Combustion, Fusion, HEP, Nanoscience Sim Scientist DOE NL

2 5/24/2004 Chicago Meeting DOE Data Management 2 Workflows Critical need: Enable (and Automate) Scientific Work Flows –Data Generation. –Data Storage –Data Transfer –Data Analysis –Visualization An order of magnitude more effort can be spent on manually managing these work flows than on performing the simulation itself. Workflows are not static.

3 5/24/2004 Chicago Meeting DOE Data Management 3 Simulations Simulations run in batch mode. Remaining workflow interactive or “on demand.” Simulation and analyses performed by distributed teams of research scientists. –Need to access remote and distributed data, resources. –Need for distributed collaborative environments. We will not present solutions in this talk! –Some solutions will be problem dependent. Example: Remote Viz. vs. Local Viz., Parallel HDF5 vs. Parallel netcdf, …

4 5/24/2004 Chicago Meeting DOE Data Management 4 How do we do simulation science (I) Let’s suppose that we have a verified HPC code. –I will use the Gyrokinetic Toroidal Code (GTC) to serve as an example. We also suppose that we have a suite of analysis and visualization programs. We want to eventually compare the output of this to theoretical and/or experimental and/or other simulation results.

5 5/24/2004 Chicago Meeting DOE Data Management 5 A fast peek at the workflow Thought HPC Compute volume average Compute tracer particle energy, position momentum Compute 1d and 2d radial and velocity profiles Compute correlation functions Feature tracking of the heat potential Thought VIZ Global Analysis tools VIZ TB’s viz featuresmetadata movies paper Let’s go through the scientific process requirements: 1TB/sim now: 10TB/year 100TB/sim 5yr:.5PB/year 58Mbs now, 1.6Gbs 5 yr Data Transfer Data Generation Data Analysis Data Visualization Data Storage

6 5/24/2004 Chicago Meeting DOE Data Management 6 Stage 1: Initial Question + Thought Scientist thinks of a problem to answer a physical question. Example: –What saturates transport driven by Ion Temperature Gradient? Requirements: –Possible changes in the code. –New visualization routines to examine particles. –New modifications in analysis tools. Question Thought Time thought Question Collaborate with O(5) people: face to face, phone.

7 5/24/2004 Chicago Meeting DOE Data Management 7 Stage 2: Change code add analysis If –Code is mature, go to stage 4. Else –Scientists modify HPC code to put in new routines for new physics, new capabilities. –Scientists change the code to answer the question. –If necessary, analysis/viz routines are added/modified –where do the inputs come from? experiments, other sims, theory. HPC Thought weeks Time thought Question Code modifications Total output = 1TB/full run 40 hours= 58Mbs: now 5 years: 0.1PB/hero run 150 hours= 1.6Gbs O(5) people modify code Code input computation I/O Runtime  1TS

8 5/24/2004 Chicago Meeting DOE Data Management 8 Stage 3: Debugging Stage Scientists modify HPC code to put in new routines for new physics Scientist generally run a parameter survey to answer the question(s). Scientist change the code to answer the question. 1 to 2 people debug the code. Verify code again, regression test. HPC Thought weeks… Time Question Code modifications Total output = 0.1Mbs Thought Compute volume average Continue Run sequence thought VIZ results are thrown away

9 5/24/2004 Chicago Meeting DOE Data Management 9 Stage 4: Run production code. Now the scientist has confidence in the modifications. Scientist generally run a parameter survey and/or sensitivity analysis to answer the question(s). Scientist need good analysis and visualization routines. O(3) look at raw data and run analysis programs. –Filter data –Look for features for the larger group. O(10) look at end viz. and interpret the results. Time Question 0.01Mbs Particles 50Mbs Production run Interpret results thought Thought HPC Compute volume average Compute tracer particle energy, position, momentum Compute 1d and 2d radial and velocity profiles VIZ scalar 60 Mbs.5% TS 1000TS data can flow from RAM to RAM/disk/WAN/LAN.

10 5/24/2004 Chicago Meeting DOE Data Management 10 Stage 4a: Data Management Observations. We must understand 1.Data Generation from simulation and analysis routines. 2.Size of Data being generated. –Latency issues for access patterns. –Can we develop good compression techniques? –Bandwidth/disk speed issues. –Do we need non-volatile storage? RAM-RAM, RAM – Disk-tape –“Plug and play” analysis routines, need a common data model – non-trivial to transfer from N processors to M processors! –Bottleneck  analysis is too slow. Time Code modifications thought Thought HPC VIZ particles 50Mbs 0.01Mbs scalar 60 Mbs.5% TS 1000TS Save scalar data for more post-processing. Save Viz data Toss particle Data Particles 50Mbs

11 5/24/2004 Chicago Meeting DOE Data Management 11 Stage 5: Feedback Stage After the production run we interpret the results We then ask a series of questions: –Do I have adequate analysis routines? –Was the original hypothesis correct? –Should the model equations change? – Do we need to modify it? If everything is ok, should we continue the parameter survey? Time Production run Interpret results … Thought HPC Compute correlation function Thought VIZ The workflow is changing! comparison to other data, theory, sim., experiments

12 5/24/2004 Chicago Meeting DOE Data Management 12 Stage 5: Observations To expedite this process –Need standard data model(s). –Can we build analysis routines which can be used for multiple codes and or multiple disciplines?? Data Model must allow flexibility. –Commonly we add/remove variables used in the simulations/analysis routines. –Need for metadata, annotation, and provenance: Nature of Metadata –Code versions, compiler information, machine configuration. –Simulation parameters, model parameters. –Information on simulation inputs. –Need for tools to record provenance in databases. Additional provenance (above that provided by the above metadata) needed to describe: –Reliability of data; how the data arrived in the form in which it was accessed; data ownership. Production run Interpret results …

13 5/24/2004 Chicago Meeting DOE Data Management 13 Stage 5: Observations Data Analysis routines can include –Data Transformation Format transformation Reduction Coordinate transformation Unit transformation Creation of derived data … –Feature detection, extraction, tracking Define metadata Find regions of interest Perform level set analyses in spacetime Perform born analyses. –Inverse feature tracking –Statistical Analysis: PCA, Comparative Component Analyses, data fitting, correlations Time Production run Interpret results … Thought HPC Thought VIZ

14 5/24/2004 Chicago Meeting DOE Data Management 14 Stage 5: Observations Visualization Needs –Local, Remote, Interactive, Collaborative, Quantitative, Comparative –Platforms –Fusion of different data types Experimental, Theoretical, Computational,… New representations Time Production run Interpret results … Thought HPC Thought VIZ

15 5/24/2004 Chicago Meeting DOE Data Management 15 Stage 6: Complete parameter survey Complete all of the runs for the parameter survey to answer the question. 1 – 3 are looking at the results during the parameter survey. Time Production run Interpret results Production run Interpret results … Thought HPC Feature tracking Thought VIZ

16 5/24/2004 Chicago Meeting DOE Data Management 16 Stage 7: Run a “large” Hero run Now we can run a high resolution case, which will run for a very long time. O(10) are looking at the results. Time LARGE Hero run, Interpret results … Thought HPC Thought VIZ

17 5/24/2004 Chicago Meeting DOE Data Management 17 Stage 8: Assimilate the results. Did I answer the question? –Yes Now publish a paper. O(10+) look at results. Compare to experiment –Details here. What do we need stored? –Short term storage –Long term storage –NO Go back to Stage 1: Question Time Interpret results TB’sviz featuresmetadata movies Data repository Global Analysis tools … VIZ Data Mining tools assimilate results

18 5/24/2004 Chicago Meeting DOE Data Management 18 Stage 9: Other scientist use information Now other scientist can look at this information and use it for their analysis, or input for their simulation. What is the data access patterns –Global Interactive VIZ: GB’s of data/time slice, TB’s in the future. –Bulk data is accessed numerous times. –Look at derived quantities. MB’s to GB’s of data. How long do we keep the data? –Generally less than 5 years. Time Interpret results … Data repository Global Analysis tools VIZ TB’sviz featuresmetadata movies

19 5/24/2004 Chicago Meeting DOE Data Management 19 Let Thought be the bottleneck Simulation Scientists generally have scripts to semi- automate parts of the workflow. To expedite this process they need to –Automate the workflow as much as possible. –Remove the bottlenecks Better visualization, better data analysis routines, will allow users to decrease the interpretation time. Better routines to “find the needle in the haystack” will allow the thought process to be decreased: Feature detection/tracking Faster turn around time for simulations will decrease the code runtimes. –Better numerical algorithms, more scalable algorithms. –Faster processors, faster networking, faster I/O. –More HPC systems, more end stations.

20 5/24/2004 Chicago Meeting DOE Data Management 20 Summary: Biggest bottleneck: Interpretation of Results. – This is the biggest bottleneck because Babysitting –Scientists spend their “real-time” babysitting computational experiments. [trying to interpret results, move data, and orchestrate the computational pipeline]. –Deciding if the analysis routines are working properly with this “new” data. Non-scalable data analysis routines –Looking for the “needle in the haystack”. –Better analysis routines could mean less time in the thought process and in the interpretation of the results. The entire scientific process can not be fully automated.

21 5/24/2004 Chicago Meeting DOE Data Management 21 Workflows No changes in these workflows.

22 5/24/2004 Chicago Meeting DOE Data Management 22 Section 3: Astrophysical Simulation Workflow Cycle Parallel HDF5 Run Simulation batch job on capability system HPSS Archive checkpoint files to HPSS Simulation generates checkpoint files MSS, Disks, & OS Migrate subset of checkpoint files to local cluster Application Layer GPFS PVFS or LUSTRE Vis & Analysis on local Beowulf cluster Continue Simulation? Start New Simulation? Storage Layer Parallel I/O Layer

23 5/24/2004 Chicago Meeting DOE Data Management 23 Biomolecular Simulation Molecular System Construction Statistical Analysis Structure Database (e.g. PDB) Parameterization Hardware, OS, Math Libraries, MSS (HPSS) Molecular Trajectories Storage Management, Data Movement And Access Workflow Design Molecular System Analysis & Visualization Computer Simulation Archive Trajectories Review/ Curation Trajectory Database Server (e.g.BioSimGrid) Large Scale Temporary Storage Raw Data Visualization

24 5/24/2004 Chicago Meeting DOE Data Management 24 Combustion Workflow

25 5/24/2004 Chicago Meeting DOE Data Management 25 GTC Workflow Deposit the charge of very particle on the grid Solve Poisson equation to get the potential on the grid Calculate the electric field Gather the forces from the grid to the particles and push them Do process migration with the particles that have moved out of their current domain GTC Compute volume averaged quantities Compute tracer Particle Energy, position momentum Compute 1d and 2d radial and velocity profiles viz analysis Compute Correlation functions

26 5/24/2004 Chicago Meeting DOE Data Management 26 NIMROD Workflow nimrod.in NIMROD Run-time Config nimhdf, nimfl, nimplot, … Run-time Config Phi.h5 nimfl.bin Xdraw AVS/Express SCIRun OpenDX Animations Images Screen nimset dump.00000 Input files fluxgrid.in nimhdf.in nimfl,.in … dischargeenergy nimhist data for every time step dump.* Restart file ~100 files

27 5/24/2004 Chicago Meeting DOE Data Management 27 Initial Run VMEC, JSOLVER, EFIT, etc M3D Simulation Studies 2009 (rough estimate) Restart 1Restart 2Restart N HPSS (NERSC) PPPL Local Project Disks Done Run M3D at NERSC on 10,000 processors for 20 hours per segment Post-process locally on PPPL upgraded cluster. Requires 10 min per time slice to analyze. Typically analyze 20 time slices. 1 TB files, transfer time 10 min, if parallel?

28 5/24/2004 Chicago Meeting DOE Data Management 28 A Simplified VORPAL Workflow Initial Parameters Input Data VORPAL Run-time Configurations D1DnD2D3 Data Filtering/Extraction D1DnD2D3 Image Generator (Xdraw) png1pngnpng2png3 Time slices Sim1 Animation Sim2 Animation SimX Animation Currently, the workflow is handled by a set of scripts. Data movement is handled either by scripts or manually.

29 5/24/2004 Chicago Meeting DOE Data Management 29 TRANSP Workflow Preliminary data Analysis and Preparation (largely automated) Diagnostic Hardware Experiments (CMod, DIII-D, JET, MAST, NSTX) 20-50 signals {f(t), f(x,t)} Plasma position, Shape, Temperatures, Densities Field, Current, RF and Beam Injected Powers. TRANSP Analysis*: Current diffusion, MHD equilibrium, fast ions, thermal plasma heating; power, particle and momentum balance. Experiment simulation Output Database ~1000-2000 signals {f(t), f(x,t)} Visualization Load Relational Databases Detailed (3d) time-slice physics simulations (GS2, ORBIT, M3D…) Pre- and Post-processing at the experimental site… D. McCune 23 Apr 2004

30 5/24/2004 Chicago Meeting DOE Data Management 30 Workflow for Pellet Injection Simulations Preliminary Analysis (Deciding run parameters) Run 1D Pellet code Table of energy sink term as a function of flux-surface and time Run AMR Production Code HDF5 data files Run post-processing code to compute visualization variables and other diagnostic quantities (e.g. total energy) for plotting Visualize field quantities in computational space using ChomboVis Create diagnostic plots Interpolate solution on finest mesh. Create data files for plotting field quantities in a torus Input Files HDF5 data files of plotting variables ASCII files of diagnostic variables Interpolated data files (binary) Visualize field quantities in a torus using AVS or ensight Majority of Time

31 5/24/2004 Chicago Meeting DOE Data Management 31 Degas2 Workflow

32 5/24/2004 Chicago Meeting DOE Data Management 32 High-Energy Physics Workflow Typical of a major collaboration SIMULATION Users: Simulation Team At: 10s of sites DATA ACQUISITION Users: DAQ team At: 1 site DATABASES: < 1 terabyte Conditions, Metadata And Workflow RECONSTRUC TION (Feature Extractions) Users: Reconstruction Team At: few sites SKIMMING/ FILTERING Users: Skim Team At: few sites ANALYSIS Users: All Physicists At: 100+ Sites 100s of terabytes today 10s of petabytes in 2010

33 5/24/2004 Chicago Meeting DOE Data Management 33 Nuclear Physics Workflow Typical of a major collaboration SIMULATION Users: Simulation Team At: 10s of sites DATA ACQUISITION Users: DAQ team At: 1 site DATABASES: < 1 terabyte Conditions, Metadata And Workflow RECONSTRUC TION (Feature Extractions) Users: Reconstruction Team At: few sites SKIMMING/ FILTERING Users: Skim Team At: few sites ANALYSIS Users: All Physicists At: 100+ Sites 100s of terabytes today 10s of petabytes in 2010

34 5/24/2004 Chicago Meeting DOE Data Management 34 Comments from others


Download ppt "Astrophysics, Biology, Climate, Combustion, Fusion, HEP, Nanoscience Sim Scientist DOE NL."

Similar presentations


Ads by Google