MS698: Implementing an Ocean Model Benchmark tests. Dealing with big data files. – (specifically big NetCDF files). Signell paper. Activity: – nctoolbox.

Slides:



Advertisements
Similar presentations
A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or 1.
Advertisements

© Crown copyright Met Office Met Office Unified Model I/O Server Paul Selwood.
University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.
A Unified Data Model and Programming Interface for Working with Scientific Data Doug Lindholm Laboratory for Atmospheric and Space Physics University of.
Summary previous session 1 3 D:\ tools models add meta information netCDF on web server transform to netCDF netCDF on OPeNDAP server data.
Collaboration Tools and Techniques for ROMS Rich Signell,USGS Woods Hole, MA.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
Swift: A Scientist’s Gateway to Campus Clusters, Grids and Supercomputers Swift project: Presenter contact:
DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.
Introduction Downloading and sifting through large volumes of data stored in differing formats can be a time-consuming and sometimes frustrating process.
Ch 4. The Evolution of Analytic Scalability
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
GADS: A Web Service for accessing large environmental data sets Jon Blower, Keith Haines, Adit Santokhee Reading e-Science Centre University of Reading.
Accessing the Amazon Elastic Compute Cloud (EC2) Angadh Singh Jerome Braun.
2 3 ROMS/COAWST NcML file 4 5 Exploiting IOOS: A Distributed, Standards-Based Framework and Software Stack for Searching, Accessing, Analyzing and.
Unidata TDS Workshop TDS Overview – Part I XX-XX October 2014.
THREDDS Data Server Ethan Davis GEOSS Climate Workshop 23 September 2011.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.
Bleeding edge technology to transform Data into Knowledge HADOOP In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
Large company practices. Small company responsiveness. Working for YOU. Jose C. Renteria Kevin Lind 1 RASM: Enhancing VIC.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Parallel Computing with Matlab CBI Lab Parallel Computing Toolbox TM An Introduction Oct. 27, 2011 By: CBI Development Team.
Enhancements to a Community Toolset for Ocean Model Data Interoperability: Unstructured grids, NCTOOLBOX, and Distributed Search Rich Signell (USGS), Woods.
Accomplishments and Remaining Challenges: THREDDS Data Server and Common Data Model Ethan Davis Unidata Policy Committee Meeting May 2011.
CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.
IOOS Model Data Interoperability Design ROMS POM WW3 WRF ECOM NcML Common Data Model OPeNDAP+CF WCS NetCDF Subset THREDDS Data Server Standardized (CF)
Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.
9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.
IOOS Modeling Testbed Cyberinfrastructure Rich Signell, USGS, Woods Hole, MA IOOS-RA-Briefing, Feb 14, 2012.
IOOS Coastal Ocean Modeling Testbed (COMT) Cyberinfrastructure Oceans 12 Becky Baltes, IOOS Liz Smith, SURA Rich Signell, USGS Eoin Howlett, Kyle Wilcox,
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
David Adams ATLAS DIAL status David Adams BNL November 21, 2002 ATLAS software meeting GRID session.
_______________________________________________________________CMAQ Libraries and Utilities ___________________________________________________Community.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
IDC HPC USER FORUM Weather & Climate PANEL September 2009 Broomfield, CO Panel questions: 1 response per question Limit length to 1 slide.
Server Performance, Scaling, Reliability and Configuration Norman White.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
IOOS Data Services with the THREDDS Data Server Rich Signell USGS, Woods Hole IOOS DMAC Workshop Silver Spring Sep 10, 2013 Rich Signell USGS, Woods Hole.
CS 591 x I/O in MPI. MPI exists as many different implementations MPI implementations are based on MPI standards MPI standards are developed and maintained.
Geographic Visualization to Support Epidemiology in Bulgaria Anthony C. Robinson GeoVISTA Center Department of Geography The Pennsylvania State University.
The HDF Group Data Interoperability The HDF Group Staff Sep , 2010HDF/HDF-EOS Workshop XIV1.
The HDF Group Introduction to netCDF-4 Elena Pourmal The HDF Group 110/17/2015.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
00/XXXX 1 Data Processing in PRISM Introduction. COCO (CDMS Overloaded for CF Objects) What is it. Why is COCO written in Python. Implementation Data Operations.
1 Adventures in Web Services for Large Geophysical Datasets Joe Sirott PMEL/NOAA.
Catalog-driven workflows using CSW Rich Signell, USGS, Woods Hole, MA, USA Filipe Fernandes, SECOORA, Brazil Kyle Wilcox, Axiom Data Science, Wickford,
A computer contains two major sets of tools, software and hardware. Software is generally divided into Systems software and Applications software. Systems.
NGS computation services: APIs and.
Climate-SDM (1) Climate analysis use case –Described by: Marcia Branstetter Use case description –Data obtained from ESG –Using a sequence steps in analysis,
1 2.5 DISTRIBUTED DATA INTEGRATION WTF-CEOP (WGISS Test Facility for CEOP) May 2007 Yonsook Enloe (NASA/SGT) Chris Lynnes (NASA)
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Hadoop Aakash Kag What Why How 1.
NGS computation services: APIs and Parallel Jobs
The cf-python software library
HYCOM CONSORTIUM Data and Product Servers
Distributed System Structures 16: Distributed Structures
Cloud Distributed Computing Environment Hadoop
Building Web Applications
Ch 4. The Evolution of Analytic Scalability
Requirements on GSICS Plotting Tool to support VISNIR products
Genre1: Condor Grid: CSECCR
DriveScale Log Collection Method of Procedure
ExPLORE Complex Oceanographic Data
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

MS698: Implementing an Ocean Model Benchmark tests. Dealing with big data files. – (specifically big NetCDF files). Signell paper. Activity: – nctoolbox on matlab. April 4, 2014

Benchmark Tests Name NtileI, NtileJ (grid is 50x70) No. of Processors No. of NodesPPN Type of Node CPU Time (hr) Wall Time (hr) Julia1,1111score1.8 Daniel1,2212score Danielle4,1414qcore Danny1,8818qcore Jiabi4,2818qcore Jiabi2,4818qcore Britt2,1212score Fei2,2414dcore2.7.7

Activity Today: Part I: Analyze Benchmark Tests Plot the walltime vs. number of processors. Make another figure of the speedup vs. the number of processors. Discuss the model performance when run in parallel: – Do you see a difference in speedup depending on how the model tiles were configured? – Does it seem worthwhile to run the model in parallel up to the number of processors tested (8)? – Based on these benchmarks, how would you set up a parallel run if you wanted to represent a long period of time?

Themes of the paper Model output makes big data. ~terabyte scale. Data access is limited by bandwidth in many cases. But, often, you don’t want or need the entire file or data set, you just need part of it. Especially in collaborative settings where different models are being combined or compared – it can be a pain to compare models (different grids, different timesteps, different variable names, different units…). Because the output data is “big” it may be spread across multiple files so that each file < 2GB.

Section 2 gives 5 pieces of advice 1.Store data in a machine-independent, self-describing format. (like NetCDF). 2.Use CF (Climate and Forecast) conventions. This makes it easier for processing scripts to figure out the model grid, variables, etc. Especially important for users who do not know the details of the model. 3.Use and develop generic tools that work with CF – compliant data. 4.Use OPeNDAP to distribute data. Lets the data be served over the internet so that subsets of the data can be accessed at one time. 5.Use a THREDDS catalog. This lets you string a lot of data files together into one dataset.

Now for our example: How can we deal with big datasets? Use matlab script on /export/home/ckharris/MODELS/ROMS/RIVER PLUME2/MS698…

We are going to use nctoolbox in matlab to analyze big NetCDF data on the vlab computers.  Why are we going to use the vlab computers?  Because it has a new enough version of Matlab and java and can access our model output.  Avoid the step of logging onto the cluster or poverty.  You might be able to do this from poverty as well.  You can use these tools on pacific but running jobs interactively on the cluster requires some extra steps.  Why do we want to use the nctoolbox?  It gives us some useful tools for concatenating across history files.  It has some useful tools for analyzing ocean model data.

Now for our example: How can we deal with big datasets? Use matlab script on /export/home/ckharris/MODELS/ROMS/RIVER PLUME2/MS698…

Activity for today Plot the walltime vs. number of processors. Make another figure of the speedup vs. the number of processors. Discuss the model performance when run in parallel: – Do you see a difference in speedup depending on how the model tiles were configured? – Does it seem worthwhile to run the model in parallel up to the number of processors tested (8)? – Based on these benchmarks, how would you set up a parallel run if you wanted to represent a long period of time? Use the nctoolbox to plot a timeseries of the data from the RIVERPLUME 2 test case. /export/home/ckharris/MODELS/ROMS/RIVERPLUME2/MS698…