A User’s Perspective on Acquisition and Management of CMIP5 Data

A User’s Perspective on Acquisition and Management of CMIP5 Data
Jennifer Miletta Adams George Mason University / COLA ESGF2F, December 2014 [Abstract] The complexity, volume, and distributed nature of the CMIP5 data collection has left many users struggling to acquire the CMIP5 data they need. This presentation outlines strategies that were developed to overcome the challenges CMIP5 data users face: authentication, searching for published data that match a list of desired experiments and variables, acquisition of wget scripts, managing wget script execution and the high wget failure rate, retention of critical metadata not present in the data files, version control, local data management, and setting up the data for analysis and visualization using GrADS. All these strategies exist in an automated workflow that is completely independent of any browser interface.

COLA’s CMIP5 Data Collection
Automated record keeping of acquired datasets began in February Periodic checks on disk volume highlighted in red.

COLA’s CMIP5 Data Collection
Reconstructed history of CMIP5 acquisition

Workflow Requirements
No , , , , et al. Script-Based Flexible Automated Runs in a UNIX environment

Workflow Elements Create list of desired data: ”All available models and ensembles for a subset of experiments, realms, frequencies, and variables” Keep track of what has already been acquired Identify what data are available Get needed data Make data user-friendly This workflow was designed during a time when new CMIP5 data sets were constantly being published – the available data was always expanding.

Programmatic View of Workflow
while(1) { list(acquired); for(desired) { search(available); for(available) { if(!acquired) needed; } download(needed);

Keep Track of Acquired Data
11 keywords are required: cmip5 /data /Experiment /Realm /Frequency /MIP-Table /Variable /Institute.Model /Ensemble /Version /datafiles.nc Workflow Element #2 Local data management structure 11 keywords uniquely identify each data set (keyword product ‘output1/output2’ is not retained). Not all keywords are present in the data file name, or in the data file attributes, so this metadata is retained in the directory path. Version control: the version number is not present in a CMIP5 data file, but it is available during the wget script acquisition process, so it must be deliberately retained in local directory structure.

Discovery of Available Data
Build a Dataset search URL: &latest=true &replica=false &facets=id &limit=0 &project=CMIP5 &experiment=piControl &realm=atmos &time_frequency=mon &cmor_table=Amon &variable=clt&variable=hfls….&variable=vas Workflow Element #3 Using keywords from the ‘desired’ text file (format not shown here, keywords highlighted in pink), build a URL to search for datasets that match the experiment description. The results of this search are compared to the list of what has been acquired to determine what needs to be downloaded.

Download Needed Data Build a file search URL to determine number of files for each data set Build a wget URL to download wget scripts; then give them unique names Keep authentication certificates up-to-date Monitor execution of wget scripts in a staging area Put files in place under local directory structure Workflow Element #4 Downloading needed data has its own sub-workflow.

Make Data User-Friendly
Create GrADS descriptor files Aggregate files over time dimension Make use of ensemble dimension when appropriate Identify missing or overlapping time periods Assign non-standard dimensions (e.g. basin averages) Handle 365-day calendars Interpolate data on non-rectilinear grids For ocean and sea ice realms ESMF’s RegridWeightGen generates the interpolation weights Rotate vector fields from grid-relative to Earth-relative coordinates before interpolation Workflow Element #5 After the data files have been collected and placed in position, there is still more work to do to make them easy to use with GrADS.

Complications Solutions Version number not with data
Retained during wget script acquisition 1000 File limit per wget script Please minimize file granularity! User authentication Automated with MyProxyClient Errors from wget Never mind why, just keep trying. Failure is an option. Some data nodes are friendlier than others Data node blacklist Missing or overlapping data DO NOT hide missing data with a non-linear time axis! Rotation of grid-relative vectors Please publish gridspec files! Data on wacky grids ESMF’s RegridWeightGen Special thanks to: Luca Cinquini, Estani Gonzalez, Gavin Bell, Lawson Hanson, and the CMIP5 Helpdesk!

A User’s Perspective on Acquisition and Management of CMIP5 Data

Similar presentations

Presentation on theme: "A User’s Perspective on Acquisition and Management of CMIP5 Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A User’s Perspective on Acquisition and Management of CMIP5 Data

Similar presentations

Presentation on theme: "A User’s Perspective on Acquisition and Management of CMIP5 Data"— Presentation transcript:

Similar presentations

About project

Feedback