Presentation on theme: "1 3 Dec 2009 Assembling Large, Multi-Sensor Climate Datasets Using CVO Brian Wilson, Gerald Manipon, Tom Yunck, and Zhangfan Xing Jet Propulsion Laboratory."— Presentation transcript:
1 3 Dec 2009 Assembling Large, Multi-Sensor Climate Datasets Using CVO Brian Wilson, Gerald Manipon, Tom Yunck, and Zhangfan Xing Jet Propulsion Laboratory Do multi-instrument science by authoring a dataflow doc. for a reusable operator tree. Access scientific data by naming it.
2 3 Dec 2009 Large-Scale Data Fusion Find Level-2 datasets –Space/time granule query for multiple EOS (“A-Train”) instruments – AIRS, AMSR-E, AMSU, MODIS, Cloudsat, GPS Co-locate retrievals using space/time metadata –Instantaneous “matchups” in space & time Read the data –Temperature, water vapor, quality flags, radiances (HDF) Understand the data –Units, quality control (non-trivial !!), etc. Publish merged (matchup) products –Temperature & water vapor from GPS, AIRS, & AMSU –Determine instrument biases, understand by stratifying Publish multi-sensor “fused” products –Improve AIRS retrievals by introducing temperature information from highly accurate GPS refractivity
3 3 Dec 2009 CVO Software Layers CVO web application (visual programming) –Guide user in configuring & executing a science analysis workflow –Client works in a browser, pre-configured for various users SciFlo workflow engine –Server side: executes “backing” workflow (XML) document –Metadata in XML, science data in HDF/netCDF (binary) Python & command-line glue –Wrap Fortran/C/C++ operators into workflow –Workflow execution available from python or Unix command line Analysis & visualization operators –Fortran, C, IDL, Matlab, NCAR_Graphics –Algorithms now available as workflow steps & callable services! Merged (“matchup”) datasets –Merged data in netCDF (simpler than HDF)
4 3 Dec 2009 CVO Software Layers CVO Builder and underlying services & operators. Fortran, C/C++ Operators XML Workflows Executed on CVO SciFlo Node CVO Builder GUI Unix Command Line Metadata Database IDL, Matlab Operators Libraries: python, graphics Python Operators Python Glue Registered Science Data Files: Remote URL’s or Local Cache SciFlo Web Services GeoRegionQuery Webify To Slice Data Arrays Published Workflows Matchup & Analysis ServicesPlotting Services VizFlow Authoring GUI = browser GUI client app. = remotely callable service = local codes & data
5 3 Dec 2009 Configured for GPS-AIRS Matchup CVO Browser GUI
7 3 Dec 2009 “SciFlo” Workflow Engine n Automate large-scale, multi-instrument science processing by authoring a dataflow document that specifies a tree of executable operators or services. n VizFlow Visual Authoring Tool (AJAX GUI in browser) n Distributed Dataflow Execution Engine (in python) n Data Grid: Move data “granules” to the operators using FTP/HTTP, or slice variables using OpenDAP URLs. n Compute Grid: Move operators (executables) to the data. n Built-in reusable operators provided for many tasks such as subsetting, co-registration, regridding, data fusion, etc. n Custom operators easily plugged in by scientists. n Publish algorithms as remotely-callable Web Services and then orchestrate services in easily authored workflow. n CVO web app: Guide user in selecting and providing inputs to matchup & analysis workflows.
8 3 Dec 2009 GPS-AIRS Matchup & Temp. Profile Comparison VizFlow Flowchart Connect a series of services and operators into a dataflow Drag services/operators from menu, and drop onto the canvas Lay out the flowchart by moving nodes Connect the input/output ports by drawing lines User guided by matching up port names and types
9 3 Dec 2009 Service/Operator Orchestration Each SciFlo processing step is one of: –Template for XML (or string) generation –REST (http GET) call: e.g. WMS/WCS, DAP URLs –SOAP service call: “have WSDL, will call” –XPath 2.0 transformation for XML mediation –XQuery 1.0 query/transformation –Command-line script or executable –Python function call –Scientist’s custom IDL or MATLAB script –Other (What do you need?)
10 3 Dec 2009 SciFlo As an Authoring Toolkit Assemble operators by writing XML document –Connect SOAP/REST data query/access services to custom, executable algorithms written by scientists –IDL, MATLAB, or python codes can become operators SciFlo Engine automatically: –Generates web (HTML) form to call the flow –Publishes custom flow as a new web service (if desired) Create many Web Services Automatically –No glue or SOAP code to write –Only write science algorithms, in language of choice Publish Analysis Flows –Exchange SciFlo documents –Generated products have lineage & user annotations
11 3 Dec 2009 CVO/SciFlo Network Compute & Data grid JPL GPS AIRS AIRS Science Team JPL CVO Nodes (2) Goddard DAAC NOAA AMSU UCAR GPS NCAR CVO Node
12 3 Dec 2009 Full SciFlo Network JPL GPS, AIRS, & MISR Science Areas U. Michigan [NASA Goddard DAAC] NASA Langley DAAC U. Alabama Huntsville (More universities coming). JPL DAAC UCLA Ohio State
13 3 Dec 2009 Open Data Access Protocol www.opendap.org www.opendap.org –Use a one-line query URL to retrieve a slice of a variable grid from a netCDF or HDF file anywhere in the world –Binary wire protocol for fine-grained data transfer OpenDAP URL –http://gen-dev.jpl.nasa.gov/genesis/cgi-bin/dods/nph-dods/ genesis/data/airs/L2/20030113/airx2ret/AIRS.2003.01.13.171.L2.RetStd.hdf?TAirStd(1:3, 3:6, 4:17) OpenDAP Servers –netCDF, HDF, GRIB, FreeForm, JGOFS, other file formats –Easy to implement another server OpenDAP clients –Matlab and IDL, any web browser –Python (pydap or SciFlo)
14 3 Dec 2009 Webify (Zhangfan Xing) Webification server: http://w10n.jpl.nasa.govhttp://w10n.jpl.nasa.gov Drill down into a “deep web” of science data. –Simple URL’s to get metadata, slice variables, etc. –Lighter weight than OpenDAP, mostly in python –HDF group supporting fast HDF5 server in C –Support multiple file formats: HDF, netCDF, FITS, GRIB, etc. –Returns multiple formats: XML, HTML, JSON, netCDF NetCDF Example –http://w10n.jpl.nasa.gov/test/data/nc/coads_climatology.nc (download netCDF file)http://w10n.jpl.nasa.gov/test/data/nc/coads_climatology.nc –http://w10n.jpl.nasa.gov/test/data/nc/coads_climatology.nc/ (file metadata)http://w10n.jpl.nasa.gov/test/data/nc/coads_climatology.nc/ –http://w10n.jpl.nasa.gov/test/data/nc/coads_climatology.nc/SST/ (variable metadata)http://w10n.jpl.nasa.gov/test/data/nc/coads_climatology.nc/SST/ –http://w10n.jpl.nasa.gov/test/data/nc/coads_climatology.nc/SST[0:2, 45:55,85:95] (slice variable)http://w10n.jpl.nasa.gov/test/data/nc/coads_climatology.nc/SST[0:2, 45:55,85:95]
15 3 Dec 2009 Carbon Cycle AIRS/GPS Co-Registration: Point to Swath Matchup AIRS Level2 Swaths over Pacific GPS Level2 Profile Locations Multi-Instrument Atmospheric Science
16 3 Dec 2009 AIRS/GPS Temperature & Water Vapor Comparison Plots AIRS / GPS Matchups
17 3 Dec 2009 Space/Time Query in SciFlo A SciFlo Dataset is: –Specified as a space/time query over collections of data products (or retrieved physical variables) GeoRegionQuery ( DataProduct, TimeRange, LatLonRegion) GeoRegionQuery(PhysicalVariable, TimeRange, LatLonRegion) –Realized as a list of object ID’s or URI’s (permanent names) GeoRegionQuery returns unique objectIds along with geolocation metadata –Accessed using a list of URL’s pointing to on-line replicas of the data objects (files). FindDataById(objectIds) URLs (ftp, http, or OpenDAP) Translate unique object ID’s into list of on-line locations in DataPools or any SciFlo node DataPools & SciFlo P2P network are “crawled” to update distributed translation tables Or query ECHO metadata repository –SciFlo network is a distributed cache for scientific datasets
18 3 Dec 2009 “Smart” Data Grid Register data collections –Crawl GPS & AIRS/AMSU datasets & extract spatial bbox –Recognize AIRS granules: AIRS.2003.01.02.004.L2.RetStd.hdf Space/time matchups: GPS point to AIRS/AMSU swath –Perform matchups by spatial lookup of AIRS granules –Save matchup indices Move & cache data files –Using three AIRS products: L2.RetStd, L2.Support, L1b radiances –Workflow uses cached file or auto-caches remote file Generate merged products –Desired GPS & AIRS variables in netCDF files Register merged products as new “recognized” dataset –Run statistics workflows for monthly/seasonal/yearly statistics –Publish merged products, aggregate statistics, plots, etc.
19 3 Dec 2009 [Demo of CVO browser GUI, SciFlo, and Webify.]
20 3 Dec 2009 Multiple Service Interfaces Web Services Remote function calls: HTTP URL or XML messaging Machine-to-machine Slice variables out of data files using “Webify” server Workflows Call web services and custom analysis operators CVO GUI for users, layered on top of automated workflow Python Programming Call services & operators directly from python Operational scripts implementing custom workflow Command line Execute workflows from command line (sflExec.py) Incorporate into workflows into larger operational scripts
21 3 Dec 2009 SciFlo is Multi-Purpose Each SciFlo client/server node is multi-functional: –Provides pre-configured SOAP services (e.g. GeoRegionQuery) –Serves data via an OpenDAP server, ftp, and soon Webify –Provides a Redirection server: translate objectID -> file URL’s –Contains metadata in a relational database (mysql) –Contains an XQuery-able XML document store (dbxml) –Executes SciFlo documents (dataflow execution engine) –Serves flow results on private & shared web pages (wiki) SciFlo Software Bundle –All Open Source, Push-Button Install on Linux & Windows –Installable by each user, root/admin privileges not required –One install provides pre-configured: SOAP services, OpenDAP server, redirection, ftp, mysql, dbxml, dataflow engine, & wiki. Personal Data Center for each scientist –Electronic scientific notebook (personal, configurable) –Collaborate by sharing wiki pages & exchanging SciFlo docs.
22 3 Dec 2009 Open Source in the SciFlo Bundle SOAPpy – SOAP client & server ElementTree – XML parsing, pseudo-XPath lxml – XML parsing Xpath 1.0 Twisted, openssl – secure web server pyldap, openldap – authorization, roles mysql – relational database Sleepycat dbxml – XML database w/ XQuery 1.0 and XPath 2.0 scipy, numpy, matplotlib, basemap Other scientific libraries with python bindings wiki dojo AJAX library – client dev. Google maps widget, Google Earth animations OpenDAP -- fine-grained data access, “drill down” into files OpenID – simple user credentials Parts of Globus v4 – For Grid Virtualization –Globus Security Infrastructure (GSI) –GridFTP
23 3 Dec 2009 CVO 2 nd Year Plans More features for CVO browser GUI –Choose variables by generic names or by product names –Populate more analysis & visualization operators AIRS/AMSU forward model, seasonal-to-yearly trend plots, etc. Extend bias analysis –GPS-AIRS comparisons over entire AIRS mission –GPS-AMSU comparisons for several NOAA satellites –Stratify bias trends by lat/lon, season, day/night, scene Publish merged products for use by community –GPS-AIRS & GPS-AMSU variable matchups –GPS-AMSU comparisons for several NOAA satellites Documentation and User Guides –CVO user guide, installation, security setup (OpenID) –Publish observed bias trends and operator algorithms –Dual publication: science papers refer to CVO tech. paper