Deb Agarwal (UCB and LBNL) Catharine van Ingen (MSFT) Berkeley Water Center Microsoft TCI IndoFlux Meeting, Chennai, India, July 13, 2006 Designing CyberInfrastructure to Support End Science
Project Motivation l Data is now being gathered into common data archives l Data archives provide an opportunity for cross-discipline and cross-site investigations l Data analysis techniques which worked well on small data sets often do not scale l Current CS tools have evolved in support of other disciplines – Investigate their ability to facilitate data analysis
Distributed Data Sets Building BWC Water Cyberinfrastructure to Connect Data, Resources, and People Science Portal Data Harvesting and Transformations Data Cleaning, Models, Analysis Tools Computational Resources
Web Service Interface to Data and Tools Data Providers: Host Ameriflux Climate Data Statsgo Soils Data MODIS products Web-based Workbench access Tools: Statistical Graphical LAI Temp Fpar Veg Index Surf Refl NPP Albedo Choose Ameriflux Area/Transect, Time Range, Data Type Gap Fill, A technique Gap Fill, B technique Design Workflow Statistical & graphical analysis Canoak Model Site 9 Data harvest Sites 1-16 Canoak Model Site 1 Version control Network display LAI Statistical & Graphical analysis Data Cleaning Tools Data Mining and Analysis Tools Modeling Tools Visualization Tools Ecology Toolbox Compute Resources Carbon Community Workbench Climate Statsgo MODIS Import other Datasets Knowledge Generation Tools
Approach l Work closely with the end scientists to define, prototype, and test the system l Provide a solution that leverages both server-based and local desktop/laptop environments l Leverage commercial tools to the extent possible
Some Critical Capabilities l Support for versioning of data sets l Work with multiple data sets l Advanced data selection and plotting capabilities m Select data relative to an event m Simple calculation across any specified date range m Statistical information available m Plots - scatter, diurnal, time series, probability density function, tiled, correlation l Ability to access capabilities from desktop
Data Pipeline ORNL Ameriflux Site CSV Files BWC SQL Server Database Data Cube Excel Pivot Table and Chart
Data Cleaning and Versioning BWC SQL Server Database Excel spreadsheet of current data Investigator updated spreadsheet
Analysis Services Data Cube l An organized view of the data l A multi-dimensional view into the data l Can integrate multiple data sources l Define measures and dimensions m Measure – a value you want to be able to plot m Dimension – An axis you want to be able to use to select data and as axis l Calculations – define new measures
Precipitation trends and totals Summer precipitation: Tonzi and Vaira ~ 2% of total Metolius ~ 24% of total Walker Branch ~ 40% of total *Plot created by Gretchen Miller of UC Berkeley
Other applications *Plot created by Gretchen Miller of UC Berkeley
Observations by latitude *Plot created by Gretchen Miller of UC Berkeley
Observations by ecosystem type *Plot created by Gretchen Miller of UC Berkeley
Some Lessons Learned so Far l Data naming and unit consistency is critical to easy ingest of large amounts of data l Commercial tools do not necessarily provide all the right analysis capabilities directly l Scaling capabilities of the tools not yet clear l We will need tools to aid in notification of PIs
Portal Deployment l Behind the portal are a collection of databases and data cubes l Distribution for ease of use m Only see the data of interest m Private data remains stable l Distribution for scaling m Smaller queries on smaller databases take less resources m Larger databases and cubes can be replicated across machines l Batch job like infrastructure for managing very long running queries
Acknowlegements l Science Team m Dennis Baldocchi m Bev Law m Gretchen Miller l Cyberinfrastructure m Matt Rodriguez m Monte Goode l Microsoft m Tony Hey m Nolan Li l Oak Ridge National Lab CDIAC personnel l Berkeley Water Center m Yoram Rubin m Susan Hubbard
URLs and Connection Coordinates l Web Site m l Blog m l m