Berkeley Water Center Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal, LBL Catharine van Ingen,

Slides:



Advertisements
Similar presentations
Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006.
Advertisements

Introduction Lesson 1 Microsoft Office 2010 and the Internet
Maines Sustainability Solutions Initiative (SSI) Focuses on research of the coupled dynamics of social- ecological systems (SES) and the translation of.
Earth System Curator Spanning the Gap Between Models and Datasets.
Flux Data Server User Tutorial Deb Agarwal, Catharine van Ingen, Susan Holladay, and Misha Krassovski Berkeley Water Center (UCB, LBL), ORNL, and Microsoft.
Alternate Software Development Methodologies
Visibility Information Exchange Web System. Source Data Import Source Data Validation Database Rules Program Logic Storage RetrievalPresentation AnalysisInterpretation.
Tutorial 5: Working with Excel Tables, PivotTables, and PivotCharts
C van Ingen, D Agarwal, M Goode, J Gupchup, J Hunt, R Leonardson, M Rodriguez, N Li Berkeley Water Center John Hopkins University Lawrence Berkeley Laboratory.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Time Series Analyst An Internet Based Application for Viewing and Analyzing Environmental Time Series Jeffery S. Horsburgh Utah State University David.
Briefing for the Upper Colorado River Basin Pilot Update Meeting, February 24, 2011.
Distributed DBMSs A distributed database is a single logical database that is physically distributed to computers on a network. Homogeneous DDBMS has the.
Development of a Community Hydrologic Information System Jeffery S. Horsburgh Utah State University David G. Tarboton Utah State University.
16 months…. The Visibility Information Exchange Web System is a database system and set of online tools originally designed to support the Regional Haze.
Integrating Historical and Realtime Monitoring Data into an Internet Based Watershed Information System for the Bear River Basin Jeff Horsburgh David Stevens,
Introduction to Databases Transparencies
GEOG440: GIS and Urban Planning Chapter 3. GIS Decision Support Methods and Workflow Dr. Ahmad BinTouq URL:
SQL Reporting Services Overview SSRS includes all the development and management pieces necessary to publish end user reports in  HTML  PDF 
Chapter 1 Introduction to Databases
Deborah Agarwal BWC technical team 16 July Applications of eddy covariance measurements, Part 1: Lecture on Analyzing and Interpreting CO2 Flux.
State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Function BIRN: Quality Assurance Practices Introduction: Conclusion: Function BIRN In developing a common fMRI protocol for a multi-center study of schizophrenia,
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
The Case for Data Stewardship: Preserving the Scientific Record Matthew Mayernik National Center for Atmospheric Research Version 2.0 [Review Date]
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
material assembled from the web pages at
Deb Agarwal abd Marty Humphrey e Norman Beekwilder e Monte Goode abd
Chapter 6 SAS ® OLAP Cube Studio. Section 6.1 SAS OLAP Cube Studio Architecture.
CC&E Best Data Management Practices, April 19, 2015 Please take the Workshop Survey 1.
Abstract Carbon Fluxes Across Four Land Use Types in New Hampshire Sean Z. Fogarty, Lucie C. Lepine, Andrew P. Ouimette — University of New Hampshire,
© 2007 by Prentice Hall 1 Introduction to databases.
Peter Bajcsy, Rob Kooper, Luigi Marini, Barbara Minsker and Jim Myers National Center for Supercomputing Applications (NCSA) University of Illinois at.
CHAPTER TEN AUTHORING.
EARTH SCIENCE MARKUP LANGUAGE Why do you need it? How can it help you? INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
1 st -4 th December st BioXHIT Annual Meeting WorkPackage 5.2: Implementation of Data management and Project Tracking in Structure Solution Peter.
Fisheries Oceanography Collaboration Software Donald Denbo NOAA/PMEL-UW/JISAO Presented by Nancy Soreide NOAA/PMEL AMS 2002/IIPS 10.3.
Deb Agarwal (UCB and LBNL) Catharine van Ingen (MSFT) Berkeley Water Center Microsoft TCI IndoFlux Meeting, Chennai, India, July.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
- Ahmad Al-Ghoul Data design. 2 learning Objectives Explain data design concepts and data structures Explain data design concepts and data structures.
Copyright © 1994 Carnegie Mellon University Disciplined Software Engineering - Lecture 3 1 Software Size Estimation I Material adapted from: Disciplined.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Data Warehouse. Group 5 Kacie Johnson Summer Bird Washington Farver Jonathan Wright Mike Muchane.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
Abstract Analysis and Visualization of Hydrologic Data and Observations Catalogs Using the OLAP Data Cube Technology Ilya Zaslavsky a, Matthew Rodriguez.
Building Dashboards SharePoint and Business Intelligence.
Breakout # 1 – Data Collecting and Making It Available Data definition “ Any information that [environmental] researchers need to accomplish their tasks”
Using a Global Flux Network—FLUXNET— to Study the Breathing of the Terrestrial Biosphere Dennis Baldocchi ESPM/Ecosystem Science Div. University of California,
Abstract OLAP Cube Visualization of Hydrologic Data Catalogs Ilya Zaslavsky a, Matthew Rodriguez a, Bora Beran b, David Valentine a, Jillian Wallis c,
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
INTRODUCTION TO GIS  Used to describe computer facilities which are used to handle data referenced to the spatial domain.  Has the ability to inter-
Vegetation Index Visualization of individual composite period. The tool provides a color coded grid display of the subset region. The tool provides time.
Goal: to understand carbon dynamics in montane forest regions by developing new methods for estimating carbon exchange at local to regional scales. Activities:
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
Earth System Curator and Model Metadata Discovery and Display for CMIP5 Sylvia Murphy and Cecelia Deluca (NOAA/CIRES) Hannah Wilcox (NCAR/CISL) Metafor.
Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.
Metadata Development in the Earth System Curator Spanning the Gap Between Models and Datasets Rocky Dunlap, Georgia Tech 5 th GO-ESSP Community Meeting.
V7 Foundation Series Vignette Education Services.
By ILTAF MEHDI (MCS, MCSE, CCNA) 1 Remember: Examination is a chance not ability. 6/12/2016.
The Bear River Watershed Information System Jeffery S. Horsburgh Utah Water Research Laboratory Utah State University David.
USGS EROS LCMAP System Status Briefing for CEOS
INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
  1-A) How would Arctic science benefit from an improved GIS?
Scientific Workflows Lecture 15
Best Practices in Higher Education Student Data Warehousing Forum
Presentation transcript:

Berkeley Water Center Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal, LBL Catharine van Ingen, MSFT 20 September 2006

Berkeley Water Center Outline  Landscape  Data archives and other sources  Typical small group collaboration needs  Examples using “Ameriflux”  Science Data Server  Goals and ideal capabilities  Approach  Experiences with the current system  Next generation  Next set of development efforts  Research issues  Conclusion

Berkeley Water Center Unprecedented Data Availability

Berkeley Water Center Typical Data Flow Today Large Data Archives Local measurements Models

Berkeley Water Center 6 Ameriflux Collaboration Overview  149 Sites across the Americas  Each site reports a minimum of 22 common measurements.  Communal science – each principle investigator acts independently to prepare and publish data.  Data published to and archived at Oak Ridge.  Total data reported to date on the order of 150M half-hourly measurements. 

Berkeley Water Center 1.Applications of eddy covariance measurements, Part 1: Lecture on Analyzing and Interpreting CO2 Flux Measurements, Dennis Baldocchi, CarboEurope Summer Course, 2006, Namur, Belgium ( What A Tower Sees

Berkeley Water Center Example Carbon-Climate Investigations  Net carbon exchange for the ecosystem  Impact of climate change on the greening of ecosystems  Start of leaf growth  Duration of photosynthesis  Effects of early spring on carbon uptake  Role of ecosystem and latitude on carbon flux  Effect of various pollution sources on carbon in atmosphere and carbon balance

Berkeley Water Center Measurements Are Not Simple or Complete  Gaps in the data  Quiet nights  Bird poop  High winds  ….  Difficult to make measurements  Leaf area index  Wood respiration  Soil respiration  …  Localized measurements – tower footprint  Local investigator knowledge important  PIs’ science goals are not uniform across the towers

Soils Climate Remote Sensing Examples of Carbon-Climate Datasets Observatory datasets Spatially continuous datasets

Berkeley Water Center Scientific Data Server Large Data Archives Local measurements

Berkeley Water Center Scientific Data Server - Goals  Act as a local repository for data and metadata assembled by a small group of scientists from a wide variety of sources  Simplify provenance by providing a common “safe deposit box” for assembled data  Interact simply with existing and emerging Internet portals for data and metadata download, and, over time, upload  Simplify data assembly by adding automation  Simplify name space confusion by adding explicit decode translation  Support basic analyses across the entire dataset for both data cleaning and science  Simplify mundane data handling tasks  Simplify quality checking and data selection by enabling data browsing

Berkeley Water Center Scientific Data Server - Non-Goals  Replace the large Internet data source sites  The technology developed may be applicable, but the focus is on the group collaboration scale and usability  Very large datasets require different operational practices  Perform complex modeling and statistical analyses  There are a lot of existing tools with established trust based on long track records  Only part of a full LIMS (laboratory information management system)  Develop a new standard schema or controlled vocabulary  Other work on these is progressing independently  Due to the heterogeneity of the data, more than one such standard seems likely to be relevant

Berkeley Water Center Scientific Data Server - Workflows  Staging: adding data or metadata  New downloaded or field measurements added  New derived measurements added  Editing: changing data or metadata  Existing older measurements re-calibrated or re-derived  Data cleaning or other algorithm changes  Gap filling  Sharing: making the latest acquired data available rapidly  Even before all the checks have been made  Browsing new data before more detailed analyses  Private Analysis: Supporting individual researchers (MyDB)  Stable location for personal calibrations, derivations, and other data transformations  Import/Export to analysis tools and models  Curating: data versioning and provenance  Simple parent:child versioning to track collections of data used for specific uses Large Data Archives Local measurements

Berkeley Water Center Scientific Data Server - Logical Overview

Berkeley Water Center Databases  All descriptive metadata and data held in relational databases  Metadata is important too!  While separate databases are shown, the datasets may actually reside in a single database  Mapping is transparent to the scientist  Separate databases used for performance  Unified databases used for simplicity  New metadata and data are staged with a temporary database  Minimal quality checks applied  All name and unit conversions  Data may be exported to flat file, copied to a private MyDb database, directly accessed programmatically, or ?

Berkeley Water Center Data Cubes  A data cube is a database specifically for data mining (OLAP)  Initially developed for commercial needs like tracking sales of Oreos and milk  Simple aggregations (sum, min, or max) can be pre-computed for speed  Additional calculations (median) can be computed dynamically  Both operate along dimensions such as time, site, or datumtype  Constructed from a relational database  A specialized query language (MDX) is used  Client tool integrations is evolving  Excel PivotTables allow simple data viewing  More powerful charting with Tableaux or ProClarity (commercial mining tools)

Berkeley Water Center Browsing For Data Availability Sites Reporting Data Colored by Year

Berkeley Water Center Browsing For Data Availability Total Data Availability by Type Colored by Site Data type reporting is far from uniform across type

Berkeley Water Center Browsing for Data Availability Total Data Availability by Site Colored by Type Sites report more data either because of longevity or specific research interests

Berkeley Water Center Browsing for Data Quality  Real field data has both short term gaps and longer term outages  The utility of the data depends on the nature of the science being performed  Browsing data counts can give rapid insight into how the data can be used before more complex analyses are performed Data often missing in the winter! Measurements charted on axes are gaps What’s going on at higher latitudes? (It should be getting colder) Data Count

Berkeley Water Center Browsing for Data Quality  Real field data has unit and time scale conversion problems  Sometimes easy to spot in isolation  Sometimes easier to spot when comparing to other data  Browsing data values can give rapid insight into how the data can be used before more complex analyses are performed Maximum Annual Air Temperature Global Warming or Reporting in Fahrenheit? Odd Microclimate Effects or Error in Time Reporting ? Average Air Temperature Two Nearby Sites Local time or GMT time?

Berkeley Water Center Lessons Learned To Date  Metadata is as important as data  Comparing sites of like vegetation, climate is as important as latitude or other physical quantity  Curate the two together  Controlled vocabularies are hard  Humans like making up names and have a hard time remembering 100+ names  Assume a decode step in the staging pipeline  There are at least three database schema families and two cube construction approaches  Everyone has a favorite  Each has advantages and disadvantages  Automate the maintenance and use the right one for the right job  Visual programming tools are great for prototyping  But debugging and maintenance can hit a wall  It’s easy to overbuild – use when “good enough”  Data analysis and data cleaning are intertwined  Data cleaning is always on-going  Share the simple tools and visualizations The saga continues at and

Berkeley Water Center Near Term Futures  Improve current capabilities  Assemble gap-filled and non-gap filled data sets  Implement incremental data staging to enable speedy and simple data editing by an actual scientist (rather than a programmer)  Implement expanded metadata handling to enable scientist to add site characteristics and sort sites on those expanded definitions  Add basic reporting capabilities for server-side browsing of data availability to speed and simplify locating “interesting” data  Apply Data Server capabilities to a different set of data with different (but related) science  Considering either Russian River or Yosemite Valley hydrological data  Will be automating download from multiple different national data sets  Spatial (GIS) analyses more important  Linkage with imagery data necessary for science

Berkeley Water Center Longer Term Futures  Handling imagery and other remote sensing information  Curating images is different from curating time series data  Using both together enables new science and new insights  Graphical selection and display of data  Support for user specified calculations within the database  Support for direct connections to analysis and statistical packages  Linkage with models  Additional (emerging) data standards such as NetCDF  Handling “just in time” data delivery and model result curation  Data mining subscription services  Handling of a broader array of data types  Support for workflow tools

Berkeley Water Center Conclusions  Large data archives create the opportunity to  Do science at the regional and global scale  Combine data from multiple disciplines  Perform historical trend analysis  Small scientific collaborations need help to  Perform analyses using more data than they can currently manage  Enable data handling and versioning  Store the currently needed data and metadata  Browse the data for science  It’s the science, not the computer science  Computer science research can certainly help

Berkeley Water Center URLs  Berkeley Water Center (BWC)  Microsoft Project at BWC  Ameriflux Project