Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.

Slides:

Advertisements

Similar presentations

Overview of the Science Environment for Ecological Knowledge (SEEK) Ricardo Scachetti Pereira.

Advertisements

An Operational Metadata Framework For Searching, Indexing, and Retrieving Distributed GIServices on the Internet By Ming-Hsiang.

Using Specimen Data in Scientific Workflow Environments to Connect to Metadata Archive and Discovery Services in Environmental Biology CJ Grady, J.H. Beach,

UCSD SAN DIEGO SUPERCOMPUTER CENTER Ilkay Altintas Scientific Workflow Automation Technologies Provenance Collection Support in the Kepler Scientific Workflow.

Jennifer A. Dunne Santa Fe Institute Pacific Ecoinformatics & Computational Ecology Lab Rich William, Neo Martinez, et al. Challenges.

Chad Berkley National Center for Ecological Analysis and Synthesis (NCEAS), University of California, Santa Barbara February.

Experiences in Integration of the 'R' System into Kepler Dan Higgins – National Center for Ecological Analysis and Synthesis (NCEAS), UC Santa Barbara.

Workflow Exchange and Archival: The KSW File and the Kepler Object Manager Shawn Bowers (For Chad Berkley & Matt Jones) University of California, Davis.

February 11, 2010 Center for Hybrid and Embedded Software Systems Ptolemy II - Heterogeneous Concurrent Modeling and Design.

GIS Actors in Kepler - Java-based, GDAL-JNI, and C++(Grass) Routines Dan Higgins - UC Santa Barbara (NCEAS) Chad Berkley – UC Santa Barbara (NCEAS) Jianting.

Center for Environmental Studies Arizona State University Digital Research Records at Center for Environmental Studies Peter McCartney.

Leveraging semantic metadata for ecological data discovery and integration for analysis and modeling Matthew B. Jones Mark P. Schildhauer with contributions.

Synthesis of Incomplete and Qualified Data using the GCE Data Toolbox Wade Sheldon Georgia Coastal Ecosystems LTER University of Georgia.

A Semantic Workﬂow Mechanism to Realise Experimental Goals and Constraints Edoardo Pignotti, Peter Edwards, Alun Preece, Nick Gotts and Gary Polhill School.

Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.

Biology.sdsc.edu CIPRes in Kepler: An integrative workflow package for streamlining phylogenetic data analyses Zhijie Guan 1, Alex Borchers 1, Timothy.

January, 23, 2006 Ilkay Altintas

1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.

Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

U.S. Department of the Interior U.S. Geological Survey CDI Data Management Working Group December 12, 2011 Sally Holl, USGS Texas Water Science Center.

SEEK: Enabling Ecology and Biodiversity Science Through Cyberinfrastructure.

Introduction for BEAM Ecological Niche Modeling Working Meeting Deana Pennington University of New Mexico December 14, 2004.

San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center

Long Term Ecological Research Network Information System LTER Grid Pilot Study LTER Information Manager’s Meeting Montreal, Canada 4-7 August 2005 Mark.

Pipelines and Scientific Workflows with Ptolemy II Deana Pennington University of New Mexico LTER Network Office Shawn Bowers UCSD San Diego Supercomputer.

Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.

EcoGrid SEEK All Hands Meeting February 2003 Albuquerque, NM.

Directions in observational data organization: from schemas to ontologies Matthew B. Jones 1 Chad Berkley 1 Shawn Bowers 2 Joshua Madin 3 Mark Schildhauer.

Ecological Metadata Language (EML) and Morpho

Science Environment for Ecological Knowledge: EcoGrid Matthew B. Jones National Center for.

Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)

11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)

SEEK EcoGrid l Integrate diverse data networks from ecology, biodiversity, and environmental sciences l Metacat, DiGIR, SRB, Xanthoria,... l EML is the.

Research Design for Collaborative Computational Approaches and Scientific Workflows Deana Pennington January 8, 2007.

Grid Technologies Arcot Rajasekar (SEEK) Paul Watson (North East eScience Centre)

DataNet – Flexible Metadata Overlay over File Resources Daniel Harężlak 1, Marek Kasztelnik 1, Maciej Pawlik 1, Bartosz Wilk 1, Marian Bubak 1,2 1 ACC.

Ecoinformatics Workshop Summary SEEK, LTER Network Main Office University of New Mexico Aluquerque, NM.

The SEEK EcoGrid: A Data Grid System for Ecology Arcot Rajasekar Matthew Jones Bertram Ludäscher

Using R in Kepler Dan Higgins – NCEAS Prepared for: Ecoinformatics Training for Ecologists LTER (Albuquerque) January 8-12, 2007

Data Integration and Management A PDB Perspective.

PREMIS Implementation Fair, San Francisco, CA October 7, Stanford Digital Repository PREMIS & Geospatial Resources Nancy J. Hoebelheinrich Knowledge.

Using Desktop Data in Kepler Dan Higgins – NCEAS Prepared for: Ecoinformatics Training for Ecologists LTER (Albuquerque) January 8-12, 2007

LTER Data Management Margaret O’Brien Santa Barbara Coastal Long Term Ecological Research (LTER) Project Santa Barbara Channel Biodiversity Observation.

Kepler includes contributors from GEON, SEEK, SDM Center and Ptolemy II, supported by NSF ITRs (SEEK), EAR (GEON), DOE DE-FC02-01ER25486.

Knowledge Representation Breakout KR: to create content (objects, reltnshps) for SMS (logic/inference) that will be useful for enhancing the discovery.

Information Management using Ecological Metadata Language Corinna Gries - CAP Margaret O’Brien - SBC.

Laura Russell Programmer VertNet Buenos Aires (Argentina) 28 September 2011 Training course on biodiversity data publishing and.

Specifications document A number of revisions & refinements done => upcoming revision of design document Summary: –support smart data discovery find data.

EScience Workshop on Scientific Workflows Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.

Scientific Workflow systems: Summary and Opportunities for SEEK and e-Science.

The US Long Term Ecological Research (LTER) Network: Site and Network Level Information Management Kristin Vanderbilt Department of Biology University.

Open GSBPM compliant data processing system in Statistics Estonia (VAIS) 2011 MSIS Conference Maia Ennok Head of Data Warehouse Service Data Processing.

John Porter Sheng Shan Lu M. Gastil Gastil-Buhl With special thanks to Chau-Chin Lin and Chi-Wen Hsaio.

SEEK Science Environment for Ecological Knowledge l EcoGrid l Ecological, biodiversity and environmental data l Computational access l Standardized, open.

The Virtual Heliospheric Observatory and Distributed Data Processing T.W. Narock 1,2, A. Szabo 2, A. Davis 3 1. L3 Communications,

Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara Advancing Software for Ecological.

OOI Cyberinfrastructure and Semantics OOI CI Architecture & Design Team UCSD/Calit2 Ocean Observing Systems Semantic Interoperability Workshop, November.

Satisfying Requirements BPF for DRA shall address: –DAQ Environment (Eclipse RCP): Gumtree ISEE workbench integration; –Design Composing and Configurability,

Visualization in Kepler Dan Higgins – NCEAS Prepared for: Ecoinformatics Training for Ecologists LTER (Albuquerque) January 8-12, 2007

Morpho – metadata management software SEEK Training January 2004.

SysML v2 Model Interoperability & Standard API Requirements Axel Reichwein Consultant, Koneksys December 10, 2015.

The Earth System Curator Metadata Infrastructure for Climate Modeling Rocky Dunlap Georgia Tech.

EcoGrid in SEEK A Data Grid System for Ecology Bertram Ludaescher University of California, Davis Arcot Rajasekar San Diego Supercomputer Center, University.

Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,

Scientific workflow in Kepler – hands on tutorial

Improving Data Discovery Through Semantic Search

Ptolemy II - Heterogeneous Concurrent Modeling and Design in Java

Problem: Ecological data needed to address critical questions are dispersed, heterogeneous, and complex Solution: An internet-based mechanism to discover,

A Semantic Type System and Propagation

Toward an Ontology-Driven Architectural Framework for B2B E. Kajan, L

Presentation transcript:

Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research Network Office, University of New Mexico University of Kansas San Diego Supercomputer Center Kepler: A Workflow Tool for Heterogeneous Ecological Data Analysis December 4, 2003 Edinburgh, Scotland

Outline Quick history SEEK overview Ecological Metadata Language Using workflows in Ecology Workflow editing with Kepler Future visions

History Late 1990s – patterns noticed in the problems surrounding data synthesis at NCEAS Michener et al paper on ecological metadata 2000 – Knowledge Network for Biocomplexity Morpho, Metacat, Ecological Metadata Language Some footholds into workflow creation and execution 2003 – Scientific Environment for Ecological Knowledge (SEEK) Grant Continues the work done on the KNB grant Emphasis on using metadata for advanced data processing

SEEK approach General approach to specific ecological problems Data described with adequate metadata in a grid accessible repository Reasoning engine (ontology based) to locate and extract data and processes Modeling system to put it all together and control execution flow

SEEK Components Ecogrid Analysis Library Metadata and data repository Semantic Mediation System Controlled semantic vocabulary Ontological discovery system Analysis and Modeling System (Kepler) Workflow control system Utilizes resources from other components

SEEK Architecture

Ecological Metadata Language Common language for archiving and transport of datasets XML based Designed for/by the ecological community Describes physical and logical structure of data Also includes project, literature and software information SEEK will add semantic information

Workflows in SEEK In the SEEK model, data ingestion/cleaning is metadata driven (specifically with EML) Output generation includes creating appropriate metadata The analysis pipeline itself becomes metadata

Metadata driven data ingestion Key information needed to read and machine process a data file is in the metadata File descriptors (CSV, Excel, RDBMS, etc.) Entity (table) and Attribute (column) descriptions Name Type (integer, float, string, etc.) Codes (missing values, nulls, etc.) In the future, this will include semantic typing

Metadata revision Metadata is revised following any transformation Versioning of metadata and data is very important This process results in a lineage of the data file as it has been transformed

Typical ecological workflow example Workflows can automate the integration process if data is described with adequate structured metadata

Homogeneous data integration Integration of homogeneous or mostly homogeneous data via EML metadata is relatively straightforward

Heterogeneous Data integration Integration of heterogeneous data requires much more advanced metadata and processing Attributes must be semantically typed Collection protocols must be known Units and measurement scale must be known Measurement mechanics must be known (i.e. that Density=Count/Area)

Label data with semantic types Label inputs and outputs of analytical components with semantic types Use Semantic Mediation System (SMS) to generate transformation steps Beware analytical constraints Use SMS to discover relevant components Ontology – specification of a conceptualization (a knowledge map) Semantic typing and ontologies DataOntology Workflow Components

Measurement Ontology Density is part of a larger measurement ontology SEEK’s intent is to create one or more community created ecological ontologies Creates a controlled vocabulary for ecological metadata More about this in Bertram’s talk

About Kepler Kepler is the name of the SEEK/SDM additions to the Ptolemy modeling system Ptolemy was designed by the UC Berkeley EECS department Primary use is modeling EE circuits Free, opensource, pure Java Flexible design GUI for building workflows

Kepler A Kepler model consists of linked “actors” (which correspond to workflow steps) Timing is controlled by a “director” All actors are written in Java but can call other applications (such as SAS and MATLAB or native language code via JNI) Actors can call arbitrary Web (or Grid) Services Ptolemy already has a very large inventory of actors Easy to use, drag ‘n drop interface

SEEK Contributions to Kepler (so far) EML data ingestion actor Actor design tool

EML data ingestion actor Ingests any data format described by EML metadata Converts raw data to Kepler format Data can then be operated on with other actors Produces one output port for each attribute in the dataset Individual attributes can then be mapped to other actors

Ptolemy model with EML ingestion actor

SEEK Contributions to Kepler (so far) EML data ingestion actor Actor design tool

Allows “place-holder” actors to be defined on the fly by non-programmers during workflow creation Domain scientists can thereby create workflows without programming knowledge Workflows created with these actors can be executed once their functionality is implemented by a programmer Allows quick prototyping of workflows by domain scientists “Place-holder” actors can still be linked to other working actors

Ptolemy and dynamically created actor

How domain scientists will benefit More fully automated integration systems A library of pre-defined analytical processes which can be executed on heterogeneous data Semantic data discovery and processing Automated unit and measurement scale conversions A fuller understanding of cross site research implications

Acknowledgements This material is based upon work supported by: The National Science Foundation under Grant Numbers , , and to NCEAS and its collaborators. The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number ), the University of California, and the UC Santa Barbara campus. Primary Collaborators: University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research) More info: Questions? IRC: irc.ecoinformatics.org #seek