Three Flavors of Data Science Data Simulations and Sensor Readings Catalog Data Metadata; descriptors of datasets, data products and other processing artifacts.

Slides:



Advertisements
Similar presentations
September 13, 2004NVO Summer School1 VO Protocols Overview Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL O BSERVATORY.
Advertisements

Introduction to the BinX Library eDIKT project team Ted Wen Robert Carroll
Michael Pizzo Software Architect Data Programmability Microsoft Corporation.
Evaluating XML-Extended OLAP Queries Based on a Physical Algebra Xuepeng Yin and Torben B. Pedersen Department of Computer Science Aalborg University.
System Design and Memory Limits. Problem  If you were integrating a feed of end of day stock price information (open, high, low, and closing price) for.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Algebraic Manipulation of Scientific Datasets Bill Howe and David Maier OGI School of Science and Engineering at Oregon Health and Science University Portland.
Interactive Dynamic Aggregate Queries Kenneth A. Ross Junyan Ding Columbia University.
Query Languages Aswin Yedlapalli. XML Query data model Document is viewed as a labeled tree with nodes Successors of node may be : - an ordered sequence.
Development of a Community Hydrologic Information System Jeffery S. Horsburgh Utah State University David G. Tarboton Utah State University.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
Overview of Search Engines
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Enterprise Search. Search Architecture Configuring Crawl Processes Advanced Crawl Administration Configuring Query Processes Implementing People Search.
Introduction to DBMS Purpose of Database Systems View of Data
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
The role of metadata schema registries XML and Educational Metadata, SBU, London, 10 July 2001 Pete Johnston UKOLN, University of Bath Bath, BA2 7AY UKOLN.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
GLAST LAT ProjectDOE/NASA Baseline-Preliminary Design Review, January 8, 2002 K.Young 1 LAT Data Processing Facility Automatically process Level 0 data.
 DATABASE DATABASE  DATABASE ENVIRONMENT DATABASE ENVIRONMENT  WHY STUDY DATABASE WHY STUDY DATABASE  DBMS & ITS FUNCTIONS DBMS & ITS FUNCTIONS 
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
FI-CORE Data Context Media Management Chapter Release 4.1 & Sprint Review.
Chapter 1 : Introduction §Purpose of Database Systems §View of Data §Data Models §Data Definition Language §Data Manipulation Language §Transaction Management.
Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.
The Data Ring: Community Content Sharing Serge Abiteboul (INRIA) Alkis Polyzotis (UC Santa Cruz)
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
©Silberschatz, Korth and Sudarshan1.1Database System Concepts Chapter 1: Introduction Purpose of Database Systems View of Data Data Models Data Definition.
Virtual Cell and CellML The Virtual Cell Group Center for Cell Analysis and Modeling University of Connecticut Health Center Farmington, CT – USA.
Introduction to the new mainframe © Copyright IBM Corp., All rights reserved. Chapter 12 Understanding database managers on z/OS.
Introduction to Web Services Eric Lease Morgan University Libraries of Notre Dame June 24, 2005.
MAGDA Roger Jones UCL 16 th December RWL Jones, Lancaster University MAGDA  Main authors: Wensheng Deng, Torre Wenaus Wensheng DengTorre WenausWensheng.
The european ITM Task Force data structure F. Imbeaux.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma
Metadata Mòrag Burgon-Lyon University of Glasgow.
David Adams ATLAS DIAL/ADA JDL and catalogs David Adams BNL December 4, 2003 ATLAS software workshop Production session CERN.
Any data..! Any where..! Any time..! Linking Process and Content in a Distributed Spatial Production System Pierre Lafond HydraSpace Solutions Inc
CS 127 Introduction to Computer Science. What is a computer?  “A machine that stores and manipulates information under the control of a changeable program”
FRANEC and BaSTI grid integration Massimo Sponza INAF - Osservatorio Astronomico di Trieste.
E-infrastructure shared between Europe and Latin America FP6−2004−Infrastructures−6-SSA gLite Information System Pedro Rausch IF.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Eurostat 4. SDMX: Main objects for data exchange 1 Raynald Palmieri Eurostat Unit B5: “Central data and metadata services” SDMX Basics course, October.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America gLite Information System Claudio Cherubino.
Virtual Data Management for CMS Simulation Production A GriPhyN Prototype.
STAR Scheduling status Gabriele Carcassi 9 September 2002.
David Adams ATLAS ATLAS Distributed Analysis (ADA) David Adams BNL December 5, 2003 ATLAS software workshop CERN.
David Adams ATLAS ADA: ATLAS Distributed Analysis David Adams BNL December 15, 2003 PPDG Collaboration Meeting LBL.
FESR Trinacria Grid Virtual Laboratory gLite Information System Muoio Annamaria INFN - Catania gLite 3.0 Tutorial Trigrid Catania,
WMO GRIB Edition 3 Enrico Fucile Inter-Program Expert Team on Data Representation Maintenance and Monitoring IPET-DRMM Geneva, 30 May – 3 June 2016.
E-science grid facility for Europe and Latin America Updates on Information System Annamaria Muoio - INFN Tutorials for trainers 01/07/2008.
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
System Software Laboratory Databases and the Grid by Paul Watson University of Newcastle Grid Computing: Making the Global Infrastructure a Reality June.
Energy Management Solution
The CUAHSI Hydrologic Information System Spatial Data Publication Platform David Tarboton, Jeff Horsburgh, David Maidment, Dan Ames, Jon Goodall, Richard.
Internet/Web Databases
Introduction to DBMS Purpose of Database Systems View of Data
Agenda:- DevOps Tools Chef Jenkins Puppet Apache Ant Apache Maven Logstash Docker New Relic Gradle Git.
Hadoop.
Open Source distributed document DB for an enterprise
Chris Menegay Sr. Consultant TECHSYS Business Solutions
Energy Management Solution
Emergent Semantics: Towards Self-Organizing Scientific Metadata
Modeling Data Product Generation
Introduction to DBMS Purpose of Database Systems View of Data
Laura Bright David Maier Portland State University
LOD reference architecture
Information Services Claudio Cherubino INFN Catania Bologna
Presentation transcript:

Three Flavors of Data Science Data Simulations and Sensor Readings Catalog Data Metadata; descriptors of datasets, data products and other processing artifacts. Active Data Data associated with logging, monitoring and scheduling compute tasks.

Three Flavors of Data (1) Science Data  Simulation Data: Solutions to partial differential equations governing the physics of the Columbia River Estuary  Sensor Data: measurements of the physical characteristics used to guide and validate simulations Wanted: Simple means for specifying new data products from these raw data and computing them efficiently Approach: Data manipulation language based on a GridField data model.

Three Flavors of Data (2) Catalog Data Explicit metadata to describe system artifacts Wanted: Tools to locate artifacts given descriptors (query) A metadata collection facility that tolerates change  The metadata we wish to collect may change (eg, new product ‘lines’ are developed)  The source of the metadata may change (eg, file naming conventions or directory structures evolve.) Approach: Generic database; custom collection scripts

Three Flavors of Data (3) Active Data Data describing past, current, and future compute tasks. Wanted: Tools for scheduling, monitoring, and managing...  individual tasks (eg, a single data product derivation)  groups of interdependent tasks (eg, a daily forecast run)  campaigns (eg, a series of calibration runs followed by a re-computation of the runs of 2002 with a different implicitness) Approach: undecided

Simulation Data: GridFields The data product suite exhibits recurring processing idioms larger grids reduced to smaller grids Ex: ‘estuary’ data products vs. ‘far’ data products grids mapped to other grids Ex: 3D grid mapped to a 2D slice grids combined Ex: 1D depth grid ‘crossed’ with a 2D horizontal grid.

Simulation Data: GridFields (2) We’re expressing these idioms as operators over a grid-based data model. Advantages: Simpler recipes  5 ops for all the data products (plus helper functions) Flexible model; fewer maintenance troubles  N dimensions uniform handling of space and time (maybe more...)  Any cell type segments, triangles, quadrangles, arbitrary polytopes Optimization opportunities  operators prescribe semantics, but not implementation  topological equivalences exposed and exploited

Simulation Data: GridFields (3) Status: Core operators functional Simple examples hooked to XMVIS for viewing Todo:  Examples hooked to VTK  Write/Test examples from the current product suite  Support GridFields too large for memory  Expose a nice syntax for writing recipes

Catalog Data: Collection Where is the Metadata? File Name File Path Version: 1.04 Variable: salt : File Content 1_salt.63 /forecasts/ /run/images/isosal_estuary7/anim-sal_estuary_7.gif Other Files?

Collection scripts For each file type the meta-data collection mechanism is different. gifs binary output Param.in Use a script for each file type that will emit meta-data for that type of file. Only these simple scripts need change as the system evolves

Example: gif animation CorieDate = “ ” Region = “Estuary” Lat = xxxx Long = xxxx /forecasts/ /.../isosal_estuary7/anim-sal_estuary_7.gif Variable = “Salinity” Type = “Animation” Depth = “7” product line = “isoline” Here, a script can just parse the path and file name

Example: Binary output Need a different mechanism than for gif animations; might be convenient to implement it in a different script. /forecasts/ /run/1_salt.gif Variable= “Salinity” What about number of nodes? Mean Sea Level? We need to access the file’s content 1_salt.63 nodes: msl: 4285 :

Architecture Reflector creates XML file containing meta-data for each file and also stores the meta-data into the database Reflector determines file type (based on regular expressions) and calls appropriate collection script Collection script uses an “AddItem” Perl function to return the meta-data back to the reflector Reflector Collection Script invokes Meta-data DB XML

Metadata in XML and DB? These XML files give you filesystem-based access to the metadata for an artifact Use “info” to present the XML in a readable form: /../run> info 1_salt.63 variable: salt version: 1.04 msl: 4285 nodes: Also useful if DB is inaccessible.

Minor Technical Change Previously we had suggested that the collection scripts should emit metadata on standard output We have provided a perl function AddItem(Name,Value,Notes,Type)

How does this help ? Find artifacts via descriptors (query)  ‘find animations showing the estuary where we used a constant bottom friction coefficient’  where region = “estuary” and type = “animation” and ntau = “0” Write robust metadata-driven programs  Chris’ low bandwidth zoom web app  Stay-Fresh Powerpoint Slides