7 +/- 2 Maybe Good Ideas John Caron June 2011. (1) NetCDF-Java (aka CDM) has lots of functionality, but only available in Java – NcML Aggregation – Access.

Slides:



Advertisements
Similar presentations
THREDDS Status John Caron Unidata 5/7/2013. Outline Release schedule Aggregations -> featureCollections / NCSS GRIB refactor Discrete Sampling Geometry.
Advertisements

A Common Data Model In the Middle Tier Enabling Data Access in Workflows … HDF/HDF-EOS Workshop XIV September 29, 2010 Doug Lindholm Laboratory for Atmospheric.
Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.
THREDDS, CDM, OPeNDAP, netCDF and Related Conventions John Caron Unidata/UCAR Sep 2007.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Jeff Ator.  Convert incoming observational data of varying types and formats into BUFR for NCEP observational database in /dcom/us  49 different.
Hadoop Ecosystem Overview
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Unidata TDS Workshop THREDDS Data Server Overview October 2014.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Ch 4. The Evolution of Analytic Scalability
John Caron Unidata October 2012
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Unidata’s TDS Workshop TDS Overview – Part II October 2012.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Unidata TDS Workshop TDS Overview – Part I XX-XX October 2014.
THREDDS Data Server Ethan Davis GEOSS Climate Workshop 23 September 2011.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Mid-Course Review: NetCDF in the Current Proposal Period Russ Rew
Accomplishments and Remaining Challenges: THREDDS Data Server and Common Data Model Ethan Davis Unidata Policy Committee Meeting May 2011.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
DAP4 James Gallagher & Ethan Davis OPeNDAP and Unidata.
Big Data Vs. (Traditional) HPC Gagan Agrawal Ohio State ICPP Big Data Panel (09/12/2012)
Unidata TDS Workshop THREDDS Data Server Overview
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Recent developments with the THREDDS Data Server (TDS) and related Tools: covering TDS, NCML, WCS, forecast aggregation and not including stuff covered.
Unidata’s Common Data Model and the THREDDS Data Server John Caron Unidata/UCAR, Boulder CO Jan 6, 2006 ESIP Winter 2006.
Unidata’s TDS Workshop TDS Overview – Part I July 2011.
Information Technology: GrADS INTEGRATED USER INTERFACE Maps, Charts, Animations Expressions, Functions of Original Variables General slices of { 4D Grids.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Powered by Microsoft Azure, PointMatter Is a Flexible Solution to Move and Share Data between Business Groups and IT MICROSOFT AZURE ISV PROFILE: LOGICMATTER.
GrADS-DODS Server An open-source tool for distributed data access and analysis Joe Wielgosz, Brian Doty, Jennifer Adams COLA/IGES - Calverton, MD
Distributed Time Series Database
Weathertop Consulting, LLC Server-side OPeNDAP Analysis – Concrete steps toward a generalized framework via a reference implementation using F-TDS Roland.
ESG-CET Meeting, Boulder, CO, April 2008 Gateway Implementation 4/30/2008.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Update on Unidata Technologies for Data Access Russ Rew
THREDDS Data Server (TDS) and Data Discovery John Caron Unidata/UCAR May 15, 2006.
Unidata Infrastructure for Data Services Russ Rew GO-ESSP Workshop, LLNL
BIG DATA/ Hadoop Interview Questions.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Hadoop Aakash Kag What Why How 1.
Hadoop.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
LOCO Extract – Transform - Load
Spark Presentation.
Central Florida Business Intelligence User Group
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Ch 4. The Evolution of Analytic Scalability
Introduction to Apache
Keep Your Digital Media Assets Safe and Save Time by Choosing ImageVault to be Your Digital Asset Management Solution, Hosted in Microsoft Azure Partner.
Google App Engine Ying Zou 01/24/2016.
Overview of big data tools
Accessing Remote Datasets through the netCDF interface.
Zoie Barrett and Brian Lam
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
MAPREDUCE TYPES, FORMATS AND FEATURES
THE GOOGLE FILE SYSTEM.
Map Reduce, Types, Formats and Features
Pig Hive HBase Zookeeper
Presentation transcript:

7 +/- 2 Maybe Good Ideas John Caron June 2011

(1) NetCDF-Java (aka CDM) has lots of functionality, but only available in Java – NcML Aggregation – Access to lots of other file formats – Feature types (eg collections of point data) – Ironically, some functionality (eg aggregation) already available for remote datasets through opendap – But not for local datasets How can we get the CDM into other languages ? – Replicate in C and maintain two software stacks – Use reverse JNI (call Java from C) – Or …

CdmRemote Server (aka TDS Lite) Lightweight server for CDM datasets – Zero configuration – use queries to configure – Local filesystem – Cache expensive objects – Allow non-Java applications access to CDM stack – Create virtual datasets: aggregations, logical views – Coordinate space queries – Feature Type subsetting – New API (!)

CdmRemote Server (aka TDS Lite) Data cdmRemote Server Coordinate Systems Data Access C Client Application cdmRemote CDM Point Feature API Python / ?

(2) Ncstream as a netCDF file format Write-optimized Append only Encode the full CDM object model Uses Google’s protobuf for serialization Java, C Libraries can read and access through the standard netCDF API Tools to convert to netcdf-3 and 4 formats

(3) BUFR/GRIB Table registration Unidata sponsored web service Registered users can upload BUFR/GRIB tables – Unique id is assigned (MD5 16 byte checksum?) – Convince producers to include the id into the data – unambiguous which table was used – Anyone can download. GRIB and BUFR Decoding – Using CDM – find bugs ! – Might become (ad-hoc) reference library – Might spur objections from “the experts” – Turn over to WMO if they want it Survival of Human Race is at stake here

(4) Streaming data / standing queries The proposal Dennis and I submitted last year “As soon as it arrives on IDD, send me PrecipTotal from NCEP/ RUC2 model subsetted by lat/lon bounding box in netCDF-4 / CF format” “As it arrives, send me GTS BUFR data in lat/lon bounding box in CSV”

TDS Current IDD data access Dataset LDM Push (header) Dataset FILE CDM library Pull requests IDD Data

Content based filtering (standing requests) LDM Push (content) PIPE Standing request IDD Data Request Content Filter service Message Service Content filtering Change encoding Protocol?

(5) Python Unidata should choose a scripting language to support, and give scientists full access to all of our tools in it Python wants to be the open-source Matlab DOE, BADC have bought into Python Python is a safe choice

(6) NetCDF management tools Develop consistent set of tools for managing collections of netCDF files – Use existing tools (ncgen, nccopy, ncdump, nco, etc) under the covers – but don’t be constrained by their interfaces Look at RDBMS management languages Use a scripting language like Python

(7) Hadoop – Open Source started by Doug Cutting (Lucene) and Yahoo – Based on Google’s Map-Reduce for parallel processing – Lots of industry use, part of new data ecosystems – Objects in distributed, replicated file system – Commodity, shared-nothing hardware nodes – Simple key-value store – Append-only, sequential reading – Scale to arbitrarily large amount of data (batch) – Gather many queries and run them over the data

(8) SciDB Michael Stonebraker, David DeWitt – “SciDB will be optimized for data management of big data and for big analytics. – “The scientists that are participating in our open source project believe that the SciDB database — when completed — will dramatically impact their ability to conduct their experiments faster and more efficiently and further improve the quality of life on our planet by enabling them to run experiments that were previously impossible due to the limitations of existing database systems and infrastructure.” Getting involved: 1.Load netcdf/hdf5 into SciDB 2.“Native mode” – leave data in netcdf/hdf5