PI: Prof. Yelena Yesha and Prof. Milton Halem Sponsored by NASA

Slides:



Advertisements
Similar presentations
DEPARTMENT OF LAND INFORMATION – SATELLITE REMOTE SENSING SERVICES CRCSI AC Workshop November 2005 Remote Sensing in Near-Real Time of Atmospheric.
Advertisements

1 Towards an Open Service Framework for Cloud-based Knowledge Discovery Domenico Talia ICAR-CNR & UNIVERSITY OF CALABRIA, Italy Cloud.
NOAA National Geophysical Data Center
VIIRS Data Evaluation Over Thailand and Japan February 13, 2012 Chair, Chris Elvidge NOAA-NGDC 1.
EEA – Copenhague May 2006 Net Primary Production derived from land products Remote Sensing Data Overview Nadine Gobron With collaboration from.
Calipso (LIDAR in space) Data during DODO Flight B237 over Ocean off Mauritanian Coast 22 nd August 2006.
Robin Hogan Julien Delanoe University of Reading Remote sensing of ice clouds from space.
Computations with Big Image Data Phuong Nguyen Sponsor: NIST 1.
A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or 1.
Distributed and Parallel Processing Technology Chapter2. MapReduce
MODIS Land Products Workshop April 18, 2003 Instructors: Matt Reeves Numerical Terradynamics Simulation Group University of Montana Dave Verbyla Dept.
CS525: Special Topics in DBs Large-Scale Data Management
Towards a Fundamental Global IR Radiance Decadal Data Record Sponsor: Innovim Corp. PI M. Halem, PD. C.C. Wu Students: P. Nguyen, D. Chapman.
Discrete Event (time) Simulation Kenneth.
OMI follow-on Project Toekomstige missies gericht op troposfeer en klimaat Pieternel Levelt, KNMI.
The A-Train: How Formation Flying Is Transforming Remote Sensing Stanley Q. Kidder J. Adam Kankiewicz Thomas H. Vonder Haar Curtis Seaman Lawrence D. Carey.
The A-Train: How Formation Flying Is Transforming Remote Sensing Stanley Q. Kidder J. Adam Kankiewicz and Thomas H. Vonder Haar Cooperative Institute for.
DoD Center for Geosciences/Atmospheric Research at Colorado State University Annual Review Nov 15-17, Meteosat Second Generation Algorithms for.
Dan Bassett, Jonathan Canfield December 13, 2011.
Use of Lidar Backscatter to Determine the PBL Heights in New York City, NY Jia-Yeong Ku, Chris Hogrefe, Gopal Sistla New York State Department of Environmental.
The GlobCOLOUR products The GlobCOLOUR Products Gilbert Barrot ACRI-ST Globcolour / Medspiration user consultation, Dec 4-6, 2006, Villefranche/mer.
Big Data + SDN SDN Abstractions. The Story Thus Far Different types of traffic in clusters Background Traffic – Bulk transfers – Control messages Active.
Enhancement of Satellite-based Precipitation Estimates using the Information from the Proposed Advanced Baseline Imager (ABI) Part II: Drizzle Detection.
Evaluating Calibration of MODIS Thermal Emissive Bands Using Infrared Atmospheric Sounding Interferometer Measurements Yonghong Li a, Aisheng Wu a, Xiaoxiong.
Earth System Science Teachers of the Deaf Workshop, August 2004 S.O.A.R. High Earth Observing Satellites.
Mark Schoeberl NASA/GSFC
1 Towards A 35 Year Earth Science Data Record of Gridded IR Atmospheric Radiances by M. Halem, D. Chapman, P. Nguyen University of Maryland, Baltimore.
2. Point Cloud x, y, z, … Complete LiDAR Workflow 1. Survey 4. Analyze / “Do Science” 3. Interpolate / Grid USGS Coastal & Marine.
Remote Sensing of Mesoscale Vortices in Hurricane Eyewalls Presented by: Chris Castellano Brian Cerruti Stephen Garbarino.
Fundamentals of Satellite Remote Sensing NASA ARSET- AQ Introduction to Remote Sensing and Air Quality Applications Winter 2014 Webinar Series ARSET -
Visible Satellite Imagery Spring 2015 ARSET - AQ Applied Remote Sensing Education and Training – Air Quality A project of NASA Applied Sciences Week –
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Metr 415/715 Monday May Today’s Agenda 1.Basics of LIDAR - Ground based LIDAR (pointing up) - Air borne LIDAR (pointing down) - Space borne LIDAR.
Prospects for Improved Global Mapping of Development Using VIIRS Data Chris Elvidge Earth Observation Group NOAA-NESDIS National Geophysical Data Center.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Introduction to Hadoop and HDFS
MODIS Sea-Surface Temperatures for GHRSST-PP Robert H. Evans & Peter J. Minnett Otis Brown, Erica Key, Goshka Szczodrak, Kay Kilpatrick, Warner Baringer,
OpenDAP Server-side Functions for Multi-Instrument Aggregation ESIP Session: Advancing the Power and Utility of Server-side Aggregation Jon C. Currey (NASA),
GIFTS - The Precursor Geostationary Satellite Component of a Future Earth Observing System GIFTS - The Precursor Geostationary Satellite Component of a.
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
, Key Components of a Successful Earth Science Subsetter Architecture ASDC Introduction The Atmospheric Science Data Center (ASDC) at NASA Langley Research.
Aristeidis K. Georgoulias Konstantinos Kourtidis Konstantinos Konstantinidis AMFIC Web Data Base AMFIC Annual Meeting - Beijing October 2008 Democritus.
DEVELOPING HIGH RESOLUTION AOD IMAGING COMPATIBLE WITH WEATHER FORECAST MODEL OUTPUTS FOR PM2.5 ESTIMATION Daniel Vidal, Lina Cordero, Dr. Barry Gross.
EG2234: Earth Observation Interactions - Land Dr Mark Cresswell.
NASA Snow and Ice Products NASA Remote Sensing Training Geo Latin America and Caribbean Water Cycle capacity Building Workshop Colombia, November 28-December.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Facilitating Access to EOS Data at the NSIDC DAAC Siri Jodha Singh Khalsa ECS Science Coordinator for the National Snow and Ice Data Center, Distributed.
MODIS Snow and Sea Ice Data Products George Riggs SSAI Cryospheric Sciences Branch, NASA/GSFC Greenbelt, Md. Dorothy K.
R. T. Pinker, H. Wang, R. Hollmann, and H. Gadhavi Department of Atmospheric and Oceanic Science, University of Maryland, College Park, Maryland Use of.
Ambiguity of Quality in Remote Sensing Data Christopher Lynnes, NASA/GSFC Greg Leptoukh, NASA/GSFC Funded by : NASA’s Advancing Collaborative Connections.
Tianfeng Chai 1,2, Hyun-Cheol Kim 1,2, Daniel Tong 1,2, Pius Lee 2, Daewon W. Byun 2 1, Earth Resources Technology, Laurel, MD 2, NOAA OAR/ARL, Silver.
Early Results from AIRS and Risk Reduction Benefits for other Advanced Infrared Sounders Mitchell D. Goldberg NOAA/NESDIS Center for Satellite Applications.
Satellites Storm “Since the early 1960s, virtually all areas of the atmospheric sciences have been revolutionized by the development and application of.
Goddard Earth Sciences Data and Information Services Center, NASA/GSFC, Code 902, Greenbelt, Maryland 20771, USA INTRODUCTION  NASA Goddard Earth Sciences.
MODIS Enhanced-V 1.MODIS IR covering Enhanced-V event 0430z May 25, 2004 Located MODIS L1B Granule at LAADs (MODIS archive and distribution web site) that.
Ball Aerospace & Technologies Corporation -
Data acquisition From satellites with the MODIS instrument.
GSFC/Spinhirne 03/13/2002 Multispectral and Stereo Infrared Cloud Observations by COVIR (Compact Visible and Infrared Imaging Radiometer) J. Spinhirne,
A Brief Overview of CO Satellite Products Originally Presented at NASA Remote Sensing Training California Air Resources Board December , 2011 ARSET.
Consistent Earth System Data Records for Climate Research: Focus on Shortwave and Longwave Radiative Fluxes Rachel T. Pinker, Yingtao Ma and Eric Nussbaumer.
SCM x330 Ocean Discovery through Technology Area F GE.
Matrix Multiplication in Hadoop
UEE Seminar Series Lidar Sensing of Tropospheric Aerosols and Clouds
Big Data is a Big Deal!.
Paper under review for JGR-Atmospheres …
Mark Schoeberl NASA/GSFC
Charles Tappert Seidenberg School of CSIS, Pace University
Presentation transcript:

A Scalable Workflow Scheduling and Gridding of CALIPSO Lidar/Infrared Data PI: Prof. Yelena Yesha and Prof. Milton Halem Sponsored by NASA Presented by: Phuong Nguyen and Frank Harris IAB Meeting Research Report Dec 18, 2012

Project objectives Make use of the a scalable workflow scheduling system developed by CHMPR/MC2 (implemented on top of Hadoop) on a real Big Data scientific use case perform analysis of global climate change from decadal satellite data infrared radiance records stored in two distinct archives obtained from AIRS and MODIS instruments. perform gridding and subsequent monthly, seasonal and annual trend inter-comparisons with Surface Temperatures from ground station records and compare with model output reanalysis. Gridding other satellite data such as CALIPSO Lidar aerosols and delivery gridded data products

A Scalable Workflow Scheduling System

A Scalable Workflow Scheduling System We have developed a A Scalable Workflow Scheduling and Implemented as a workflow system on top of Apache Hadoop Expresses and dynamically schedules parallel data intensive workflow computations: data flows in Directed Acyclic Graph rather than control flow optimizes the level of concurrency shares cluster resources using fine grain scheduling (HybS) support scientific data format (e.g HDF) and computation using float arrays performance predictive model Available JAVA APIs: DagJob, DagBuilder, Graphs … and Libraries: gridding, statistic routines, statistic model Available HybS Hadoop plug in scheduler – configurable to work as Hadoop scheduler in current Hadoop distribution 1.0.1

Use case: global climate changes from AIRS and MODIS Atmospheric Infrared Sounder (AIRS) 14 - 40km Footprint 2378 IR Spectral Channels 5.5 TB / year (L1B) 55 Terabytes; 876,000 HDF files, each file 135x90x2378 (28,892,700 elements) 60MB Moderate Resolution Imaging Spectroradiometer (MODIS) 1 - 4km Footprint (Infrared) 16 IR Spectral Channels 17 TB / year (L1B) 170 Terabytes; 1,051,200 HDF files Produces data product 10 year AIRS FCDR anomalies At 0.50x1.00 lon-lat (100km) from 2002-2012

Global climate changes from AIRS and MODIS

AIRS gridding using MapReduce approaches Step 1: Parallel upload AIRS/ MODIS HDF files from NSF/PVFS into Hadoop HDFS Step2: Run gridding AIRS/MODIS using MapReduce jobs Output written to HBASE tables Step3: Analysis on gridded data from HBASE tables or Loading data out of HBASE/HDFS to store HDF files in NFS/PVFS for other analysis Gridding using MapReduce input for Map function a HDF file and output (key, value). key grid cell (latxlon) value is array of sum and count of radiances for all spectral channel Reduce function avg all values with the same key and output into Hbase tables

Spatial data locality Bounding box is implemented Image source: David Chapman Bounding box is implemented Reduce local before shuffle Output stores in Hbase tables for queries e.g monthly, seasonal and annual trend inter-comparisons

Improvement of AIRS gridding Estimated based on daily gridding Used bounding box for spatial data locality Gridding: compare with regular method, embarrassing parallel, gain 35% improvement in total processing time Benefits: scaling, failure handling, gridding at high resolutions, queries by random data access on Hbase tables.

Gridding CALIPSO Lidar aerosols Background Cloud-Aerosol LIDAR Infrared Pathfinder Satellite Observations (CALIPSO) is an Infrared/Lidar satellite, joint project between NASA and CNES (France) Fourth satellite in the A-train formation, follows CloudSat by 15 s, and Aqua by 165 s Launched in 2006

Instruments Cloud-Aerosol Lidar with Orthogonal Polarization (CALIPSO) Detects reflectance of 20 ns laser pulses at 1064 nm (IR) and 532 nm (vis) 333 m footprints at full spatial resolution Imaging Infrared Interferometer (IIR) Provides a 3-channel infrared product at 8.65, 10.6, and 12.05 μm at 1 km spatial resolution Wide Field Camera (WFC) 1-channel visible product at 1 km resolution

Progress Developed serial gridder in C, tested on subset of IIR data Acquired 14 months of IIR data, 333 days, average 1.5 GB per day, total ~ 500 GB In addition, 2 months of CALIPSO data downloaded, for a total of 625 GB, for a total of 3.7 TB/year

Gridded Product Full 360x180 degree image At full-resolution, image is 36000x18000 pixels and 2.4 GB in size Shows expected swath path for sun-synchronous satellite Shows limited coverage of nadir imaging Subset of gridded image Shows high detail within individual swaths Also shows significant moiré interference as a result of my gridding algorithm Plan to improve gridding via inverse distance weighting interpolation in the near future

What's Next Acquire rest of dataset (3 TB IIR, 22 TB CALIOP) When naïve sequential approach done, process using map- reduce Interference, sparse coverage and file size problems can be dealt with by significantly lowering resolution of product to 1°x1° Use NCAR Graphics library instead of reusing built-from- scratch internal code Produce gridded products, monthly and yearly averages Possible scientific applications: Solar reflectances to generate cloud maps, using altimetry data from CALIOP as correction for existing datasets

Project Status Have developed AIRS and MODIS gridding and analysis using MapReduce approaches (make use of the workflow system) Showed gridding CALIPSO using serial approach Future work work on gridding CALIPSO using the MapReduce approach test, evaluate and produce data products Phuong Nguyen Working on Open source workflow system and Hybs Hadoop plug-in scheduler.

Publications Phuong Nguyen, Tyler Simon, Milton Halem, David Chapman, and Quang Le, "A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce Environment”, The 5th IEEE/ACM International Conference on Utility and Cloud Computing 2012 Phuong Nguyen, David Chapman, Jeff Avery, and Milton Halem, “A near fundamental decadal data record of AIRS Infred Brightness Temperatures” IEEE Geoscience and Remote Sensing Symposium 2012 Phuong Nguyen PhD dissertation “Data intensive scientific compute model for multiple core clusters” submitted to UMBC Dec 3, 2012 Phuong Nguyen, Milton Halem,“A MapReduce Workflow System for Architecting Scientific Data Intensive Applications”, ACM International Workshop on Software Engineering for Cloud Computing proceeding of ICSE 2011

Questions?

Back up

Difference between Scientific Workflows and business workflows Service-Oriented Computing: Semantics, Processes, Agents UCL Department of Computer Science August 2004 Difference between Scientific Workflows and business workflows Scientific workflows Business workflows Thousands of service instances (partners) << service instances Thousands of basic service invocations; ten thousands of SOAP messages << invocations and SOAP messages Large numbers of sub-workflows for parallel execution << opportunities for parallel execution Very large amounts of data to be analysed routinely << amount of data to be analysed BPEL is primarily targeted at business workflows Scientific workflows differ in a number of ways The main difference is one of scale along several dimensions Partner, Invocations, Messages When compared to scientific workflows, business workflows usually define a relatively small number of BPEL partners with whom to interact. Scientific workflows may involve thousands of service instances that will need to be modelled as partners. Furthermore, scientific workflows will often execute thousands of basic service invocations and, consequently, send ten thousands of SOAP messages to be exchanged among service partners. Business workflows, in the majority of instances, operate on a smaller scale. Parallel Execution Scientific workflows apply complex computational models that generate large amounts of data and then analyse this data. Therefore, such workflows contain large numbers of similar independent sub-workflows that may be executed concurrently to, for example, run models concurrently and to filter and extract data resulting from an experiment. Business workflows do not usually display such massively parallel execution of very similar sub-workflows on such a scale. Amount of Data Need powerful data manipulation primitives. Think of data analysis pipeline Experiment A scientific workflow represents an experiment that is likely to be run only a limited number of times, before new ideas and insights will need to be incorporated. Frequent changes and re-deployment need to be supported and be made simple. A business workflow captures a set of activities and their relationships in order to describe a business process. The overall aim is to be able to automate this process and execute it repeatedly over possibly long periods of time. © Singh & Huhns

Background: MapReduce/Hadoop Distributed computation on large cluster Each job consists of Map and Reduce tasks Job stages Map tasks run computations in parallel Shuffle combines intermediate Map outputs Reduce tasks run computations in parallel M R M M R R M M Source slide: Brian Cho UIUC

Background: MapReduce/Hadoop Distributed computation on large cluster Each job consists of Map and Reduce tasks Job stages Map tasks run computations in parallel Shuffle combines intermediate Map outputs Reduce tasks run computations in parallel Map input/Reduce output stored in distributed file system (e.g. HDFS) Scheduling: Which task to run on empty resources (slots) Job 1 Job 3 M R M M R R R M R R M M M R M M R M M M M M M M Job 2 Source slide: Brian Cho UIUC

Why new workflow scheduling system? Characteristics of data intensive scientific apps Repeat experiments on the different data Computations on high dimension arrays: spatial, temporal, spectral Variety of data formats, need math libraries Complex components e.g. model prediction Lack of a scientific workflow system to deal with scale scalability, reliability, scheduling, data management, provenance, low overhead Current limitation: Hadoop is a scalable systems built to run at large scales (e.g. runs on 8000 cores) commodity clusters Still need to improve key performance metrics Limited support for scientific apps

HBASE design for multiple satellites: gridded data RowKey Column family rowKeyId Resolution-Statistics GeolocationData 100km_AvgBT 100km_Stdev 1km_AvgBT 250m_AvgBT Lat AIRS_20050101_ch528_20N20S10E10W …data… Scan into MapReduce computation (Table, RowKey, Family, Column, Timestamp) → Value Hbase Index on rowKey value The rowKey design for multiple satellite instruments <InstrumentID>_ <DateTime>_<SpectralChannel>_<Spatial Index> Column families e.g. Resolution-Statistics column1_100km, column_1Km. Spatial Index lat, lon bounding box Index by Instruments, Date Time, Spatial Index and Spectral Channel Scan rows (which columns) into MapReduce computation