Tera/Petabyte data distribution architectures Chris A. Mattmann USC-CSE Annual Research Review Monday, June 15, 2015Monday, June 15, 2015Monday, June 15,

Slides:



Advertisements
Similar presentations
Jeremy S. Bradbury, James R. Cordy, Juergen Dingel, Michel Wermelinger
Advertisements

Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Jens G Jensen Atlas Petabyte store Supporting Multiple Interfaces to Mass Storage Providing Tape and Mass Storage to Diverse Scientific Communities.
Software change management
IEEE/FIPA WG Mobile Agents Ulrich Pinsdorf Fraunhofer-Institute IGD, Germany Dept. Security Technology
Formalizing Security Requirements for Grids Syed Naqvi 1,2, Philippe Massonet 1, Alvaro Arenas 2 1 Centre of Excellence in Information and Communication.
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Facilitating Distributed.
Aug. 20, JPL, SoCalBSI '091 The power of bioinformatics tools in cancer research Early Detection Research Network, JPL Mentors: Dr. Chris Mattmann,
Copyright © Richard N. Taylor, Nenad Medvidovic, and Eric M. Dashofy. All rights reserved. Software Connectors.
Unlocking the Grid Chris A. Mattmann, Nenad Medvidovic, Paul M. Ramirez, Vladimir Jakobac Wednesday, June 10, th International ACM SIGSOFT Symposium.
Automated Analysis and Code Generation for Domain-Specific Models George Edwards Center for Systems and Software Engineering University of Southern California.
CSCI 578 Software Architectures Dr. Chris Mattmann Tuesday, January 13, 2009.
11 DICOM Image Communication in Globus-Based Medical Grids Michal Vossberg, Thomas Tolxdorff, Associate Member, IEEE, and Dagmar Krefting Ting-Wei, Chen.
A Framework for the Assessment and Selection of Software Components and Connectors in COTS-based Architectures Jesal Bhuta, Chris Mattmann {jesal,
Software Connector Classification and Selection for Data-Intensive Systems Chris A. Mattmann, David Woollard, Nenad Medvidovic, Reza Mahjourian 2nd Intl.
Quality of Service in IN-home digital networks Alina Albu 23 October 2003.
Chapter 10 Managing the Delivery of Information Services.
Cloud Usability Framework
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Enterprise Architecture
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
QoS-enabled middleware by Saltanat Mashirova. Distributed applications Distributed applications have distinctly different characteristics than conventional.
A Software Architecture for Highly Data-Intensive Systems Chris A. Mattmann USC Center for Software Engineering Annual Research Review.
Scientific Data Infrastructure in CAS Dr. Jianhui Scientific Data Center Computer Network Information Center Chinese Academy of Sciences.
CSCI 578 Software Architectures Dr. Chris Mattmann Tuesday, August 27, 2013.
JuxMem: An Adaptive Supportive Platform for Data Sharing on the Grid Gabriel Antoniu, Luc Bougé, Mathieu Jan IRISA / INRIA & ENS Cachan, France Workshop.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
A Taxonomy of Evaluation Approaches in Software Engineering A. Chatzigeorgiou, T. Chaikalis, G. Paschalidou, N. Vesyropoulos, C. K. Georgiadis, E. Stiakakis.
DISTRIBUTED COMPUTING
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
Page 1 Informatics Pilot Project EDRN Knowledge System Working Group San Antonio, Texas January 21, 2001 Steve Hughes Thuy Tran Dan Crichton Jet Propulsion.
1 A National Virtual Specimen Database for Early Cancer Detection June 26, 2003 Daniel Crichton NASA Jet Propulsion Laboratory Sean Kelly NASA Jet Propulsion.
DOMENICO TALIA (joint work with M. Cannataro, A. Congiusta, P. Trunfio) DEIS University of Calabria ITALY Grid-Based Data Mining and.
The roots of innovation Future and Emerging Technologies (FET) Future and Emerging Technologies (FET) The roots of innovation Proactive initiative on:
JuxMem: An Adaptive Supportive Platform for Data Sharing on the Grid Gabriel Antoniu, Luc Bougé, Mathieu Jan IRISA / INRIA & ENS Cachan, France Grid Data.
Transboundary Trust Space February 16, 2012 Ensuring trust in information exchange – proposal and approaches from Russia and CIS-states (RCC states) National.
Distributed Architectures A Comparative Analysis Client-Server (socket), RPC/RMI,P2P,Grid Where do you want to go today ? Chintan Odhavji Patel and Feng.
Modeling and Simulation of Cloud Computing:A Review Wei Zhao, Yong Peng, Feng Xie, Zhonghua Dai 報告者 : 饒展榕.
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
Yuhui Chen; Romanovsky, A.; IT Professional Volume 10, Issue 3, May-June 2008 Page(s): Digital Object Identifier /MITP Improving.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
An Architecture-based Framework For Understanding Large-Volume Data Distribution Chris A. Mattmann USC CSSE Annual Research Review March 17, 2009.
The Earth System Grid (ESG) Computer Science and Technologies DOE SciDAC ESG Project Review Argonne National Laboratory, Illinois May 8-9, 2003.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
ESFRI & e-Infrastructure Collaborations, EGEE’09 Krzysztof Wrona September 21 st, 2009 European XFEL.
Uni Innsbruck Informatik - 1 Network Support for Grid Computing... a new research direction! Michael Welzl DPS NSG Team
Providing web services to mobile users: The architecture design of an m-service portal Minder Chen - Dongsong Zhang - Lina Zhou Presented by: Juan M. Cubillos.
03/03/051 Performance Engineering of Software and Distributed Systems Research Activities at IIT Bombay Varsha Apte March 3 rd, 2005.
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Improving System Availability in Distributed Environments Sam Malek with Marija Mikic-Rakic Nels.
Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,
A Collaborative e-Science Architecture towards a Virtual Research Environment Tran Vu Pham 1, Dr. Lydia MS Lau 1, Prof. Peter M Dew 2 & Prof. Michael J.
ETICS An Environment for Distributed Software Development in Aerospace Applications SpaceTransfer09 Hannover Messe, April 2009.
Internet of Things. Creating Our Future Together.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
National Aeronautics and Space Administration Jet Propulsion Laboratory March 17, 2009 Workflow Orchestration: Conducting Science Efficiently on the Grid.
1 Visual Computing Institute | Prof. Dr. Torsten W. Kuhlen Virtual Reality & Immersive Visualization Till Petersen-Krauß | GUI Testing | GUI.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Chapter 8 Environments, Alternatives, and Decisions.
Presented by: Saurav Kumar Bengani
Building Distributed Educational Applications using P2P
Managing the Delivery of Information Services
Software Connectors.
CSCI 578 Software Architectures
Model-Driven Analysis Frameworks for Embedded Systems
The Movement Towards Grid Architectures in Planetary Science
Automated Analysis and Code Generation for Domain-Specific Models
Database System Architectures
Presentation transcript:

Tera/Petabyte data distribution architectures Chris A. Mattmann USC-CSE Annual Research Review Monday, June 15, 2015Monday, June 15, 2015Monday, June 15, 2015Monday, June 15, 2015

15-Jun-15MATTMANN-ARRCAM-2 Outline Research Problem and Importance Research Problem and Importance Background and Related Work Background and Related Work Problem Statement Problem Statement Approach Approach Evaluation Strategy Evaluation Strategy Conclusions Conclusions

15-Jun-15MATTMANN-ARRCAM-3 Research Problem and Importance Volume of data returned from scientific experiments and media content providers growing rapidly Volume of data returned from scientific experiments and media content providers growing rapidly –Planetary Data System  Current: 20 terabytes for all NASA missions  Growing to: over 200 terabytes from a single mission! –Orbiting Carbon Observatory  Current: hundreds of gigabytes to a single terabyte  Growing to: over 150 terabytes! * Projected as of 1/11/04

15-Jun-15MATTMANN-ARRCAM-4 Research Problem and Importance National Cancer Institute’s Early Detection Research Network (EDRN) National Cancer Institute’s Early Detection Research Network (EDRN) –Current: tens of gigabytes to hundreds of gigabytes –Growing to: hundreds of gigabytes to terabytes Question: how to distribute these voluminous data sets? Question: how to distribute these voluminous data sets?

15-Jun-15MATTMANN-ARRCAM-5 Distributing Large Volumes of Data Use existing infrastructure? Use existing infrastructure? HTTP/REST? HTTP/REST? Issues: Issues: –Scalability? –Single entry point? –Limited bandwidth? –What about other distribution mechanisms? RMI SOAPGridFTP

15-Jun-15MATTMANN-ARRCAM-6 Distributing Large Volumes of Data Few data movement mechanisms in place for scientists, students, educators, etc. to get their data Few data movement mechanisms in place for scientists, students, educators, etc. to get their data –EDRN: HTTP/REST –National Space Science Data Archive: FTP –Physical Oceanography Data Active Archive Center: FTP, and Aspera commercial UDP technology –Even Google: HTTP/REST, SOAP Even when there are many mechanisms in place, how do we select the correct one? Even when there are many mechanisms in place, how do we select the correct one? Sometimes, we may even need to use them in concert Sometimes, we may even need to use them in concert –Certain users may only be able to get data from GridFTP, while others may require HTTP/REST –HTTP combined with a UDP based mechanism may speed up the transfer

15-Jun-15MATTMANN-ARRCAM-7 Distributing Large Volumes of Data Understanding the Tradeoffs Understanding the Tradeoffs –HTTP/REST isn’t all bad: it’s pervasive, it’s ubiquitous, it’s a standard  It’s good in many situations, but not all situations –Same goes for many of the other distribution mechanisms  RMI scalable, but ties you to java, Peer-to-Peer highly scalable and efficient, but may neglect dependability and consistency Understanding how many different data movement technologies there are: Understanding how many different data movement technologies there are: –GridFTP, Aspera software, HTTP/REST, RMI, CORBA, SOAP, XML-RPC, Bittorrent, JXTA, UFTP, FTP, SFTP, SCP, Siena, GLIDE/PRISM-MW –…and that’s just off the top of my head! Understanding the classes of data movement technologies Understanding the classes of data movement technologies

15-Jun-15MATTMANN-ARRCAM-8 Software Architecture The definition of a system in the form of its canonical building blocks The definition of a system in the form of its canonical building blocks –Software Components: the computational units in the system –Software Connectors: the communications and interactions between software components –Software Configurations: arrangements of components and connectors and the rules that guide their composition

15-Jun-15MATTMANN-ARRCAM-9 A Software Architectural View of the Data Distribution Problem …Understanding the architectures of existing data systems …Understanding the architectures of existing data systems

15-Jun-15MATTMANN-ARRCAM-10 A Software Architectural View of the Data Distribution Problem …Deciding the appropriate software connectors for data distribution (and their combinations) to use …Deciding the appropriate software connectors for data distribution (and their combinations) to use

15-Jun-15MATTMANN-ARRCAM-11 A Software Architectural View of the Data Distribution Problem …Satisfying specified user scenarios for data distribution …Satisfying specified user scenarios for data distribution

15-Jun-15MATTMANN-ARRCAM-12 A Software Architectural View of the Data Distribution Problem …Making these people happy! …Making these people happy!

15-Jun-15MATTMANN-ARRCAM-13 Research Question What types of software connectors are best suited for delivering these huge amounts of data to the users, that satisfy their particular scenarios, in a manner that is performant, scalable, in these hugely distributed data systems? What types of software connectors are best suited for delivering these huge amounts of data to the users, that satisfy their particular scenarios, in a manner that is performant, scalable, in these hugely distributed data systems?

15-Jun-15MATTMANN-ARRCAM-14 Problem Statement Identifying and selecting suitable software connectors for data distribution* that satisfy user specified constraints Identifying and selecting suitable software connectors for data distribution* that satisfy user specified constraints Use eight key dimensions of data distribution Use eight key dimensions of data distribution –Literature review –Our own experience in the context of planetary science and cancer research at JPL –User specified constraints on eight dimensions are data distribution scenarios Identification of four basic distribution connector classes Identification of four basic distribution connector classes –RPC, P2P, Grid, Event-based –What classes are appropriate for which distribution scenarios? * Referred to as “distribution connectors” or “data distribution connectors”

15-Jun-15MATTMANN-ARRCAM-15 Eight Dimensions of Data Distribution

15-Jun-15MATTMANN-ARRCAM-16 Eight Dimensions of Data Distribution Total Volume - the total amount of data that needs to be transferred from providers of data to consumers of data. Total Volume - the total amount of data that needs to be transferred from providers of data to consumers of data. Number of Delivery Intervals - the number, size and frequency (timing) of intervals that the volume of data should be delivered within. Number of Delivery Intervals - the number, size and frequency (timing) of intervals that the volume of data should be delivered within. Performance Requirements - any constraints and requirements on the scalability, efficiency, consistency, and dependability of the distribution scenario. Performance Requirements - any constraints and requirements on the scalability, efficiency, consistency, and dependability of the distribution scenario. Number of Users - the amount of unique users that the data volume needs to be delivered to. Number of Users - the amount of unique users that the data volume needs to be delivered to. Number of User Types - the amount of unique user types, such as scientists, or students, that the data volume needs to be delivered to. Number of User Types - the amount of unique user types, such as scientists, or students, that the data volume needs to be delivered to. Data Types - The number of different data types that are part of the total volume to be delivered. Data Types - The number of different data types that are part of the total volume to be delivered. Geographic Distribution - The geographic distribution of the data providers and consumers. Geographic Distribution - The geographic distribution of the data providers and consumers. Access Policies - The number and types of access policies in place at each producer and consumer of data. Access Policies - The number and types of access policies in place at each producer and consumer of data.

15-Jun-15MATTMANN-ARRCAM-17Approach Classification Categorization Integration Testing/Evaluation

15-Jun-15MATTMANN-ARRCAM-18 Evaluation Strategy Empirical evaluation using real world systems Empirical evaluation using real world systems –NASA Planetary Data System –NASA Orbiting Carbon Observatory Mission –National Cancer Institute’s Early Detection Research Network Quantifiably measure Quantifiably measure –consistency (data delivered is data sent) –efficiency (memory footprint and data throughput) –scalability (data volume and number of hosts –dependability (uptime, number of faults) Compare to off-the-shelf connector solutions Compare to off-the-shelf connector solutions –OODT, GridFTP, Aspera, UFTP, Bittorrent, possibly more

15-Jun-15MATTMANN-ARRCAM-19 Current Progress Preliminary Study with NASA’s Planetary Data System Preliminary Study with NASA’s Planetary Data System Classified and Compared Data Movement Technologies Classified and Compared Data Movement Technologies –Parallel TCP/IP technologies  GridFTP, bbFTP –UDP bursting technologies  Aspera, UFTP –Baseline technologies  SCP, FTP, HTTP

15-Jun-15MATTMANN-ARRCAM-20 Experimental Results Classified and Evaluated each technology against data distribution dimensions Classified and Evaluated each technology against data distribution dimensions Measured transfer rate Measured transfer rate –LAN-based –WAN-based –Varied dataset sizes from 10s of MBs to 10s of GBs Ease to operate, ease to install Ease to operate, ease to install UDP technologies not testable on WAN (firewall, security, ease to configure) UDP technologies not testable on WAN (firewall, security, ease to configure) * GridFTP (blue), bbFTP (red), FTP (green)

15-Jun-15MATTMANN-ARRCAM-21 Conclusions Proposed approach for classifying, selecting and evaluating different software connectors for data distribution Proposed approach for classifying, selecting and evaluating different software connectors for data distribution Preliminary results suggest parallel TCP/IP technologies beneficial in real world system (PDS) Preliminary results suggest parallel TCP/IP technologies beneficial in real world system (PDS) Currently formalizing connector metadata and developing connector XML profiles Currently formalizing connector metadata and developing connector XML profiles

15-Jun-15MATTMANN-ARRCAM-22 Questions? Thanks for your attention! Thanks for your attention!

Backup

15-Jun-15MATTMANN-ARRCAM-24 Refereed Papers C. Mattmann, S. Kelly, D. Crichton, S. Hughes, S. Hardman, P. Ramirez and R. Joynger. A Classification and Evaluation of Data Movement Technologies for the Delivery of Highly Voluminous Scientific Data Products. In Proceedings of NASA/IEEE Conference on Mass Storage Systems and Technologies, May C. Mattmann, S. Kelly, D. Crichton, S. Hughes, S. Hardman, P. Ramirez and R. Joynger. A Classification and Evaluation of Data Movement Technologies for the Delivery of Highly Voluminous Scientific Data Products. In Proceedings of NASA/IEEE Conference on Mass Storage Systems and Technologies, May C. Mattmann, D. Crichton, N. Medvidovic and S. Hughes. A Software Architecture- Based Framework for Highly Distributed and Data Intensive Scientific Applications. In Proceedings of ICSE, Shanghai, China, May 20th-28th, C. Mattmann, D. Crichton, N. Medvidovic and S. Hughes. A Software Architecture- Based Framework for Highly Distributed and Data Intensive Scientific Applications. In Proceedings of ICSE, Shanghai, China, May 20th-28th, N. Medvidovic and C. Mattmann. The GridLite DREAM: Bringing the Grid to Your Pocket. In Proceedings of the Monterey Workshop on Networked Systems, Irvine, CA, September, N. Medvidovic and C. Mattmann. The GridLite DREAM: Bringing the Grid to Your Pocket. In Proceedings of the Monterey Workshop on Networked Systems, Irvine, CA, September, C. Mattmann, N. Medvidovic, P. Ramirez and V. Jakobac. Unlocking the Grid. In Proceedings of the 8th ACM SIGSOFT International Symposium on Component-based Software Engineering (CBSE8), pp LNCS 3489, St. Louis, Missouri, May 14th-15th, C. Mattmann, N. Medvidovic, P. Ramirez and V. Jakobac. Unlocking the Grid. In Proceedings of the 8th ACM SIGSOFT International Symposium on Component-based Software Engineering (CBSE8), pp LNCS 3489, St. Louis, Missouri, May 14th-15th, C. Mattmann, S. Malek, N. Beckman, M. Mikic-Rakic, N. Medvidovic and D. Crichton. GLIDE: A Grid-based, Lightweight, Infrastructure for Data-intensive Environments. In Proceedings of the European Grid Conference (EGC2005), pp LNCS 3470, Amsterdam, The Netherlands, February 14-16, C. Mattmann, S. Malek, N. Beckman, M. Mikic-Rakic, N. Medvidovic and D. Crichton. GLIDE: A Grid-based, Lightweight, Infrastructure for Data-intensive Environments. In Proceedings of the European Grid Conference (EGC2005), pp LNCS 3470, Amsterdam, The Netherlands, February 14-16, 2005.

15-Jun-15MATTMANN-ARRCAM-25 Refereed Papers J. Steven Hughes, D. Crichton, S. Kelly, C. Mattmann, R. Joyner, J. Wilf and J. Crichton. A Planetary Data System for the 2006 Mars Reconnaissance Orbiter Era and Beyond. In Proceedings of the 2nd ESA Symposium on Ensuring the Long Term Preservation and Adding Value to Scientific and Technical Data (PV-2004). Frascati, Italy, October 5-7, J. Steven Hughes, D. Crichton, S. Kelly, C. Mattmann, R. Joyner, J. Wilf and J. Crichton. A Planetary Data System for the 2006 Mars Reconnaissance Orbiter Era and Beyond. In Proceedings of the 2nd ESA Symposium on Ensuring the Long Term Preservation and Adding Value to Scientific and Technical Data (PV-2004). Frascati, Italy, October 5-7, C. Mattmann, D. Crichton, J.S. Hughes, S. Kelly and P. Ramirez. Software Architecture for Large scale, Distributed, Data-Intensive Systems. In Proceedings of the 4th IEEE/IFIP Working Conference on Software Architecture (WICSA-4), pp Oslo, Norway, June 12th-15th, C. Mattmann, D. Crichton, J.S. Hughes, S. Kelly and P. Ramirez. Software Architecture for Large scale, Distributed, Data-Intensive Systems. In Proceedings of the 4th IEEE/IFIP Working Conference on Software Architecture (WICSA-4), pp Oslo, Norway, June 12th-15th, C. Mattmann, P. Ramirez, D. Crichton and J.S. Hughes. Packaging Data Products using Data Grid Middleware for Deep Space Mission Systems. In Proceedings of the 8th International Conference on Space Operations (Spaceops-2004), AIAA Press. Montreal, Canada, May C. Mattmann, P. Ramirez, D. Crichton and J.S. Hughes. Packaging Data Products using Data Grid Middleware for Deep Space Mission Systems. In Proceedings of the 8th International Conference on Space Operations (Spaceops-2004), AIAA Press. Montreal, Canada, May 2004.