Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tera/Petabyte data distribution architectures Chris A. Mattmann USC-CSE Annual Research Review Monday, June 15, 2015Monday, June 15, 2015Monday, June 15,

Similar presentations


Presentation on theme: "Tera/Petabyte data distribution architectures Chris A. Mattmann USC-CSE Annual Research Review Monday, June 15, 2015Monday, June 15, 2015Monday, June 15,"— Presentation transcript:

1 Tera/Petabyte data distribution architectures Chris A. Mattmann USC-CSE Annual Research Review Monday, June 15, 2015Monday, June 15, 2015Monday, June 15, 2015Monday, June 15, 2015

2 15-Jun-15MATTMANN-ARRCAM-2 Outline Research Problem and Importance Research Problem and Importance Background and Related Work Background and Related Work Problem Statement Problem Statement Approach Approach Evaluation Strategy Evaluation Strategy Conclusions Conclusions

3 15-Jun-15MATTMANN-ARRCAM-3 Research Problem and Importance Volume of data returned from scientific experiments and media content providers growing rapidly Volume of data returned from scientific experiments and media content providers growing rapidly –Planetary Data System  Current: 20 terabytes for all NASA missions  Growing to: over 200 terabytes from a single mission! –Orbiting Carbon Observatory  Current: hundreds of gigabytes to a single terabyte  Growing to: over 150 terabytes! * Projected as of 1/11/04

4 15-Jun-15MATTMANN-ARRCAM-4 Research Problem and Importance National Cancer Institute’s Early Detection Research Network (EDRN) National Cancer Institute’s Early Detection Research Network (EDRN) –Current: tens of gigabytes to hundreds of gigabytes –Growing to: hundreds of gigabytes to terabytes Question: how to distribute these voluminous data sets? Question: how to distribute these voluminous data sets?

5 15-Jun-15MATTMANN-ARRCAM-5 Distributing Large Volumes of Data Use existing infrastructure? Use existing infrastructure? HTTP/REST? HTTP/REST? Issues: Issues: –Scalability? –Single entry point? –Limited bandwidth? –What about other distribution mechanisms? RMI SOAPGridFTP

6 15-Jun-15MATTMANN-ARRCAM-6 Distributing Large Volumes of Data Few data movement mechanisms in place for scientists, students, educators, etc. to get their data Few data movement mechanisms in place for scientists, students, educators, etc. to get their data –EDRN: HTTP/REST –National Space Science Data Archive: FTP –Physical Oceanography Data Active Archive Center: FTP, and Aspera commercial UDP technology –Even Google: HTTP/REST, SOAP Even when there are many mechanisms in place, how do we select the correct one? Even when there are many mechanisms in place, how do we select the correct one? Sometimes, we may even need to use them in concert Sometimes, we may even need to use them in concert –Certain users may only be able to get data from GridFTP, while others may require HTTP/REST –HTTP combined with a UDP based mechanism may speed up the transfer

7 15-Jun-15MATTMANN-ARRCAM-7 Distributing Large Volumes of Data Understanding the Tradeoffs Understanding the Tradeoffs –HTTP/REST isn’t all bad: it’s pervasive, it’s ubiquitous, it’s a standard  It’s good in many situations, but not all situations –Same goes for many of the other distribution mechanisms  RMI scalable, but ties you to java, Peer-to-Peer highly scalable and efficient, but may neglect dependability and consistency Understanding how many different data movement technologies there are: Understanding how many different data movement technologies there are: –GridFTP, Aspera software, HTTP/REST, RMI, CORBA, SOAP, XML-RPC, Bittorrent, JXTA, UFTP, FTP, SFTP, SCP, Siena, GLIDE/PRISM-MW –…and that’s just off the top of my head! Understanding the classes of data movement technologies Understanding the classes of data movement technologies

8 15-Jun-15MATTMANN-ARRCAM-8 Software Architecture The definition of a system in the form of its canonical building blocks The definition of a system in the form of its canonical building blocks –Software Components: the computational units in the system –Software Connectors: the communications and interactions between software components –Software Configurations: arrangements of components and connectors and the rules that guide their composition

9 15-Jun-15MATTMANN-ARRCAM-9 A Software Architectural View of the Data Distribution Problem …Understanding the architectures of existing data systems …Understanding the architectures of existing data systems

10 15-Jun-15MATTMANN-ARRCAM-10 A Software Architectural View of the Data Distribution Problem …Deciding the appropriate software connectors for data distribution (and their combinations) to use …Deciding the appropriate software connectors for data distribution (and their combinations) to use

11 15-Jun-15MATTMANN-ARRCAM-11 A Software Architectural View of the Data Distribution Problem …Satisfying specified user scenarios for data distribution …Satisfying specified user scenarios for data distribution

12 15-Jun-15MATTMANN-ARRCAM-12 A Software Architectural View of the Data Distribution Problem …Making these people happy! …Making these people happy!

13 15-Jun-15MATTMANN-ARRCAM-13 Research Question What types of software connectors are best suited for delivering these huge amounts of data to the users, that satisfy their particular scenarios, in a manner that is performant, scalable, in these hugely distributed data systems? What types of software connectors are best suited for delivering these huge amounts of data to the users, that satisfy their particular scenarios, in a manner that is performant, scalable, in these hugely distributed data systems?

14 15-Jun-15MATTMANN-ARRCAM-14 Problem Statement Identifying and selecting suitable software connectors for data distribution* that satisfy user specified constraints Identifying and selecting suitable software connectors for data distribution* that satisfy user specified constraints Use eight key dimensions of data distribution Use eight key dimensions of data distribution –Literature review –Our own experience in the context of planetary science and cancer research at JPL –User specified constraints on eight dimensions are data distribution scenarios Identification of four basic distribution connector classes Identification of four basic distribution connector classes –RPC, P2P, Grid, Event-based –What classes are appropriate for which distribution scenarios? * Referred to as “distribution connectors” or “data distribution connectors”

15 15-Jun-15MATTMANN-ARRCAM-15 Eight Dimensions of Data Distribution

16 15-Jun-15MATTMANN-ARRCAM-16 Eight Dimensions of Data Distribution Total Volume - the total amount of data that needs to be transferred from providers of data to consumers of data. Total Volume - the total amount of data that needs to be transferred from providers of data to consumers of data. Number of Delivery Intervals - the number, size and frequency (timing) of intervals that the volume of data should be delivered within. Number of Delivery Intervals - the number, size and frequency (timing) of intervals that the volume of data should be delivered within. Performance Requirements - any constraints and requirements on the scalability, efficiency, consistency, and dependability of the distribution scenario. Performance Requirements - any constraints and requirements on the scalability, efficiency, consistency, and dependability of the distribution scenario. Number of Users - the amount of unique users that the data volume needs to be delivered to. Number of Users - the amount of unique users that the data volume needs to be delivered to. Number of User Types - the amount of unique user types, such as scientists, or students, that the data volume needs to be delivered to. Number of User Types - the amount of unique user types, such as scientists, or students, that the data volume needs to be delivered to. Data Types - The number of different data types that are part of the total volume to be delivered. Data Types - The number of different data types that are part of the total volume to be delivered. Geographic Distribution - The geographic distribution of the data providers and consumers. Geographic Distribution - The geographic distribution of the data providers and consumers. Access Policies - The number and types of access policies in place at each producer and consumer of data. Access Policies - The number and types of access policies in place at each producer and consumer of data.

17 15-Jun-15MATTMANN-ARRCAM-17Approach Classification Categorization Integration Testing/Evaluation

18 15-Jun-15MATTMANN-ARRCAM-18 Evaluation Strategy Empirical evaluation using real world systems Empirical evaluation using real world systems –NASA Planetary Data System –NASA Orbiting Carbon Observatory Mission –National Cancer Institute’s Early Detection Research Network Quantifiably measure Quantifiably measure –consistency (data delivered is data sent) –efficiency (memory footprint and data throughput) –scalability (data volume and number of hosts –dependability (uptime, number of faults) Compare to off-the-shelf connector solutions Compare to off-the-shelf connector solutions –OODT, GridFTP, Aspera, UFTP, Bittorrent, possibly more

19 15-Jun-15MATTMANN-ARRCAM-19 Current Progress Preliminary Study with NASA’s Planetary Data System Preliminary Study with NASA’s Planetary Data System Classified and Compared Data Movement Technologies Classified and Compared Data Movement Technologies –Parallel TCP/IP technologies  GridFTP, bbFTP –UDP bursting technologies  Aspera, UFTP –Baseline technologies  SCP, FTP, HTTP

20 15-Jun-15MATTMANN-ARRCAM-20 Experimental Results Classified and Evaluated each technology against data distribution dimensions Classified and Evaluated each technology against data distribution dimensions Measured transfer rate Measured transfer rate –LAN-based –WAN-based –Varied dataset sizes from 10s of MBs to 10s of GBs Ease to operate, ease to install Ease to operate, ease to install UDP technologies not testable on WAN (firewall, security, ease to configure) UDP technologies not testable on WAN (firewall, security, ease to configure) * GridFTP (blue), bbFTP (red), FTP (green)

21 15-Jun-15MATTMANN-ARRCAM-21 Conclusions Proposed approach for classifying, selecting and evaluating different software connectors for data distribution Proposed approach for classifying, selecting and evaluating different software connectors for data distribution Preliminary results suggest parallel TCP/IP technologies beneficial in real world system (PDS) Preliminary results suggest parallel TCP/IP technologies beneficial in real world system (PDS) Currently formalizing connector metadata and developing connector XML profiles Currently formalizing connector metadata and developing connector XML profiles

22 15-Jun-15MATTMANN-ARRCAM-22 Questions? Thanks for your attention! Thanks for your attention!

23 Backup

24 15-Jun-15MATTMANN-ARRCAM-24 Refereed Papers C. Mattmann, S. Kelly, D. Crichton, S. Hughes, S. Hardman, P. Ramirez and R. Joynger. A Classification and Evaluation of Data Movement Technologies for the Delivery of Highly Voluminous Scientific Data Products. In Proceedings of NASA/IEEE Conference on Mass Storage Systems and Technologies, May 2006. C. Mattmann, S. Kelly, D. Crichton, S. Hughes, S. Hardman, P. Ramirez and R. Joynger. A Classification and Evaluation of Data Movement Technologies for the Delivery of Highly Voluminous Scientific Data Products. In Proceedings of NASA/IEEE Conference on Mass Storage Systems and Technologies, May 2006. C. Mattmann, D. Crichton, N. Medvidovic and S. Hughes. A Software Architecture- Based Framework for Highly Distributed and Data Intensive Scientific Applications. In Proceedings of ICSE, Shanghai, China, May 20th-28th, 2006. C. Mattmann, D. Crichton, N. Medvidovic and S. Hughes. A Software Architecture- Based Framework for Highly Distributed and Data Intensive Scientific Applications. In Proceedings of ICSE, Shanghai, China, May 20th-28th, 2006. N. Medvidovic and C. Mattmann. The GridLite DREAM: Bringing the Grid to Your Pocket. In Proceedings of the Monterey Workshop on Networked Systems, Irvine, CA, September, 2005. N. Medvidovic and C. Mattmann. The GridLite DREAM: Bringing the Grid to Your Pocket. In Proceedings of the Monterey Workshop on Networked Systems, Irvine, CA, September, 2005. C. Mattmann, N. Medvidovic, P. Ramirez and V. Jakobac. Unlocking the Grid. In Proceedings of the 8th ACM SIGSOFT International Symposium on Component-based Software Engineering (CBSE8), pp. 322-336. LNCS 3489, St. Louis, Missouri, May 14th-15th, 2005. C. Mattmann, N. Medvidovic, P. Ramirez and V. Jakobac. Unlocking the Grid. In Proceedings of the 8th ACM SIGSOFT International Symposium on Component-based Software Engineering (CBSE8), pp. 322-336. LNCS 3489, St. Louis, Missouri, May 14th-15th, 2005. C. Mattmann, S. Malek, N. Beckman, M. Mikic-Rakic, N. Medvidovic and D. Crichton. GLIDE: A Grid-based, Lightweight, Infrastructure for Data-intensive Environments. In Proceedings of the European Grid Conference (EGC2005), pp. 68-77. LNCS 3470, Amsterdam, The Netherlands, February 14-16, 2005. C. Mattmann, S. Malek, N. Beckman, M. Mikic-Rakic, N. Medvidovic and D. Crichton. GLIDE: A Grid-based, Lightweight, Infrastructure for Data-intensive Environments. In Proceedings of the European Grid Conference (EGC2005), pp. 68-77. LNCS 3470, Amsterdam, The Netherlands, February 14-16, 2005.

25 15-Jun-15MATTMANN-ARRCAM-25 Refereed Papers J. Steven Hughes, D. Crichton, S. Kelly, C. Mattmann, R. Joyner, J. Wilf and J. Crichton. A Planetary Data System for the 2006 Mars Reconnaissance Orbiter Era and Beyond. In Proceedings of the 2nd ESA Symposium on Ensuring the Long Term Preservation and Adding Value to Scientific and Technical Data (PV-2004). Frascati, Italy, October 5-7, 2004. J. Steven Hughes, D. Crichton, S. Kelly, C. Mattmann, R. Joyner, J. Wilf and J. Crichton. A Planetary Data System for the 2006 Mars Reconnaissance Orbiter Era and Beyond. In Proceedings of the 2nd ESA Symposium on Ensuring the Long Term Preservation and Adding Value to Scientific and Technical Data (PV-2004). Frascati, Italy, October 5-7, 2004. C. Mattmann, D. Crichton, J.S. Hughes, S. Kelly and P. Ramirez. Software Architecture for Large scale, Distributed, Data-Intensive Systems. In Proceedings of the 4th IEEE/IFIP Working Conference on Software Architecture (WICSA-4), pp. 255- 264. Oslo, Norway, June 12th-15th, 2004. C. Mattmann, D. Crichton, J.S. Hughes, S. Kelly and P. Ramirez. Software Architecture for Large scale, Distributed, Data-Intensive Systems. In Proceedings of the 4th IEEE/IFIP Working Conference on Software Architecture (WICSA-4), pp. 255- 264. Oslo, Norway, June 12th-15th, 2004. C. Mattmann, P. Ramirez, D. Crichton and J.S. Hughes. Packaging Data Products using Data Grid Middleware for Deep Space Mission Systems. In Proceedings of the 8th International Conference on Space Operations (Spaceops-2004), AIAA Press. Montreal, Canada, May 2004. C. Mattmann, P. Ramirez, D. Crichton and J.S. Hughes. Packaging Data Products using Data Grid Middleware for Deep Space Mission Systems. In Proceedings of the 8th International Conference on Space Operations (Spaceops-2004), AIAA Press. Montreal, Canada, May 2004.


Download ppt "Tera/Petabyte data distribution architectures Chris A. Mattmann USC-CSE Annual Research Review Monday, June 15, 2015Monday, June 15, 2015Monday, June 15,"

Similar presentations


Ads by Google