Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data: Movement, Crunching, and Sharing Guy Almes, Academy for Advanced Telecommunications 13 February 2015.

Similar presentations


Presentation on theme: "Big Data: Movement, Crunching, and Sharing Guy Almes, Academy for Advanced Telecommunications 13 February 2015."— Presentation transcript:

1 Big Data: Movement, Crunching, and Sharing Guy Almes, Academy for Advanced Telecommunications 13 February 2015

2 Overarching theme Understanding the interplay among data movement, crunching, and sharing is key.

3 This is a persistent theme mid-1980s: NSF launched two closely related programs The NSF Supercomputer Centers brought HPC and the emergent computational science to the mainstream of NSF- funded research The NSFnet program, needed to connect science users to those supercomputers, resulted in connecting all our research universities to the Internet File transfer of huge (e.g., one megabyte!) files was a key issue Thus, A&M connected to NSFnet in August 1987

4 An ongoing theme The Internet “outgrew” the narrow mission of connecting universities to supercomputers But, in its broad missions, it often neglects the big-data needs of university researchers Thus, having spawned the commercial Internet in the early 1990s, the universities created Internet2 in 1996 Again, a dramatic improvement in our ability to move huge (e.g., one gigabyte) files Note the Teragrid network as a false step

5 To the present First, note A&M innovation in the ScienceDMZ, so that key data-intensive resources, e.g., the gridftp servers of Brazos high-throughput cluster, have direct access to the wide-area (LEARN, Internet2, ESnet, etc.) Recently, that wide-area infrastructure has been upgraded to 100-Gb/s We’ll look at these in turn

6 ScienceDMZ You can achieve high-speed wide-are flows only if packet loss is very very small and MTU is not small This fails if you try to extend these flows into the general- purpose campus LAN Beginning 2009, we designed the Data Intensive Network to place key resources adjacent to the wide-area network This idea, called “ScienceDMZ” and popularized by ESnet, is now widely adopted across the country If both source and destination of a high-speed wide-area flow are on ScienceDMZ’s, very good performance can be achieved Example: gridftp servers supporting flows to/from the 240 TByte file system for the Brazos high-throughput cluster

7 100 Gb/s Upgrade The Internet2 backbone is built around 100-Gb/s circuits (and with up to 80 such circuits/lambdas per fiber) With a combination of NSF and local funding, LEARN is evolving to 100-Gb/s: Now: 100-Gb/s College Station to Houston Now: 100-Gb/s Houston to Internet2 backbone at Greenspoint Now: 100-Gb/s Austin to Dallas Now: 100-Gb/s Dallas to Internet2 backbone at Kansas City Future: 100-Gb/s College Station to Dallas, and Austin to San Antonio to Houston This would then result in a consistent 100-Gb/s wide-area infrastructure

8 Sum of current good situation ScienceDMZ and the emerging 100-Gb/s infrastructure permit very good end-to-end performance to resources on the ScienceDMZ Software tools such as gridftp, Globus Online, discipline- specific tools such as Phedex, permit wide-area flows in excess of 1 TByte/hour to be sustained from other high-end sites Emerging “Advanced Layer-2 Services”, based on software- defined network techniques, may be very important

9 Crunching Several key computing resources are already on A&M’s ScienceDMZ Parallel file system of Brazos Similarly for Eos Similarly for Ada, a new very large x86 cluster Emerging: Power7 cluster and eventually? the BlueGene cluster Data-moving servers attached to the parallel file systems of these resources And, using tools such as Globus Online, large data flows can be achieved to the computing resources of NSF/XSEDE and the DoE

10 Sharing Things are more primitive here. One can only point to: A few discipline-specific examples, e.g., the Phedex system of the Large Hadron Collider’s CMS collaboration Some key tools: InCommon / Shibboleth provide federated identity/authentication Globus Online provides some support for controlled sharing But, generally, this situation does not match our needs to be able to share data among key scientific collaborations

11 An important work in progress Controlled high-performance sharing of data is key to effective scientific collaboration in the big-data world


Download ppt "Big Data: Movement, Crunching, and Sharing Guy Almes, Academy for Advanced Telecommunications 13 February 2015."

Similar presentations


Ads by Google