PSC Blacklight, a Large Hardware-Coherent Shared Memory Resource In TeraGrid Production Since 1/18/2011.

Slides:

Advertisements

Similar presentations

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

Advertisements

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.

Running Large Graph Algorithms – Evaluation of Current State-of-the-Art Andy Yoo Lawrence Livermore National Laboratory – Google Tech Talk Feb Summarized.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

IDC HPC User Forum Conference Appro Product Update Anthony Kenisky, VP of Sales.

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

Silicon Graphics, Inc. Poster Presented by: SGI Proprietary Technologies for Breakthrough Research Rosario Caltabiano North East Higher Education & Research.

Introduction CS 524 – High-Performance Computing.

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.

Fine Grain MPI Earl J. Dodd Humaira Kamal, Alan University of British Columbia 1.

Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.

N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.

Computer System Architectures Computer System Software

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

Computer Architecture ECE 4801 Berk Sunar Erkay Savas.

SDSC RP Update TeraGrid Roundtable Reviewing Dash Unique characteristics: –A pre-production/evaluation “data-intensive” supercomputer based.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

© 2012 Pittsburgh Supercomputing Center Big Memory = New Science Jim Kasdorf Director of Special Projects HPC User Forum Imperial College, London.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

The Future of the iPlant Cyberinfrastructure: Coming Attractions.

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Michael L. Norman Principal Investigator Interim Director, SDSC Allan Snavely.

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

Service - Oriented Middleware for Distributed Data Mining on the Grid ，劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Computer Organization & Assembly Language © by DR. M. Amer.

CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.

Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Programmability Hiroshi Nakashima Thomas Sterling.

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.

Extreme Scalability Working Group (XS-WG): Status Update Nick Nystrom Director, Strategic Applications Pittsburgh Supercomputing Center October 21, 2010.

Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.

Tackling I/O Issues 1 David Race 16 March 2010.

Background Computer System Architectures Computer System Software.

HPC University Requirements Analysis Team Training Analysis Summary Meeting at PSC September Mary Ann Leung, Ph.D.

Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Tools and Services Workshop

Joslynn Lee – Data Science Educator

Spark Presentation.

Toward a Unified HPC and Big Data Runtime

Programming Languages

Presentation transcript:

PSC Blacklight, a Large Hardware-Coherent Shared Memory Resource In TeraGrid Production Since 1/18/2011

2 © 2010 Pittsburgh Supercomputing Center SG-WG Update | Sanielevici | March 18, 2011 Why Shared Memory? Enable memory-intensive computation Enable data exploration statisticsstatistics machine learning vizviz graph- based informatics Increase users’ productivity productivity algorithm expression interactivityinteractivity rapid prototyping ISV apps high- productivity languages …… Change the way we look at data Change the way we look at data Boost scientific output Broaden participation Boost scientific output Broaden participation

3 © 2010 Pittsburgh Supercomputing Center SG-WG Update | Sanielevici | March 18, 2011

4 © 2010 Pittsburgh Supercomputing Center SG-WG Update | Sanielevici | March 18, 2011 PSC’s Blacklight (SGI Altix ® UV 1000) Programmability + Hardware Acceleration → Productivity 2×16 TB of cache-coherent shared memory –hardware coherency unit: 1 cache line (64B) –16 TB exploits the processor’s full 44-bit physical address space –ideal for fine-grained shared memory applications, e.g. graph algorithms, sparse matrices 32 TB addressable with PGAS languages, MPI, and hybrid approaches –low latency, high injection rate supports one-sided messaging –also ideal for fine-grained shared memory applications NUMAlink ® 5 interconnect –fat tree topology spanning full UV system; low latency, high bisection bandwidth –transparent hardware support for cache-coherent shared memory, message pipelining and transmission, collectives, barriers, and optimization of fine-grained, one-sided communications –hardware acceleration for PGAS, MPI, gather/scatter, remote atomic memory operations, etc. Intel Nehalem-EX processors: 4096 cores (2048 cores per SSI) –8-cores per socket, 2 hardware threads per core, 4 flops/clock, 24MB L3, Turbo Boost, QPI –4 memory channels per socket  strong memory bandwidth –x86 instruction set with SSE 4.2  excellent portability and ease of use SUSE Linux operating system –supports OpenMP, p-threads, MPI, PGAS models  high programmer productivity –supports a huge number of ISV applications  high end user productivity

5 © 2010 Pittsburgh Supercomputing Center SG-WG Update | Sanielevici | March 18, 2011 Programming Models & Languages UV supports an extremely broad range of programming models and languages for science, engineering, and computer science –Parallelism Coherent shared memory: OpenMP, POSIX threads (“p-threads”), OpenMPI, q-threads Distributed shared memory: UPC, Co-Array Fortran* Distributed memory: MPI, Charm++ Linux OS and standard languages enable users’ domain-specific languages, e.g. NESL –Languages C, C++, Java, UPC, Fortran, Co-Array Fortran* R, R-MPI Python, Perl, … → Rapidly express algorithms that defy distributed-memory implementation. → To existing codes, offer TB memory and high concurrency. * pending F2008-compliant compilers

6 © 2010 Pittsburgh Supercomputing Center SG-WG Update | Sanielevici | March 18, 2011 Cache coherency protocols ensure that all data is maintained consistently in all levels of the memory hierarchy. The unit of consistency should match the processor, i.e. one cache line. Hardware support is required to this maintain memory consistency at acceptable speeds. ccNUMA memory (a brief review; 1) ccNUMA: cache-coherent non-uniform memory access Memory is organized into a non-uniform hierarchy, where each level takes longer to access: registers1clock L1 cache, ~32 kB per core~4clocks L2 cache, ~ kB per core~11clocks L3 cache, ~1-3 MB per core, shared between cores~40clocks DRAM attached to a processor (“socket”)O(200)clocks DRAM attached to a neighboring processor on the nodeO(200)clocks DRAM attached to processors on other nodesO(1500)clocks 1 socket ~2-4 sockets many sockets

7 © 2010 Pittsburgh Supercomputing Center SG-WG Update | Sanielevici | March 18, 2011 Blacklight Architecture: Blade “node pair” NUMAlink-5 “node” UV Hub Intel Nehalem EX-8 QPI 64 GB RAM RAM RAM RAM UV Hub Intel Nehalem EX-8 QPI 64 GB RAM RAM RAM RAM Topology fat tree, spanning all 4096 cores Per SSI: 128 sockets 2048 cores 16 TB hardware-enabled coherent shared memory Full system: 256 sockets 4096 cores 32 TB PGAS, MPI, or hybrid parallelism NL5

8 © 2010 Pittsburgh Supercomputing Center SG-WG Update | Sanielevici | March 18, 2011 I/O and Grid /bessemer –PSC’s Center-wide Lustre filesystem $SCRATCH: Zest-enabled –high efficiency scalability (designed for O(10 6 ) cores), low-cost commodity components, lightweight software layers, end-to-end parallelism, client-side caching and software parity, and a unique model of load-balancing outgoing I/O onto high- speed intermediate storage followed by asynchronous reconstruction to a 3rd-party parallel file system Gateway ready: Gram5, GridFTP, comshell, Lustre-WAN… P. Nowoczynski, N. T. B. Stone, J. Yanovich, and J. Sommerfield, Zest Checkpoint Storage System for Large Supercomputers, Petascale Data Storage Workshop ’08. papers/Nowoczynski_Zest_paper_PDSW08.pdf

9 © 2010 Pittsburgh Supercomputing Center SG-WG Update | Sanielevici | March 18, 2011 Memory-Intensive Analysis Use Cases Algorithm Expression –Implement algorithms and analyses, e.g. graph-theoretical, for which distributed-memory implementations have been elusive or impractical. –Enable rapid, innovative analyses of complex networks. Interactive Analysis of Large Datasets –Example: fit the whole ClueWeb09 corpus into RAM to enable development of rapid machine-learning algorithms for inferring relationships. –Foster totally new ways of exploring large datasets. Interactive queries and deeper analyses limited only by the community’s imagination.

10 © 2010 Pittsburgh Supercomputing Center SG-WG Update | Sanielevici | March 18, 2011 User Productivity Use Cases Rapid Prototyping –Rapid development of algorithms for large-scale data analysis –Rapid development of “one-off” analyses –Enable creativity and exploration of ideas Familiar Programming Languages –Java, R, Octave, etc. –Leverage tools that scientists, engineers, and computer scientists already know and use. Lower the barrier to using HPC. ISV Applications –ADINA, Gaussian, VASP, … Vast memory accessible from even a modest number of cores –Leverage tools that scientists, engineers, and computer scientists already know and use. Lower the barrier to using HPC.

11 © 2010 Pittsburgh Supercomputing Center SG-WG Update | Sanielevici | March 18, 2011 Data crisis: genomics DNA sequencing machine throughput increasing at a rate of 5x per year Hundreds of petabytes of data will be produced in the next few years Moving and analyzing these data will be the major bottleneck in this field

12 © 2010 Pittsburgh Supercomputing Center SG-WG Update | Sanielevici | March 18, 2011 Genomics analysis: two basic flavors Loosely-coupled problems Sequence alignment: Read many short DNA sequences from disk and map to a reference genome –Lots of disk I/O –Fits well with MapReduce framework Tightly-coupled problems De novo assembly: Assemble a complete genome from short genome fragments generated by sequencers –Primarily a large graph problem –Works best with a lot of shared memory

13 © 2010 Pittsburgh Supercomputing Center SG-WG Update | Sanielevici | March 18, 2011 Sequence Assembly of Sorghum Sarah Young and Steve Rounsley (University of Arizona) Sequence assemblies of this type will be key to the iPlant Collaborative. Larger plant assemblies are planned in the future. PSC Blacklight: EARLY illumination Tested various genomes, assembly codes, and parameters to determine best options for plant genome assemblies Performed assembly of a 600+ Mbase genome of a member of the Sorghum genus on Blacklight using ABySS.

14 © 2010 Pittsburgh Supercomputing Center SG-WG Update | Sanielevici | March 18, 2011 What can a machine with 16 TB shared memory do for genomics? Exploring efficient solution of both loosely and tightly-coupled problems: Sequence alignment: –Experimenting with use of ramdisk to alleviate I/O bottlenecks and increase performance –Configuring Hadoop to work on large shared memory system –Increasing productivity by allowing researchers to use simple, familiar MapReduce framework De novo assembly of huge genomes: –Human genome with 3 gigabases (Gb) of DNA typically requires GB RAM to assemble –Cancer research requires hundreds of these assemblies –Certain important species, e.g. Loblolly pine, have genomes ~10x larger than humans requiring terabytes of RAM to assemble –Metagenomics (sampling unknown microbial populations): no theoretical limit to how many base pairs one might assemble together (100x more than human assembly!) Pinus taeda (Loblolly Pine)

15 © 2010 Pittsburgh Supercomputing Center SG-WG Update | Sanielevici | March 18, 2011 Thermodynamic Stability of Quasicrystals Max Hutchinson and Mike Widom (Carnegie Mellon University) A leading proposal for the thermodynamic stability of quasicrystals depends on the configurational entropy associated with tile flips (“phason flips”). Exploring the entropy of symmetry-broken structures whose perimeter is an irregular octagon will allow an approximate theory of quasicrystal entropy to be developed, replacing the actual discrete tilings with a continuum system modeled as a dilute gas of interacting tiles. Quasicrystals are modeled by rhombic/octagonal tilings, for which enumeration exposes thermodynamic properties. Breadth-first search over a graph that grows super- exponentially with system size; very little locality. Nodes must carry arbitrary-precision integers. PSC Blacklight: EARLY illumination T(7) = graph for the 3,3,3,3 quasicrystal T(1)=8

16 © 2010 Pittsburgh Supercomputing Center SG-WG Update | Sanielevici | March 18, 2011 Performance Profiling of Million-core Runs Sameer Shende (ParaTools and University of Oregon) Metadata Information about 1million core profile datasets, TAUParaProf Manager Window. Execution Time Breakdown of LS3DF subroutines over all MPI ranks. LS3DF Routines Profiling Data on rank 1,048,575. Histogram of MPI_Barrier, distribution of the routine calls over the execution time. ~500 GB of shared memory successfully applied to the visual analysis of very large scale performance profiles, using TAU. Profile data: synthetic million-core dataset assembled from 32k-core LS3DF runs on ANL’s BG/P. PSC Blacklight: EARLY illumination

17 © 2010 Pittsburgh Supercomputing Center SG-WG Update | Sanielevici | March 18, 2011 Summary On PSC’s Blacklight resource, hardware-supported cache-coherent shared memory is enabling new data-intensive and memory- intensive analytics and simulations. In particular, Blacklight is: –enabling new kinds of analyses on large data, –bringing new communities into HPC, and –increasing the productivity of both “traditional HPC” and new users. PSC is actively working with the research community to bring this new analysis capability to diverse fields of research. This will entail development of data-intensive workflows, new algorithms, scaling and performance engineering, and software infrastructure. Interested? Contact