JLab Status & 2016 Planning April 2015 All Hands Meeting Chip Watson Jefferson Lab Outline Operations Status FY15 File System Upgrade 2016 Planning for.

Slides:



Advertisements
Similar presentations
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
1 Computational models of the physical world Cortical bone Trabecular bone.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.
SUMS Storage Requirement 250 TB fixed disk cache 130 TB annual increment for permanently on- line data 100 TB work area (not controlled by SUMS) 2 PB near-line.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Performance benchmark of LHCb code on state-of-the-art x86 architectures Daniel Hugo Campora Perez, Niko Neufled, Rainer Schwemmer CHEP Okinawa.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Scientific Computing at Jefferson Lab Petabytes, Petaflops and GPUs Chip Watson Scientific Computing Group Jefferson Lab Presented at CLAS12 Workshop,
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
IT in the 12 GeV Era Roy Whitney, CIO May 31, 2013 Jefferson Lab User Group Annual Meeting.
SDSC RP Update TeraGrid Roundtable Reviewing Dash Unique characteristics: –A pre-production/evaluation “data-intensive” supercomputer based.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
9/16/2000Ian Bird/JLAB1 Planning for JLAB Computational Resources Ian Bird.
Outline IT Organization SciComp Update CNI Update
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Computing and IT Update Jefferson Lab User Group Roy Whitney, CIO & CTO 10 June 2009.
NLIT May 26, 2010 Page 1 Computing Jefferson Lab Users Group Meeting 8 June 2010 Roy Whitney CIO & CTO.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Scientific Computing Experimental Physics Lattice QCD Sandy Philpott May 20, 2011 IT Internal Review 12GeV Readiness.
History of Microprocessor MPIntroductionData BusAddress Bus
Looking Ahead: A New PSU Research Cloud Architecture Chuck Gilbert - Systems Architect and Systems Team Lead Research CI Coordinating Committee Meeting.
JLab Scientific Computing: Theory HPC & Experimental Physics Thomas Jefferson National Accelerator Facility Newport News, VA Sandy Philpott.
Proposed 2007 Acquisition Don Holmgren LQCD Project Progress Review May 25-26, 2006 Fermilab.
May 25-26, 2006 LQCD Computing Review1 Jefferson Lab 2006 LQCD Analysis Cluster Chip Watson Jefferson Lab, High Performance Computing.
Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
1 Cluster Development at Fermilab Don Holmgren All-Hands Meeting Jefferson Lab June 1-2, 2005.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Capacity Planning - Managing the hardware resources for your servers.
U.S. Department of Energy’s Office of Science Midrange Scientific Computing Requirements Jefferson Lab Robert Edwards October 21, 2008.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Software System Performance CS 560. Performance of computer systems In most computer systems:  The cost of people (development) is much greater than.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Compute and Storage For the Farm at Jlab
LQCD Computing Project Overview
Getting the Most out of Scientific Computing Resources
JLab “SciPhi-XVI” KNL Cluster
Getting the Most out of Scientific Computing Resources
Low-Cost High-Performance Computing Via Consumer GPUs
Tom LeCompte High Energy Physics Division Argonne National Laboratory
Diskpool and cloud storage benchmarks used in IT-DSS
Scott Michael Indiana University July 6, 2017
Heterogeneous Computation Team HybriLIT
DIRECT IMMERSION COOLED IMMERS HPC CLUSTERS BUILDING EXPERIENCE
Central Processing Unit- CPU
DIRECT IMMERSION COOLED IMMERS HPC CLUSTERS BUILDING EXPERIENCE
Architecture Background
LQCD Computing Operations
Scientific Computing At Jefferson Lab
Many-Core Graph Workload Analysis
Outline IT Division News Other News
Lecture 20 Parallel Programming CSE /27/2019.
Presentation transcript:

JLab Status & 2016 Planning April 2015 All Hands Meeting Chip Watson Jefferson Lab Outline Operations Status FY15 File System Upgrade 2016 Planning for Next USQCD Resource

JLab Resources Overview 3 IB clusters, 8,800 cores, shrinking to 6,200 July 1 3 GPU clusters, 512 GPUs –48 nodes, quad gaming GPU, going to 36 quads –36 nodes, quad C2050, will shrink as cards fail –42 nodes, quad K20 Xeon Phi KNC test cluster, 64 accelerators -> 48 –Will convert 4 nodes into interactive and R&D nodes 1.3 PB Lustre file system –shared with Experimental Physics, 70% LQCD –32 servers (soon to be 23) –8.5 GB/s aggregate bandwidth 10 PB tape library, shared, 10% LQCD –LQCD growing at about 40 TB / month

Operations & Utilization LQCD running well Colors are different USQCD projects/users (note that peak is above the 8,800 cores owned by USQCD) JLab load balances with Experimental Physics, which can consume nodes during our slow months. (No penalties this past year.) LQCD is now consuming unused “farm” cycles (debt shown in chart to left)

Lustre File System 1.3 PB across 32 servers shared with Experimental Physics –Aggregates bandwidth, helping both hit higher peaks –Allows more flexibility in adjusting allocations quickly –As 12 GeV program ramps up, split will move to 50% each Now upgrading to version –OpenZFS RAID-z2, full RAID check on every read –Will rsync across IB project by project starting in May –Will drain and move servers; as 1.8 shrinks, 2.5 grows –3 new servers (1 LQCD) will allow decommissioning 2009 hardware (12 oldest, smallest servers) –Soon to procure newer higher performance system(s) to replace 2010 hardware and increase total bandwidth to GB/s

Computer Room Upgrades To meet DOE goal of PUE of 1.4, power and cooling are being refurbished in 2015 –New 800 KW UPS –3 new 200 KW air handlers (+ refurbished 180) –All file servers, interactive, etc. will move to dual fed power, one side of which will be generator backed (99.99% uptime) Transitions –Chilled water outage later this month (1-2 days) –Rolling cluster outages to relocate and re-rack to KW/rack as opposed to KW today –Anticipate 2 days outage per rack (3-4 racks at a time) plus 4 days full system outage over the next 7 months, so <2% for the year. JLab will augment x86 capacity by 2% to compensate

2016 Planning

2016 LQCD Machine 5 year plan has leaner budgets, 40% less hardware, with no hardware funds in 2015, so the project plans to combine funds into 2 procurements (current plan of record): –FY16 & FY17 into a 2 phase procurement of ~$1.96M –FY18 & FY19 into a 2 phase procurement of ~$2.65M Process: The timeline & process is the same as previous years. The goal is also the same: Optimize the portfolio of machines to get the most science on the portfolio of applications.

x86… GPU… Xeon Phi… combo ? The Probable Contenders –Latest conventional x86, Pascal GPU, Xeon Phi / Knights Landing, … –Likely configurations for each Dual socket, 16 core Xeon (64 threads), 1:1 QDR or 2:1 FDR Quad GPU + dual socket (thousands of threads/GPU, on package high bandwidth memory); quad GPU to amortize cost of host OR dual GPU to minimize Amdahl’s Law; either way this is a fatter node therefore higher speed Infiniband per node, FDR or faster Single socket, 64+ core Xeon Phi (256+ threads, 512 bit SIMD, on-package high bandwidth memory)

KNL many core has-intel-made-about-knights-landing Not an accelerator. Not a heterogeneous architecture. x86 single socket node. Better core than KNC. Out-of-order execution Advanced branch prediction Scatter gather 8 on package MCDRAM, “up to 16 GB” 6 DDR4 ports “up to 384 GB” 1 MB L2 cache per 2 core tile (figure shows up to 72 cores if all are real & operational)

Time to Consider a New Architecture Xeon Phi software maturity is growing 2013 saw LQCD running at TACC / Stampede (KNC) Optimized Dirac matched performance of contemporary GPU Additional developments under way on multiple codes, driven by large future resources –Cori, 2016; with 9,300+ chips, –followed by ANL’s Theta (KNL) in 2016; 2,500 chips –and ANL’s Aurora (KNH – Knights Hill) in 2018, with “50,000 nodes”

Other Significant Changes Both Pascal and Knights Landing will have on package memory – high bandwidth, memory mapped (or cache, but probably better directly managed). What is happening now to enable use of this feature in 15 months? Pascal will have a new NVlink. Details still NDA. Intel will have an on-chip network that can replace Infiniband, but timeline is still NDA (certainly in time for Aurora). GPU-Power processor coupling with NVlink. Will this significantly reduce Amdahl’s Law hits? What will we need to do to exploit this?

Community Participation Very Important! This next machine will serve to replace all of the ARRA hardware (which by that time will be gone), while also increasing total USQCD project resources. When it turns on, it might represent as much as 50% of the JLab + FNAL + BNL LQCD resources. Questions: 1)What are the best representative applications to characterize a large fraction of our activities on USQCD owned resources? (Inverters are a part, but more is needed) 2)For what applications does CUDA code exist? Will exist ? 3)Who is working to prepare Xeon Phi code? What is anticipated to exist by early 2016? What is the minimum we should have to fairly evaluate Pascal vs Knights Landing ? …

Additional Questions 4)How much memory for each of the three architectures? For GPU’s, how much host memory is needed compared to GPU memory? (we’ll need to understand the % gain that comes from doubling memory, to compare to the cost of that upgrade) 5)What will it take to exploit on-package memory ? (can be similar to QCDOC, fast, memory mapped) 6)What applications are significantly disk I/O bound ? Is anyone dependent upon the performance of random I/O ? (i.e. is it time for SSD, or just better servers?) Please help the project in making the best selection by providing your input!