DAS 1-4: 14 years of experience with the Distributed ASCI Supercomputer Henri Bal Vrije Universiteit Amsterdam.

Slides:



Advertisements
Similar presentations
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Experiences.
Advertisements

The First 16 Years of the Distributed ASCI Supercomputer Henri Bal Vrije Universiteit Amsterdam COMMIT/
Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences DAS-1 DAS-2 DAS-3.
Opening Workshop DAS-2 (Distributed ASCI Supercomputer 2) Project vrije Universiteit.
Vrije Universiteit Interdroid: a platform for distributed smartphone applications Henri Bal, Nick Palmer, Roelof Kemp, Thilo Kielmann High Performance.
Vrije Universiteit Interdroid: a platform for distributed smartphone applications Henri Bal, Nick Palmer, Roelof Kemp, Thilo Kielmann High Performance.
CCGrid2013 Panel on Clouds Henri Bal Vrije Universiteit Amsterdam.
The Ibis Project: Simplifying Grid Programming & Deployment Henri Bal Vrije Universiteit Amsterdam.
Big Data: Big Challenges for Computer Science Henri Bal Vrije Universiteit Amsterdam.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
The Distributed ASCI Supercomputer (DAS) project Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences.
The Ibis model as a paradigm for programming distributed systems Henri Bal Vrije Universiteit Amsterdam (from Grids and Clouds to Smartphones)
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Distributed supercomputing on DAS, GridLab, and Grid’5000 Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences.
Summary Background –Why do we need parallel processing? Applications Introduction in algorithms and applications –Methodology to develop efficient parallel.
Henri Bal Vrije Universiteit Amsterdam vrije Universiteit.
The Distributed ASCI Supercomputer (DAS) project Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences.
1 Solving Awari using Large-Scale Parallel Retrograde Analysis John W. Romein Henri E. Bal Vrije Universiteit, Amsterdam.
Real-World Distributed Computing with Ibis Henri Bal Vrije Universiteit Amsterdam.
Parallel Programming Henri Bal Rob van Nieuwpoort Vrije Universiteit Amsterdam Faculty of Sciences.
Inter-Operating Grids through Delegated MatchMaking Alexandru Iosup, Dick Epema, Hashim Mohamed,Mathieu Jan, Ozan Sonmez 3 rd Grid Initiative Summer School,
Parallel Programming Henri Bal Vrije Universiteit Faculty of Sciences Amsterdam.
The Ibis Project: Simplifying Grid Programming & Deployment Henri Bal, Jason Maassen, Rob van Nieuwpoort, Thilo Kielmann, Niels Drost, Ceriel Jacobs, Frank.
Virtual Laboratory for e-Science (VL-e) Henri Bal Department of Computer Science Vrije Universiteit Amsterdam vrije Universiteit.
Grid Adventures on DAS, GridLab and Grid'5000 Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences.
Ibis: a Java-centric Programming Environment for Computational Grids Henri Bal Vrije Universiteit Amsterdam vrije Universiteit.
Transposition Driven Work Scheduling in Distributed Search Department of Computer Science vrijeamsterdam vrije Universiteit amsterdam John W. Romein Aske.
The Ibis Project: Simplifying Grid Programming & Deployment Henri Bal Vrije Universiteit Amsterdam.
The Distributed ASCI Supercomputer (DAS) project Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences.
4 december, The Distributed ASCI Supercomputer The third generation Dick Epema (TUD) (with many slides from Henri Bal) Parallel and Distributed.
Real Parallel Computers. Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel.
Parallel Programming Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Going Dutch: How to Share a Dedicated Distributed Infrastructure for Computer Science Research Henri Bal Vrije Universiteit Amsterdam.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Panel Abstractions for Large-Scale Distributed Systems Henri Bal Vrije Universiteit Amsterdam.
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.
1 Challenge the future KOALA-C: A Task Allocator for Integrated Multicluster and Multicloud Environments Presenter: Lipu Fei Authors: Lipu Fei, Bogdan.
Henri Bal Vrije Universiteit Amsterdam High Performance Distributed Computing.
A High Performance Middleware in Java with a Real Application Fabrice Huet*, Denis Caromel*, Henri Bal + * Inria-I3S-CNRS, Sophia-Antipolis, France + Vrije.
More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers.
ICT infrastructure for Science: e-Science developments Henri Bal Vrije Universiteit Amsterdam.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Parallel Computing on Wide-Area Clusters: the Albatross Project Aske Plaat Thilo Kielmann Jason Maassen Rob van Nieuwpoort Ronald Veldema Vrije Universiteit.
Background Computer System Architectures Computer System Software.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
SCARIe: using StarPlane and DAS-3 Paola Grosso Damien Marchel Cees de Laat SNE group - UvA.
DutchGrid KNMI KUN Delft Leiden VU ASTRON WCW Utrecht Telin Amsterdam Many organizations in the Netherlands are very active in Grid usage and development,
High level programming for the Grid Gosia Wrzesinska Dept. of Computer Science Vrije Universiteit Amsterdam vrije Universiteit.
Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid Gosia Wrzesińska, Rob V. van Nieuwpoort, Jason Maassen, Henri.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Summary Background Introduction in algorithms and applications
Course Outline Introduction in algorithms and applications
Henri Bal Vrije Universiteit Amsterdam
Vrije Universiteit Amsterdam
Solving Awari using Large-Scale Parallel Retrograde Analysis
Cluster Computers.
Presentation transcript:

DAS 1-4: 14 years of experience with the Distributed ASCI Supercomputer Henri Bal Vrije Universiteit Amsterdam

Introduction ● DAS: shared distributed infrastructure for experimental computer science research ● Controlled experiments for CS, not production ● 14 years experience with funding & research ● Huge impact on Dutch CS DAS-1DAS-2DAS-3 DAS-4

Overview ● Historical overview of the DAS systems ● How to organize a national CS testbed ● Main research results at the VU

Outline ● DAS (pre-)history ● DAS organization ● DAS-1 – DAS-4 systems ● DAS-1 research ● DAS-2 research ● DAS-3 research ● DAS-4 research (early results) ● DAS conclusions

Historical perspective Collections of workstations (NOWs) Networks of workstations (COWs) Processor pools, clusters Metacomputing paper Flocking Condors Grid blueprint book Grids e-Science Optical networks Clouds Heterogeneous computing Green IT Mainframes DAS Pre-DAS Beowulf cluster Minicomputers

VU (pre-)history ● Andy Tanenbaum already built a cluster around 1984 ● Pronet token ring network ● 8086 CPUs ● (no pictures available) ● He built several Amoeba processor pools ● MC68000, MC68020, MicroSparc ● VME bus, 10 Mb/s Ethernet, Myrinet, ATM

Amoeba processor pool (Zoo, 1994)

DAS-1 background: ASCI ● Research schools (Dutch product from 1990s) ● Stimulate top research & collaboration ● Organize Ph.D. education ● ASCI (1995): ● Advanced School for Computing and Imaging ● About 100 staff & 100 Ph.D. students from TU Delft, Vrije Universiteit, Amsterdam, Leiden, Utrecht, TU Eindhoven, TU Twente, …

Motivation for DAS-1 ● CS/ASCI needs its own infrastructure for ● Systems research and experimentation ● Distributed experiments ● Doing many small, interactive experiments ● Need distributed experimental system, rather than centralized production supercomputer

Funding ● DAS proposals written by ASCI committees ● Chaired by Tanenbaum (DAS-1), Bal (DAS 2-4) ● NWO (national science foundation) funding for all 4 systems (100% success rate) ● About 900 K€ funding per system, 300 K€ matching by participants, extra funding from VU ● ASCI committee also acts as steering group ● ASCI is formal owner of DAS

Goals of DAS-1 ● Goals of DAS systems: ● Ease collaboration within ASCI ● Ease software exchange ● Ease systems management ● Ease experimentation ● Want a clean, laboratory-like system ● Keep DAS simple and homogeneous ● Same OS, local network, CPU type everywhere ● Single (replicated) user account file  [ACM SIGOPS 2000] (paper with 50 authors)

Behind the screens …. Source: Tanenbaum (ASCI’97 conference)

DAS-1 ( ) A homogeneous wide-area system VU (128)Amsterdam (24) Leiden (24)Delft (24) 6 Mb/s ATM 200 MHz Pentium Pro 128 MB memory Myrinet interconnect BSDI  Redhat Linux Built by Parsytec

BSDI  Linux

DAS-2 ( ) a Computer Science Grid VU (72)Amsterdam (32) Leiden (32) Delft (32) SURFnet 1 Gb/s Utrecht (32) two 1 GHz Pentium-3s ≥1 GB memory GB disk Myrinet interconnect Redhat Enterprise Linux Globus 3.2 PBS  Sun Grid Engine Built by IBM

VU (85) TU Delft (68)Leiden (32) UvA/MultimediaN (40/46) DAS - 3 ( ) An optical grid dual AMD Opterons 4 GB memory GB disk More heterogeneous: GHz Single/dual core nodes Myrinet-10G (exc. Delft) Gigabit Ethernet Scientific Linux 4 Globus, SGE Built by ClusterVision SURFnet6 10 Gb/s lambdas

DAS-4 (2011) Testbed for clouds, diversity & Green IT Dual quad-core Xeon E GB memory 1-10 TB disk Infiniband + 1Gb/s Ethernet Various accelerators (GPUs, multicores, ….) Scientific Linux Bright Cluster Manager Built by ClusterVision VU (74) TU Delft (32)Leiden (16) UvA/MultimediaN (16/36) SURFnet6 10 Gb/s lambdas ASTRON (23)

Performance

Impact of DAS ● Major incentive for VL-e  20 M€ funding ● Virtual Laboratory for e-Science ● Collaboration SURFnet on DAS-3 & DAS-4 ● SURFnet provides dedicated 10 Gb/s light paths ● About 10 Ph.D. theses per year use DAS ● Many other research grants

Major VU grants DAS-1: NWO Albatross; VU USF (0.4M) DAS-2: NWO Ibis, Networks; FP5 GridLab (0.4M) DAS-3: NWO StarPlane, JADE-MM, AstroStream; STW SCALP BSIK VL-e (1.6M); VU-ERC (1.1M); FP6 XtreemOS (1.2M) DAS-4: NWO GreenClouds (0.6M); FP7 CONTRAIL (1.4M); BSIK COMMIT (0.9M) Total: ~9M grants, <2M NWO/VU investments

Outline ● DAS (pre-)history ● DAS organization ● DAS-1 – DAS-4 systems ● DAS-1 research ● DAS-2 research ● DAS-3 research ● DAS-4 research (early results) ● DAS conclusions

DAS research agenda Pre-DAS: Cluster computing DAS-1: Wide-area computing DAS-2: Grids & P2P computing DAS-3: e-Science & optical Grids DAS-4: Clouds, diversity & green IT

Overview VU research Algorithms & applications Programming systems DAS-1Albatross Manta MagPIe DAS-2 Search algorithms Awari Satin DAS-3 StarPlane Model checking Ibis DAS-4 Multimedia analysis Semantic web Ibis

DAS-1 research ● Albatross: optimize wide-area algorithms ● Manta: fast parallel Java ● MagPIe: fast wide-area collective operations

Albatross project ● Study algorithms and applications for wide- area parallel systems ● Basic assumption: wide-area system is hierarchical ● Connect clusters, not individual workstations ● General approach ● Optimize applications to exploit hierarchical structure  most communication is local  [HPCA 1999]

Successive Overrelaxation ● Problem: nodes at cluster-boundaries ● Optimizations: ● Overlap wide-area communication with computation ● Send boundary row once every X iterations (slower convergence, less communication) Cluster 1 CPU 3CPU 2CPU 1CPU 6CPU 5CPU µsec µs Cluster 2

Wide-area algorithms ● Discovered numerous optimizations that reduce wide-area overhead ● Caching, load balancing, message combining … ● Performance comparison between ● 1 small (15 node) cluster, 1 big (60 node) cluster, wide-area (4*15 nodes) system

Manta: high-performance Java ● Native compilation (Java  executable) ● Fast RMI protocol ● Compiler-generated serialization routines ● Factor 35 lower latency than JDK RMI ● Used for writing wide-area applications  [ACM TOPLAS 2001]

MagPIe: wide-area collective communication ● Collective communication among many processors ● e.g., multicast, all-to-all, scatter, gather, reduction ● MagPIe: MPI’s collective operations optimized for hierarchical wide-area systems ● Transparent to application programmer  [PPoPP’99]

Spanning-tree broadcast Cluster 1Cluster 2Cluster 3Cluster 4 ● MPICH (WAN-unaware) ● Wide-area latency is chained ● Data is sent multiple times over same WAN-link ● MagPIe (WAN-optimized) ● Each sender-receiver path contains ≤1 WAN-link ● No data item travels multiple times to same cluster

DAS-2 research ● Satin: wide-area divide-and-conquer ● Search algorithms ● Solving Awari

Satin: parallel divide-and-conquer ● Divide-and-conquer is inherently hierarchical ● More general than master/worker ● Satin: Cilk-like primitives (spawn/sync) in Java

Satin ● Grid-aware load-balancing [PPoPP’01] ● Supports malleability (nodes joining/leaving) and fault-tolerance (nodes crashing) [IPDPS’05] ● Self-adaptive [PPoPP’07] ● Range of applications (SAT-solver, N-body simulation, raytracing, grammar learning, ….) [ACM TOPLAS 2010]  Ph.D theses: van Nieuwpoort (2003), Wrzesinska (2007)

Satin on wide-area DAS-2

Search algorithms ● Parallel search: partition search space ● Many applications have transpositions ● Reach same position through different moves ● Transposition table: ● Hash table storing positions that have been evaluated before ● Shared among all processors

Distributed transposition tables ● Partitioned tables ● 10,000s synchronous messages per second ● Poor performance even on a cluster ● Replicated tables ● Broadcasting doesn’t scale (many updates)

● Send job asynchronously to owner table entry ● Can be overlapped with computation ● Random (=good) load balancing ● Delayed & combined into fewer large messages  Bulk transfers are far more efficient on most networks Transposition Driven Scheduling

Speedups for Rubik’s cube Single Myrinet cluster TDS on wide-area DAS-2 ● Latency-insensitive algorithm works well even on a grid, despite huge amount of communication  [IEEE TPDS 2002]

Solving awari ● Solved by John Romein [IEEE Computer, Oct. 2003] ● Computed on VU DAS-2 cluster, using similar ideas as TDS ● Determined score for 889,063,398,406 positions ● Game is a draw Andy Tanenbaum: ``You just ruined a perfectly fine 3500 year old game’’

DAS-3 research ● Ibis – see tutorial & [IEEE Computer 2010] ● StarPlane ● Wide-area Awari ● Distributed Model Checking

StarPlane

● Multiple dedicated 10G light paths between sites ● Idea: dynamically change wide-area topology

Wide-area Awari ● Based on retrograde analysis ● Backwards analysis of search space (database) ● Partitions database, like transposition tables ● Random distribution good load balance ● Repeatedly send results to parent nodes ● Asynchronous, combined into bulk transfers ● Extremely communication intensive: ● 1 Pbit of data in 51 hours (on 1 DAS-2 cluster)

Awari on DAS-3 grid ● Implementation on single big cluster ● 144 cores ● Myrinet (MPI) ● Naïve implementation on 3 small clusters ● 144 cores ● Myrinet + 10G light paths (OpenMPI)

Initial insights ● Single-cluster version has high performance, despite high communication rate ● Up to 28 Gb/s cumulative network throughput ● Naïve grid version has flow control problems ● Faster CPUs overwhelm slower CPUs with work ● Unrestricted job queue growth  Add regular global synchronizations (barriers)

Optimizations ● Scalable barrier synchronization algorithm ● Ring algorithm has too much latency on a grid ● Tree algorithm for barrier&termination detection ● Reduce host overhead ● CPU overhead for MPI message handling/polling ● Optimize grain size per network (LAN vs. WAN) ● Large messages (much combining) have lower host overhead but higher load-imbalance  [CCGrid 2008]

Performance ● Optimizations improved grid performance by 50% ● Grid version only 15% slower than 1 big cluster ● Despite huge amount of communication (14.8 billion messages for 48-stone database)

From Games to Model Checking ● Distributed model checking has very similar communication pattern as Awari ● Search huge state spaces, random work distribution, bulk asynchronous transfers ● Can efficiently run DiVinE model checker on wide- area DAS-3, use up to 1 TB memory [IPDPS’09]

Required wide-area bandwidth

DAS-4 research (early results) ● Distributed reasoning ● (Jacopo Urbani) ● Multimedia analysis on GPUs ● (Ben van Werkhoven)

WebPIE ● The Semantic Web is a set of technologies to enrich the current Web ● Machines can reason over SW data to find best results to the queries ● WebPIE is a MapReduce reasoner with linear scalability ● It significantly outperforms other approaches by one/two orders of magnitude

Performance previous state-of-the-art

Performance WebPIE Now we are here (DAS-4)! Our performance at CCGrid 2010 (SCALE Award, DAS-3)

Parallel-Horus: User Transparent Parallel Multimedia Computing on GPU Clusters

Execution times (secs) of a line detection application on two different GPU Clusters Number of Tesla T10 GPUsNumber of GTX 480 GPUs Unable to execute because of limited GTX 480 memory

Naive 2D Convolution Kernel on DAS-4 GPU

Our best performing 2D Convolution Kernel

Conclusions ● Having a dedicated distributed infrastructure for CS enables experiments that are impossible on production systems ● Need long-term organization (like ASCI) ● DAS has had a huge impact on Dutch CS, at moderate cost ● Collaboration around common infrastructure led to large joint projects (VL-e, SURFnet) & grants

Acknowledgements VU Group: Niels Drost Ceriel Jacobs Roelof Kemp Timo van Kessel Thilo Kielmann Jason Maassen Rob van Nieuwpoort Nick Palmer Kees van Reeuwijk Frank Seinstra Jacopo Urbani Kees Verstoep Ben van Werkhoven & many others DAS Steering Group: Lex Wolters Dick Epema Cees de Laat Frank Seinstra John Romein Rob van Nieuwpoort DAS management: Kees Verstoep DAS grandfathers: Andy Tanenbaum Bob Hertzberger Henk Sips Funding: NWO, NCF, SURFnet, VL-e, MultimediaN