Open TS: An Outline of Dynamic Parallelization Approach

Open TS: An Outline of Dynamic Parallelization Approach
Program Systems Institute RAS, Moscow State University, Institute of Mechanics Sergey Abramov, Alexei Adamovich, Alexander Inyukhin, Alexander Moskovsky, Vladimir Roganov, Elena Shevchuk, Yuri Shevchuk, Alexander Vodomerov, 06/09/05 (Pact 2005 Conference, Krasnoyarsk)

Presentation Outline Short self-introduction Open TS outline
MPI vs Open TS case study Applications Future work

1. Short Self-Introduction

PSI RAS, Pereslavl-Zalesski

SKIF Supercomputing Project
Joint of Russian Federation and Republic of Belarus organizations PSI RAS is lead organization from Russian Federation Hardware and Software

Flagship “SKIF К-1000” Peak performance 2,5 Tflops Linpack-performance 2,0 Tflops Efficiency ratio 80.1 % November 2004 : The most powerful supercomputer in ex-USSR, Rank 98 in Top500

Moscow State University
MSU 250

Open TS Overview

T-System History Mid-80-ies Basic ideas of T-System
1990-ies First implementation of T-System , “SKIF” GRACE — Graph Reduction Applied to Cluster Environment 2003-current, “SKIF” Open TS — Open T-system

Comparison: T-System and MPI
High-level a few keywords Low-level hundred(s) primitives C/Fortran T-System Assembler MPI Sequential Parallel

Related work Parallel Programming Using C++ (Scientific and Engineering Computation) by Gregory V. Wilson (Editor), Paul Lu (Editor) ABC++, Amelia, CC++, CHAOS++, COOL, C++//, ICC++, Mentat, MPC++, MPI++, pC++, POOMA, TAU, UC++

T-System in Comparison
Related work Open TS differentiator Charm++ FP-based approach UPC, mpC++ Implicit parallelism Glasgow Parallel Haskell Allows C/C++ based low-level optimization OMPC++ Provides both language and C++ templates library Cilk Supports SMP, MPI, PVM, and GRID platforms

Open TS: an Outline High-performance computing
“Automatic dynamic parallelization” Combining functional and imperative approaches, high-level parallel programming Т++ language: “Parallel dialect” of C++ — an approach popular in 90-ies

Т-Approach “Pure” functions (tfunctions) invocations produce grains of parallelism T-Program is Functional – on higher level Imperative – on low level (optimization) C-compatible execution model Non-ready variables, Multiple assignment “Seamless” C-extension (or Fortran-extension) Optimization through usability ----- it should be simple to do things in language that are low-cost for parallel computer It should be difficult to do things that are costly for parallel computer

Т++ Keywords tfun — Т-function tval — Т-variable tptr — Т-pointer
tout — Output parameter (like &) tdrop — Make ready twait — Wait for readiness tct — Т-context

Sample Program #include <stdio.h> tfun int fib (int n) {
return n < 2 ? n : fib(n-1)+fib(n-2); } tfun int main (int argc, char **argv) { if (argc != 2) { printf("Usage: fib <n>\n"); return 1; } int n = atoi(argv[1]); printf("fib(%d) = %d\n", n, (int)fib(n)); return 0; Not computationally intensive – but numerical integration has the same structure

Open TS: Environment

Open TS Runtime Three-tiered architecture (Т, M, S)
Design: microkernel , 10 extensions currently «Supermemory» Lightweight threads DMPI: Dynamic MPI auto selection of MPI implementation dynamic loading and linking

Supermemory Object-Oriented Distributed shared memory (OO DSM)
Global address space Cell versioning

Multithreading & Communications
Lightweight threads PIXELS ( threads) Asynchronous communications A thread “A” asks non-ready value (or new job) Asynchronous request sent: Active messages & Signals delivery over network to stimulate data transfer to the thread “A” Context switches (including a quant for communications) Latency Hiding for node-node exchange

DMPI Dynamic MPI Seven implementations of MPI supported now:
auto selection of MPI implementation dynamic loading and linking Seven implementations of MPI supported now: LAM MPICH SCALI MPI MVAPICH IMPI MPICH-G2 PACX-MPI even PVM can be used instead of MPI

Debugging: WAD, LTDB

Statistics Gathering

Message Tracing

Open TS: Applying to Distributed Computing
Meta-cluster messaging support (MPICH-G2, IMPI, PACX-MPI) Customizable scheduling strategies (network topology information used)

NPB, Test ЕР Rewritten @OpenTS
ЕР – Embarrassingly Parallel NASA Parallel Benchmarks suite Speedup = 96% of theoretical maximum (on 10 nodes) Efficiency, % of theoretical Time, % of sequential

Open TS vs MPI case study

Applications Popular and widely used
Developed by independent teams (MPI experts) PovRay – Persistence of Vision Ray-tracer, enabled for parallel run by a patch ALCMD/MP_lite – molecular dynamics package (Ames Lab)

T-PovRay vs MPI PovRay: code complexity
Program Source code volume MPI modules for PovRay 3.10g 1,500 lines MPI patch for PovRay 3.50c 3,000 lines T++ modules (for both versions 3.10g & 3.50c) 200 lines

T-PovRay vs MPI PovRay: performance
Efficiency of parallel implementations are comparable The ratio timeMPI (N)/timeT(N) varies from 87% (T-PovRay is slightly slower then MPI-PovRay) to 173% (T-PovRay is considerably faster then MPI-PovRay) for cluster “SKIF FIRST-BORN M”; from 101% (T-PovRay is slightly faster) to 201% (T-PovRay is considerably faster) for cluster “SKIF K-1000”. 16 dual Athlon 1800, AMD Athlon MP RAM 1GB, FastEthernet, LAM 7.0.6

T-PovRay vs MPI PovRay: performance
Efficiency of parallel implementations are comparable The ratio timeMPI (N)/timeT(N) varies from 87% (T-PovRay is slightly slower then MPI-PovRay) to 173% (T-PovRay is considerably faster then MPI-PovRay) for cluster “SKIF FIRST-BORN M”; from 101% (T-PovRay is slightly faster) to 201% (T-PovRay is considerably faster) for cluster “SKIF K-1000”. 2CPUs AMD Opteron GHz RAM 4GB, GigE, LAM 7.1.1

ALCMD/MPI vs ALCMD/OpenTS
MP_Lite component of ALCMD rewritten in T++ Fortran code is left intact

ALCMD/MPI vs ALCMD/OpenTS : code complexity
Program Source code volume MP_Lite total/MPI ~20,000 lines MP_Lite,ALCMD-related/ MPI ~3,500 lines MP_Lite,ALCMD-related/ OpenTS 500 lines

ALCMD/MPI vs ALCMD/OpenTS: performance
Efficiency of parallel implementations are comparable The ratio timeMPI (N)/timeT(N) varies from 87% (T-PovRay is slightly slower then MPI-PovRay) to 173% (T-PovRay is considerably faster then MPI-PovRay) for cluster “SKIF FIRST-BORN M”; from 101% (T-PovRay is slightly faster) to 201% (T-PovRay is considerably faster) for cluster “SKIF K-1000”. 16 dual Athlon 1800, AMD Athlon MP RAM 1GB, FastEthernet, LAM 7.0.6, Lennard-Jones MD, atoms

Efficiency of parallel implementations are comparable The ratio timeMPI (N)/timeT(N) varies from 87% (T-PovRay is slightly slower then MPI-PovRay) to 173% (T-PovRay is considerably faster then MPI-PovRay) for cluster “SKIF FIRST-BORN M”; from 101% (T-PovRay is slightly faster) to 201% (T-PovRay is considerably faster) for cluster “SKIF K-1000”. 2CPUs AMD Opteron GHz RAM 4GB, GigE, LAM 7.1.1, Lennard-Jones MD, atoms

Efficiency of parallel implementations are comparable The ratio timeMPI (N)/timeT(N) varies from 87% (T-PovRay is slightly slower then MPI-PovRay) to 173% (T-PovRay is considerably faster then MPI-PovRay) for cluster “SKIF FIRST-BORN M”; from 101% (T-PovRay is slightly faster) to 201% (T-PovRay is considerably faster) for cluster “SKIF K-1000”. 2CPUs AMD Opteron GHz RAM 4GB, InfiniBand,MVAMPICH 0.9.4, Lennard-Jones MD, atoms

Open TS applications

Т-Applications MultiGen – biological activity estimation
Remote sensing applications Plasma modeling Protein simulation Aeromechanics Query engine for XML AI-applications etc.

MultiGen Chelyabinsk State University
К0 Level 0 Level 1 К11 К12 Level 2 К21 К22 Multi-conformation model

Exectution time (min.:с)
MultiGen: Speedup National Cancer Institute USA Reg.No. NCI (AIDS drug lead) National Cancer Institute USA Reg.No. NCI (AIDS drug lead) TOSLAB company (Russia-Belgium) Reg.No. TOSLAB A2-0261 (antiphlogistic drug lead) Substance Atom number Rotations number Conformers Exectution time (min.:с) 1 node 4 nodes 16 nodes NCI 28 4 13 9:33 3:21 1:22 TOSLAB A2-0261 82 18 49 115:27 39:23 16:09 NCI 126 25 74 266:19 95:57 34:48

Aeromechanics Institute of Mechanics, MSU

AEROMECHANICS Institute of Mechanics, MSU

Creating space-born radar image from hologram

Simulating broadband radar signal
Graphical User Interface Non-PSI RAS development team (Space research institute of Khrunichev corp.)

Landsat Image Classification
Computational “web-service”

Future Work Multi-kernel CPU support Distributed computing
Schedulers Transport Interface to web-services Fault-tolerance Optimizing for modern CPUs Algorithmic skeletons, patterns and high level parallel libraries Interface to web-services – integration point

Out of Presentation Scope
Other T-languages: T-Refal, T-Fortan Memoization Automatically choosing between call-style and fork-style of function invocation Checkpointing Heartbeat mechanism Flavours of data references: “normal”, “glue” and “magnetic” — lazy, eager and ultra-eager (speculative) data transfer

ACKNOLEDGEMENTS “SKIF” supercomputing project
Russian Academy of Science grants Program “High-performance computing systems on new principles of computational process organization” Program of Presidium of Russian Academy of Science “Development of basics for implementation of distributed scientific informational-computational environment on GRID technologies” Russian Foundation Basic Research “ офи_а” Microsoft – contract for “Open TS vs MPI” case study

THANKS … … ANY QUESTIONS ???… …

Open TS benchmarks

Tests: NASA CG, NASA EP, FIB

EP @ OpenTS benchmark Embarrassingly parallel Recursive implementation
Two parameters size – number of operations in task ~ 2size depth – number of grains (t-function calls) = 2depth Number of operations per grain ~ 2size-depth Allows to stress Runtime

NPB, Test ЕР Rewritten @OpenTS
ЕР – Embarrassingly Parallel NASA Parallel Benchmarks suite Speedup = 96% of theoretical maximum (on 10 nodes) Efficiency, % of theoretical Time, % of sequential

Additional EPs The same T++ source code linked with different RTL extensions EP – standard, with dynamic load balance EP_ASYNC – “asynchronous” , data exchange interrupts calculation EP_GS – “grid scheduler”, minimize load deviation when assigning a task EP_GS_ASYNC – “grid scheduler” with “asynchronous” data exchange

EP metrics 2size/time/number of CPUs M Calculated as
Take % of the best over all experiments Good metric: is approx. the same on a single CPU with depths between 6 and 12 Cluster: 16 Dual Athlon 1800MP+, Fast Ethernet

EP results For all size  [28,32], depth [6,12], M=99,9% if Ncpu=1
M drops below 90% if NCPU>8 CPU for size=6 On 32 CPUs EP_GS_ASYNC is the best with M=88,2%, and depth=12, size=32

Open TS: An Outline of Dynamic Parallelization Approach

Similar presentations

Presentation on theme: "Open TS: An Outline of Dynamic Parallelization Approach"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Open TS: An Outline of Dynamic Parallelization Approach

Similar presentations

Presentation on theme: "Open TS: An Outline of Dynamic Parallelization Approach"— Presentation transcript:

Similar presentations

About project

Feedback