Presentation is loading. Please wait.

Presentation is loading. Please wait.

Open TS: An Outline of Dynamic Parallelization Approach

Similar presentations


Presentation on theme: "Open TS: An Outline of Dynamic Parallelization Approach"— Presentation transcript:

1 Open TS: An Outline of Dynamic Parallelization Approach
Program Systems Institute RAS, Moscow State University, Institute of Mechanics Sergey Abramov, Alexei Adamovich, Alexander Inyukhin, Alexander Moskovsky, Vladimir Roganov, Elena Shevchuk, Yuri Shevchuk, Alexander Vodomerov, 06/09/05 (Pact 2005 Conference, Krasnoyarsk)

2 Presentation Outline Short self-introduction Open TS outline
MPI vs Open TS case study Applications Future work

3 1. Short Self-Introduction

4 PSI RAS, Pereslavl-Zalesski

5 SKIF Supercomputing Project
Joint of Russian Federation and Republic of Belarus organizations PSI RAS is lead organization from Russian Federation Hardware and Software

6 Flagship “SKIF К-1000” Peak performance 2,5 Tflops Linpack-performance 2,0 Tflops Efficiency ratio 80.1 % November 2004 : The most powerful supercomputer in ex-USSR, Rank 98 in Top500

7 Moscow State University
MSU 250

8 Open TS Overview

9 T-System History Mid-80-ies Basic ideas of T-System
1990-ies First implementation of T-System , “SKIF” GRACE — Graph Reduction Applied to Cluster Environment 2003-current, “SKIF” Open TS — Open T-system

10 Comparison: T-System and MPI
High-level a few keywords Low-level hundred(s) primitives C/Fortran T-System Assembler MPI Sequential Parallel

11 Related work Parallel Programming Using C++ (Scientific and Engineering Computation) by Gregory V. Wilson (Editor), Paul Lu (Editor) ABC++, Amelia, CC++, CHAOS++, COOL, C++//, ICC++, Mentat, MPC++, MPI++, pC++, POOMA, TAU, UC++

12 T-System in Comparison
Related work Open TS differentiator Charm++ FP-based approach UPC, mpC++ Implicit parallelism Glasgow Parallel Haskell Allows C/C++ based low-level optimization OMPC++ Provides both language and C++ templates library Cilk Supports SMP, MPI, PVM, and GRID platforms

13 Open TS: an Outline High-performance computing
“Automatic dynamic parallelization” Combining functional and imperative approaches, high-level parallel programming Т++ language: “Parallel dialect” of C++ — an approach popular in 90-ies

14 Т-Approach “Pure” functions (tfunctions) invocations produce grains of parallelism T-Program is Functional – on higher level Imperative – on low level (optimization) C-compatible execution model Non-ready variables, Multiple assignment “Seamless” C-extension (or Fortran-extension) Optimization through usability ----- it should be simple to do things in language that are low-cost for parallel computer It should be difficult to do things that are costly for parallel computer

15 Т++ Keywords tfun — Т-function tval — Т-variable tptr — Т-pointer
tout — Output parameter (like &) tdrop — Make ready twait — Wait for readiness tct — Т-context

16 Sample Program #include <stdio.h> tfun int fib (int n) {
return n < 2 ? n : fib(n-1)+fib(n-2); } tfun int main (int argc, char **argv) { if (argc != 2) { printf("Usage: fib <n>\n"); return 1; } int n = atoi(argv[1]); printf("fib(%d) = %d\n", n, (int)fib(n)); return 0; Not computationally intensive – but numerical integration has the same structure

17 Open TS: Environment

18 Open TS Runtime Three-tiered architecture (Т, M, S)
Design: microkernel , 10 extensions currently «Supermemory» Lightweight threads DMPI: Dynamic MPI auto selection of MPI implementation dynamic loading and linking

19 Supermemory Object-Oriented Distributed shared memory (OO DSM)
Global address space Cell versioning

20 Multithreading & Communications
Lightweight threads PIXELS ( threads) Asynchronous communications A thread “A” asks non-ready value (or new job) Asynchronous request sent: Active messages & Signals delivery over network to stimulate data transfer to the thread “A” Context switches (including a quant for communications) Latency Hiding for node-node exchange

21 DMPI Dynamic MPI Seven implementations of MPI supported now:
auto selection of MPI implementation dynamic loading and linking Seven implementations of MPI supported now: LAM MPICH SCALI MPI MVAPICH IMPI MPICH-G2 PACX-MPI even PVM can be used instead of MPI

22 Debugging: WAD, LTDB

23 Statistics Gathering

24 Message Tracing

25 Open TS: Applying to Distributed Computing
Meta-cluster messaging support (MPICH-G2, IMPI, PACX-MPI) Customizable scheduling strategies (network topology information used)

26 NPB, Test ЕР Rewritten @OpenTS
ЕР – Embarrassingly Parallel NASA Parallel Benchmarks suite Speedup = 96% of theoretical maximum (on 10 nodes) Efficiency, % of theoretical Time, % of sequential

27 Open TS vs MPI case study

28 Applications Popular and widely used
Developed by independent teams (MPI experts) PovRay – Persistence of Vision Ray-tracer, enabled for parallel run by a patch ALCMD/MP_lite – molecular dynamics package (Ames Lab)

29 T-PovRay vs MPI PovRay: code complexity
Program Source code volume MPI modules for PovRay 3.10g 1,500 lines MPI patch for PovRay 3.50c 3,000 lines T++ modules (for both versions 3.10g & 3.50c) 200 lines

30 T-PovRay vs MPI PovRay: performance
Efficiency of parallel implementations are comparable The ratio timeMPI (N)/timeT(N) varies from 87% (T-PovRay is slightly slower then MPI-PovRay) to 173% (T-PovRay is considerably faster then MPI-PovRay) for cluster “SKIF FIRST-BORN M”; from 101% (T-PovRay is slightly faster) to 201% (T-PovRay is considerably faster) for cluster “SKIF K-1000”. 16 dual Athlon 1800, AMD Athlon MP RAM 1GB, FastEthernet, LAM 7.0.6

31 T-PovRay vs MPI PovRay: performance
Efficiency of parallel implementations are comparable The ratio timeMPI (N)/timeT(N) varies from 87% (T-PovRay is slightly slower then MPI-PovRay) to 173% (T-PovRay is considerably faster then MPI-PovRay) for cluster “SKIF FIRST-BORN M”; from 101% (T-PovRay is slightly faster) to 201% (T-PovRay is considerably faster) for cluster “SKIF K-1000”. 2CPUs AMD Opteron GHz RAM 4GB, GigE, LAM 7.1.1

32 ALCMD/MPI vs ALCMD/OpenTS
MP_Lite component of ALCMD rewritten in T++ Fortran code is left intact

33 ALCMD/MPI vs ALCMD/OpenTS : code complexity
Program Source code volume MP_Lite total/MPI ~20,000 lines MP_Lite,ALCMD-related/ MPI ~3,500 lines MP_Lite,ALCMD-related/ OpenTS 500 lines

34 ALCMD/MPI vs ALCMD/OpenTS: performance
Efficiency of parallel implementations are comparable The ratio timeMPI (N)/timeT(N) varies from 87% (T-PovRay is slightly slower then MPI-PovRay) to 173% (T-PovRay is considerably faster then MPI-PovRay) for cluster “SKIF FIRST-BORN M”; from 101% (T-PovRay is slightly faster) to 201% (T-PovRay is considerably faster) for cluster “SKIF K-1000”. 16 dual Athlon 1800, AMD Athlon MP RAM 1GB, FastEthernet, LAM 7.0.6, Lennard-Jones MD, atoms

35 ALCMD/MPI vs ALCMD/OpenTS: performance
Efficiency of parallel implementations are comparable The ratio timeMPI (N)/timeT(N) varies from 87% (T-PovRay is slightly slower then MPI-PovRay) to 173% (T-PovRay is considerably faster then MPI-PovRay) for cluster “SKIF FIRST-BORN M”; from 101% (T-PovRay is slightly faster) to 201% (T-PovRay is considerably faster) for cluster “SKIF K-1000”. 2CPUs AMD Opteron GHz RAM 4GB, GigE, LAM 7.1.1, Lennard-Jones MD, atoms

36 ALCMD/MPI vs ALCMD/OpenTS: performance
Efficiency of parallel implementations are comparable The ratio timeMPI (N)/timeT(N) varies from 87% (T-PovRay is slightly slower then MPI-PovRay) to 173% (T-PovRay is considerably faster then MPI-PovRay) for cluster “SKIF FIRST-BORN M”; from 101% (T-PovRay is slightly faster) to 201% (T-PovRay is considerably faster) for cluster “SKIF K-1000”. 2CPUs AMD Opteron GHz RAM 4GB, InfiniBand,MVAMPICH 0.9.4, Lennard-Jones MD, atoms

37 Open TS applications

38 Т-Applications MultiGen – biological activity estimation
Remote sensing applications Plasma modeling Protein simulation Aeromechanics Query engine for XML AI-applications etc.

39 MultiGen Chelyabinsk State University
К0 Level 0 Level 1 К11 К12 Level 2 К21 К22 Multi-conformation model

40 Exectution time (min.:с)
MultiGen: Speedup National Cancer Institute USA Reg.No. NCI (AIDS drug lead) National Cancer Institute USA Reg.No. NCI (AIDS drug lead) TOSLAB company (Russia-Belgium) Reg.No. TOSLAB A2-0261 (antiphlogistic drug lead) Substance Atom number Rotations number Conformers Exectution time (min.:с) 1 node 4 nodes 16 nodes NCI 28 4 13 9:33 3:21 1:22 TOSLAB A2-0261 82 18 49 115:27 39:23 16:09 NCI 126 25 74 266:19 95:57 34:48

41 Aeromechanics Institute of Mechanics, MSU

42 AEROMECHANICS Institute of Mechanics, MSU

43 Creating space-born radar image from hologram

44 Simulating broadband radar signal
Graphical User Interface Non-PSI RAS development team (Space research institute of Khrunichev corp.)

45 Landsat Image Classification
Computational “web-service”

46 Future Work Multi-kernel CPU support Distributed computing
Schedulers Transport Interface to web-services Fault-tolerance Optimizing for modern CPUs Algorithmic skeletons, patterns and high level parallel libraries Interface to web-services – integration point

47 Out of Presentation Scope
Other T-languages: T-Refal, T-Fortan Memoization Automatically choosing between call-style and fork-style of function invocation Checkpointing Heartbeat mechanism Flavours of data references: “normal”, “glue” and “magnetic” — lazy, eager and ultra-eager (speculative) data transfer

48 ACKNOLEDGEMENTS “SKIF” supercomputing project
Russian Academy of Science grants Program “High-performance computing systems on new principles of computational process organization” Program of Presidium of Russian Academy of Science “Development of basics for implementation of distributed scientific informational-computational environment on GRID technologies” Russian Foundation Basic Research “ офи_а” Microsoft – contract for “Open TS vs MPI” case study

49 THANKS … … ANY QUESTIONS ???… …

50 Open TS benchmarks

51 Tests: NASA CG, NASA EP, FIB

52 EP @ OpenTS benchmark Embarrassingly parallel Recursive implementation
Two parameters size – number of operations in task ~ 2size depth – number of grains (t-function calls) = 2depth Number of operations per grain ~ 2size-depth Allows to stress Runtime

53 NPB, Test ЕР Rewritten @OpenTS
ЕР – Embarrassingly Parallel NASA Parallel Benchmarks suite Speedup = 96% of theoretical maximum (on 10 nodes) Efficiency, % of theoretical Time, % of sequential

54 Additional EPs The same T++ source code linked with different RTL extensions EP – standard, with dynamic load balance EP_ASYNC – “asynchronous” , data exchange interrupts calculation EP_GS – “grid scheduler”, minimize load deviation when assigning a task EP_GS_ASYNC – “grid scheduler” with “asynchronous” data exchange

55 EP metrics 2size/time/number of CPUs M Calculated as
Take % of the best over all experiments Good metric: is approx. the same on a single CPU with depths between 6 and 12 Cluster: 16 Dual Athlon 1800MP+, Fast Ethernet

56 EP results For all size  [28,32], depth [6,12], M=99,9% if Ncpu=1
M drops below 90% if NCPU>8 CPU for size=6 On 32 CPUs EP_GS_ASYNC is the best with M=88,2%, and depth=12, size=32


Download ppt "Open TS: An Outline of Dynamic Parallelization Approach"

Similar presentations


Ads by Google