Download presentation
Presentation is loading. Please wait.
Published byJemima O’Connor’ Modified over 7 years ago
1
Open TS: An Outline of Dynamic Parallelization Approach
Program Systems Institute RAS, Moscow State University, Institute of Mechanics Sergey Abramov, Alexei Adamovich, Alexander Inyukhin, Alexander Moskovsky, Vladimir Roganov, Elena Shevchuk, Yuri Shevchuk, Alexander Vodomerov, 06/09/05 (Pact 2005 Conference, Krasnoyarsk)
2
Presentation Outline Short self-introduction Open TS outline
MPI vs Open TS case study Applications Future work
3
1. Short Self-Introduction
4
PSI RAS, Pereslavl-Zalesski
5
SKIF Supercomputing Project
Joint of Russian Federation and Republic of Belarus organizations PSI RAS is lead organization from Russian Federation Hardware and Software
6
Flagship “SKIF К-1000” Peak performance 2,5 Tflops Linpack-performance 2,0 Tflops Efficiency ratio 80.1 % November 2004 : The most powerful supercomputer in ex-USSR, Rank 98 in Top500
7
Moscow State University
MSU 250
8
Open TS Overview
9
T-System History Mid-80-ies Basic ideas of T-System
1990-ies First implementation of T-System , “SKIF” GRACE — Graph Reduction Applied to Cluster Environment 2003-current, “SKIF” Open TS — Open T-system
10
Comparison: T-System and MPI
High-level a few keywords Low-level hundred(s) primitives C/Fortran T-System Assembler MPI Sequential Parallel
11
Related work Parallel Programming Using C++ (Scientific and Engineering Computation) by Gregory V. Wilson (Editor), Paul Lu (Editor) ABC++, Amelia, CC++, CHAOS++, COOL, C++//, ICC++, Mentat, MPC++, MPI++, pC++, POOMA, TAU, UC++
12
T-System in Comparison
Related work Open TS differentiator Charm++ FP-based approach UPC, mpC++ Implicit parallelism Glasgow Parallel Haskell Allows C/C++ based low-level optimization OMPC++ Provides both language and C++ templates library Cilk Supports SMP, MPI, PVM, and GRID platforms
13
Open TS: an Outline High-performance computing
“Automatic dynamic parallelization” Combining functional and imperative approaches, high-level parallel programming Т++ language: “Parallel dialect” of C++ — an approach popular in 90-ies
14
Т-Approach “Pure” functions (tfunctions) invocations produce grains of parallelism T-Program is Functional – on higher level Imperative – on low level (optimization) C-compatible execution model Non-ready variables, Multiple assignment “Seamless” C-extension (or Fortran-extension) Optimization through usability ----- it should be simple to do things in language that are low-cost for parallel computer It should be difficult to do things that are costly for parallel computer
15
Т++ Keywords tfun — Т-function tval — Т-variable tptr — Т-pointer
tout — Output parameter (like &) tdrop — Make ready twait — Wait for readiness tct — Т-context
16
Sample Program #include <stdio.h> tfun int fib (int n) {
return n < 2 ? n : fib(n-1)+fib(n-2); } tfun int main (int argc, char **argv) { if (argc != 2) { printf("Usage: fib <n>\n"); return 1; } int n = atoi(argv[1]); printf("fib(%d) = %d\n", n, (int)fib(n)); return 0; Not computationally intensive – but numerical integration has the same structure
17
Open TS: Environment
18
Open TS Runtime Three-tiered architecture (Т, M, S)
Design: microkernel , 10 extensions currently «Supermemory» Lightweight threads DMPI: Dynamic MPI auto selection of MPI implementation dynamic loading and linking
19
Supermemory Object-Oriented Distributed shared memory (OO DSM)
Global address space Cell versioning
20
Multithreading & Communications
Lightweight threads PIXELS ( threads) Asynchronous communications A thread “A” asks non-ready value (or new job) Asynchronous request sent: Active messages & Signals delivery over network to stimulate data transfer to the thread “A” Context switches (including a quant for communications) Latency Hiding for node-node exchange
21
DMPI Dynamic MPI Seven implementations of MPI supported now:
auto selection of MPI implementation dynamic loading and linking Seven implementations of MPI supported now: LAM MPICH SCALI MPI MVAPICH IMPI MPICH-G2 PACX-MPI even PVM can be used instead of MPI
22
Debugging: WAD, LTDB
23
Statistics Gathering
24
Message Tracing
25
Open TS: Applying to Distributed Computing
Meta-cluster messaging support (MPICH-G2, IMPI, PACX-MPI) Customizable scheduling strategies (network topology information used)
26
NPB, Test ЕР Rewritten @OpenTS
ЕР – Embarrassingly Parallel NASA Parallel Benchmarks suite Speedup = 96% of theoretical maximum (on 10 nodes) Efficiency, % of theoretical Time, % of sequential
27
Open TS vs MPI case study
28
Applications Popular and widely used
Developed by independent teams (MPI experts) PovRay – Persistence of Vision Ray-tracer, enabled for parallel run by a patch ALCMD/MP_lite – molecular dynamics package (Ames Lab)
29
T-PovRay vs MPI PovRay: code complexity
Program Source code volume MPI modules for PovRay 3.10g 1,500 lines MPI patch for PovRay 3.50c 3,000 lines T++ modules (for both versions 3.10g & 3.50c) 200 lines
30
T-PovRay vs MPI PovRay: performance
Efficiency of parallel implementations are comparable The ratio timeMPI (N)/timeT(N) varies from 87% (T-PovRay is slightly slower then MPI-PovRay) to 173% (T-PovRay is considerably faster then MPI-PovRay) for cluster “SKIF FIRST-BORN M”; from 101% (T-PovRay is slightly faster) to 201% (T-PovRay is considerably faster) for cluster “SKIF K-1000”. 16 dual Athlon 1800, AMD Athlon MP RAM 1GB, FastEthernet, LAM 7.0.6
31
T-PovRay vs MPI PovRay: performance
Efficiency of parallel implementations are comparable The ratio timeMPI (N)/timeT(N) varies from 87% (T-PovRay is slightly slower then MPI-PovRay) to 173% (T-PovRay is considerably faster then MPI-PovRay) for cluster “SKIF FIRST-BORN M”; from 101% (T-PovRay is slightly faster) to 201% (T-PovRay is considerably faster) for cluster “SKIF K-1000”. 2CPUs AMD Opteron GHz RAM 4GB, GigE, LAM 7.1.1
32
ALCMD/MPI vs ALCMD/OpenTS
MP_Lite component of ALCMD rewritten in T++ Fortran code is left intact
33
ALCMD/MPI vs ALCMD/OpenTS : code complexity
Program Source code volume MP_Lite total/MPI ~20,000 lines MP_Lite,ALCMD-related/ MPI ~3,500 lines MP_Lite,ALCMD-related/ OpenTS 500 lines
34
ALCMD/MPI vs ALCMD/OpenTS: performance
Efficiency of parallel implementations are comparable The ratio timeMPI (N)/timeT(N) varies from 87% (T-PovRay is slightly slower then MPI-PovRay) to 173% (T-PovRay is considerably faster then MPI-PovRay) for cluster “SKIF FIRST-BORN M”; from 101% (T-PovRay is slightly faster) to 201% (T-PovRay is considerably faster) for cluster “SKIF K-1000”. 16 dual Athlon 1800, AMD Athlon MP RAM 1GB, FastEthernet, LAM 7.0.6, Lennard-Jones MD, atoms
35
ALCMD/MPI vs ALCMD/OpenTS: performance
Efficiency of parallel implementations are comparable The ratio timeMPI (N)/timeT(N) varies from 87% (T-PovRay is slightly slower then MPI-PovRay) to 173% (T-PovRay is considerably faster then MPI-PovRay) for cluster “SKIF FIRST-BORN M”; from 101% (T-PovRay is slightly faster) to 201% (T-PovRay is considerably faster) for cluster “SKIF K-1000”. 2CPUs AMD Opteron GHz RAM 4GB, GigE, LAM 7.1.1, Lennard-Jones MD, atoms
36
ALCMD/MPI vs ALCMD/OpenTS: performance
Efficiency of parallel implementations are comparable The ratio timeMPI (N)/timeT(N) varies from 87% (T-PovRay is slightly slower then MPI-PovRay) to 173% (T-PovRay is considerably faster then MPI-PovRay) for cluster “SKIF FIRST-BORN M”; from 101% (T-PovRay is slightly faster) to 201% (T-PovRay is considerably faster) for cluster “SKIF K-1000”. 2CPUs AMD Opteron GHz RAM 4GB, InfiniBand,MVAMPICH 0.9.4, Lennard-Jones MD, atoms
37
Open TS applications
38
Т-Applications MultiGen – biological activity estimation
Remote sensing applications Plasma modeling Protein simulation Aeromechanics Query engine for XML AI-applications etc.
39
MultiGen Chelyabinsk State University
К0 Level 0 Level 1 К11 К12 Level 2 К21 К22 Multi-conformation model
40
Exectution time (min.:с)
MultiGen: Speedup National Cancer Institute USA Reg.No. NCI (AIDS drug lead) National Cancer Institute USA Reg.No. NCI (AIDS drug lead) TOSLAB company (Russia-Belgium) Reg.No. TOSLAB A2-0261 (antiphlogistic drug lead) Substance Atom number Rotations number Conformers Exectution time (min.:с) 1 node 4 nodes 16 nodes NCI 28 4 13 9:33 3:21 1:22 TOSLAB A2-0261 82 18 49 115:27 39:23 16:09 NCI 126 25 74 266:19 95:57 34:48
41
Aeromechanics Institute of Mechanics, MSU
42
AEROMECHANICS Institute of Mechanics, MSU
43
Creating space-born radar image from hologram
44
Simulating broadband radar signal
Graphical User Interface Non-PSI RAS development team (Space research institute of Khrunichev corp.)
45
Landsat Image Classification
Computational “web-service”
46
Future Work Multi-kernel CPU support Distributed computing
Schedulers Transport Interface to web-services Fault-tolerance Optimizing for modern CPUs Algorithmic skeletons, patterns and high level parallel libraries Interface to web-services – integration point
47
Out of Presentation Scope
Other T-languages: T-Refal, T-Fortan Memoization Automatically choosing between call-style and fork-style of function invocation Checkpointing Heartbeat mechanism Flavours of data references: “normal”, “glue” and “magnetic” — lazy, eager and ultra-eager (speculative) data transfer
48
ACKNOLEDGEMENTS “SKIF” supercomputing project
Russian Academy of Science grants Program “High-performance computing systems on new principles of computational process organization” Program of Presidium of Russian Academy of Science “Development of basics for implementation of distributed scientific informational-computational environment on GRID technologies” Russian Foundation Basic Research “ офи_а” Microsoft – contract for “Open TS vs MPI” case study
49
THANKS … … ANY QUESTIONS ???… …
50
Open TS benchmarks
51
Tests: NASA CG, NASA EP, FIB
52
EP @ OpenTS benchmark Embarrassingly parallel Recursive implementation
Two parameters size – number of operations in task ~ 2size depth – number of grains (t-function calls) = 2depth Number of operations per grain ~ 2size-depth Allows to stress Runtime
53
NPB, Test ЕР Rewritten @OpenTS
ЕР – Embarrassingly Parallel NASA Parallel Benchmarks suite Speedup = 96% of theoretical maximum (on 10 nodes) Efficiency, % of theoretical Time, % of sequential
54
Additional EPs The same T++ source code linked with different RTL extensions EP – standard, with dynamic load balance EP_ASYNC – “asynchronous” , data exchange interrupts calculation EP_GS – “grid scheduler”, minimize load deviation when assigning a task EP_GS_ASYNC – “grid scheduler” with “asynchronous” data exchange
55
EP metrics 2size/time/number of CPUs M Calculated as
Take % of the best over all experiments Good metric: is approx. the same on a single CPU with depths between 6 and 12 Cluster: 16 Dual Athlon 1800MP+, Fast Ethernet
56
EP results For all size [28,32], depth [6,12], M=99,9% if Ncpu=1
M drops below 90% if NCPU>8 CPU for size=6 On 32 CPUs EP_GS_ASYNC is the best with M=88,2%, and depth=12, size=32
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.