Presentation is loading. Please wait.

Presentation is loading. Please wait.

Windows-NT based Distributed Virtual Parallel Machine The MILLIPEDE Project Technion, Israel.

Similar presentations

Presentation on theme: "Windows-NT based Distributed Virtual Parallel Machine The MILLIPEDE Project Technion, Israel."— Presentation transcript:

1 Windows-NT based Distributed Virtual Parallel Machine The MILLIPEDE Project Technion, Israel

2 What is Millipede ? A strong Virtual Parallel Machine: employ non-dedicated distributed environments Distributed Environment Implementation of Parallel Programming Langs Programs

3 Programming Paradigms Millipede Layer Distributed Shared Memory (DSM) Communication Packages U-Net, Transis, Horus,… Operating System Services Communication, Threads, Page Protection, I/O Software Packages User-mode threads ParC Java SPLASH ParFortran90 CParPar Other “Bare Millipede” Cilk/CalipsoCC++ Events Mechanism (MJEC) Migration Services (MGS)

4 So, what’s in a VPM? Check list: Using non-dedicated cluster of PCs (+ SMPs) Multi-threaded Shared memory User-mode Strong support for weak memory Dynamic page- and job-migration Load sharing for maximal locality of reference Convergence to optimal level of parallelism Millipede inside Millipede inside

5 Using a non-dedicated cluster Dynamically identify idle machines Move work to idle machines Evacuate busy machines Do everything transparently to native user Co-existence of several parallel applications

6 Multi-Threaded Environments Well known: – Better utilization of resources – An intuitive and high level of abstraction – Latency hiding by comp. and comm. overlap Natural for parallel programing paradigms & environments – Programmer defined max-level of parallelism – Actual level of parallelism set dynamically. Applications scale up and down – Nested parallelism – SMPs

7 The Tradeoff: Higher level of parallelism VS. Better locality of memory reference Optimal speedup - not necessarily with the maximal number of computers Achieved level of parallelism - depends on the program needs and on the capabilities of the system Convergence to Optimal Speedup

8 PVM /* Receive data from master */ msgtype = 0; pvm_recv(-1,msgtype); pvm_upkint(&nproc, 1, 1); pvm_upkint(tids,nproc, 1); pvm_upkint(&n, 1, 1); pvm_upkfloat(data, n, 1); /* Determine which slave I am (0..nproc-1)*/ for(i=0;i { "@context": "", "@type": "ImageObject", "contentUrl": "", "name": "PVM /* Receive data from master */ msgtype = 0; pvm_recv(-1,msgtype); pvm_upkint(&nproc, 1, 1); pvm_upkint(tids,nproc, 1); pvm_upkint(&n, 1, 1); pvm_upkfloat(data, n, 1); /* Determine which slave I am (0..nproc-1)*/ for(i=0;i

9 Relaxed Consistency (Avoiding false sharing and ping pong) Sequential, CRUW, Sync(var), Arbitrary-CW Sync Multiple relaxations for different shared variables within the same program No broadcast, no central address servers (so can work efficiently interconnected LANs) New protocols welcome (user defined?!) Step-by-step optimization towards maximal parallelism page copies

10 LU Decomposition 1024x1024 matrix written in SPLASH - Advantages gained when reducing consistency of a single variable (the Global structure):

11 MJEC - Millipede Job Event Control A job has a unique systemwide id Jobs communicate and synchronize by sending events Although a job is mobile, its events follow and reach its events queue wherever it goes Event handlers are context-sensitive An open mechanism with which various synchronization methods can be implemented

12 MJEC (con’t) Modes: –In Execution-Mode: arriving events are enqueued –In Dispatching-Mode: events are dequeued and handled by a user-supplied dispatching routine

13 MJEC Interface Registration and Entering Dispatch Mode: milEnterDispatchingMode((FUNC)foo, void *context) Post Event: milPostEvent(id target, int event, int data) Dispatcher Routine Syntax: int foo(id origin, int event, int data, void *context) Execution Mode ret := func(INIT, context) ret==EXIT? milEnterDispatchingMode(func, context) ret := func(EXIT, context) event pending? No Yes ret := func(event, context) Wait for event

14 Experience with MJEC ParC: ~ 250 lines SPLASH: ~ 120 lines Easy implementation of many synchronization methods: semaphores, locks, condition variables, barriers Implementation of location-dependent services (e.g., graphical display)

15 Example - Barriers with MJEC Dispatcher: …... BARRIER(...) Dispatcher: … EVENT ARR Barrier Server Job … Job Barrier() { milPostEvent(BARSERV, ARR, 0); milEnterDispatchingMode(wait_in_barrier, 0); } wait_in_barrier(src, event, context) { if (event == DEP) return EXIT_DISPATCHER; else return STAY_IN_DISPATCHER; }

16 Example - Barriers with MJEC (con’t) Dispatcher: …... BARRIER(...) Dispatcher: … EVENT DEP Barrier Server Job BARRIER(...) Dispatcher: … EVENT DEP Job BarrierServer() { milEnterDispatchingMode(barrier_server, info); } barrier_server(src, event, context) { if (event == ARR) enqueue(context.queue, src); if (should_release(context)) while(context.cnt>0) { milPostEvent(context.dequeue, DEP); } return STAY_IN_DISPATCHER; }

17 Dynamic Page- and Job-Migration Migration may occur in case of: –Remote memory access –Load imbalance –User comes back from lunch –Improving locality by location rearrangement Sometimes migration should be disabled –by system: ping-pong, critical section –by programmer: control system

18 Locality of memory reference is THE dominant efficiency factor Migration Can Help Locality: Only Job MigrationOnly Page MigrationPage & Job Migration

19 Load Sharing + Max. Locality = Minimum-Weight multiway cut ppqq rr

20 Problems with the multiway cut model NP-hard for #cuts>2. We have n>X,000,000. Polynomial 2-approximations known Not optimized for load balancing Page replica Graph changes dynamically Only external accesses are recorded ===> only partial information is available

21 Our Approach Record the history of remote accesses Use this information when taking decisions concerning load balancing/load sharing Save old information to avoid repeating bad decisions (learn from mistakes) Detect and solve ping-pong situations Do everything by piggybacking on communication that is taking place anyway page 1 page 2 page 1 page 0 Access

22 Ping Pong Detection (local): 1.Local threads attempt to use the page short time after it leaves the local host 2.The page leaves the host shortly after arrival Treatment (by ping-pong server): Collect information regarding all participating hosts and threads Try to locate an underloaded target host Stabilize the system by locking-in pages/threads

23 TSP- Effect of Locality 15 cities, Bare Millipede Optimization sec 0 500 1000 1500 2000 2500 3000 3500 4000 123456 NO-FS OPTIMIZED-FS FS hosts In the NO-FS case false sharing is avoided by aligning all allocations to page size. In the other two cases each page is used by2 threads: in FS no optimizations are used, and in OPTIMIZED-FS the history mechanism is enabled.

24 k optimized? # DSM- related messages # ping-pong treatment msgs Number of thread migrations execution time (sec) 2 Yes 5100 290 68 645 2 No 176120 0 23 1020 3 Yes 4080 279 87 620 3 No 160460 0 32 1514 4 Yes 5060 343 99 690 4 No 155540 0 44 1515 5 Yes 6160 443 139 700 5 No 162505 0 55 1442 TSP on 6 hosts k number of threads falsely sharing a page

25 Ping Pong Detection Sensitivity TSP-1 0 100 200 300 400 500 600 700 800 900 1000 234567891011121314151617181920 Best results are achieved at maximal sensitivity, since all pages are accessed frequently. 0 100 200 300 400 500 600 700 800 900 1000 1100 234567891011121314151617181920 Since part of the pages are accessed frequently and part- only occasionally, maximal sensitivity causes unnecessary pingpong treatment and significantly increases execution time. TSP-2

26 Applications Numerical computations: Multigrid Model checking: BDDs Compute-intensive graphics: Ray-Tracing, Radiosity Games, Search trees, Pruning, Tracking, CFD...

27 Performance Evaluation L - underloaded H - overloaded Delta(ms) - lock in time t/o delta - polling (MGS,DSM) msg delta - system pages delta T_epoch - max history time ??? - remove old histories - refresh old histories L_epoch - histories length page histories vs. job histories migration heuristic - which func? ping-pong - - what is initial noise? - what freq. is PP?

28 LU Decomposition 1024x1024 matrix written in SPLASH: Performance improvements when there are few threads on each host

29 LU Decomposition 2048x2048 matrix written in SPLASH - Super-Linear speedups due to the caching effect.

30 Jacobi Relaxation 512x512 matrix (using 2 matrices, no false sharing) written in ParC

31 Overhead of ParC/Millipede on a single host. Testing with Tracking algorithm:

32 Info... Release available at the Millipede site !

Download ppt "Windows-NT based Distributed Virtual Parallel Machine The MILLIPEDE Project Technion, Israel."

Similar presentations

Ads by Google