Presentation is loading. Please wait.

Presentation is loading. Please wait.

“Towards an SSI for HP Java” Francis Lau The University of Hong Kong With contributions from C.L. Wang, Ricky Ma, and W.Z. Zhu.

Similar presentations


Presentation on theme: "“Towards an SSI for HP Java” Francis Lau The University of Hong Kong With contributions from C.L. Wang, Ricky Ma, and W.Z. Zhu."— Presentation transcript:

1 “Towards an SSI for HP Java” Francis Lau The University of Hong Kong With contributions from C.L. Wang, Ricky Ma, and W.Z. Zhu

2 7/10/2003ICPP-HPSECA032 Cluster Coming of Age HPC –Cluster the de facto standard equipment –Grid? Clusters –Fortran or C + MPI the norm –99% on top of bare-bone Linux or the like –Ok if application is embarrassingly parallel and regular

3 7/10/2003ICPP-HPSECA033 Cluster for the Mass Two modes: –For number crunching in Grande type applications (superman) –As a CPU farm to support high- throughput computing (poor man) Commercial: Data mining, Financial Modeling, Oil Reservoir Simulation, Seismic Data Processing, Vehicle and Aircraft Simulation Government: Nuclear Stockpile Stewardship, Climate and Weather, Satellite Image Processing, Forces Modeling Academic: Fundamental Physics (particles, relativity, cosmology), Biochemistry, Environmental Engineering, Earthquake Prediction Commercial: Data mining, Financial Modeling, Oil Reservoir Simulation, Seismic Data Processing, Vehicle and Aircraft Simulation Government: Nuclear Stockpile Stewardship, Climate and Weather, Satellite Image Processing, Forces Modeling Academic: Fundamental Physics (particles, relativity, cosmology), Biochemistry, Environmental Engineering, Earthquake Prediction

4 7/10/2003ICPP-HPSECA034 Cluster Programming Auto-parallelization tools have limited success Parallelization a chore but “have to do it” (or let’s hire someone) Optimization for performance not many users’ cup of tea –Partitioning and parallelization –Mapping –Remapping (experts?)

5 7/10/2003ICPP-HPSECA035 Amateur Parallel Programming Common problems –Poor parallelization: few large chunks or many small chunks –Load imbalances: large and small chunks Meeting the amateurs half-way –They do crude parallelization –System does the rest: mapping/remapping (automatic optimization) –And I/O?

6 7/10/2003ICPP-HPSECA036 Automatic Optimization “Feed the fat boy with two spoons, and a few slim ones with one spoon” But load information could be elusive Need smart runtime supports Goal is to achieve high performance with good resource utilization and load balancing Large chunks that are single-threaded a problem

7 7/10/2003ICPP-HPSECA037 The Good “Fat Boys” Large chunks that span multiple nodes Must be a program with multiple execution “threads” Threads can be in different nodes – program expands and shrinks Threads/programs can roam around – dynamic migration This encourages fine-grain programming cluster node “amoeba”

8 7/10/2003ICPP-HPSECA038 Mechanism and Policy Mechanism for migration –Traditional process migration –Thread migration Redirection of I/O and messages Objects sharing between nodes for threads Policy for good dynamic load balancing –Message traffic a crucial parameter –Predictive Towards the “single system image” ideal

9 7/10/2003ICPP-HPSECA039 Single System Image If user does only crude parallelization and system does the rest … If processes/threads can roam, and processes expand/shrink … If I/O (including sockets) can be at any node anytime … We achieve at least 50% of SSI –The rest is difficult Single Entry Point File System Virtual Networking I/O and Memory Space Process Space Management / Programming View …Single Entry Point File System Virtual Networking I/O and Memory Space Process Space Management / Programming View …

10 7/10/2003ICPP-HPSECA0310 Bon Java! Java (for HPC) in good hands –JGF Numerics Working Group, IBM Ninja, … –JGF Concurrency/Applications Working Group (benchmarking, MPI, …) –The workshops Java has many advantages (vs. Fortran and C/C++) Performance not an issue any more Threads as first-class citizens! JVM can be modified “Java has the greatest potential to deliver an attractive productive programming environment spanning the very broad range of tasks needed by the Grande programmer ” – The Java Grande Forum Charter

11 7/10/2003ICPP-HPSECA0311 Process vs. Thread Migration Process migration easier than thread migration –Threads are tightly coupled –They share objects Two styles to explore –Process, MPI (“distributed computing”) –Thread, shared objects (“parallel computing”) –Or combined Boils down to messages vs. distributed shared objects

12 7/10/2003ICPP-HPSECA0312 Two Projects @ HKU M-JavaMPI – “M” for “Migration” –Process migration –I/O redirection –Extension to grid –No modification of JVM and MPI JESSICA – “Java-Enabled Single System Image Computing Architecture” –By modifying JVM –Thread migration, Amoeba mode –Global object space, I/O redirection –JIT mode (Version 2)

13 7/10/2003ICPP-HPSECA0313 Design Choices Bytecode instrumentation –Insert code into programs, manually or via pre-processor JVM extension –Make thread state accessible from Java program –Non-transparent –Modification of JVM is required Checkpointing the whole JVM process –Powerful but heavy penalty Modification of JVM –Runtime support –Totally transparent to the applications –Efficient but very difficult to implement

14 7/10/2003ICPP-HPSECA0314 M-JavaMPI Support transparent Java process migration and provide communication redirection services Communication using MPI Implemented as a middleware on top of standard JVM No modifications of JVM and MPI Checkpointing the Java process + code insertion by preprocessor

15 7/10/2003ICPP-HPSECA0315 System Architecture

16 7/10/2003ICPP-HPSECA0316 Preprocessing Bytecode is modified before passing to JVM for execution “Restoration functions” are inserted as exception handlers, in the form of encapsulated “try-catch” statements Re-arrangement of bytecode, and addition of local variables

17 7/10/2003ICPP-HPSECA0317 The Layers Java-MPI API layer Restorable MPI layer –Provides restorable MPI communications –No modification of MPI library Migration Layer –Captures and save the execution state of the migrating process in the source node, and restores the execution state of the migrated process in the destination node –Cooperates with the Restorable MPI layer to reconstruct the communication channels of the parallel application

18 7/10/2003ICPP-HPSECA0318 State Capturing and Restoring Program code: re-used in the destination node Data: captured and restored by using the object serialization mechanism Execution context: captured by using JVMDI and restored by inserted exception handlers Eager (all) strategy: For each frame, local variables, referenced objects, the name of the class and class method, and program counter are saved using object serialization

19 7/10/2003ICPP-HPSECA0319 State Capturing using JVMDI public class A { int a; char b; … } public class A { try { … } catch (RestorationException e) { a = saved value of local variable a; b = saved value of local variable b; pc = saved value of program counter when the program is suspended jump to the location where the program is suspended }

20 7/10/2003ICPP-HPSECA0320 Message Redirection Model MPI daemon in each node to support message passing between distributed java processes IPC between Java program and MPI daemon in the same node through shared memory and semaphores client-server

21 7/10/2003ICPP-HPSECA0321 Process migration steps Source Node Destination Node

22 7/10/2003ICPP-HPSECA0322 Experiments PC Cluster –16-node cluster –300 MHz Pentium II with 128MB of memory –Linux 2.2.14 with Sun JDK 1.3.0 –100Mb/s fast Ethernet All Java programs executed in interpreted mode

23 7/10/2003ICPP-HPSECA0323 Bandwidth: PingPong Test Native MPI: 10.5 MB/s Direct Java-MPI binding: 9.2 MB/s Restorable MPI layer: 7.6 MB/s

24 7/10/2003ICPP-HPSECA0324 Native MPI: 0.2 ms Direct Java-MPI binding: 0.23 ms Restorable MPI layer: 0.26 ms Latency: PingPong Test

25 7/10/2003ICPP-HPSECA0325 Migration Cost: capturing and restoring objects

26 7/10/2003ICPP-HPSECA0326 Migration Cost: capturing and restoring frames

27 7/10/2003ICPP-HPSECA0327 Application Performance PI calculation Recursive ray-tracing NAS integer sort Parallel SOR

28 7/10/2003ICPP-HPSECA0328 Time spent in calculating PI and ray-tracing with and without the migration layer

29 7/10/2003ICPP-HPSECA0329 Execution time of NAS program with different problem sizes (16 nodes) Problem size (no. of integers) Time used (sec) in environment without M- JavaMPI Time used (sec) in environment with M- JavaMPI Overhead introduced by M-JavaMPI (in %) TotalCompCommTotalCompCommTotalComm Class S: 65536 0.0230.0090.0140.0260.0090.01713%21% Class W:1048576 0.3930.1820.2120.4240.1820.2427.8%14% Class A: 8388608 3.2061.5451.663.3871.5461.8405.6%11% No noticeable overhead introduced in the computation part; while in the communication part, an overhead of about 10-20%

30 7/10/2003ICPP-HPSECA0330 Time spent in executing SOR using different numbers of nodes with and without migration layer

31 7/10/2003ICPP-HPSECA0331 Cost of Migration Time spent in executing the SOR program on an array of size 256x256 without and with one migration during the execution

32 7/10/2003ICPP-HPSECA0332 ApplicationsAverage migration time PI2 Ray-tracing3 NAS2 SOR3 Time spent in migration (in seconds) for different applications Cost of Migration

33 7/10/2003ICPP-HPSECA0333 Dynamic Load Balancing A simple test –SOR program was executed using six nodes in an unevenly loaded environment with one of the nodes executing a computationally intensive program Without migration : 319s With migration: 180s

34 7/10/2003ICPP-HPSECA0334 In Progress –M-JavaMPI in JIT mode –Develop system modules for automatic dynamic load balancing –Develop system modules for effective fault- tolerant supports

35 7/10/2003ICPP-HPSECA0335 Java Virtual Machine Class Loader –Loads class files Interpreter –Executes bytecode Runtime Compiler –Converts bytecode to native code 0a0b0c0d0c6262431 c1d688662a0b0c0d0 c1334514726522723 01010101000101110 10101011000111010 10110011010111011 Class loader Interpreter Runtime compiler Bytecode Native code Application Class FileJava API Class File

36 7/10/2003ICPP-HPSECA0336 Threads in JVM Heap (Data) object Class loader Class files Thread 3 Java Method Area (Code) Thread 2 Thread 1 PC Stack Frame public class ProducerConsumerTest { public static void main(String[] args) { CubbyHole c = new CubbyHole(); Producer p1 = new Producer(c, 1); Consumer c1 = new Consumer(c, 1); p1.start(); c1.start(); } A Multithreaded Java Program Execution Engine

37 7/10/2003ICPP-HPSECA0337 Java Memory Model (How to maintain memory consistency between threads) Load variable from main memory to working memory before use. Variable is modified in T 1 ’s working memory. T1T1 T2T2 Upon T 1 performs unlock, variable is written back to main memory Upon T 2 performs lock, flush variable in working memory When T 2 uses variable, it will be loaded from main memory Garbage Bin Per-Thread working memory Main memory Object Variable Heap Area Threads: T 1, T 2 JMM master copy

38 7/10/2003ICPP-HPSECA0338 Problems in Existing DJVMs Mostly based on interpreters –Simple but slow Layered design using distributed shared memory system (DSM)  cannot be tightly coupled with JVM –JVM runtime information cannot be channeled to DSM –False sharing if page-based DSM is employed –Page faults block the whole JVM Programmer to specify thread distribution  lack of transparency –Need to rewrite multithreaded Java applications –No dynamic thread distribution (preemptive thread migration) for load balancing

39 7/10/2003ICPP-HPSECA0339 Related Work Method shipping: IBM cJVM Like remote method invocation (RMI) : when accessing object fields, the proxy redirects the flow of execution to the node where the object's master copy is located. Executed in Interpreter mode. Load balancing problem : affected by the object distribution. Page shipping: Rice U. Java/DSM, HKU JESSICA Simple. GOS was supported by some page-based Distributed Shared Memory (e.g., TreadMarks, JUMP, JiaJia) JVM runtime information can’t be channeled to DSM. Executed in Interpreter mode. Object shipping: Hyperion, Jackal Leverage some object-based DSM Executed in native mode: Hyperion: translate Java bytecode to C. Jackal: compile Java source code directly to native code

40 7/10/2003ICPP-HPSECA0340 Global Object Space High Speed Network PC OS Java Threads created in a program PC OS PC OS PC OS JESSICA2: A distributed Java Virtual Machine (DJVM) spanning multiple cluster nodes can provide a true parallel execution environment for multithreaded Java applications with a Single System Image illusion to Java threads. Distributed Java Virtual Machine (DJVM)

41 7/10/2003ICPP-HPSECA0341 JESSICA2 Main Features Transparent Java thread migration –Runtime capturing and restoring of thread execution context. –No source code modification; no bytecode instrumentation (preprocessing); no new API introduced –Enables dynamic load balancing on clusters Operated in Just-In-Time (JIT) compilation Mode Global Object Space –A shared global heap spanning all cluster nodes –Adaptive object home migration protocol –I/O redirection Transparent migration JITGOS JESSICA2

42 7/10/2003ICPP-HPSECA0342 Transparent Thread Migration in JIT Mode Simple for interpreters (e.g., JESSICA) –Interpreter sits in the bytecode decoding loop which can be stopped upon a migration flag checking –The full state of a thread is available in the data structure of interpreter –No register allocation JIT mode execution makes things complex (JESSICA2) –Native code has no clear bytecode boundary –How to deal with machine registers? –How to organize the stack frames (all are in native form now)? –How to make extracted thread states portable and recognizable by the remote JVM? –How to restore the extracted states (rebuild the stack frames) and restart the execution in native form? Need to modify JIT compiler to instrument native code

43 7/10/2003ICPP-HPSECA0343 Approaches Using JVMDI (e.g., M-JavaMPI)? –Only newer JDKs (Aug., 2002) provide full speed debugging to support the capturing of thread status –Portable but too heavy need large data structures to keep debug information –Only using JVMDI cannot support full function of DJVM How to access remote object? Put a DSM under it? But you can’t control Sun JVM’s memory allocation unless you get the latest JDK source codes Our lightweight approach –Provide the minimum functions required to capture and restore Java threads to support Java thread migration

44 7/10/2003ICPP-HPSECA0344 An overview of JESSICA2 Java thread migration Thread Frame (1) Alert Frames Method Area GOS (heap) JVM  Frame parsing  Restore execution Frame  Stack analysis  Stack capturing Thread Scheduler Source node Destination node Migration Manager Load Monitor Method Area GOS (heap) (4b) Load method from NFS Frames (2) (4a) Object Access (3) PC

45 7/10/2003ICPP-HPSECA0345 Essential Functions Migration points selection –At the start of loop, basic block or method Register context handler –Spill dirty registers at migration point without invalidation so that native code can continue the use of registers –Use register recovering stub at restoring phase Variable type deduction –Spill type in stacks using compression Java frames linking –Discover consecutive Java frames

46 7/10/2003ICPP-HPSECA0346 Dynamic Thread State Capturing and Restoring in JESSICA2 mov slot1->reg1 mov slot2->reg2... Bytecode verifier bytecode translation migration point code generation Intermediate Code invoke 1. Add migration checking 2. Add object checking 3. Add type & register spilling register allocation Native Code Native stack scanning Linking & Constant Resolution Register recovering reg slots cmp obj[offset],0 jz... cmp mflag,0 jz... mov 0x110182, slot... Native thread stack Java frame C frame (Restore) Global Object Access Frame (Capturing) migration point Selection

47 7/10/2003ICPP-HPSECA0347 How to Maintain Memory Consistency in a Distributed Environment? T2T2 High Speed Network PC OS PC OS PC OS PC OS T4T4 T6T6 T8T8 T1T1 T3T3 T5T5 T7T7 Heap

48 7/10/2003ICPP-HPSECA0348 Embedded Global Object Space (GOS) Take advantage of JVM runtime information for optimization (e.g., object types, accessing threads, etc.) Use threaded I/O interface inside JVM for communication to hide the latency  Non-blocking GOS access OO-based to reduce false sharing Home-based, compliant with JVM Memory Model (“Lazy Release Consistency”) Master heap (home objects) and cache heap (local and cached objects): reduce object access latency

49 7/10/2003ICPP-HPSECA0349 Object Cache

50 7/10/2003ICPP-HPSECA0350 Adaptive object home migration Definition –“home” of an object = the JVM that holds the master copy of an object Problems –cache objects need to be flushed and re-fetched from the home whenever synchronization happens Adaptive object home migration –if # of accesses from a thread dominates the total # of accesses to an object, the object home will be migrated to the node where the thread is running

51 7/10/2003ICPP-HPSECA0351 I/O redirection Timer Use the time in master node as the standard time Calibrate the time in worker node when they register to master node File I/O Use half word of “fd” as node number Open file For read, check local first, then master node For write, go to master node Read/Write Go to the node specified by the node number in fd Network I/O Connectionless send: do it locally Others, go to master

52 7/10/2003ICPP-HPSECA0352 Experimental Setting Modified Kaffe Open JVM version 1.0.6 Linux PC clusters 1.Pentium II PCs at 540MHz (Linux 2.2.1 kernel) connected by Fast Ethernet 2.HKU Gideon 300 Cluster (for the Ray Tracing demo)

53 7/10/2003ICPP-HPSECA0353 Parallel Ray Tracing on JESSICA2 (Using 64 nodes of the Gideon 300 cluster) Linux 2.4.18-3 kernel (Redhat 7.3) 64 nodes: 108 seconds 1 node: 4420 seconds (~ 1 hour) Speedup = 4402/108 = 40.75

54 7/10/2003ICPP-HPSECA0354 Micro Benchmarks (PI Calculation)

55 7/10/2003ICPP-HPSECA0355 Java Grande Benchmark

56 7/10/2003ICPP-HPSECA0356 SPECjvm98 Benchmark “M-”: disabling migration mechanism “M+”: enabling migration “I+”: enabling pseudo-inlining “I-”: disabling pseudo-inlining

57 7/10/2003ICPP-HPSECA0357 JESSICA2 vs JESSICA (CPI)

58 7/10/2003ICPP-HPSECA0358 Application Performance

59 7/10/2003ICPP-HPSECA0359 Effect of Adaptive Object Home Migration (SOR)

60 7/10/2003ICPP-HPSECA0360 Work in Progress New optimization techniques for GOS Incremental Distributed GC Load balancing module Enhanced single I/O space to benefit more real-life applications Parallel I/O support

61 7/10/2003ICPP-HPSECA0361 Conclusion Effective HPC for the mass –They supply the (parallel) program, system does the rest –Let’s hope for parallelizing compilers –Small to medium grain programming –SSI the ideal –Java the choice –Poor man mode too Thread distribution and migration feasible Overhead reduction –Advances in low-latency networking –Migration as intrinsic function (JVM, OS, hardware) Grid and pervasive computing

62 7/10/2003ICPP-HPSECA0362 Some Publications W.Z. Zhu, C.L. Wang, and F.C.M. Lau, “A Lightweight Solution for Transparent Java Thread Migration in Just-in-Time Compilers”, ICPP 2003, Taiwan, October 2003. W.J. Fang, C.L. Wang, and F.C.M. Lau, “On the Design of Global Object Space for Efficient Multi-threading Java Computing on Clusters”, Parallel Computing, to appear. W.Z. Zhu, C.L. Wang, and F.C.M. Lau, “JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support,” CLUSTER 2002, Chicago, September 2002, 381-388. R. Ma, C.L. Wang, and F.C.M. Lau, “M-JavaMPI : A Java-MPI Binding with Process Migration Support,'' CCGrid 2002, Berlin, May 2002. M.J.M. Ma, C.L. Wang, and F.C.M. Lau, “JESSICA: Java-Enabled Single-System-Image Computing Architecture,’’ Journal of Parallel and Distributed Computing, Vol. 60, No. 10, October 2000, 1194- 1222.

63 7/10/2003ICPP-HPSECA0363 THE END And Thanks!


Download ppt "“Towards an SSI for HP Java” Francis Lau The University of Hong Kong With contributions from C.L. Wang, Ricky Ma, and W.Z. Zhu."

Similar presentations


Ads by Google