Presentation is loading. Please wait.

Presentation is loading. Please wait.

JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support Wenzhang Zhu, Cho-Li Wang, Francis Lau The Systems Research Group.

Similar presentations


Presentation on theme: "JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support Wenzhang Zhu, Cho-Li Wang, Francis Lau The Systems Research Group."— Presentation transcript:

1 JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support Wenzhang Zhu, Cho-Li Wang, Francis Lau The Systems Research Group Department of Computer Science and Information Systems The University of Hong Kong

2 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 2 HKU JESSICA Project JESSICA: “Java-Enabled Single-System- Image Computing Architecture” : Project started in 1996. First version (JESSICA1) in 1999. A middleware that runs on top of the standard UNIX/Linux operating system to support parallel execution of multi- threaded Java applications in a cluster of computers. JESSICA hides the physical boundaries between machines and makes the cluster appear as a single computer to applications -- a single-system image (SSI). Special feature : preemptive thread migration which allows a thread to freely move between machines. Part of the RGC’s Area of Excellence project in 1999-2002.

3 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 3 JESSICA Team Members Supervisors: Dr. Francis C.M. Lau Dr. Cho-Li Wang Research Students: Ph.D: Wenzhang Zhu (Thread Migration) Ph.D: WeiJian Fang (Global Heap) M.Phil: Zoe Ching Han Yu (Distributed Garbage Collection) Ph.D: Benny W. L. Cheung (Software Distributed Shared Memory) Graduated: Matchy Ma (JESSICA1) The Systems Research Group JESSICA Team Members

4 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 4 Outline Introduction of Cluster Computing Motivations Related works JESSICA2 features Performance Analysis Conclusion & Future works

5 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 5 What’s a cluster ? A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand- alone/complete computers cooperatively working together as a single, integrated computing resource – IEEE TFCC. My definition : a HPC system that integrates mainstream commodity components to process large-scale problems  low cost, self-made, yet powerful.

6 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 6 Cluster Computer Architecture High-Speed LAN (Fast/Gigabit Ethernet, SCI, Myrinet) Availability Infrastructure Single System Image Infrastructure Programming Environment (Java, C, MPI, HPF, DSM) Management & Monitoring & Job Scheduling Cluster Applications (Web, Storage, Computing, Rendering, Financing..) OS Node OS Node OS Node OS Node

7 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 7 Single System Image (SSI) ?  JESSICA Project: Java-Enabled Single-System- Image Computing Architecture  A single system image is the illusion, created by software or hardware, that presents a collection of resources as one, more powerful resource.  Ultimate Goals of SSI : makes the cluster appear like a single machine to the user, to applications, and to the network. Single Entry Point, Single File System, Single Virtual Networking, Single I/O and Memory Space, Single Process Space, Single Management / Programming View …

8 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 8 CountShareRmax [GF/s]Rpeak [GF/s]Processors MPP22444.8 %104899.21168829.00111104 Constellations18737.4 %40246.5059038.0033828 Cluster8016 %37596.1669774.0050181 SMP91.8 %39208.1044875.006056 Total500100 %221949.97342516.00201169 Top 500 computers by “classification” (June 2002) (Source: http://www.top500.org/ ) MPPMassively Parallel Processor ConstellationE.g., cluster of HPCs ClusterCluster of PCs SMPSymmetric Multiprocessor About the TOP500 List: 1.the 500 most powerful computer systems installed in the world. 2.Compiled twice a year since June 1993 3.Ranked by their performance on the LINPACK Benchmark

9 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 9 #1 Supercomputer: NEC’s Earth Simulator Built by NEC, 640 processor nodes, each consisting of 8 vector processors, total of 5120 processors, 40 TFlop/s peak, and 10 TB memory. Linpack : 35.86 Tflop/s (Tera FLOPS = 10 12 floating point operations per second = 450 x Pentium 4 PCs) Interconnect: Single stage crossbar (1800 miles of cable) 83,000 copper cables, 16 GB/s cross section bandwidth Area of computer = 4 tennis courts, 3 floors (Source: NEC)

10 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 10 Other Supercomputers in TOP500 #2 #3 Supercomputer: ASCI Q 7.7 TF/s Linpack performance. Los Alamos National Laboratory, U.S. HP Alphserver SC (375 x 32-way multiprocessors, total 11,968 processors), 12 terabytes of memory and 600 terabytes of disk storage #4: IBM ASCI White (U.S.) 8,192 copper microprocessors (IBM SP POWER3), and contains 160 trillion bytes (TB) of memory with more than 160 TB of IBM disk storage capacity; Linpack: 7.22 Tflops. Located at Lawrence Livermore National Laboratory. 512-node, 16-way symmetric multiprocessor. Covers an area the size of two basketball courts, weighs 106 tons. 2,000 miles of copper wiring. Cost: US$110 million.

11 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 11 TOP500 Nov 2002 List 2 new PC clusters made the TOP 10: #5 is a Linux NetworX/Quadrics cluster at Lawrence Livermore National Laboratory. #8 is a HPTi/Myrinet cluster at the Forecast Systems Laboratory at NOAA. A total of 55 Intel based and 8 AMD based PC clusters are in the TOP500. The number of clusters in the TOP500 grew again to a total of 93 systems.

12 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 12 Poor Man’s Cluster HKU Ostrich Cluster 32 x 733 MHz Pentium III PCs, 384MB Memory Hierarchical Ethernet-based network : four 24- port Fast Ethernet switches + one 8-port Gigabit Ethernet backbone switch)

13 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 13 Computational Plant (C-Plant cluster) 1536 Compaq DS10L 1U servers (466 MHz Alpha 21264 (EV6) microprocessor, 256 MB ECC SDRAM) Each node contains a 64-bit, 33 MHz Myrinet network interface card (1.28 Gbps/s) connected to a 64-port Mesh64 switch. 48 cabinets, each of which contains 32 nodes (48x32=1536) Rich Man’s Cluster

14 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 14 The HKU Gideon 300 Cluster (Operating in mid Oct. 2002) 300 PCs (2.0GHz Pentium 4, 512MB DDR mem, 40GB disk, Linux OS) connected by a 312-port Foundry FastIron 1500 (Fast Ethernet) switch Linpack performance: 355 Gflops #175 in TOP500 (Nov. 2002 List)

15 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 15 Building Gideon 300

16 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 16 JESSICA2 : Introduction Research Goal High Performance Java Computing using Clusters Why Java? The dominant language for server-side programming. More than 2 million Java developers [CNETAsia: 06/2002] Platform independent: “Compile once, run anywhere” Code mobility (i.e., dynamic class loading) and data mobility (i.e., object serialization). Built-in multithreading support at language level (parallel programming using MPI, PVM, RMI, RPC, HPF, DSM is difficult) Why cluster? Large scale server-side applications need high-performance multithreaded programming supports A cluster provides a scalable hardware platform for true parallel execution.

17 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 17 Java Virtual Machine Class Loader Loads class files Interpreter Executes bytecode Runtime Compiler Converts bytecode to native code 0a0b0c0d0c6262431 c1d688662a0b0c0d0 c1334514726522723 01010101000101110 10101011000111010 10110011010111011 Class loader Interpreter Runtime compiler Bytecode Native code Application Class FileJava API Class File

18 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 18 Threads in JVM Heap (Data) object Class loader Class files Thread 3 Java Method Area (Code) Thread 2 Thread 1 PC Stack Frame public class ProducerConsumerTest { public static void main(String[] args) { CubbyHole c = new CubbyHole(); Producer p1 = new Producer(c, 1); Consumer c1 = new Consumer(c, 1); p1.start(); c1.start(); } A Multithreaded Java Program Execution Engine

19 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 19 Java Memory Model Define memory consistency semantics in multi- threaded Java programs when values must be transferred between the main memory and per-thread working memory There is a “lock” associated with each object Protect critical sections Maintain memory consistency between threads Basic Rules: releasing a lock forces a flush of all writes from working memory employed by the thread, and acquiring a lock forces a (re)load of the values of accessible fields

20 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 20 Java Memory Model (How to maintain memory consistency between threads) Load variable from main memory to working memory before use. Variable is modified in T 1 ’s working memory. T1T1 T2T2 Upon T 1 performs unlock, variable is written back to main memory Upon T 2 performs lock, flush variable in working memory When T 2 uses variable, it will be loaded from main memory Garbage Bin Per-Thread working memory Main memory Object Variable Heap Area Threads: T 1, T 2 Threads in a JVM master copy

21 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 21 Global Object Space High Speed Network PC OS Java Threads created in a program PC OS PC OS PC OS JESSICA2: A distributed Java Virtual Machine (DJVM) spanning multiple cluster nodes can provide a true parallel execution environment for multithreaded Java applications with a Single System Image illusion to Java threads. Distributed Java Virtual Machine (DJVM)

22 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 22 Problems in Existing DJVMs Mostly based on interpreters Simple but slow Layered design using distributed shared memory system (DSM)  can’t be tightly coupled with JVM JVM runtime information can’t be channeled to DSM False sharing if page-based DSM is employed Page faults block the whole JVM Programmer specifies thread distribution  lacks of transparency Need to rewrite multithreaded Java applications No dynamic thread distribution (preemptive thread migration) for load balancing.

23 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 23 Related Work Method shipping: IBM cJVM Like remote method invocation (RMI) : when accessing object fields, the proxy redirects the flow of execution to the node where the object's master copy is located. Executed in Interpreter mode. Load balancing problem : affected by the object distribution. Page shipping : Rice U. Java/DSM, HKU JESSICA Simple. GOS was supported by some page-based Distributed Shared Memory (e.g., TreadMarks, JUMP, JiaJia) JVM runtime information can’t be channeled to DSM. Executed in Interpreter mode. Object shipping: Hyperion, Jackal Leverage some object-based DSM Executed in native mode: Hyperion: translate Java bytecode to C. Jackal: compile Java source code directly to native code

24 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 24 Related Work (Summary) cJVM Method Shipping IntrProxy Java/DSM Manual Migration Intr Page-based DSM JESSICA Transparent Migration Intr Page-based DSM Intr=Interpreter Mode

25 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 25 JESSICA2 Main Features Transparent Java thread migration Runtime capturing and restoring of thread execution context. No source code modification. No bytecode instrumenting (preprocessing) No new API introduced. Enable dynamic load balancing on clusters Operated in Just-In-Time (JIT) compilation Mode Global Object Space A shared global heap spanning all cluster nodes Adaptive object home migration protocol I/O redirection Transparent migration JITGOS JESSICA2

26 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 26 JESSICA2 Architecture public class ProducerConsumerTest { public static void main(String[] args) { CubbyHole c = new CubbyHole(); Producer p1 = new Producer(c, 1); Consumer c1 = new Consumer(c, 1); p1.start(); c1.start(); } Java Bytecode or Source Code

27 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 27 Transparent Thread Migration in JIT Mode Simple for interpreters (e.g. JESSICA) Interpreter sits in the bytecode decoding loop which can be stopped upon a migration flag checking The full state of a thread is available in the data structure of interpreter No register allocation JIT mode execution makes things complex (JESSICA2) Native code has no clear bytecode boundary How to deal with machine registers? How to organize the stack frames (all are in native form now) ? How to make extracted thread states portable and recognizable by the remote JVM ? How to restore the extracted states (rebuild the stack frames) and restart the execution in native form ? Need to modify JIT compiler to instrument native codes

28 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 28 Approaches Using JVMDI (e.g., HKU M-JavaMPI) ? Only recent JDK1.4.1 (Aug., 2002) provides full speed debugging to support the capturing of thread status Portable but too heavy need large data structures to keep debug information Only using JVMDI can’t support full function of DJVM How to access remote object? Put a DSM under it? But you can’t control Sun JVM’s memory allocation unless you get the latest JDK source codes Our lightweight approach Provide the minimum functions required to capture and restore Java threads to support Java thread migration

29 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 29 An overview of JESSICA2 Java thread migration Thread Frame (1) Alert Frames Method Area GOS (heap) JVM  Frame parsing  Restore execution Frame  Stack analysis  Stack capturing Thread Scheduler Source node Destination node Migration Manager Load Monitor Method Area GOS (heap) (4b) Load method from NFS Frames (2) (4a) Object Access (3) PC

30 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 30 What are those functions? Migration points selection Delayed to the head of loop basic block or method Register context handler Spill dirty registers at migration point without invalidation so that native codes can continue the use of registers Use register recovering stub at restoring phase Variable type deduction Spill type in stacks using compression Java frames linking Discover consecutive Java frames

31 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 31 Dynamic Thread State Capturing and Restoring in JESSICA2 mov slot1->reg1 mov slot2->reg2... Bytecode verifier bytecode translation migration point code generation Intermediate Code invoke 1. Add migration checking 2. Add object checking 3. Add type & register spilling register allocation Native Code Native stack scanning Linking & Constant Resolution Register recovering reg slots cmp obj[offset],0 jz... cmp mflag,0 jz... mov 0x110182, slot... Native thread stack Java frame C frame (Restore) Global Object Access Frame (Capturing) migration point Selection

32 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 32 How to Maintain Memory Consistency in a Distributed Environment ? T2T2 High Speed Network PC OS PC OS PC OS PC OS T4T4 T6T6 T8T8 T1T1 T3T3 T5T5 T7T7 Heap

33 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 33 Embedded Global Object Space (GOS) Main Features: Take advantage of JVM runtime information for optimization (e.g. object types, accessing threads, etc.) Use threaded I/O interface inside JVM for communication to hide the latency  Non-blocking GOS access OO based to reduce false sharing Home-based, compliant with JVM Memory Model (“Lazy Release Consistency”) Master Heap (home objects) and Cache Heap (local and cached objects) : reduce object access latency

34 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 34 Object Cache

35 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 35 Adaptive object home migration Definition “home” of an object = the JVM that holds the master copy of an object Problems cache objects need to be flushed and re-fetched from the home whenever synchronization happens Adaptive object home migration if # of accesses from a thread dominates the total # of accesses to an object, the object home will be migrated to the node where the thread is running

36 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 36 I/O redirection Timer Use the time in Master node as the standard time Calibrate the time in worker node when they register to master node File I/O Use half word of fd as node number Open file For read, check local first, then master node For write, go to master node Read/Write Go to the node specified by the node number in fd Network I/O Connectionless send: do locally Others, go to master

37 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 37 Experimental Setting Modified Kaffe Open JVM version 1.0.6 Linux PC Clusters: 1. Pentium II PCs at 540MHz (Linux 2.2.1 kernel) Connected by Fast Ethernet 2. HKU Gideon 300 Cluster (RayTracing)

38 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 38 Parallel Ray Tracing on JESSICA2 (Running at 64-node Gideon 300 cluster) Linux 2.4.18-3 kernel (Redhat 7.3) 64 nodes: 108 seconds 1 node: 3430 seconds (~ 1 hour) Speedup = 4402/108= 40.75

39 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 39 Micro Benchmarks (PI Calculation)

40 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 40 Java Grande Benchmark

41 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 41 SPECjvm98 Benchmark “M-” : disabling migration mechanism, “M+” : enabling migration mechanism. “I+” : enabling pseudo-inlining. “I-” : disabling pseudo-inlining.

42 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 42 JESSICA2 vs JESSICA (CPI)

43 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 43 Application benchmark

44 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 44 Effect of Adaptive object home migration (SOR)

45 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 45 Conclusions Transparent Java thread migration in JIT compiler enable the high-performance execution of multithreaded Java application on clusters while keeping the merits of Java JVM approach => dynamic class loading Just-in-Time compilation for speed An embedded GOS layer can take advantage of the JVM runtime information to reduce communication overhead

46 HKJU, Dec. 18, 2002 JESSICA2, CSIS, HKU 46 Thanks HKU SRG: http://www.srg.csis.hku.hk/ JESSICA2 Webpage: http://www.csis.hku.hk/~clwang/proj ects/JESSICA2.html


Download ppt "JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support Wenzhang Zhu, Cho-Li Wang, Francis Lau The Systems Research Group."

Similar presentations


Ads by Google