Presentation on theme: "1 Grid Computing in Hong Kong Dr. Cho-Li Wang Systems Research Group Department of Computer Science and Information Systems The University of Hong Kong."— Presentation transcript:
1 Grid Computing in Hong Kong Dr. Cho-Li Wang Systems Research Group Department of Computer Science and Information Systems The University of Hong Kong
2 Agenda Grid computing – a simple picture The Hong Kong Grid SRG Projects SLIM, ODGPC G-JavaMPI JESSICA2 LOTS DSM for Grid Summary and Conclusion
3 Grid Computing : A Simple Picture Grid Computing Access to remote resources via standard protocols for cross-domain collaboration CPU power, Memory, Network, Storage… Data.. Services.. Resource providers End users Much like “utilities” in our daily lives – electricity, water, etc. Advantages: Cost-effectiveness Platform extensibility Convenience (P&P)
4 Grid Computing in Hong Kong -- The Hong Kong Grid The experimental grid in HK Supported under HKU Foundation Seed Grant http://www.hkgrid.org/
5 The Hong Kong Grid (HKGrid) Goals: to construct and make available a grid test bed to facilitate the development of grid middleware and applications by local industry and institutions in Hong Kong and their partners in the region to demonstrate the benefits of adopting grid technologies and to showcase any outstanding results of development or application HKGrid provides a platform for its members to experiment with various research prototypes and pilot applications
6 HKGrid - Current constituents InstitutionsComputing facilities City University of HKService gateway (2-way Xeon SMP) HK Baptist University2-way Xeon SMP x 64 (#300 in TOP500, 6/2003) HK University of Science and Technology 4-way SMP cluster The HK Polytechnic UniversityService gateway (2-way Xeon SMP) The HK Institute of HPCService gateway (2-way Xeon SMP) HKU – Computer Centre2-way Xeon SMP x 128 (#240 in TOP500, 11/2003) HKU – Department of CSISPentium 4 x 300 (#340 in TOP500, 6/2003) A 4 Tflop/s theoretical maximum computing power
7 Grid Point Monitoring with Ganglia URL: http://gideon.csis.hku.hk/status/
9 Main Computing Facilities: HKU-CSIS Gideon 300 Cluster
10 Research Projects in HKGrid HKBU: Knowledge Grid (Autonomous grid service composition). HKPU: Peer-to-peer (P2P) grid, meta scheduler, fault tolerance HKUST: Development of sensor Grid infrastructure HKU ETI: Modelling of Air Quality in Hong Kong (E-Business Technology Institute with the Environmental Protection Department, HKSAR) Computer Centre : HKU campus grid ; scientific applications running across the ApGrid CSIS : Robust Speech Recognition (J. Wu and Dr. Q. Huo) CSIS : Simulation for the DNA Shuffling Experiment (W.H. Hon and Dr. T.W. Lam) CSIS: Approximate String Matching on DNA Sequences (L.L. Cheng) CSIS: Whole Genome Alignment via Mutation-Sensitive Sequence Similarity (H.L. Chan, N. Lu, and Dr. T.W. Lam) ME: Parallel Simulation of Turbulent Flow Model (Dr. C.H. Liu, Dept. of Mechanical Engineering) CSIS : HKU Grid Point (863 Project: China National Grid) CSIS: Asia-Pacific Grid …..
11 HKGrid – Connections Links to China National Grid (CNGrid) and Asia-Pacific Grid (ApGrid) via CERNET and APAN Internet2 connection to the Abilene backbone at Chicago, USA Plays the role of a gateway for the other bigger grids
12 China National Grid (CNGrid) : 863 Project 上海超级计算中心 中科院计算所 香港大学 (CSIS) 西安交通大学 中国科技大学 国防科技大学 中科院应用物理所 清华大学 China National Grid Participants Supporting software : VEGA ( 织女星 ) grid management system : dynamic service deployment, single-sign-on, data replication, and performance monitoring. Developed by Institute of Computing Technology, Chinese Academy of Sciences V.1.0 released 8 中科院计算所开发的网 格系统软件已将计算所 、华中科技大学 与香 港大学网格节点连接在 一起，通过 VEGA_GOS …
14 ApGrid Demon on The HKU School Open Day (Oct. 2003)
15 Grid Research at HKU-CSIS SRG Projects SLIM + ODGPC G-JavaMPI JESSICA2 LOTS DSM
16 Our Goal Utility computing: to aggregate and make use of distributed computing resources transparently Traditional means: to utilize the dedicated HPC facilities distributed across institutions Performance and reliability are key Pervasive means: any user can be resource provider (e.g., idle PCs, etc.) or consumer, or both Convenience and security are key To construct an advanced grid computing platform to accommodate utility-like computing via traditional and “pervasive” means
17 Research at HKU – An Advanced Grid Computing Platform G-JavaMPI JESSICA LOTS ODGPCSLIM Load balancing AGP On-demand Grid point construction (ODGPC) Research Issues Single- system image Performance and Reliability Objectives User’s convenience Grid point construction Convenient system administration (Programming Environment)
19 SLIM Utility computing decouples computing platforms (resources) and computing logic (applications) I.e., a single platform can run completely different applications Problem: different applications demand different execution environments (OS, shared libraries, supporting apps, etc.) Hassles associated with managing execution environments (EE’s) in the resource provider side offset the benefits of resource sharing SLIM is a network service for managing and constructing EE’s, and disseminating them to remote computing platforms
20 SLIM – System design How it works? A node sends a EE specification across the network to find the Boot server Boot server delivers the requested Linux kernel Image server constructs an EE by collecting shared libraries, user data, etc. Linux kernel boots, and contacts the Image Server to “mount” the EE via a file synchronization protocol such as NFS Aggressive caching techniques are deployed to optimize performance
21 SLIM – Ongoing and future work SLIM has been managing: the HKU-CSIS grid point (350 nodes) for various grid research projects an addition 300+ lab machines for teaching purpose (different courses have different requirements) Future work To overcome the challenges in deploying SLIM over broadband links Realizing the “pervasive utility computing”
22 On-Demand Grid Point Construction (ODGPC) 1. Software installation at SLIM server2. Client boots and obtains kernel 3. OS image/App disseminated4. Process to generate certificates
23 SLIM and ODGPC Performance Evaluation Boot up 100 machines (Linux + GT3) : 6 minutes. Generate certificates for 100 machines (Step 4) : 30 minutes. Total time : 6 + 30 = 36 minutes 256 PCs < 5 minutes (OS only)
25 G-JavaMPI A grid-enabled Java-MPI system with dynamic load-balancing via process migration
26 G-JavaMPI A grid-enabled implementation of Java binding of MPI, supporting efficient MPI communication among distributed Java processes Supports transparent Java process migration (through JVMDI) within and across grid points for balancing CPU and network loads Communication-aware process migration policies based on: application’s communication pattern available network bandwidth on grid overlays
27 G-JavaMPI – System design (*) Gatekeeper (1)(1*) LS Gatekeepe r (3*) LS Gatekeeper (3) LS (2) WAN Migrating (restarting a new process through Globus remote job request with delegated user credentials and Java-MPI job credentials) Java-MPI communicatio n Some legacy messages are redirected during migration (2*) JVM M Migration module resides in each JVM
28 G-JavaMPI – Ongoing and future work The migration mechanism has been implemented Future work targets at process migration policies Goal: to offset performance pitfalls caused by heterogeneity through dynamic process migration Sources of heterogeneity in grids CPU, network, runtime environments, etc. CPU and network heterogeneities cause long “blocking” periods in cooperative processes, thus limiting the system throughput G-JavaMPI aims to detect and eliminate “blocking” through process migration (e.g. to migrate a “bottleneck” process to a faster node, etc.)
29 G-JavaMPI – Key references L. Chen, C.L. Wang, and F.C.M. Lau, “A Grid Middleware for Distributed Java Computing with MPI Binding and Process Migration Supports,” Journal of Computer Science and Technology (China), Vol. 18, No. 4, July 2003, pp. 505-514. L. Chen, C.L. Wang, F.C.M. Lau, and R.K.K. Ma, “A Grid Middleware for Distributed Java Computing with MPI Binding and Process Migration Supports,” International Workshop on Grid and Cooperative Computing (GCC-2002), December 26-28, 2002, Hainan, China, pp. 640- 652.
30 JESSICA2 : A Java-Enabled Single- System Image Computing Architecture JESSICA2 is a distributed Java Virtual Machine (DJVM) which consists of a group of extended JVMs running on a distributed environment to support true parallel execution of a multithreaded Java application. Java threads can freely move across node boundaries and execute in parallel to achieve more scalable high-performance computing using clusters The JESSICA2 DJVM provides standard JVM services, that are compliant with the Java language specification, as if running on a single machine – Single System Image (SSI).
31 JESSICA2 Architecture Thread Migration Global Object Space JESSICA2 JVM A Multithreaded Java Program JESSICA2 JVM JESSICA2 JVM JESSICA2 JVM JESSICA2 JVM JESSICA2 JVM MasterWorker JIT Compiler Mode Portable Java Frame
32 JESSICA2 Main Features Transparent Java thread migration Runtime capturing and restoring of thread execution context. No source code modification; no bytecode instrumentation (preprocessing); no new API introduced Enable dynamic load balancing on clusters Full Speed Computation JITEE: cluster-aware bytecode execution engine Operated in Just-In-Time (JIT) compilation mode Zero cost if no migration Transparent Remote Object Access Global Object Space : A shared global heap spanning all cluster nodes Adaptive migrating home protocol for memory consistency + various optimizing schemes. I/O redirection
33 Ray Tracing on JESSICA2 (64 PCs) Linux 2.4.18-3 kernel (Redhat 7.3) 64 nodes: 108 seconds 1 node: 4402 seconds ( 1.2 hour) Speedup = 4402/108= 40.75
34 JESSICA – Key references W.Z. Zhu, C.L. Wang, and F.C.M. Lau “A Lightweight Solution for Transparent Java Thread Migration in Just-in-Time Compilers,” The 2003 International Conference on Parallel Processing (ICPP-2003), pp. 465-472, Taiwan, Oct. 6-10, 2003 W.Z. Zhu, C.L. Wang and F.C.M. Lau, “JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support,” IEEE Fourth International Conference on Cluster Computing (CLUSTER 2002), Chicago, USA, September 23-26, 2002, pp. 381-388. M.J.M. Ma, C.L. Wang, F.C.M. Lau. “JESSICA: Java- Enabled Single-System-Image Computing Architecture,” Journal of Parallel and Distributed Computing, Vol. 60, No. 10, October 2000, pp. 1194-1222.
35 LOTS OS H/W LOTS OS H/W LOTS OS H/W LOTS OS H/W LOTS OS H/W Large Global Object Space LOTS: Large Object Space on Grid A large software distributed memory system for Grid. Provides a global object space larger than the process space (4GB in 32-bit CPU) Uses local hard disk to store recently unused objects Scope Consistency + Home Migration to reduce redundant data traffic Grid
36 Summary Performance G-JavaMPI, JESSICA, establish extensible grid platforms (good for computation-intensive applications) Process/thread migration enables performance optimization and load balancing LOTS supports shared memory programming environment on large object space (easier to develop data grid applications) Reliability G-JavaMPI migrates processes from failed machines SLIM helps construct platforms for failover Convenience G-JavaMPI, JESSICA, and LOTS enable users to harness distributed resources via traditional means SLIM and ODGPC simplify Grid point managements
37 Conclusion Grid/utility computing are relatively new paradigms that deserve further investigation We address the performance, reliability, and user convenience issues in grid/utility computing Our advanced grid computing platform (consisting of G-JavaMPI, JESSICA2, LOTS, and SLIM/ODGPC) is geared to deploy in the HKGrid for easy adoption of Grid technologies.
39 Reference Hong Kong Grid http://www.hkgrid.org/ Grid Computing Research Portal http://grid.csis.hku.hk/ The HKU Systems Research Group http://www.srg.csis.hku.hk VEGA Project http://vega.ict.ac.cn/ The HK Supercomputing Directory http://www.hkhpc.org/~SuperDir/