12/20/2005AgentTeamwork1 AgentTeamwork: Mobile-Agent-Based Middleware for Distributed Job Coordination Munehiro Fukuda Computing & Software Systems, University.

Slides:



Advertisements
Similar presentations
Building Portals to access Grid Middleware National Technical University of Athens Konstantinos Dolkas, On behalf of Andreas Menychtas.
Advertisements

Three types of remote process invocation
UNIVERSITY OF JYVÄSKYLÄ P2PDisCo – Java Distributed Computing for Workstations Using Chedar Peer-to-Peer Middleware Presentation for 7 th International.
Mobile Agents Mouse House Creative Technologies Mike OBrien.
Operating System.
Cilk NOW Based on a paper by Robert D. Blumofe & Philip A. Lisiecki.
Distributed System Structures Network Operating Systems –provide an environment where users can access remote resources through remote login or file transfer.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
Distributed systems Programming with threads. Reviews on OS concepts Each process occupies a single address space.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
CSS434 Grid Computing1 Textbook No Corresponding Chapters Professor: Munehiro Fukuda A portion of these slides were compiled from The Grid: Blueprint for.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
Implementation of XML Database and Enhancement of Resource and Sensor Agents Cuong Ngo CSS497 Summer 2006 Professor Munehiro Fukuda.
5/25/2006CSS Speaker Series1 Parallel Job Deployment and Monitoring in a Hierarchy of Mobile Agents Munehiro Fukuda Computing & Software Systems, University.
Company LOGO Development of Resource/Commander Agents For AgentTeamwork Grid Computing Middleware Funded By Prepared By Enoch Mak Spring 2005.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Inter-cluster Job Deployment by AgentTeamwork Sentinel Agents Emory Horvath CSS497 Spring 2006 Advisor: Dr. Munehiro Fukuda.
Message Passing Interface In Java for AgentTeamwork (MPJ) By Zhiji Huang Advisor: Professor Munehiro Fukuda 2005.
DISTRIBUTED PROCESS IMPLEMENTAION BHAVIN KANSARA.
Parallel Programming with Java
Distributed Process Implementation Hima Mandava. OUTLINE Logical Model Of Local And Remote Processes Application scenarios Remote Service Remote Execution.
Distributed Process Implementation
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Chapter 9 Message Passing Copyright © Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere2 Introduction.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.
UNIX SVR4 COSC513 Zhaohui Chen Jiefei Huang. UNIX SVR4 UNIX system V release 4 is a major new release of the UNIX operating system, developed by AT&T.
A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster
So, Jung-ki Distributed Computing System LAB School of Computer Science and Engineering Seoul National University Implementation of Package Management.
Cloud computing for internet emulator. Professor Muthucumaru Maheswaran Team Members Mia Hochar Simon Foucher David El Achkar David El Achkar Marc Atie.
Grid Computing I CONDOR.
Through the development of advanced middleware, Grid computing has evolved to a mature technology in which scientists and researchers can leverage to gain.
SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
G-JavaMPI: A Grid Middleware for Distributed Java Computing with MPI Binding and Process Migration Supports Lin Chen, Cho-Li Wang, Francis C. M. Lau and.
SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.
Chapter 5.4 DISTRIBUTED PROCESS IMPLEMENTAION Prepared by: Karthik V Puttaparthi
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Distributed System Concepts and Architectures 2.3 Services Fall 2011 Student: Fan Bai
Transparent Mobility of Distributed Objects using.NET Cristóbal Costa, Nour Ali, Carlos Millan, Jose A. Carsí 4th International Conference in Central Europe.
More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers.
1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University.
Hwajung Lee.  Interprocess Communication (IPC) is at the heart of distributed computing.  Processes and Threads  Process is the execution of a program.
Process Architecture Process Architecture - A portion of a program that can run independently of and concurrently with other portions of the program. Some.
8/25/2005IEEE PacRim The Design Concept and Initial Implementation of AgentTeamwork Grid Computing Middleware Munehiro Fukuda Computing & Software.
1 Supporting Dynamic Migration in Tightly Coupled Grid Applications Liang Chen Qian Zhu Gagan Agrawal Computer Science & Engineering The Ohio State University.
SAN DIEGO SUPERCOMPUTER CENTER Inca Control Infrastructure Shava Smallen Inca Workshop September 4, 2008.
Introduction to Grid Computing and its components.
ECHO A System Monitoring and Management Tool Yitao Duan and Dawey Huang.
Process Manager Specification Rusty Lusk 1/15/04.
1 Process Description and Control Chapter 3. 2 Process A program in execution An instance of a program running on a computer The entity that can be assigned.
By Nitin Bahadur Gokul Nadathur Department of Computer Sciences University of Wisconsin-Madison Spring 2000.
CSS497 Undergraduate Research Performance Comparison Among Agent Teamwork, Globus and Condor By Timothy Chuang Advisor: Professor Munehiro Fukuda.
Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.
8/25/2005IEEE PacRim The Check-Pointed and Error-Recoverable MPI Java of AgentTeamwork Grid Computing Middleware Munehiro Fukuda and Zhiji Huang.
MSF and MAGE: e-Science Middleware for BT Applications Sep 21, 2006 Jaeyoung Choi Soongsil University, Seoul Korea
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Agent Teamwork Research Assistant
Thank you, chairman for the kind introduction. And hello, everyone.
University of Technology
Supporting Fault-Tolerance in Streaming Grid Applications
Class project by Piyush Ranjan Satapathy & Van Lepham
CSS490 Grid Computing Textbook No Corresponding Chapter
Ashish Malgi, Neelesh Bansod, Byung K. Choi
MPJ: A Java-based Parallel Computing System
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

12/20/2005AgentTeamwork1 AgentTeamwork: Mobile-Agent-Based Middleware for Distributed Job Coordination Munehiro Fukuda Computing & Software Systems, University of Washington, Bothell Funded by

12/20/2005 AgentTeamwork 2 Outline 1.Introduction 2.Execution Model 3.System Design 4.Performance Evaluation 5.Related Work 6.Conclusions

12/20/2005 AgentTeamwork 3 1. Introduction Why Grid Computing Background Objective Project Overview

12/20/2005 AgentTeamwork 4 Why Grid Computing Textbooks say:  Only 30% CPU utilization  Only episodic job requirements  Anyone and anywhere like a power grid Many research prototypes and commercial products:  Globus, Condor, Legion(Avaki), NetSolve, Ninf, Entropia PCGrid, Sun Grid Engine, etc. Then, have you ever used them?  Probably not so many of you. What is a big hurdle?  You don’t need it anyway. Or, what?

12/20/2005 AgentTeamwork 5 Background Most Grid Systems Functional viewpoints:  Centralized resource/job management  Two drawbacks A powerful central server essential to manage all slave computing nodes Applications based on master-slave or parameter-sweep model  Out motivation Decentralized job distribution, coordination, and fault tolerance Applications based on a variety of communication models Practical viewpoints:  Systems dedicated to large institutions/companies  Two drawbacks A lot of installation work required under the root account. A group of individual computer owners not targeted at.  Our motivation Easy participation in grid-computing and easy installation

12/20/2005 AgentTeamwork 6 Background How to Pursue Our Motivation Use of mobile agents  We are experts in mobile agents. Most mobile agents  An execution model previously highlighted as a prospective infrastructure of distributed systems.  No more than an alternative approach to centralized grid middleware implementation. Our initial goal  Decentralized middleware design with mobile agents

12/20/2005 AgentTeamwork 7 Objective A mobile agent execution platform fitted to grid computing  Allowing an agent to identify which MPI rank to handle and which agent to send a job snapshot to. A fault-tolerant inter-process communication  Recovering lost messages.  Allowing over-gateway connections. Agent-collaborative algorithms for job coordination  Allocating computing nodes in a distributed manner.  Implementing decentralized snapshot maintenance and job recovery.

12/20/2005 AgentTeamwork 8 Project Overview Funded by:NSF Middleware Initiative Sponsored by:University of Washington In Collaboration of:Ehime University In a Team of:UWB Undergraduates

12/20/2005 AgentTeamwork 9 2. Execution Model System Overview Execution Layer Programming Interface

12/20/2005 AgentTeamwork 10 System Overview FTP Server User A User B User B snapshot snapshots User program wrapper Snapshot Methods GridTCP User program wrapper Snapshot Methods GridTCP User program wrapper Snapshot Methods GridTCP snapshot User A’s Process User A’s Process User B’s Process TCP Communication Commander Agent Sentinel Agent Resource Agent Sentinel Agent Resource Agent Bookkeeper Agent Results

12/20/2005 AgentTeamwork 11 Execution Layer Operating systems UWAgents mobile agent execution platform Commander, resource, sentinel, and bookkeeper agents User program wrapper GridTcpJava socket mpiJava-AmpiJava-S mpiJava API Java user applications

12/20/2005 AgentTeamwork 12 Programming Interface public class MyApplication { public GridIpEntry ipEntry[]; // used by the GridTcp socket library public int funcId; // used by the user program wrapper public GridTcp tcp; // the GridTcp error-recoverable socket public int nprocess; // #processors public int myRank; // processor id ( or mpi rank) public int func_0( String args[] ) { // constructor MPJ.Init( args, ipEntry, tcp ); // invoke mpiJava-A.....; // more statements to be inserted return 1; // calls func_1( ) } public int func_1( ) { // called from func_0 if ( MPJ.COMM_WORLD.Rank( ) == 0 ) MPJ.COMM_WORLD.Send(... ); else MPJ.COMM_WORLD.Recv(... );.....; // more statements to be inserted return 2; // calls func_2( ) } public int func_2( ) { // called from func_2, the last function.....; // more statements to be inserted MPJ.finalize( ); // stops mpiJava-A return -2; // application terminated }

12/20/2005 AgentTeamwork System Design Mobile Agents Job Coordination  Distribution  Monitoring  Resumption and migration Programming Support  Language preprocessing  Communication check-pointing

12/20/2005 AgentTeamwork 14 id 0 Agent domain (time=3:31pm, 8/25/05 ip = perseus.uwb.edu name = fukuda) id 0 UWInject: submits a new agent from shell. Agent domain (time=3:30pm, 8/25/05 ip = medusa.uwb.edu name = fukuda) UWAgents – Concept of Agent Domain Agent domain created per each submission from the Unix shell # children each agent can spawn is given upon the initial submission No name server Messages forwarded through an agent tree A user job scheduled as a thread, using suspend/resume User id 1id 2id 3 id 7id 6id 5id 4id 11id 10id 9id 8 id 12 -m 4 id 1 id 2 -m 3 UWPlace A user job

12/20/2005 AgentTeamwork 15 UWAgents – Over Gateway Migration

12/20/2005 AgentTeamwork 16 Job Distribution User Commander id 0 Sentinel id 2 rank 0 Bookkeeper id 3 rank 0 Resource id 1 eXist Sentinel id 8 rank 1 Sentinel id 11 rank 4 Sentinel id 10 rank 3 Sentinel id 9 rank 2 Bookkeeper id 12 rank 1 Bookkeeper id 15 rank 4 Bookkeeper id 14 rank 3 Bookkeeper id 13 rank 2 Sentinel id 32 rank 5 Sentinel id 34 rank 7 Sentinel id 33 rank 6 Bookkeeper id 48 rank 5 Bookkeeper id 50 rank 7 Bookkeeper id 49 rank 6 Job Submission XML Query Spawn id: agent id rank: MPI Rank snapshot Sensor id 4 Sensor id 5

12/20/2005 AgentTeamwork 17 Resource Allocation Node 1Node 0Node 2 User Commander id 0 Resource id 1 eXist Job submission An XML query CPU Architecture OS Memory Disk Total nodes Multiplier total nodes x multiplier A list of available nodes Spawn Sentinel id 2 rank 0 Bookkeeper id 2 rank 0 Node 1Node 0Node5Node 4Node 3Node 2 Sentinel id 8 rank 1 Bookkeeper id 12 rank 5 Sentinel id 2 rank 0 Sentinel id 8 rank 1 Bookkeeper id 2 rank 0 Bookkeeper id 12 rank 5 Case 1: Total nodes = 2 Multiplier = 1.5 Case 2: Total nodes = 2 Multiplier = 3 Future use

12/20/2005 AgentTeamwork 18 Resource Monitoring Commander id 0 Resource id 1 eXist A resource request A list of available nodes An XML query Spawn Sensor id 4 Sensor id 5 Sensor id 16 Sensor id 18 Sensor id 17 Sensor id 19 Sensor id 20 Sensor id 22 Sensor id 21 Sensor id 23 ttcp Performance data ttcp  Current restrictions  Minimum interval: 3secs  Static distribution of sensor agents  Future extensions  Sensor migration  Use of NWS at each site

12/20/2005 AgentTeamwork 19 Job Resumption by a Parent Sentinel Sentinel id 2 rank 0 Sentinel id 8 rank 1 Sentinel id 11 rank 4 Sentinel id 10 rank 3 Sentinel id 9 rank 2 Bookkeeper id 15 rank 4 (0) Send a new snapshot periodically MPI connections (2) Search for the latest snapshot (1) Detect a ping error Sentinel id 11 rank 4 New (4) Send a new agent (5) Restart a user program (3) Retrieve the snapshot

12/20/2005 AgentTeamwork 20 Job Resumption by a Child Sentinel Commander id 0 Sentinel id 2 rank 0 Bookkeeper id 3 rank 0 Sentinel id 8 rank 1 Bookkeeper id 12 rank 1 Resource id 1 (1) No pings for 8 * 5 (= 40sec) No pings for 12 * 5 (= 60sec) (2) Search for the latest snapshot (3) Search for the latest snapshot(4) Retrieve the snapshot New Sentinel id 2 rank 0 (5) Send a new agent (7) Search for the latest snapshot (8) Search for the latest snapshot (9) Retrieve the snapshot (11) Detect a ping error (13) Detect a ping error and follow the same child resumption procedure as in p9. Commander id 0 (10) Send a new agent (6) No pings for 2 * 5 (= 10sec) (12) Restart a new resource agent from its beginning Resource id 1 New

12/20/2005 AgentTeamwork 21 User Program Wrapper statement_1; statement_2; statement_3; statement_4; statement_5; statement_6; statement_7; statement_8; statement_9; int fid = 1; while( fid == -2) { switch( func_id ) { case 0: fid = func_0( ); case 1: fid = func_1( ); case 2: fid = func_2( ); } check_point( ) { // save this object // including func_id // into a file } check_point( ); func_0( ) { statement_1; statement_2; statement_3; return 1; } func_1( ) { statement_4; statement_5; statement_6; return 2; } func_2( ) { statement_7; statement_8; statement_9; return -2; } User Program Wrapper Source Code Preprocessed

12/20/2005 AgentTeamwork 22 Preproccesser and Drawback No recursions Useless source line numbers indicated upon errors Still need of explicit snapshot points. statement_1; statement_2; statement_3; check_point( ); while (…) { statement_4; if (…) { statement_5; check_point( ); statement_6; } else statement_7; statement_8; } check_point( ); int func_0( ) { statement_1; statement_2; statement_3; return 1; } int func_1( ) { while(…) { statement_4; if (…) { statement_5; return 2; } else statement_7; statement_8; } int func_2( ) { statement_6; statement_8; while(…) { statement_4; if (…) { statement_5; return 2; } else statement_7; statement8; } Source Code Preprocessed Code Before check_point( ) in if-clause After check_point( ) in if-clause Preprocessed

12/20/2005 AgentTeamwork 23 GridTcp – Check-Pointed Connection n1.uwb.edu n3.uwb.edu n2.uwb.edu TCP user program rankip 1n1.uwb.edu 2n2.uwb.edu outgoing backup incoming User Program Wrapper Snapshot maintenance TCP user program n2.uwb.edu2 n1.uwb.edu1 iprank incoming ougoing backup User Program Wrapper n3.uwb.edu user program n3.uwb.edu2 n1.uwb.edu1 iprank incoming ougoing backup User Program Wrapper TCP Outgoing packets saved in a backup queue All packets serialized in a backup file every check pointing Upon a migration  Packets de-serialized from a backup file  Backup packets restored in outgoing queue  IP table updated

12/20/2005 AgentTeamwork 24 GridTcp – Over-Gateway Connection user program rankdestgateway 0mnode0- 1medusa- 2uw1-320medusa 3uw medusa User Program Wrapper user program rankdestgateway 0mnode0- 1medusa- 2uw uw Uw1-320 User Program Wrapper user program rankdestgateway 0mnode0medusa 1 - 2uw uw User Program Wrapper user program rankdestgateway 0mnode0uw medusauw uw User Program Wrapper mnode0 (rank 0) medusa.uwb.edu (rank 1) uw1-320.uwb.edu (rank 2) uw (rank 3)  RIP-like connection  Restriction: each node name must be unique.

12/20/2005 AgentTeamwork 25 MPJ Package MPJ Init( ), Rank( ), Size( ), and Finalize( ) Communicator All communication functions: Send( ), Recv( ), Gather( ), Reduce( ), etc. JavaComm GridComm DataType MPJMessage Op etc mpiJava-S: uses java sockets and server sockets. mpiJava-A: uses GridTcp sockets. MPJ.INT, MPJ.LONG, etc. getStatus( ), getMessage( ), etc. Operate( ) Other utilities  InputStream for each rank  OutputStream for each rank  User a permanent 64K buffer for serialization  Emulate collective communication sending the same data to each OutputStream, which deteriorates performance

12/20/2005 AgentTeamwork 26 user program n2.uwb.edu2 n1.uwb.edu1 iprank outgoing backup incoming TCP User Program Wrapper MPI Connection MPI Job Execution UWPlace (UWAgent Execution Platform) Sentinel Agent Main Thread SendSnapshot Thread TCPError ThreadReceiveMsg Thread snapshot Bookkeeper Agent snapshot Resumed Sentinel Agent Restart message (a new rank/ip pair) n3.uwb.edu

12/20/2005 AgentTeamwork Performance Evaluation Evaluation Environment:  A 8-node Myrinet-2000 cluster: 2.8GHz pentium4-Xeon w/ 512MB  A 24-node Giga-Ethernet cluster: 3.4GHz Pentium4-Xeon w/512MB Computation Granularity Java Grande MPJ Benchmark Process Resumption Overhead

12/20/2005 AgentTeamwork 28 MPJ.Send and Recv Performance

12/20/2005 AgentTeamwork 29 Computational Granularity 1

12/20/2005 AgentTeamwork 30 Computational Granularity 2

12/20/2005 AgentTeamwork 31 Computational Granularity 3

12/20/2005 AgentTeamwork 32 Performance Evaluation - Series

12/20/2005 AgentTeamwork 33 Performance Evaluation - RayTracer

12/20/2005 AgentTeamwork 34 Performance Evaluation – MolDyn

12/20/2005 AgentTeamwork 35 Overhead of Job Resumption

12/20/2005 AgentTeamwork Related Work From the viewpoints of: System Architecture Mobile Agents Fault Tolerance

12/20/2005 AgentTeamwork 37 System Architecture SystemsArchitectural basis GlobusA toolkit CondorProcess migration Ninf, NetSolveRPC Legion (Avaki)OO Catalina, J-SEAL2, AgentTeamworkMobile agents Difference from Catalina/J-SEAL2  They are not fully implemented.  They are based on a master-slave model

12/20/2005 AgentTeamwork 38 Mobile Agents Mobile agents NamingCascading termination Job scheduling Security IBM Aglets AgeltFinder traces all agents Needs to retract one by one Schedules jobs with Baglets. Java byte-code verification Voyager RPC-based system- unique agent IDs Needs to be implemented at a user level Launches an independent user process. CORBA security service D’Agent Unpredictable agent IDs Needs to be implemented at a user level Launches an independent user process. A currency-based model Ara (Obsolete) Unpredictable agent IDs Calls ara_kill to kill all agents Launches an independent user process. An allowance model UWAgent Agent domainWaits for all descendants’ termination Schedules jobs with Java thread functions. Agent-to-agent security w/ Agent domain

12/20/2005 AgentTeamwork 39 Fault Tolerance SystemsLibrariesData recoveryCommunication recovery Legion (Avaki)FT-MPIVariables passed to MPI_FT_save( ) N/A CondorMW LibraryAll master dataMaster-worker communication DomeDome_envObjects declared as dXXX N/A AgentTeamworkGridTcpAll serializable class data All in-transit messages

12/20/2005 AgentTeamwork Conclusions Project Summary Next Two Years

12/20/2005 AgentTeamwork 41 Project summary Our focus  A decentralized job execution and fault-tolerant environment  Applications not restricted to the master-slave or parameter-sweeping model. Applications  40,000 doubles x 10,000 floating-point operations  Moderate data transfer combined with massive/collective communication  At least three times larger than its computational granularity Current status  UWAgent: completed  Agent behavioral design: basic job deployment/resumption implemented  User program wrapper: completed except security feature  GridTcp/mpiJava: in testing  Preprocessor: in design

12/20/2005 AgentTeamwork 42 Next Two Years Application support  Preprocessor implementation  Efficient input/output file transfer  Security enhancement in remote execution  GUI improvement Agent algorithms  Over-gateway application deployment  Dynamic resource monitoring  Priority-based agent migration Performance evaluation Dissemination

12/20/2005 AgentTeamwork 43 Questions?