Presentation is loading. Please wait.

Presentation is loading. Please wait.

12/20/2005AgentTeamwork1 AgentTeamwork: Mobile-Agent-Based Middleware for Distributed Job Coordination Munehiro Fukuda Computing & Software Systems, University.

Similar presentations


Presentation on theme: "12/20/2005AgentTeamwork1 AgentTeamwork: Mobile-Agent-Based Middleware for Distributed Job Coordination Munehiro Fukuda Computing & Software Systems, University."— Presentation transcript:

1 12/20/2005AgentTeamwork1 AgentTeamwork: Mobile-Agent-Based Middleware for Distributed Job Coordination Munehiro Fukuda Computing & Software Systems, University of Washington, Bothell Funded by

2 12/20/2005 AgentTeamwork 2 Outline 1.Introduction 2.Execution Model 3.System Design 4.Performance Evaluation 5.Related Work 6.Conclusions

3 12/20/2005 AgentTeamwork 3 1. Introduction Why Grid Computing Background Objective Project Overview

4 12/20/2005 AgentTeamwork 4 Why Grid Computing Textbooks say:  Only 30% CPU utilization  Only episodic job requirements  Anyone and anywhere like a power grid Many research prototypes and commercial products:  Globus, Condor, Legion(Avaki), NetSolve, Ninf, Entropia PCGrid, Sun Grid Engine, etc. Then, have you ever used them?  Probably not so many of you. What is a big hurdle?  You don’t need it anyway. Or, what?

5 12/20/2005 AgentTeamwork 5 Background Most Grid Systems Functional viewpoints:  Centralized resource/job management  Two drawbacks A powerful central server essential to manage all slave computing nodes Applications based on master-slave or parameter-sweep model  Out motivation Decentralized job distribution, coordination, and fault tolerance Applications based on a variety of communication models Practical viewpoints:  Systems dedicated to large institutions/companies  Two drawbacks A lot of installation work required under the root account. A group of individual computer owners not targeted at.  Our motivation Easy participation in grid-computing and easy installation

6 12/20/2005 AgentTeamwork 6 Background How to Pursue Our Motivation Use of mobile agents  We are experts in mobile agents. Most mobile agents  An execution model previously highlighted as a prospective infrastructure of distributed systems.  No more than an alternative approach to centralized grid middleware implementation. Our initial goal  Decentralized middleware design with mobile agents

7 12/20/2005 AgentTeamwork 7 Objective A mobile agent execution platform fitted to grid computing  Allowing an agent to identify which MPI rank to handle and which agent to send a job snapshot to. A fault-tolerant inter-process communication  Recovering lost messages.  Allowing over-gateway connections. Agent-collaborative algorithms for job coordination  Allocating computing nodes in a distributed manner.  Implementing decentralized snapshot maintenance and job recovery.

8 12/20/2005 AgentTeamwork 8 Project Overview Funded by:NSF Middleware Initiative Sponsored by:University of Washington In Collaboration of:Ehime University In a Team of:UWB Undergraduates

9 12/20/2005 AgentTeamwork 9 2. Execution Model System Overview Execution Layer Programming Interface

10 12/20/2005 AgentTeamwork 10 System Overview FTP Server User A User B User B snapshot snapshots User program wrapper Snapshot Methods GridTCP User program wrapper Snapshot Methods GridTCP User program wrapper Snapshot Methods GridTCP snapshot User A’s Process User A’s Process User B’s Process TCP Communication Commander Agent Sentinel Agent Resource Agent Sentinel Agent Resource Agent Bookkeeper Agent Results

11 12/20/2005 AgentTeamwork 11 Execution Layer Operating systems UWAgents mobile agent execution platform Commander, resource, sentinel, and bookkeeper agents User program wrapper GridTcpJava socket mpiJava-AmpiJava-S mpiJava API Java user applications

12 12/20/2005 AgentTeamwork 12 Programming Interface public class MyApplication { public GridIpEntry ipEntry[]; // used by the GridTcp socket library public int funcId; // used by the user program wrapper public GridTcp tcp; // the GridTcp error-recoverable socket public int nprocess; // #processors public int myRank; // processor id ( or mpi rank) public int func_0( String args[] ) { // constructor MPJ.Init( args, ipEntry, tcp ); // invoke mpiJava-A.....; // more statements to be inserted return 1; // calls func_1( ) } public int func_1( ) { // called from func_0 if ( MPJ.COMM_WORLD.Rank( ) == 0 ) MPJ.COMM_WORLD.Send(... ); else MPJ.COMM_WORLD.Recv(... );.....; // more statements to be inserted return 2; // calls func_2( ) } public int func_2( ) { // called from func_2, the last function.....; // more statements to be inserted MPJ.finalize( ); // stops mpiJava-A return -2; // application terminated }

13 12/20/2005 AgentTeamwork 13 3. System Design Mobile Agents Job Coordination  Distribution  Monitoring  Resumption and migration Programming Support  Language preprocessing  Communication check-pointing

14 12/20/2005 AgentTeamwork 14 id 0 Agent domain (time=3:31pm, 8/25/05 ip = perseus.uwb.edu name = fukuda) id 0 UWInject: submits a new agent from shell. Agent domain (time=3:30pm, 8/25/05 ip = medusa.uwb.edu name = fukuda) UWAgents – Concept of Agent Domain Agent domain created per each submission from the Unix shell # children each agent can spawn is given upon the initial submission No name server Messages forwarded through an agent tree A user job scheduled as a thread, using suspend/resume User id 1id 2id 3 id 7id 6id 5id 4id 11id 10id 9id 8 id 12 -m 4 id 1 id 2 -m 3 UWPlace A user job

15 12/20/2005 AgentTeamwork 15 UWAgents – Over Gateway Migration

16 12/20/2005 AgentTeamwork 16 Job Distribution User Commander id 0 Sentinel id 2 rank 0 Bookkeeper id 3 rank 0 Resource id 1 eXist Sentinel id 8 rank 1 Sentinel id 11 rank 4 Sentinel id 10 rank 3 Sentinel id 9 rank 2 Bookkeeper id 12 rank 1 Bookkeeper id 15 rank 4 Bookkeeper id 14 rank 3 Bookkeeper id 13 rank 2 Sentinel id 32 rank 5 Sentinel id 34 rank 7 Sentinel id 33 rank 6 Bookkeeper id 48 rank 5 Bookkeeper id 50 rank 7 Bookkeeper id 49 rank 6 Job Submission XML Query Spawn id: agent id rank: MPI Rank snapshot Sensor id 4 Sensor id 5

17 12/20/2005 AgentTeamwork 17 Resource Allocation Node 1Node 0Node 2 User Commander id 0 Resource id 1 eXist Job submission An XML query CPU Architecture OS Memory Disk Total nodes Multiplier total nodes x multiplier A list of available nodes Spawn Sentinel id 2 rank 0 Bookkeeper id 2 rank 0 Node 1Node 0Node5Node 4Node 3Node 2 Sentinel id 8 rank 1 Bookkeeper id 12 rank 5 Sentinel id 2 rank 0 Sentinel id 8 rank 1 Bookkeeper id 2 rank 0 Bookkeeper id 12 rank 5 Case 1: Total nodes = 2 Multiplier = 1.5 Case 2: Total nodes = 2 Multiplier = 3 Future use

18 12/20/2005 AgentTeamwork 18 Resource Monitoring Commander id 0 Resource id 1 eXist A resource request A list of available nodes An XML query Spawn Sensor id 4 Sensor id 5 Sensor id 16 Sensor id 18 Sensor id 17 Sensor id 19 Sensor id 20 Sensor id 22 Sensor id 21 Sensor id 23 ttcp Performance data ttcp  Current restrictions  Minimum interval: 3secs  Static distribution of sensor agents  Future extensions  Sensor migration  Use of NWS at each site

19 12/20/2005 AgentTeamwork 19 Job Resumption by a Parent Sentinel Sentinel id 2 rank 0 Sentinel id 8 rank 1 Sentinel id 11 rank 4 Sentinel id 10 rank 3 Sentinel id 9 rank 2 Bookkeeper id 15 rank 4 (0) Send a new snapshot periodically MPI connections (2) Search for the latest snapshot (1) Detect a ping error Sentinel id 11 rank 4 New (4) Send a new agent (5) Restart a user program (3) Retrieve the snapshot

20 12/20/2005 AgentTeamwork 20 Job Resumption by a Child Sentinel Commander id 0 Sentinel id 2 rank 0 Bookkeeper id 3 rank 0 Sentinel id 8 rank 1 Bookkeeper id 12 rank 1 Resource id 1 (1) No pings for 8 * 5 (= 40sec) No pings for 12 * 5 (= 60sec) (2) Search for the latest snapshot (3) Search for the latest snapshot(4) Retrieve the snapshot New Sentinel id 2 rank 0 (5) Send a new agent (7) Search for the latest snapshot (8) Search for the latest snapshot (9) Retrieve the snapshot (11) Detect a ping error (13) Detect a ping error and follow the same child resumption procedure as in p9. Commander id 0 (10) Send a new agent (6) No pings for 2 * 5 (= 10sec) (12) Restart a new resource agent from its beginning Resource id 1 New

21 12/20/2005 AgentTeamwork 21 User Program Wrapper statement_1; statement_2; statement_3; statement_4; statement_5; statement_6; statement_7; statement_8; statement_9; int fid = 1; while( fid == -2) { switch( func_id ) { case 0: fid = func_0( ); case 1: fid = func_1( ); case 2: fid = func_2( ); } check_point( ) { // save this object // including func_id // into a file } check_point( ); func_0( ) { statement_1; statement_2; statement_3; return 1; } func_1( ) { statement_4; statement_5; statement_6; return 2; } func_2( ) { statement_7; statement_8; statement_9; return -2; } User Program Wrapper Source Code Preprocessed

22 12/20/2005 AgentTeamwork 22 Preproccesser and Drawback No recursions Useless source line numbers indicated upon errors Still need of explicit snapshot points. statement_1; statement_2; statement_3; check_point( ); while (…) { statement_4; if (…) { statement_5; check_point( ); statement_6; } else statement_7; statement_8; } check_point( ); int func_0( ) { statement_1; statement_2; statement_3; return 1; } int func_1( ) { while(…) { statement_4; if (…) { statement_5; return 2; } else statement_7; statement_8; } int func_2( ) { statement_6; statement_8; while(…) { statement_4; if (…) { statement_5; return 2; } else statement_7; statement8; } Source Code Preprocessed Code Before check_point( ) in if-clause After check_point( ) in if-clause Preprocessed

23 12/20/2005 AgentTeamwork 23 GridTcp – Check-Pointed Connection n1.uwb.edu n3.uwb.edu n2.uwb.edu TCP user program rankip 1n1.uwb.edu 2n2.uwb.edu outgoing backup incoming User Program Wrapper Snapshot maintenance TCP user program n2.uwb.edu2 n1.uwb.edu1 iprank incoming ougoing backup User Program Wrapper n3.uwb.edu user program n3.uwb.edu2 n1.uwb.edu1 iprank incoming ougoing backup User Program Wrapper TCP Outgoing packets saved in a backup queue All packets serialized in a backup file every check pointing Upon a migration  Packets de-serialized from a backup file  Backup packets restored in outgoing queue  IP table updated

24 12/20/2005 AgentTeamwork 24 GridTcp – Over-Gateway Connection user program rankdestgateway 0mnode0- 1medusa- 2uw1-320medusa 3uw1-320-00medusa User Program Wrapper user program rankdestgateway 0mnode0- 1medusa- 2uw1-320- 3uw1-320-00Uw1-320 User Program Wrapper user program rankdestgateway 0mnode0medusa 1 - 2uw1-320- 3uw1-320-00- User Program Wrapper user program rankdestgateway 0mnode0uw1-320 1medusauw1-320 2 - 3uw1-320-00- User Program Wrapper mnode0 (rank 0) medusa.uwb.edu (rank 1) uw1-320.uwb.edu (rank 2) uw1-320-00 (rank 3)  RIP-like connection  Restriction: each node name must be unique.

25 12/20/2005 AgentTeamwork 25 MPJ Package MPJ Init( ), Rank( ), Size( ), and Finalize( ) Communicator All communication functions: Send( ), Recv( ), Gather( ), Reduce( ), etc. JavaComm GridComm DataType MPJMessage Op etc mpiJava-S: uses java sockets and server sockets. mpiJava-A: uses GridTcp sockets. MPJ.INT, MPJ.LONG, etc. getStatus( ), getMessage( ), etc. Operate( ) Other utilities  InputStream for each rank  OutputStream for each rank  User a permanent 64K buffer for serialization  Emulate collective communication sending the same data to each OutputStream, which deteriorates performance

26 12/20/2005 AgentTeamwork 26 user program n2.uwb.edu2 n1.uwb.edu1 iprank outgoing backup incoming TCP User Program Wrapper MPI Connection MPI Job Execution UWPlace (UWAgent Execution Platform) Sentinel Agent Main Thread SendSnapshot Thread TCPError ThreadReceiveMsg Thread snapshot Bookkeeper Agent snapshot Resumed Sentinel Agent Restart message (a new rank/ip pair) n3.uwb.edu

27 12/20/2005 AgentTeamwork 27 4. Performance Evaluation Evaluation Environment:  A 8-node Myrinet-2000 cluster: 2.8GHz pentium4-Xeon w/ 512MB  A 24-node Giga-Ethernet cluster: 3.4GHz Pentium4-Xeon w/512MB Computation Granularity Java Grande MPJ Benchmark Process Resumption Overhead

28 12/20/2005 AgentTeamwork 28 MPJ.Send and Recv Performance

29 12/20/2005 AgentTeamwork 29 Computational Granularity 1

30 12/20/2005 AgentTeamwork 30 Computational Granularity 2

31 12/20/2005 AgentTeamwork 31 Computational Granularity 3

32 12/20/2005 AgentTeamwork 32 Performance Evaluation - Series

33 12/20/2005 AgentTeamwork 33 Performance Evaluation - RayTracer

34 12/20/2005 AgentTeamwork 34 Performance Evaluation – MolDyn

35 12/20/2005 AgentTeamwork 35 Overhead of Job Resumption

36 12/20/2005 AgentTeamwork 36 5. Related Work From the viewpoints of: System Architecture Mobile Agents Fault Tolerance

37 12/20/2005 AgentTeamwork 37 System Architecture SystemsArchitectural basis GlobusA toolkit CondorProcess migration Ninf, NetSolveRPC Legion (Avaki)OO Catalina, J-SEAL2, AgentTeamworkMobile agents Difference from Catalina/J-SEAL2  They are not fully implemented.  They are based on a master-slave model

38 12/20/2005 AgentTeamwork 38 Mobile Agents Mobile agents NamingCascading termination Job scheduling Security IBM Aglets AgeltFinder traces all agents Needs to retract one by one Schedules jobs with Baglets. Java byte-code verification Voyager RPC-based system- unique agent IDs Needs to be implemented at a user level Launches an independent user process. CORBA security service D’Agent Unpredictable agent IDs Needs to be implemented at a user level Launches an independent user process. A currency-based model Ara (Obsolete) Unpredictable agent IDs Calls ara_kill to kill all agents Launches an independent user process. An allowance model UWAgent Agent domainWaits for all descendants’ termination Schedules jobs with Java thread functions. Agent-to-agent security w/ Agent domain

39 12/20/2005 AgentTeamwork 39 Fault Tolerance SystemsLibrariesData recoveryCommunication recovery Legion (Avaki)FT-MPIVariables passed to MPI_FT_save( ) N/A CondorMW LibraryAll master dataMaster-worker communication DomeDome_envObjects declared as dXXX N/A AgentTeamworkGridTcpAll serializable class data All in-transit messages

40 12/20/2005 AgentTeamwork 40 6. Conclusions Project Summary Next Two Years

41 12/20/2005 AgentTeamwork 41 Project summary Our focus  A decentralized job execution and fault-tolerant environment  Applications not restricted to the master-slave or parameter-sweeping model. Applications  40,000 doubles x 10,000 floating-point operations  Moderate data transfer combined with massive/collective communication  At least three times larger than its computational granularity Current status  UWAgent: completed  Agent behavioral design: basic job deployment/resumption implemented  User program wrapper: completed except security feature  GridTcp/mpiJava: in testing  Preprocessor: in design

42 12/20/2005 AgentTeamwork 42 Next Two Years Application support  Preprocessor implementation  Efficient input/output file transfer  Security enhancement in remote execution  GUI improvement Agent algorithms  Over-gateway application deployment  Dynamic resource monitoring  Priority-based agent migration Performance evaluation Dissemination

43 12/20/2005 AgentTeamwork 43 Questions?


Download ppt "12/20/2005AgentTeamwork1 AgentTeamwork: Mobile-Agent-Based Middleware for Distributed Job Coordination Munehiro Fukuda Computing & Software Systems, University."

Similar presentations


Ads by Google