Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004.

Similar presentations


Presentation on theme: "Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004."— Presentation transcript:

1 Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004

2 2 Outline The Centre for e-Research Bristol & its place in national efforts University of Bristol Grid Available tool choices Support models for a distributed system Problems encountered Summary

3 3 Centre for e-Research Bristol Established as a Centre of Excellence in visualisation. Currently has one full time member of staff with several shared resources. Intended to lead the University e-Research effort including as many departments and non-traditional computational users as possible.

4 4 NGS (www.ngs.ac.uk) UK National Grid Service ‘Free’ dedicated resources accessible only through Grid interfaces, i.e. GSI-SSH, Globus Toolkit Compute clusters (York & Oxford) –64 dual CPU Intel 3.06 GHz nodes, 2GB RAM –Gigabit & Myrinet networking Data clusters (Manchester & RAL) –20 dual CPU Intel 3.06 GHz nodes, 4GB RAM –Gigabit & Myrinet networking –18TB Fibre SAN Also national HPC resources: HPC(x), CSAR Affiliates: Bristol, Cardiff, …

5 5 The University of Bristol Grid Established as a way of leveraging extra use from existing resources. Planned to consist of ~400 CPU from 1.2 → 3.2GHz arranged in 6 clusters. Currently about 100 CPU in 3 clusters. Initially legacy OS though all now moving to Red Hat Enterprise Linux 3. Based in and maintained by several different departments.

6 6 The University of Bristol Grid Decided to construct a campus grid to gain experience with middleware & system management before formally joining NGS. Central services all run on Viglen servers. –Resource Broker –Monitoring and Discovery System & Systems Monitoring. –Virtual Organisation Management –Storage Resource Broker Vault –myProxy Server Choices of software to provide these was lead by personal experience & other UK efforts to standardise.

7 7 The University of Bristol Grid, 2 Based in and maintained by several different departments. Each system with a different System Manager! Different OS’s, initiall just Linux & Windows, though others will come. Linux versions initially legacy though all now moving to Red Hat Enterprise Linux.

8 8 The System Layout

9 9 System Installation Model Draw it on the board!

10 10 Middleware Virtual Data Toolkit. –Chosen for stability and support structure. –Widely used in other European production grid systems. Contains the standard Globus Toolkit version 2.4 with several enhancements.

11 11 Resource Brokering Uses the Condor-G job distribution mechanism. Custom script for determination of resource priority. Integrated the Condor job submission system to the Globus Monitoring and Discovery Service.

12 12 Interacting Globus to Condor-G Condor uses the following Globus protocols. GSI –The Globus Toolkit's Grid Security Infrastructure (GSI) provides essential building blocks for other Grid protocols and Condor-G. This authentication and authorization system makes it possible to authenticate a user just once, using public key infrastructure (PKI). GRAM –The Grid Resource Allocation and Management (GRAM) protocol supports remote submission of a computational request (for example, to run program P) to a remote computational resource. GRAM is the Globus protocol that Condor-G uses to talk to remote Globus jobmanagers. GASS –The Globus Toolkit's Global Access to Secondary Storage (GASS) service provides mechanisms for transferring data to and from a remote HTTP, FTP, or GASS server. Condor-G uses GASS to transfer the executable, stdin, stdout, and stderr between the machine where a job is submitted and the remote resource. RSL –RSL is the language GRAM accepts to specify job information.

13 13 Accessing the Grid with Condor-G Condor-G allows the user to treat the Grid as a local resource, and the same command-line tools perform basic job management such as: –Submit a job, indicating an executable, input and output files, and arguments –Query a job's status –Cancel a job –Be informed when events happen, such as normal job termination or errors –Obtain access to detailed logs that provide a complete history of a job Condor-G extends basic Condor functionality to the grid, providing resource management while still providing fault tolerance and exactly-once execution semantics.

14 14 How to submit a job to the system

15 15 Limitations of Condor-G Submitting jobs to run under Globus has not yet been perfected. The following is a list of known limitations: –No checkpoints. –No job exit codes. Job exit codes are not available. –Limited platform availability. Condor-G is only available on Linux, Solaris, Digital UNIX, and IRIX. HP-UX support will hopefully be available later.

16 16 Resource Broker Operation

17 17 Load Management Only defines the raw numbers of jobs running, idle & with problems. Has little measure of relative performance of nodes within grid. Once a job has been allocated to a remote cluster then rescheduling elsewhere is difficult.

18 18 Provision of a Shared Filesystem Providing a Grid means it is beneficial to provide a shared file system. Newest machines come with minimum of 80GB hard-drives of which minimum is necessary for OS & user scratch space System will have 1TB Storage Resource Broker Vault as one of the core services. –Take this one step further buy partitioning system drives on core servers, –Create virtual disk of ~400GB using spare space on then all! –Install SRB client on all machines so that they can directly access shared storage.

19 19 Automation of Processes for Maintenance Installation Grid state monitoring System maintenance User control Grid Testing

20 20 Individual System Installation Simple shell scripts for overall control. Ensures middleware, monitoring and user software all installed in consistent place. Ensures ease of system upgrades. Ensures system managers have a chance to view installation method before hand.

21 21 Overall System Status and Status of the Grid

22 22 Ensuring the System Availability Uses the Big Brother™ system. –Monitoring occurs through server-client model. –Server maintains limits settings and pings resources listed. –Clients record system information and report this to the server using secure port.

23 23 Big Brother™ Monitoring

24 24 Grid Middleware Testing Uses the Grid Interface Test Script (GITS) developed for the ETF. –Tests the following; Globus Gatekeeper running and available. Globus Jobsubmission system Presence of machine within the Monitoring & Discovery Service. Ability to retrieve and distribute files through GridFTP. Run within the UoB grid every 3 hours. Latest results available on the Service webpage. Only downside is that it also needs to run as a standard user not system.

25 25 Grid Middleware Testing

26 26 What is currently running and how do I find out?

27 27 Authorisation And Authentication on the University of Bristol Grid Make use of the standard UK e-Science Certification Authority. Bristol is an authorised Registration Authority for this CA. Uses x509 type certificates and proxies for user AAA. May be replaced at a later date dependant on the current system scaling model.

28 28 User Management Globus uses a mapping between Distinguished name as defined in a Digital Certificate to local usernames on resources. Located in controlled disk space. Important that for each resource that a user is expecting to use, his DN is mapped locally. Distributing this is Organisation Management

29 29 Virtual Organisation Management and Resource Usage Monitoring/Accounting

30 30 Virtual Organisation Management and Resource Usage Monitoring/Accounting, 2 The server (previous) runs as a grid service using the ICENI framework. Clients located on machines that form part of the Virtual Organisation. Drawback currently is that this service must run using a personal certificate instead of machine certificate that would be ideal. Coming in new versions from OMII.

31 31 Locally Supporting a Distributed System Within the university first point of contact is always Information Services Helpdesk. –Given a preset list of questions to ask and log files to see if available. –Not expected to do any actual debugging. –Pass problems onto Grid experts who then pass problems on a system by system basis to their own maintenance staff. As one of the UK e-Science Centres we also have access to the Grid Operations and Support Centre.

32 32 Supporting a Distributed System Having a system that is well defined simplifies the support model. Trying to define a Service Level Description for each department to the UOBGrid as well as a overall UOBGrid Service Level Agreement to users. –Defines hardware support levels and availability. –Defines at a basic level the software support that will also be available.

33 33 Problems Encountered Some of the middleware that we have been trying to use has not been a reliable as we would have hoped. –MDS is a prime examples where necessity for reliability has defined our usage model. –More software than desired still has to make use of a user with an individual DN to operate. This must change for a production system. Getting time and effort from some already overworked System Managers has been tricky with sociological barriers –“Won’t letting other people use my system just mean I will have less available for me?”

34 34 Notes to think about! Choose your test application carefully Choose your first test users even more carefully! One use with a bad experience is worth 10 with good experiences –Grid has been very over hyped so people expect it all to work first-time every time!

35 35 Future Directions within Bristol Make sure the rest of the University clusters are installed and running on the UoBGrid as quickly as possible. Ensure that the ~600 Windows CPU currently part of Condor pools are integrated as soon as possible. This will give ~800CPU. Start accepting users from outside the University as part of our commitment to the National Grid Service. Run the Bristol systems as part of the WUNGrid.

36 36 Further Information Centre for e-Research Bristol: http://escience.bristol.ac.uk Email: david.wallom@bristol.ac.uk Telephone: +44 (0)117 928 8769


Download ppt "Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004."

Similar presentations


Ads by Google