Infrastructure Provision for Users at CamGrid Mark Calleja Cambridge eScience Centre www.escience.cam.ac.uk.

Slides:



Advertisements
Similar presentations
CamGrid Mark Calleja Cambridge eScience Centre. What is it? A number of like minded groups and departments (10), each running their own Condor pool(s),
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
Categories of I/O Devices
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
A Computation Management Agent for Multi-Institutional Grids
Threads Irfan Khan Myo Thein What Are Threads ? a light, fine, string like length of material made up of two or more fibers or strands of spun cotton,
Threads Section 2.2. Introduction to threads A thread (of execution) is a light-weight process –Threads reside within processes. –They share one address.
Introduction to CamGrid Mark Calleja Cambridge eScience Centre
A quick introduction to CamGrid University Computing Service Mark Calleja.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machine Universe in.
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.
Computer System Architectures Computer System Software
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
HTCondor and BOINC. › Berkeley Open Infrastructure for Network Computing › Grew out of began in 2002 › Middleware system for volunteer computing.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Threads, Thread management & Resource Management.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto OS Design.
Grid Computing I CONDOR.
1.4 Open source implement. Open source implement Open vs. Closed Software Architecture in Linux Systems Linux Kernel Clients and Daemon Servers Interface.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Can we use the XROOTD infrastructure in the PROOF context ? The need and functionality of a PROOF Master coordinator has been discussed during the meeting.
GPU Architecture and Programming
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Threads G.Anuradha (Reference : William Stallings)
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
1 Lattice QCD Clusters Amitoj Singh Fermi National Accelerator Laboratory.
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
The impacts of climate change on global hydrology and water resources Simon Gosling and Nigel Arnell, Walker Institute for Climate System Research, University.
1 Lecture 6 Introduction to Process Management COP 3353 Introduction to UNIX.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Processes & Threads Introduction to Operating Systems: Module 5.
1.4 Open source implement. Open source implement Open vs. Closed Software Architecture in Linux Systems Linux Kernel Clients and Daemon Servers Interface.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor and Virtual Machines.
Lecture 4 Page 1 CS 111 Online Modularity and Memory Clearly, programs must have access to memory We need abstractions that give them the required access.
Chapter 4 – Thread Concepts
Introduction to threads
GRID COMPUTING.
HTCondor Security Basics
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Scheduler activations
Chapter 4 – Thread Concepts
Modularity and Memory Clearly, programs must have access to memory
Virtualization Layer Virtual Hardware Virtual Networking
Chapter 4: Threads.
Condor-G Making Condor Grid Enabled
PU. Setting up parallel universe in your pool and when (not
Presentation transcript:

Infrastructure Provision for Users at CamGrid Mark Calleja Cambridge eScience Centre

Background: CamGrid Based around the Condor middleware from the University of Wisconsin. Consists of eleven groups, 13 pools, ~1,000 processors, “all” linux. CamGrid uses a set of RFC 1918 (“CUDN-only”) IP addresses. Hence each machine needs to be given an (extra) address in this space. Each group sets up and runs its own pool(s), and flocks to/from other pools. Hence a decentralised, federated model. Strengths: –No single point of failure –Sysadmin tasks shared out Weaknesses: –Debugging can be complicated, especially networking issues. –No overall administrative control/body.

Actually, CamGrid currently has 13 pools.

Participating departments/groups Cambridge eScience Centre Dept. of Earth Science (2) High Energy Physics School of Biological Sciences National Institute for Environmental eScience (2) Chemical Informatics Semiconductors Astrophysics Dept. of Oncology Dept. of Materials Science and Metallurgy Biological and Soft Systems

How does a user monitor job progress? “Easy” for a standard universe job (as long as you can get to the submit node), but what about other universes, e.g. vanilla & parallel? Can go a long way with a shared file system, but not always feasible, e.g. CamGrid’s multi-administrative domain. Also, the above require direct access to the submit host. This may not always be desirable. Furthermore, users like web/browser access. Our solution: put an extra daemon on each execute node to serve requests from a web-server front end.

CamGrid’s vanilla-universe file viewer Sessions use cookies. Authenticate via HTTPS Raw HTTP transfer (no SOAP). master_listener does resource discovery

Process Checkpointing Condor’s process checkpointing via the Standard Universe saves all the state of a process into a checkpoint file –Memory, CPU, I/O, etc. Checkpoints are saved on submit host unless a dedicated checkpoint server is nominated. The process can then be restarted from where it left off Typically no changes to the job’s source code needed – however, the job must be relinked with Condor’s Standard Universe support library Limitations: no forking, kernel threads, or some forms of IPC Not all combinations of OS/compilers are supported (none for Windows), and support is getting harder. VM universe is meant to be the successor, but users don’t seem too keen.

Checkpointing (linux) vanilla universe jobs Many/most applications can’t link with Condor’s checkpointing libraries. To perform this for arbitrary code we need: 1) An API that checkpoints running jobs. 2) A user-space FS to save the images For 1) we use the BLCR kernel modules – unlike Condor’s user-space libraries these run with root privilege, so less limitations as to the codes one can use. For 2) we use Parrot, which came out of the Condor project. Used on CamGrid in its own right, but with BLCR allows for any code to be checkpointed. I’ve provided a bash implementation, blcr_wrapper.sh, to accomplish this (uses chirp protocol with Parrot).

Checkpointing linux jobs using BLCR kernel modules and Parrot 1.Start chirp server to receive checkpoint images 2. Condor jobs starts: blcr_wrapper.sh uses 3 processes Parrot I/OJobParent 3. Start by checking for image from previous run 4. Start job 5. Parent sleeps; wakes periodically to checkpoint and save images. 6. Job ends: tell parent to clean up

Example of submit script Application is “my_application”, which takes arguments “A” and “B”, and needs files “X” and “Y”. There’s a chirp server at: woolly--escience.grid.private.cam.ac.uk:9096 Universe = vanilla Executable = blcr_wrapper.sh arguments = woolly--escience.grid.private.cam.ac.uk $$([GlobalJobId]) \ my_application A B transfer_input_files = parrot, my_application, X, Y transfer_files = ALWAYS Requirements = OpSys == "LINUX" && Arch == "X86_64" && HAS_BLCR == TRUE Output = test.out Log = test.log Error = test.error Queue

GPUs, CUDA and CamGrid An increasing number of users are showing interest in general purpose GPU programming, especially using NVIDIA’s CUDA. Users report speed-ups from a few factors to > x100, depending on the code being ported. Recently we’ve put a GeForce 9600 GT on CamGrid for testing. Only single precision, but for £90 we got 64 cores and 0.5GB memory. Access via Condor is not ideal, but OK. Also, Wisconsin are aware of the situation and are in a requirements capture process for GPUs and multi-core architectures in general. New cards (Tesla, GTX 2[6,8]0) have double precision. GPUs will only be applicable to a subset of the applications currently seen on CamGrid, but we predict a bright future. The stumbling block is the learning curve for developers. Positive feedback from NVIDIA in applying for support from their Professor Partnership Program ($25k awards).

Links CamGrid: Condor: Questions?