Research Issues in Cooperative Computing Douglas Thain

Slides:



Advertisements
Similar presentations
Experiences with Massive PKI Deployment and Usage Daniel Kouřil, Michal Procházka Masaryk University & CESNET Security and Protection of Information 2009.
Advertisements

File Server Organization and Best Practices IT Partners June, 02, 2010.
4/2/2002HEP Globus Testing Request - Jae Yu x Participating in Globus Test-bed Activity for DØGrid UTA HEP group is playing a leading role in establishing.
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Positioning Dynamic Storage Caches for Transient Data Sudharshan VazhkudaiOak Ridge National Lab Douglas ThainUniversity of Notre Dame Xiaosong Ma North.
Deconstructing Clusters for High End Biometric Applications NSF CCF June Douglas Thain and Patrick Flynn University of Notre Dame 5 August.
Computing and Data Infrastructure for Large-Scale Science Deploying Production Grids: NASA’s IPG and DOE’s Science Grid William E. Johnston
Research Issues in Cooperative Computing Douglas Thain
Cooperative Computing for Data Intensive Science Douglas Thain University of Notre Dame NSF Bridges to Engineering 2020 Conference 12 March 2008.
An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of Notre Dame
Usage Policy (UPL) Research for GriPhyN & iVDGL Catalin L. Dumitrescu, Michael Wilde, Ian Foster The University of Chicago.
Enabling Data-Intensive Science with Tactical Storage Systems Douglas Thain
Workload Management Massimo Sgaravatto INFN Padova.
The Condor Data Access Framework GridFTP / NeST Day 31 July 2001 Douglas Thain.
5.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 5: Working with File Systems.
Peer-to-peer archival data trading Brian Cooper and Hector Garcia-Molina Stanford University.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Operating Systems: Principles and Practice
Research Computing with Newton Gerald Ragghianti Nov. 12, 2010.
Week #10 Objectives: Remote Access and Mobile Computing Configure Mobile Computer and Device Settings Configure Remote Desktop and Remote Assistance for.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
Monitoring and Accounting on the NGS Guy Warner NeSC TOE Team.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Open Science Grid For CI-Days Internet2: Fall Member Meeting, 2007 John McGee – OSG Engagement Manager Renaissance Computing Institute.
Local Area Networks (LAN) are small networks, with a short distance for the cables to run, typically a room, a floor, or a building. - LANs are limited.
CI Days: Planning Your Campus Cyberinfrastructure Strategy Russ Hobby, Internet2 Internet2 Member Meeting 9 October 2007.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
Networked Storage Technologies Douglas Thain University of Wisconsin GriPhyN NSF Project Review January 2003 Chicago.
ARGONNE  CHICAGO Ian Foster Discussion Points l Maintaining the right balance between research and development l Maintaining focus vs. accepting broader.
Module 7: Fundamentals of Administering Windows Server 2008.
BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:
Open Science Grid For CI-Days Elizabeth City State University Jan-2008 John McGee – OSG Engagement Manager Manager, Cyberinfrastructure.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
TELE 301 Lecture 10: Scheduled … 1 Overview Last Lecture –Post installation This Lecture –Scheduled tasks and log management Next Lecture –DNS –Readings:
What is Cyberinfrastructure? Russ Hobby, Internet2 Clemson University CI Days 20 May 2008.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Grid Workload Management Massimo Sgaravatto INFN Padova.
09/02 ID099-1 September 9, 2002Grid Technology Panel Patrick Dreher Technical Panel Discussion: Progress in Developing a Web Services Data Analysis Grid.
Virtual Batch Queues A Service Oriented View of “The Fabric” Rich Baker Brookhaven National Laboratory April 4, 2002.
Condor in LIGO Festivus-Style Peter Couvares Syracuse University, LIGO Scientific Collaboration Condor Week May 2013.
NUOL Internet Application Services Midterm presentation 22 nd March, 2004.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Michael Fenn CPSC 620, Fall 09.  Grid computing is the process of allowing loosely-coupled virtual organizations to share resources over a wide area.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
VO Privilege Activity. The VO Privilege Project develops and implements fine-grained authorization to grid- enabled resources and services Started Spring.
© ITT Educational Services, Inc. All rights reserved. IS3230 Access Security Unit 6 Implementing Infrastructure Controls.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Module 6: Administering Reporting Services. Overview Server Administration Performance and Reliability Monitoring Database Administration Security Administration.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
ATLAS Distributed Analysis Dietrich Liko IT/GD. Overview  Some problems trying to analyze Rome data on the grid Basics Metadata Data  Activities AMI.
1 Christopher Moretti – University of Notre Dame 4/30/2008 High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui, Brandon.
Status report of the new NA60 “cluster” Our OpenMosix farm will increase our computing power, using the DAQ/monitoring computers. NA60 weekly meetings.
Copyright © 2004 R2AD, LLC Submitted to GGF ACS Working Group for GGF-16 R2AD, LLC Distributing Software Life Cycles Join the ACS Team GGF-16, Athens R2AD,
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
Workload Management Workpackage
THE STEPS TO MANAGE THE GRID
Simulation use cases for T2 in ALICE
US CMS Testbed.
Grid Means Business OGF-20, Manchester, May 2007
Semiconductor Manufacturing (and other stuff) with Condor
STORK: A Scheduler for Data Placement Activities in Grid
Presentation transcript:

Research Issues in Cooperative Computing Douglas Thain

Sharing is Hard! Despite decades of research in distributed systems and operating systems, sharing computing resources is still very difficult. Problems get worse as scale increases: –Office –Server Room –Distributed System –Computational Grid

Designers Go To Extremes: Peer to Peer Central Control Cooperative Computing

How Do We Share Data? Central Storage Archive (NFS, UDC, StorageTank.) P2P File Sharing (WWW, Napster)

Things I Can’t Do Today Let members of my project team store and retrieve documents from this disk in my office. –(Where my boss defines “project team”.) I must have 1 TB of space for one whole week, but it must be stored by someone I know. –(Where I give a list of trusted people.) Allow a visitor in my office to use my machine. –(But I want her workspace isolated from mine.) This bioinformatics repository can be written by my grad students, read by all ND faculty, and read by anyone approved by the NSF. –(Where each list comes from a different source.)

What is Cooperative Computing? CC means putting owners in charge. –I control who uses my resources. –Need tools for expressing trust. CC means respect for social structures. –Trust is rarely symmetric. –Hierarchy and centralization can be important. –Motivation is usually external to the system. CC means ease of use. –Resource owners need simple and effective tools. –Resource users need to be insulated from failures.

Consumption Allocation Accounting Quality of Service Security Debugging Consumption Allocation Accounting Quality of Service Security Debugging Consumption Allocation Accounting Quality of Service Security Debugging Every User Should be a Super-User Allocation Accounting Quality of Service Security Debugging Super- User

Vision of Cooperative Storage Make it easy to deploy systems that: –Allow sharing of storage space. –Respect existing human structures. –Provide reasonable space/perf promises. –Work easily and transparently without root. –Make the non-ideal properties manageable: Limited allocation. (select, renew, migrate) Unreliable networks. (useful fallback modes) Changing configuration. (auto. discovery/config)

basic filesystem storage server Where can I find 100 GB for 24 hours? Make reservation and access data access control server Is this a member of the CSE dept? Members of the CSE dept can borrow 200 GB for one week. Resource Policy storage catalog status updates Evict user! ? Who is here?

Cooperative Storage Pool disk storage server storage server storage server storage server storage server storage server dist. file system backup system dist. computation

Cooperative Computing is useful in the office… but it is badly needed on the Grid!

CPU PBS batch system CPU Condor Batch System CPU Maui Scheduler job Work Queue gate keeper gate keeper gate keeper On the Grid

Grid Computing Experience Ian Foster, et al. (102 authors) The Grid2003 Production Grid: Principles and Practice IEEE HPDC 2004 The Grid2003 Project has deployed a multi-virtual organization, application-driven grid laboratory that has sustained for several months the production-level services required by… ATLAS, CMS, SDSS, LIGO…

Grid Computing Experience The good news: –27 sites with 2800 CPUs. –40985 CPU-days provided over 6 months. –10 applications with 1300 simultaneous jobs. The bad news: –40-70 percent utilization. –30 percent of jobs would fail. –90 percent of failures were local problems. The lessons: –Most site failures were due to disk space. –Debugging most problems was impossible.

Coop Computing and the Grid The Grid is a boundary case of CC. –Large scale, high performance. –Allocate resources to partially trusted visitors. –Everyone wants to exhaust resources. Can CC scale from the office to the grid? –If it is easy for one person to deploy in an office… then it will be usable enough to work on the grid.

More Cooperative Computing Nested Principals & Authentication –Simple question: How to allow a visitor? Distributed Access Control –Can we find something more usable than PKI? Storage Abstractions –Can we do better than files/directories? Data-Intensive Grid Computing –How do I use storage and CPU together? Distributing Debugging –Consider it a distributed query problem.

Cooperative Computing Credo: Make computer structures model social structures... Not the other way around!

For more information… The Cooperative Computing Lab Prof. Douglas Thain

Two Related Problems Users don’t have direct control. –I need 50 GB of storage for one week. –Allow my collaborators to use my space. (Usually considered administrative tasks.) Users don’t have direct information. –Why was I denied this allocation? –What series of steps was used to run my job? (Usually considered implementation details.)

The Current Situation storage server storage server storage server storage server storage server libchirp open close read write chirp tool libchirp GET PUT parrot libchirp % cp % emacs % vi catalog server status updates simple ACL hostname kerberos GSI filesystem

Distributed Debugging debugger storage Server storage server storage server batch system kerberos license manager auth gateway cpu archival host log file log file log file log file log file log file log file log file workload manager job

Distributed Debugging Big challenges! –Language issues: storing and combining logs. –Ordering: How to reassemble events? –Completeness: Gaps, losses, detail. –Systems: Distributed data collection. But, could be a big win: –“A crashes whenever X gets its creds from Y.” –“Please try again: I have turned up the detail on host B.”

Grid Computing -The Vision: Make large-scale computing resources as reliable and as simple as the electric power grid or the water utility. -The Reality: Tie together existing computing clusters and archival storage around the country into systems that are (almost) usable by experts.

Storage Allocation –Give me 50 GB for 24 hours –Technical Problem: Building Allocation Distributed Debugging –Correlation –Hypothesis Proposal –Reasoning –System Building –Adaptation

disk If I can backup to you, you can backup to me. CPU CSE grads can compute here, but only when I’m not. CPU I need ten more CPUs in order to finish my paper by Friday! CPU May I use your CPUs? auth server Is this person a CSE grad? secure I/O My friends in Italy need to access this data. disk I’m not root! PBs of workstation storage! Can I use this as a cache?

Cooperative Computing Credo Put users in charge of their resources. –Share resources as they see fit. –Expose information for debugging. Mode of operation: –Make tools that are foolproof enough for casual use by one or two people in the office. –If they really are foolproof, then they will also be suitable for deployment in large scale systems such as computational grids.