Research Issues in Cooperative Computing Douglas Thain

Slides:



Advertisements
Similar presentations
Overview of local security issues in Campus Grid environments Bruce Beckles University of Cambridge Computing Service.
Advertisements

Virtual Machine Technology Dr. Gregor von Laszewski Dr. Lizhe Wang.
High Performance Computing Course Notes Grid Computing.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.
LANs and WANs Network size, vary from –simple office system (few PCs) to –complex global system(thousands PCs) Distinguish by the distances that the network.
Research Issues in Cooperative Computing Douglas Thain
Antony Jo The University of Montana. Virtualization  The process of abstraction; making something more abstract  Many types: Server Desktop Application.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Positioning Dynamic Storage Caches for Transient Data Sudharshan VazhkudaiOak Ridge National Lab Douglas ThainUniversity of Notre Dame Xiaosong Ma North.
Deconstructing Clusters for High End Biometric Applications NSF CCF June Douglas Thain and Patrick Flynn University of Notre Dame 5 August.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 1: Introduction to Windows Server 2003.
Cooperative Computing for Data Intensive Science Douglas Thain University of Notre Dame NSF Bridges to Engineering 2020 Conference 12 March 2008.
An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of Notre Dame
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Enabling Data-Intensive Science with Tactical Storage Systems Douglas Thain
The Condor Data Access Framework GridFTP / NeST Day 31 July 2001 Douglas Thain.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 1: Introduction to Windows Server 2003.
5.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 5: Working with File Systems.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 1: Introduction to Windows Server 2003.
CHAPTER OVERVIEW SECTION 5.1 – MIS INFRASTRUCTURE
Simplify your Job – Automatic Storage Management Angelo Session id:
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
SANPoint Foundation Suite HA Robert Soderbery Sr. Director, Product Management VERITAS Software Corporation.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
1 DATABASE TECHNOLOGIES BUS Abdou Illia, Fall 2007 (Week 3, Tuesday 9/4/2007)
by Marc Comeau. About A Webmaster Developing a website goes far beyond understanding underlying technologies Determine your requirements.
CHAPTER FIVE INFRASTRUCTURES: SUSTAINABLE TECHNOLOGIES
1 COMPSCI 110 Operating Systems Who - Introductions How - Policies and Administrative Details Why - Objectives and Expectations What - Our Topic: Operating.
Microsoft Active Directory(AD) A presentation by Robert, Jasmine, Val and Scott IMT546 December 11, 2004.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
Networked Storage Technologies Douglas Thain University of Wisconsin GriPhyN NSF Project Review January 2003 Chicago.
Advanced Operating Systems - Spring 2009 Lecture 21 – Monday April 6 st, 2009 Dan C. Marinescu Office: HEC 439 B. Office.
BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Active Directory Administration Lesson 5. Skills Matrix Technology SkillObjective DomainObjective # Creating Users, Computers, and Groups Automate creation.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
Chapter 10 Chapter 10: Managing the Distributed File System, Disk Quotas, and Software Installation.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Microsoft Management Seminar Series SMS 2003 Change Management.
Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
1 e-Science AHM st Aug – 3 rd Sept 2004 Nottingham Distributed Storage management using SRB on UK National Grid Service Manandhar A, Haines K,
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
Lecture 15 Page 1 CS 236 Online Evaluating Running Systems Evaluating system security requires knowing what’s going on Many steps are necessary for a full.
Background Computer System Architectures Computer System Software.
LINUX Presented By Parvathy Subramanian. April 23, 2008LINUX, By Parvathy Subramanian2 Agenda ► Introduction ► Standard design for security systems ►
Intro to Distributed Systems Hank Levy. 23/20/2016 Distributed Systems Nearly all systems today are distributed in some way, e.g.: –they use –they.
Status report of the new NA60 “cluster” Our OpenMosix farm will increase our computing power, using the DAQ/monitoring computers. NA60 weekly meetings.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
An Introduction to Local Area Networks An Overview of Peer-to-Peer and Server-Based Models.
COMPSCI 110 Operating Systems
Clouds , Grids and Clusters
2016 Citrix presentation.
Introduction to Data Management in EGI
Active Directory Administration
Thoughts on Computing Upgrade Activities
Storage Virtualization
Grid Means Business OGF-20, Manchester, May 2007
The Top 10 Reasons Why Federated Can’t Succeed
Haiyan Meng and Douglas Thain
STORK: A Scheduler for Data Placement Activities in Grid
Outline Announcements Lab2 Distributed File Systems 1/17/2019 COP5611.
Outline Review of Quiz #1 Distributed File Systems 4/20/2019 COP5611.
PLANNING A SECURE BASELINE INSTALLATION
Presentation transcript:

Research Issues in Cooperative Computing Douglas Thain

Sharing is Hard! Despite decades of research in distributed systems and operating systems, sharing computing resources is still technically and socially difficult! Most existing systems for sharing require: –Kernel level software. –A privileged login. –Centralized trust. –Loss of control over resources that you own.

Cooperative Computing Credo Let’s create tools and systems that make it easy for users to cooperate (or be selfish) as they see fit. Modus operandi: –Make tools that are foolproof enough for casual use by one or two people in the office. –If they really are foolproof, then they will also be suitable for deployment in large scale systems such as computational grids.

disk If I can backup to you, you can backup to me. CPU CSE grads can compute here, but only when I’m not. CPU I need ten more CPUs in order to finish my paper by Friday! CPU May I use your CPUs? auth server Is this person a CSE grad? secure I/O My friends in Italy need to access this data. disk I’m not root! PBs of workstation storage! Can I use this as a cache?

Storage is a Funny Resource Rebuttal: “Storage is large and practically free!” –TB -> PB is *not* free to install or manage. –But, it comes almost accidentally with CPUs. –Aggressive replication (caching) can fill it quickly. Storage has unusual properties: –Locality: Space needs to be near computation. –Non-locality: Redundant copies must be separated. –Transfer is very expensive compared to reservation. i.e. Don’t waste an hour transferring unless it will succeed! –Managing storage is different than managing data. All of this gets worse on the grid.

On the Grid Quick intro to grid computing: –The vision: Let us make large-scale computing resources as reliable and as accessible as the electric power grid or the public water utility. –The audience: Scientists with grand challenge problems that require unlimited amounts of computing power. More computation == Better results. –The reality today: Tie together computing clusters and archival storage around the country into systems that are (almost) usable by experts.

On the Grid CPU PBS batch system CPU Condor Batch System SMP CPU Maui Scheduler job Work Queue gate keeper gate keeper gate keeper

Grid Computing Experience Ian Foster, et al. (102 authors) The Grid2003 Production Grid Principles and Practice IEEE HPDC 2004 The Grid2003 Project has deployed a multi-virtual organization, application-driven grid laboratory that has sustained for several months the production-level services required by… ATLAS, CMS, SDSS, LIGO…

Grid Computing Experience The good news: –27 sites with 2800 CPUs –40985 CPU-days provided over 6 months –10 applications with 1300 simultaneous jobs The bad news: –40-70 percent utilization –30 percent of jobs would fail –90 percent of failures were site problems –Most site failures were due to disk space.

Storage Matters All of these environments: Office – Server Room – Grid Computing Require storage to be an allocable, shareable, accountable resource. We need new tools to accomplish this.

What are the Common Problems? Local Autonomy Resource Heterogeneity Complex Access Control Multiple Users Competition for Resources Low Reliability Complex Debugging

Vision of Cooperative Storage Make it easy to deploy systems that: –Allow sharing of storage space. –Respect existing human structures. –Provide reasonable space/perf promises. –Work easily and transparently without root. –Make the non-ideal properties manageable: Limited allocation. (select, renew, migrate) Unreliable networks. (useful fallback modes) Changing configuration. (auto. discovery/config)

basic filesystem storage server Where can I find 100 GB for 24 hours? Make reservation and access data access control server Is this a member of the CSE dept? Members of the CSE dept can borrow 200 GB for one week. Resource Policy storage catalog status updates Evict user! ? Who is here?

The Current Situation chirp server chirp server chirp server chirp server chirp server libchirp open close read write chirp tool libchirp GET PUT parrot libchirp % cp % emacs % vi catalog server status updates simple ACL hostname kerberos GSI filesystem

Demo Time!

Research Issues Single Resource Management Operating Systems Design Collective Resource Management storage server storage server storage server Coordinated CPU-I/O Distributed Debugging Space Allocation Dist Access Control storage device storage server operating system Visiting Principals Allocation in FS storage server

Space Allocation Simple implementation: –Like quotas, keep a flat lookaside database. –Update db on each write, or just periodically. –To recover, re-scan entire filesystem. –Not scalable to large FS or many allocations. Better implementation: –Keep alloc info hierarchically in the FS. –To recover, re-scan only the dirty subtrees. –A combination of a FS and hierarchical DB. User representation?

Distributed Access Control Things I can’t do today: –Give access rights to any CSE grad student on my local (non-AFS) filesystems. (Where Dr. Madey makes the list each semester.) –Allow members of my conference committee to share my storage space in AFS. (Where I maintain the membership list.) –Give read access to a valuable data repository to all faculty at Notre Dame and all members of a DHS Biometrics analysis program. (Where each list is kept elsewhere in the country.)

Distributed Access Control What will this require? –Separation of ACL services from filesystems. –Simple administrative tools. –Semantics for dealing with failure. –Issues of security and privacy of access lists. Isn’t this a solved problem? –Not for multiple large-scale organizations. –Not for varying degrees of trust and timeliness. –(ACLs were still a research issue in SOSP 2003.) The end result: – A highly-specialized distributed database. (DNS)

Nested Principals How do we represent visiting users? –Let visitors use my uid. –Let visitors use “nobody” (root) –Create a new temporary uid. (root) –Sandbox user and audit every action. (complex) Simple Idea: Let users create sub-principals. –root -> root:dthain –root:dthain -> root:dthain:afriend The devil is in the details: –Semantic issues: superiority, equivalence… –Implementation issues: AAA, filesystem, persistence –Philosophical issues: capabilities vs ACLs

Coordinated CPU and I/O We usually think of a cluster as: –N CPUs + disks to install the OS on. –Use local disks as cache for primary server. –Not smart for data-bound applications. –(As CPUs get faster, everyone is data bound!) Alternate conception: –Cluster = Storage device with inline CPUs. –Molasses System: Minimize movement of jobs and/or the data they consume –Large-scale PIM! –Perfect for data exploration. CPU job CPU job CPU job storage server data storage server data storage server data

! Coordinated CPU and I/O We usually think of a cluster as: –N CPUs + disks to install the OS on. –Use local disks as cache for primary server. –Not smart for data-bound applications. –(As CPUs get faster, everyone is data bound!) Alternate conception: –Cluster = Storage device with inline CPUs. –Molasses System: Minimize movement of jobs and/or the data they consume –Large-scale PIM! –Perfect for data exploration. CPU storage server data storage server data storage server data ????

Distributed Debugging debugger storage Server storage server storage server batch system kerberos license manager auth gateway cpu archival host log file log file log file log file log file log file log file log file workload manager job

Distributed Debugging Big challenges! –Language issues: storing and combining logs. –Ordering: How to reassemble events? –Completeness: Gaps, losses, detail. –Systems: Distributed data collection. But, could be a big win: –“A crashes whenever X gets its creds from Y.” –“Please try again: I have turned up the detail on host B.”

Research Issues Single Resource Management Operating Systems Design Collective Resource Management storage server storage server storage server Coordinated CPU-I/O Distributed Debugging Space Allocation Dist Access Control storage device storage server operating system Visiting Principals Allocation in FS storage server

You must build it and use it in order to understand it! Motto:

For more information… Software, systems, papers, etc… –The Cooperative Computing Lab – Or stop by to chat… –Douglas Thain –356-D Fitzpatrick