Satisfying the Ever Growing Appetite of HEP for HTC

Slides:



Advertisements
Similar presentations
Welcome to HTCondor Week #15 (year #30 for our project)
Advertisements

SLA-Oriented Resource Provisioning for Cloud Computing
High Performance Computing Course Notes Grid Computing.
The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
Introduction to DBA.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Manasa Guduru Sai Prasanth Sridhar Malini srinivasan Sinduja Narasimhan Reference: Aymerich, F. M., Fenu, G., & Surcis, S. (2008). An approach to a cloud.
April 2009 OSG Grid School - RDU 1 Open Science Grid John McGee – Renaissance Computing Institute University of North Carolina, Chapel.
Wowza and Microsoft Azure Enable Easy Deployment and Management of Cloud-Based Streaming Solutions that Deliver Live and On-Demand Video to Any Device.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
SICSA student induction day, 2009Slide 1 Social Simulation Tutorial Session 6: Introduction to grids and cloud computing International Symposium on Grid.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Welcome to CW 2007!!!. The Condor Project (Established ‘85) Distributed Computing research performed by.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Welcome and Condor Project Overview.
Miron Livny Center for High Throughput Computing Computer Sciences Department University of Wisconsin-Madison Open Science Grid (OSG)
Grid Computing at The Hartford Condor Week 2008 Robert Nordlund
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
1 Condor Team 2011 Established 1985.
Authors: Ronnie Julio Cole David
Securely Synchronize and Share Enterprise Files across Desktops, Web, and Mobile with EasiShare on the Powerful Microsoft Azure Cloud Platform MICROSOFT.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Testing in the Cloud with Tosca Testsuite: A Comprehensive Test Management and Test Automation Suite Built on Microsoft Azure MICROSOFT AZURE ISV PROFILE:
Datalayer Notebook Allows Data Scientists to Play with Big Data, Build Innovative Models, and Share Results Easily on Microsoft Azure MICROSOFT AZURE ISV.
Miron Livny Center for High Throughput Computing Computer Sciences Department University of Wisconsin-Madison High Throughput Computing (HTC)
MidVision Enables Clients to Rent IBM WebSphere for Development, Test, and Peak Production Workloads in the Cloud on Microsoft Azure MICROSOFT AZURE ISV.
3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-1.
6 march Building the INFN Grid Proposal outline a.ghiselli,l.luminari,m.sgaravatto,c.vistoli INFN Grid meeting, milano.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
An attempt to summarize…or … some highly subjective observations Matthias Kasemann, CERN & DESY.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
1 Open Science Grid: Project Statement & Vision Transform compute and data intensive science through a cross- domain self-managed national distributed.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
System Software Laboratory Databases and the Grid by Paul Watson University of Newcastle Grid Computing: Making the Global Infrastructure a Reality June.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.
RESERVOIR Service Manager NickTsouroulas Head of Open-Source Reference Implementations Unit Juan Cáceres
Accessing the VI-SEEM infrastructure
AuraPortal Cloud Helps Empower Organizations to Organize and Control Their Business Processes via Applications on the Microsoft Azure Cloud Platform MICROSOFT.
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Condor A New PACI Partner Opportunity Miron Livny
Introduction to Distributed Platforms
(HT)Condor - Past and Future
Microsoft Azure-Powered BlueCielo Meridian360 Portal Improves Asset Data Integrity and Facilitates Secure Collaboration with External Stakeholders MICROSOFT.
Driven by the potential of Distributed Computing to advance Scientific Discovery
Example: Rapid Atmospheric Modeling System, ColoState U
Vidcoding Introduces Scalable Video and TV Encoding in the Cloud at an Affordable Price by Utilizing the Processing Power of Azure Batch MICROSOFT AZURE.
Condor – A Hunter of Idle Workstation
NeoFirma Taps into the Microsoft Azure Cloud Platform to Deliver Digital Oilfield SaaS to North American Independent Oil and Gas Producers MICROSOFT AZURE.
Migratory File Services for Batch-Pipelined Workloads
Miron Livny John P. Morgridge Professor of Computer Science
US CMS Testbed.
Management of Virtual Execution Environments 3 June 2008
Built on the Powerful Microsoft Azure Platform, Lievestro Delivers Care Information, Capacity Management Solutions to Hospitals, Medical Field MICROSOFT.
Oscar AP by Massive Analytic: A Precognitive Analytics Platform for Effortless Data-Driven Decisions. Now Available in Azure Marketplace MICROSOFT AZURE.
Intelledox Infiniti Helps Organizations Digitally Transform Paper and Manual Business Processes into Intuitive, Guided User Experiences on Azure MICROSOFT.
Be Better: Achieve Customer Service Excellence and Create a Lean RMA and Returns Process with Renewity RMA and the Power of Microsoft Azure MICROSOFT AZURE.
Dean Martin Cadwallader Dean of the Graduate School
CloneManager® Helps Users Harness the Power of Microsoft Azure to Clone and Migrate Systems into the Cloud Cost-Effectively and Securely MICROSOFT AZURE.
Operating Systems Bina Ramamurthy CSE421 11/27/2018 B.Ramamurthy.
Welcome to HTCondor Week #18 (year 33 of our project)
Basic Grid Projects – Condor (Part I)
Overview of Workflows: Why Use Them?
Mats Rynge USC Information Sciences Institute
Introduction To Distributed Systems
High Throughput Computing for Astronomers
Welcome to (HT)Condor Week #19 (year 34 of our project)
Welcome to HTCondor Week #17 (year 32 of our project)
Presentation transcript:

Satisfying the Ever Growing Appetite of HEP for HTC Miron Livny John P. Morgridge Professor of Computer Science Center for High Throughput Computing University of Wisconsin-Madison

Higher luminosity and more complex events drive the LHC appetite for HTC capacity

Exascale (and commercial clouds) is a “phenotype” of the research computing eco-system that is the result of the “expression” of changes in the “genetic code” of funding for HEP computing Computing constrains Science Less control (ownership) of the computing environment

“Recommendation 2.2. NSF should (a) … and (b) broaden the accessibility and utility of these large-scale platforms by allocating high-throughput as well as high-performance workflows to them.”

“… many fields today rely on high-throughput computing for discovery.” “Many fields increasingly rely on high-throughput computing”

The UW-Madison Center for High Throughput Computing (CHTC) was established in 2006 to bring the power of Distributed High Throughput Computing (HTC) to all fields of study, allowing the future of Distributed HTC to be shaped by insight from other disciplines

Top 15 projects from latest 90-days report from the CHTC

My partnership with HEP computing started in 1990 with an email from after reading a paper we published in 1988. He basically said “we need what you did and rather than re-doing it ourselves let us collaborate and build on your software!”

CERN 92

1994 Worldwide Flock of Condors Delft Amsterdam 30 200 10 3 3 3 3 Madison Warsaw 10 10 Geneva Dubna/Berlin D. H. J Epema, Miron Livny, R. van Dantzig, X. Evers, and Jim Pruyne, "A Worldwide Flock of Condors : Load Sharing among Workstation Clusters" Journal on Future Generations of Computer Systems, Volume 12, 1996

Claims for “benefits” provided by Distributed Processing Systems P.H. Enslow, “What is a Distributed Data Processing System?” Computer, January 1978 High Availability and Reliability High System Performance Ease of Modular and Incremental Growth Automatic Load and Resource Sharing Good Response to Temporary Overloads Easy Expansion in Capacity and/or Function

Definitional Criteria for a Distributed Processing System P.H. Enslow and T. G. Saponas “”Distributed and Decentralized Control in Fully Distributed Processing Systems” Technical Report, 1981 Multiplicity of resources Component interconnection Unity of control System transparency Component autonomy

Unity of Control All the component of the system should be unified in their desire to achieve a common goal. This goal will determine the rules according to which each of these elements will be controlled.

Component autonomy The components of the system, both the logical and physical, should be autonomous and are thus afforded the ability to refuse a request of service made by another element. However, in order to achieve the system’s goals they have to interact in a cooperative manner and thus adhere to a common set of policies. These policies should be carried out by the control schemes of each element.

On the second page of the paper …

Mechanisms hold the key The Grid: Blueprint for a New Computing Infrastructure Edited by Ian Foster and Carl Kesselman July 1998, 701 pages.

Resource Sharing, Resource Scavenging and the problem of Load Balancing in Distributed Systems

The value of Job Control Languages (JCL) The power of preemptive- resume mechanisms Data shipping vs. function shipping Protection across job lines Expressing, implementing, debugging, monitoring and understanding policies

Miron Livny John P. Morgridge Professor of Computer Science Director Center for High Throughput Computing University of Wisconsin Madison It is all about Storage Management, stupid! (Allocate B bytes for T time units) We do not have tools to manage storage allocations We do not have tools to schedule storage allocations We do not have protocols to request allocation of storage We do not have means to deal with “sorry no storage available now” We do not know how to manage, use and reclaim opportunistic storage We do not know how to budget for storage space We do not know how to schedule access to storage devices

On The Fly Capacity Planning

Expanding our GPU capacity UW grid Cooley condor_annex

Software dependencies Chemical screening software System libraries NVIDIA driver XXX.YY

Phython Notebooks, the return of the interactive workload

How to bring HTC to Jupyter Notebooks*? Researchers want to “live” in a notebook Interactive nature of the notebook vs. the batch nature of HTC Shared everything notebook environment vs. shared nothing approach of (distributed) HTC Is the Python MAP function the right abstraction for bags of HTC tasks? * NOT the same as a Python binding

1992

Submit locally (queue and manage your jobs/tasks locally; leverage your local resources) and run globally (acquire any resource that is capable and willing to run your job/task)

Job owner identity is local Owner identity should never “travel” with the job to execution site. Namely trust should be established at the level of service providers. Owner attributes are local Name spaces are local File names are locally defined. They can be part of a distributed federation. Resource acquisition is local Submission site (local) is responsible for the acquisition of all resources – local and remote.

“external” forces moved us away from this “pure” local centric view of the distributed computing environment. With the help of capabilities (short lived tokens) and reassignment of responsibilities we are committed to regain full local control. Handeling users with money (real or funny) to acquire commuting resources helps us move (push) in this positive direction.

Master-Worker Applications

Clean separation between resource allocation (deployment of workers) and work delegation (assignment of tasks) Management and deployment of workers is (typically) external to the application Assumes on-the-fly deployment of workers (DevOps) and remote access to data Built in fault recovery Built in elasticity Master can follow a scheduling policy when selecting work for execution

And what if the Worker is a (little) Master?

Addresses connectivity challenges between workers and (top) Master Helps with storage allocation and data placement management Supports scale out of Master duties Promotes group of jobs/tasks into higher level structures

Some HTC applications may have job interdependencies that form (complex and/or very large) workflows Directed Acyclic Graphs (DAGs) effectively capture job interdependencies Workflows are restartable and recoverable Workflows can be recursive

Example of a LIGO Inspiral DAG (Workflow)

The advantages and challenges of “remote I/O” in a distributed environment Instrumentation/configuration of worker nodes to support remote access to data/storage Management of I/O capacity of shared data servers How to use/leverage data placement and data caching?

In 1996 I introduced the distinction between High Performance Computing (HPC) and High Throughput Computing (HTC) in a seminar at the NASA Goddard Flight Center in and a month later at the European Laboratory for Particle Physics (CERN). In June of 1997 HPCWire published an interview on High Throughput Computing.

1997 Obstacles to HTC Ownership Distribution Size and Uncertainties Technology Evolution Physical Distribution (Sociology) (Robustness) (Portability) (Technology) 1997 High Throughput Computing

High Throughput Computing requires automation as it is a 24-7-365 activity that involves large numbers of jobs and computing resources FLOPY  (60*60*24*7*52)*FLOPS 100K Hours*1 Job ≠ 1 H*100K J