Implementing a Central Quill Database in a Large Condor Installation Preston Smith Condor Week 2008 - April 30, 2008.

Slides:



Advertisements
Similar presentations
XIr2 Recommended Performance Tuning Andy Erthal BI Practice Manager.
Advertisements

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Exadata Distinctives Brown Bag New features for tuning Oracle database applications.
Common Mistakes Developers Make By Bryan Oliver SQL Server Mentor at SolidQ.
Peter Dinda Department of Computer Science Northwestern University Beth Plale Department.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
PRESTON SMITH ROSEN CENTER FOR ADVANCED COMPUTING PURDUE UNIVERSITY A Cost-Benefit Analysis of a Campus Computing Grid Condor Week 2011.
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
Module 8: Monitoring SQL Server for Performance. Overview Why to Monitor SQL Server Performance Monitoring and Tuning Tools for Monitoring SQL Server.
Presented by Jacob Wilson SharePoint Practice Lead Bross Group 1.
Russ Houberg Senior Technical Architect, MCM KnowledgeLake, Inc.
Condor at Brookhaven Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Troubleshooting SQL Server Enterprise Geodatabase Performance Issues
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
Best Practice for Configuring your SharePoint Environment Steve Smith.
Ideas to Improve SharePoint Usage 4. What are these 4 Ideas? 1. 7 Steps to check SharePoint Health 2. Avoid common Deployment Mistakes 3. Analyze SharePoint.
Peter Keller Computer Sciences Department University of Wisconsin-Madison Quill Tutorial Condor Week.
Win202 Database Administration. Introduction Welcome to OpenEdge. Type 2 Storage Areas. One of the big selling points for the OpenEdge platform and Win202.
Block1 Wrapping Your Nugget Around Distributed Processing.
Performance Dash A free tool from Microsoft that provides some quick real time information about the status of your SQL Servers.
Batch Scheduling at LeSC with Sun Grid Engine David McBride Systems Programmer London e-Science Centre Department of Computing, Imperial College.
© 2008 Quest Software, Inc. ALL RIGHTS RESERVED. Perfmon and Profiler 101.
Integrating JASMine and Auger Sandy Philpott Thomas Jefferson National Accelerator Facility Jefferson Ave. Newport News, Virginia USA 23606
Module 14 Monitoring and Optimizing SharePoint Performance.
Purdue Campus Grid Preston Smith Condor Week 2006 April 24, 2006.
Distributed Backup And Disaster Recovery for AFS A work in progress Steve Simmons Dan Hyde University.
EGEE is a project funded by the European Union under contract IST HEP Use Cases for Grid Computing J. A. Templon Undecided (NIKHEF) Grid Tutorial,
“Load Testing Early and Often” By Donald Doane Presentation to the Rockville MDCFUG.
1 MONGODB: CH ADMIN CSSE 533 Week 4, Spring, 2015.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
ESRI User Conference 2004 ArcSDE. Some Nuggets Setup Performance Distribution Geodatabase History.
Scale Fail or, how I learned to stop worrying and love the downtime.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Copyright 2007, Information Builders. Slide 1 Machine Sizing and Scalability Mark Nesson, Vashti Ragoonath June 2008.
Greg Thain Computer Sciences Department University of Wisconsin-Madison Configuring Quill Condor Week.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
MISSION CRITICAL COMPUTING Siebel Database Considerations.
An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.
ECHO A System Monitoring and Management Tool Yitao Duan and Dawey Huang.
30 Copyright © 2009, Oracle. All rights reserved. Using Oracle Business Intelligence Delivers.
CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.
Linkedin: dennisegraham Dennis E Graham Reporting For SQL Health.
SQL Advanced Monitoring Using DMV, Extended Events and Service Broker Javier Villegas – DBA | MCP | MCTS.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
Introduction to Makeflow and Work Queue Nicholas Hazekamp and Ben Tovar University of Notre Dame XSEDE 15.
You Inherited a Database Now What? What you should immediately check and start monitoring for. Tim Radney, Senior DBA for a top 40 US Bank President of.
Guide to Parallel Operating Systems with Windows 7 and Linux Chapter 10 Operating System Management.
A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.
Slide 1 Cluster Workload Analytics Revisited Saurabh Bagchi Purdue University Joint work with: Subrata Mitra, Suhas Javagal, Stephen Harrell (Purdue),
Hitting the SQL Server “Go Faster” Button Rob Douglas #509 | Brisbane 2016.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Cameron Blashka | Informer Implementation Specialist
Understanding and Improving Server Performance
You Inherited a Database Now What?
Hitting the SQL Server “Go Faster” Button
High Availability in HTCondor
SQL Server Monitoring Overview
CREAM-CE/HTCondor site
Monitoring HTCondor with Ganglia
Troubleshooting Your Jobs
Hitting the SQL Server “Go Faster” Button
Haiyan Meng and Douglas Thain
You Inherited a Database Now What?
February 11-13, 2019 Raleigh, NC.
Troubleshooting Your Jobs
Inside the Database Engine
Presentation transcript:

Implementing a Central Quill Database in a Large Condor Installation Preston Smith Condor Week April 30, 2008

Background –BoilerGrid Motivation What works well What has been challenging What just doesn’t work Future directions Overview

Purdue Condor Grid (BoilerGrid) –Comprised of Linux HPC clusters, student labs, machines from academic department, and Purdue regional campuses 8900 batch slots today.. 14,000 batch slots in a few weeks Delivered over 10 million CPU-hours to high-throughput science to Purdue and national community through Open Science Grid and TeraGrid BoilerGrid

BoilerGrid - Growth

BoilerGrid - Results

Condor 6.9.4, –Quill can store information about all the execute machines and daemons in a pool –Quill now able to store job history and queue contents in a single, central database. Since December 2007, we’ve been working to store the state of BoilerGrid in a Quill installation A Central Quill Database

Why would we want to do such a thing?? –Research into the state of a large distributed system Several at Purdue, collaborators at Notre Dame –Failure analysis/prediction, smart scheduling, interesting reporting for machine owners –“events” table useful for user troubleshooting? –And one of our familiar gripes - usage reporting Structural biologists (see earlier today) like to submit jobs from their desks, too How can we access that job history to complete the picture of BoilerGrid’s usage? Motivation

Dell 2850 –2x 2.8GHz Xeons (hyperthreaded) –Postgres on 4-disk Ultra320 SCSI RAID-0 –5GB RAM The Quill Server

Getting at usage data! What works well quill=> select distinct scheddname,owner,cluster_id,proc_id,remotewallclocktime from jobs_horizontal_history where scheddname LIKE '%bio.purdue.edu%' LIMIT 10; scheddname | owner | cluster_id | proc_id | remotewallclocktime epsilon.bio.purdue.edu | jiang12 | | 0 | 345 epsilon.bio.purdue.edu | jiang12 | | 0 | 4456 epsilon.bio.purdue.edu | jiang12 | | 0 | 1209 epsilon.bio.purdue.edu | jiang12 | | 0 | 1197 epsilon.bio.purdue.edu | jiang12 | | 0 | 1064 epsilon.bio.purdue.edu | jiang12 | | 0 | 567 epsilon.bio.purdue.edu | jiang12 | | 0 | 485 epsilon.bio.purdue.edu | jiang12 | | 0 | 480 epsilon.bio.purdue.edu | jiang12 | | 0 | 509 epsilon.bio.purdue.edu | jiang12 | | 0 | 539 (10 rows)

Thousands of hosts pounding a Postgres database is non-trivial –Be sure to turn down QUILL_POLLING_PERIOD Default is 10s - we went down to 1 hour on execute machines –At some level, this is an exercise in tuning your Postgres server. Quick diversion into Postgres tuning What works, but is painful top - 13:45:30 up 23 days, 19:59, 2 users, load average: , , 428. Tasks: 804 total, 670 running, 131 sleeping, 3 stopped, 0 zombie Cpu(s): 94.6% us, 2.9% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.4% hi, 2.2% si Mem: k total, k used, 36916k free, 10820k buffers Swap: k total, 68292k used, k free, k cached

Assuming that there’s enough disk bandwidth…. –In order to support 2500 simultaneous connections, one must turn up max_connections –If you turn up max_connections, you need ~400 bytes of shared memory per slot. Currently we have 2G of shared memory allocated Postgres

Then you’ll need to turn up shared_buffers –1G currently –Don’t forget fsm_pages… Postgres WARNING: relation "public.machines_vertical_history" contains more than "max_fsm_pages" pages with useful free space HINT: Consider compacting this relation or increasing the configuration parameter "max_fsm_pages".

So by now we can withstand the worker nodes reasonably well Add schedds –condor_history returns history from ALL schedds Bug fixed in –The execute machines create enough load that condor_q is sluggish –Added a 2nd quill database server just for job information What works, but is painful

If your daemons log a lot to sql.log files, but not writing to the database.. –Database down, etc –Your database is in a world of hurt while it tries to catch up.. What works, but is painful

Many Postgres tuning guides recommend a connection pooler if you need scads of connections –pgpool-II –Pgbouncer Tried both, Quill doesn’t seem to like it –It *did* reduce load…. What Hasn’t Worked But, often locked up the database (idle in transaction), and didn’t get anywhere

Throw hardware at the database! –Spindle count seems ok Not I/O bound (any more) –More memory = more connections 16GB? More? –More, faster CPUs We appear to be CPU-bound now Get latest multi-cores What can we do about it?

Contact Wisconsin and call for rescue What can we do about it? “Hey guys.. This is really hard on the old database” “Hmm. Let’s take a look.”

Todd, Greg, and probably others take a look: –Quill always hits the database, even for unchanged ads –Postgres backend does not prepare SQL queries before submitting Being fixed, Todd is optimistic –We’ll report with the results as soon as we have them What can Wisconsin do about it?

Reporting for users –Easy access to statistics about who ran on “my” machines. Mashups, web portals –Diagnostic tools to help users Troubleshooting, etc. Future Directions

Questions? The End

Backup slides

BoilerGrid - Results YearPool Size JobsHours Delivered Unique Users ,551346, ,7171,695, ,251,9815,527, ,611,8139,524, ??63 so far..