Presentation is loading. Please wait.

Presentation is loading. Please wait.

Keeping Your Software Ticking Testing with Metronome and the NMI Lab.

Similar presentations


Presentation on theme: "Keeping Your Software Ticking Testing with Metronome and the NMI Lab."— Presentation transcript:

1 Keeping Your Software Ticking Testing with Metronome and the NMI Lab

2 Background: Why (In a Slide!) Grid Software: Important to Science and Industry Quality of Grid Software: Not So Much Testing: Key to Quality Testing Distributed Software: Hard Testing Distributed Software Stacks: Harder Distributed Software Testing Tools: Nonexistent (before) We Needed Help, We Built Something to Help Ourselves and Our Friends, We Think It Can Help Others

3 Background: What (In a Slide!) A Framework and Tool: Metronome – Lightweight, built atop Condor, DAGMan, and other proven distributed computing tools – Portable, open source – Language/harness independent – Assumes >1 user, >1 project, >1 environment needing resources at >1 site. – Encourages explicit, well-controlled build/test environments for reproducibility – Central results repository – Fault-tolerant – Encourages build/test separation A Facility: The NMI Lab – 200+ cores, 50+ platforms @ UW (Noah’s Ark; the Anti-Cluster) – Built to use distributed resources at other sites, grids, etc. – 200 users, dozens of registered projects (most of them “real”) – 84k builds & tests managed by 1M Condor jobs, producing 6.5M tracked tasks in the DB A Team – Subset of Condor Team: Becky Gietzel, Todd Miller, Ross Oldenburg, myself. (More coming.) A Community – Working with TeraGrid, OSG, ETICS, others towards a common intl. build/test infrastructure.

4 MySQL Results DB Web Status Pages Finished Binaries Customer Source Code Condor Queue Metronome Customer Build/Test Scripts INPUT OUTPUT Distributed Build/Test Pool Spec File DAGMan DAG results build/test jobs DAG results Metronome Architecture (In a Slide!)

5 Why Is This Architecture Powerful? Fault tolerance, resource management. Real scheduler, not a toy or afterthought. Flexible workflow tools. Nothing to deploy in advance on worker nodes except Condor – can harness “unprepared” resources. Advanced job migration capabilities – critical for goal of a common build/test infrastructure across projects, sites, countries.

6 Example: NMI Lab / ETICS Site Federation with Condor-C

7 10k Foot View Past: – humble beginnings, ragtag crew of developers making building & testing easier for the projects around them (Condor, Globus, VDT, Teragrid...) Present: – now we have tax money and users should have higher expectations – good news: six months into a new 3y funding cycle, our "professionalism" has improved from our humble beginnings -- better hardware, better processes, better staffing – bad news: we’re still a bit ragtag -- inconsistent support/development request tracking, inconsistent info on resource/lab improvements, issues, and resolution, generally reactive to problems – we're clearly contributing to the build & test capabilities of the community, but we’d like to deliver much more, especially WRT testing.

8 10k Foot View: Future Maintain Metronome and the NMI Lab – continue to professionalize lab infrastructure, improve availability, stability, uptime – Better monitoring -> more proactive response to issues – Better scheduling of jobs, better use of VMs to respond to uneven x86 platform demand Enhance Metronome and the NMI Lab – New features, new capabilities – but might be less important than clarity, usability, fit & finish of existing features.

9 10k Foot View: Future Support Metronome and the NMI Lab – more systematic support operation (ticketing, etc.) – more utilization of basic testing capabilities by new users – more utilization of advanced testing capabilities by existing users – more & better information for users, admins, and pointed-haired bosses better reporting on users, resources, usage, operations, etc. Nurture Distributed Software Testing Community – to identify common B&T needs to improve software quality. – to challenge and help us to provide software & services to help meet B&T needs. – Tuesday’s meeting was a good start, I hope…

10 Maslow’s Pyramid of Testing Needs

11 Testing Opportunities more resources == more possibilities (just like science) – don’t just test under normal conditions, test the not-so-edge cases too (e.g., with CPU load!) – test everywhere your users run, not just where you develop – old/exotic/unique resources you don’t own (NMI Lab, TeraGrid) “black box” – run your existing tinderbox, etc. test harness inside Metronome decoupled builds & tests – run new tests on old builds – cross-platform binary compatibility testing – run quick smoke tests continuously, heavy tests nightly, performance/scalability tests before release

12 Testing Opportunities managed (static) vs. “unmanaged” (auto-updating) platforms – isolate your changes from the OS vendors – test your changes against a fixed target – test your working code against a moving target root-level testing automated reports from testing tools – ValGrind, Purify, Coverity, etc. cross-platform binary testing (build on A, test on B)

13 Testing Opportunities Parameterized dependencies – build with multiple library versions, compilers, etc. – test against every Java VM, Maven, Ant version around – test against different DBs (MySQL, Postgres, Oracle, etc.), VM platforms (Xen, VMWare, etc.), batch systems – make sure new versions of Condor, Globus, etc. don’t break your code Parallel scheduled testbeds – cross-platform testing (A to B) – deploy software stack across many hosts, test whole stack – multi-site testing (US to Europe) – network testing (cross-firewall, low-bandwidth, etc.) – scalability testing

14 Upshot This is all work we’d like to help this community do. Start small -- automated builds are an excellent start. Think big -- what kinds of testing would pay dividends? Let us know what we can do to help make it happen.


Download ppt "Keeping Your Software Ticking Testing with Metronome and the NMI Lab."

Similar presentations


Ads by Google