P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO.

Slides:

Advertisements

Similar presentations

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Advertisements

Peter Chochula, DCS Workshop Geneva, February 28, 2005 DCS Test campaign (Talk based on presentation to TB in December 2004) Peter Chochula.

ADABAS to RDBMS UsingNatQuery. The following session will provide a high-level overview of NatQuerys ability to automatically extract ADABAS data from.

March 16, 2004Alice controls workshop, S.Popescu Low Voltage and High Voltage OPC status and plans.

1 Unit & District Tools Phase 1. 2 To access the new Unit and District Tools, you will need to click on the link embedded in the MyScouting Flash page.

Multi-DNC Data Collection/Monitoring

Developers: Alexey Rastvortsev, Ilya Kolchinsky Supervisors: Roy Friedman, Alex Kogan.

Portable Image File Viewer ENEE 408G: Multimedia Signal Processing Seun Fabayo John Glancy Gordon Krauthamer.

Chapter 11 - Monitoring Server Performance1 Ch. 11 – Monitoring Server Performance MIS 431 – created Spring 2006.

JMeter Workshop Friday 1 December 2006 Anthony Colebourne IT Services The University of Manchester.

Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.

The Classroom Presenter Project Richard Anderson University of Washington December 5, 2006.

Choose & Book Common Issues Identified. Problems with Choose and Book 2 Slow Performance 3 Strange Behaviours 1 Does not Run.

Hjemmeeksamen 1 INF3190. Oppgave Develop a monitoring/administration tool which allows an administrator to use a client to monitor all processes running.

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

Field Proven Performance – Industrial Grade Support DATA-LINC GROUP Training: LincView OPC Released: 2012/02/04.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

1 The SpaceWire Internet Tunnel and the Advantages It Provides For Spacecraft Integration Stuart Mills, Steve Parkes Space Technology Centre University.

Chocolate Bar! luqili. Milestone 3 Speed 11% of final mark 7%: path quality and speed –Some cleverness required for full marks –Implement some A* techniques.

DNN Performance & Scalability Planning, Evaluating & Improving : Part 2.

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

8 Hour Heat Run Sequencer History of the test Analyze of the events Memory space used by the sequencer Questions in view of the future tests.

Scaling Up PVSS Phase II. 2 Purpose of this talk Start a discussion about the next phase of the Scaling Up PVSS Project. Start a discussion about the.

Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.

Using VCL for Distributed Pair Programming CSC/ECE 517, Spring 2013.

PSEN Server Balance EN/ICE Procedures Jean-Charles Tournier EN/ICE/SCD 09-September-2015.

Windows Vista Inside Out Chapter 22 - Monitoring System Activities with Event Viewer Last modified am.

Update on Database Issues Peter Chochula DCS Workshop, June 21, 2004 Colmar.

Peter Chochula ALICE DCS Workshop, October 6,2005 PVSSII Alert Handling.

Architectures of distributed systems Fundamental Models

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

Inter-Domain Routing Trends Geoff Huston APNIC March 2007.

Maintaining and Updating Windows Server Monitoring Windows Server It is important to monitor your Server system to make sure it is running smoothly.

20th September 2004ALICE DCS Meeting1 Overview FW News PVSS News PVSS Scaling Up News Front-end News Questions.

How to Run a Scenario In HP LoadRunner >>>>>>>>>>>>>>>>>>>>>>

Sep. 18th 2006 Stefan Koestner Installation Guide ECS-tools for the CCPC/Tell1 (Installation): Many problems/user- requests already arises when installing.

Online Music Store. MSE Project Presentation III

Overview of PVSS 3.6 Oliver Holme IT-CO. 16/11/2006JCOP Project Team Meeting New features in 3.6 New Installer for PVSS on Windows New Qt User Interface.

L2 Upgrade review 19th June 2007Alison Lister, UC Davis1 XFT Monitoring + Error Rates Alison Lister Robin Erbacher, Rob Forrest, Andrew Ivanov, Aron Soha.

Peter Chochula ALICE Offline Week, October 04,2005 External access to the ALICE DCS archives.

December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO.

Distributed simulation with MPI in ns-3 Joshua Pelkey and Dr. George Riley Wns3 March 25, 2011.

March 7th 2005 Stefan Koestner LHCb week ECS-tools for the CCPC/Tell1 (Installation): Many problems/user- requests already arises when installing the framework.

1 User guide for Muon shifter part 2 : control of LV, HV, TELL1 Preliminary version 9-July-08 (to be checked by Michela) I have simply put together the.

Performance Testing Test Complete. Performance testing and its sub categories Performance testing is performed, to determine how fast some aspect of a.

PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.

SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.

The DCS Databases Peter Chochula. 31/05/2005Peter Chochula 2 Outline PVSS basics (boring topic but useful if one wants to understand the DCS data flow)

1 Chapter Overview Monitoring Access to Shared Folders Creating and Sharing Local and Remote Folders Monitoring Network Users Using Offline Folders and.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Windows Terminal Services for Remote PVSS Access Peter Chochula ALICE DCS Workshop 21 June 2004 Colmar.

Database Issues Peter Chochula 7 th DCS Workshop, June 16, 2003.

WS1-1 ADM , Workshop 1, August 2005 Copyright  2005 MSC.Software Corporation WORKSHOP 1 INTRODUCTION.

3rd June 2004JCOP CCB Meeting1 JCOP Activities and News JCOP FW PVSS Scaling Up DSS GCS Rack Monitoring, Control and Safety (RMCSS) Miscellaneous.

JCOP Framework and PVSS News ALICE DCS Workshop 14 th March, 2006 Piotr Golonka CERN IT/CO-BE Outline PVSS status Framework: Current status and future.

Dynamicpartnerconnections.com Development for performance Oleksandr Katrusha, Program manager

Lecture 4 Page 1 CS 111 Summer 2013 Scheduling CS 111 Operating Systems Peter Reiher.

Monitoring Dynamic IOC Installations Using the alive Record Dohn Arms Beamline Controls & Data Acquisition Group Advanced Photon Source.

Windows Server 2003 { First Steps and Administration} Benedikt Riedel MCSE + Messaging

35t Readout Software John Freeman Dune Collaboration Meeting September 3, 2015.

The PVSS Oracle Archiver FW WG 6 th July Credits Many people involved IT/DES: Eric Grancher, Nilo Segura, Chris Lambert IT/PSS: Luca Canali ALICE:

Unit & District Tools Phase 1

GFA Controls IT Alain Bertrand

Testing The JCOP Framework

How to Fix Brother Printer Offline Status with Free Brother Printer Support? VISIT WEBSITE.

IBEX Client Migration to Eclipse 4

Outline System architecture Current work Experiments Next Steps

Client/Server Computing and Web Technologies

Presentation transcript:

P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO

Aim of the Scaling Up Project WYSIWYAF Investigate functionality and performance of large PVSS systems Reassure ourselves that PVSS scales to support large systems Provide detail rather than bland reassurances

What has been achieved? 18 months PVSS gone through many pre-release versions –“2.13” –3.0Alpha –3.0Pre-Beta –3.0Beta –3.0RC1 –3.0RC1.5 Lots of feedback to ETM. ETM have incorporated –Design fixes & Bug fixes

Progress of the project Has closely followed the different versions. Some going over the same ground, repeating tests as bugs were fixed. Good news: V3.0 Official Release is now here (even 3.0.1) Aim of this talk: –Summarise where we’ve got to today. –Show that the list of potential “showstoppers” has been addressed

What were the potential showstoppers? Basic functionality –Synchronised types in V2 ! Sheer number of systems –Can the implementation cope? Sheer number of displays Alert Avalanches –How does PVSS degrade? Is load of many Alerts reasonable? Is load of many Trends reasonable?

What were the potential showstoppers? Basic functionality –Synchronised types in V2! Sheer number of systems –Can the implementation cope? Alert Avalanches –How does PVSS degrade? Is load of many Alerts reasonable? Is load of many Trends reasonable? } Skip

Sheer number of systems 130 systems simulated on 5 machines 40,000 DPEs ~5 million DPEs Interconnected successfully

What were the potential showstoppers? Basic functionality –Synchronised types in V2! Sheer number of systems –Can the implementation cope? Alert Avalanches –How does PVSS degrade? Is load of many Alerts reasonable? Is load of many Trends reasonable? } Skip

Alert Avalanche Configuration 2 WXP machines Each machine = 1 system Each system has 5 crates declared x 256 channels x 2 alerts in each channel (“voltage” and “current”) 40,000 DPEs total in each system Each system showed alerts from both systems 9491 UI

Traffic & Alert Generation Simple UI script Repeat –Delay D mS –Change N DPEs Traffic rate D \ N –Bursts. –Not changes/sec. Option provoke alerts

Alert Avalanche Test Results - I You can select which system’s alerts you wish to view UI caches ALL alerts from ALL selected systems. Needs sufficient RAM! (5,000 CAME + 5,000 WENT alerts needed 80Mb) Screen update is CPU hungry and an avalanche takes time(!) –30 sec for 10,000 lines.

Alert Avalanche Test Results - II Too many alerts -> progressive degradation 1) Screen update suspended –Message shown 2) Evasive Action. Event Manager eventually cuts the connection to the UI; UI suicides. –EM correctly processed ALL alerts and LOST NO DATA.

Alert Avalanche Test Results - III Alert screen update is CPU intensive Scattered alert screens behave the same as local ones. (TCP) “Went” alerts that are acknowledged on one alert screen disappear from the other alert screens, as expected. –Bugs we reported have now been fixed.

What were the potential showstoppers? Basic functionality –Synchronised types in V2! Sheer number of systems –Can the implementation cope? Alert Avalanches –How does PVSS degrade? Is load of many Alerts reasonable? Is load of many Trends reasonable?

Agreed Realistic Configuration 3 level hierarchy of machines Only ancestral connections, no peer links. Only direct connections allowed. 40,000 DPEs in each system, 1 sys per machine Mixed platform (W=Windows, L=Linux) L LLLLLLLLLLLL W W W

Viewing Alerts coming from leaf systems 1,000 “came” alerts generated on PC94 took 15 sec to be absorbed by PC91. All 4(2) CPUs in PC91 shouldered the load. Additional alerts then fed from PC93 to the top node. –Same graceful degradation and evasive action seen as before. PC91’s EM killed PC91’s Alert Screen Display is again the bottleneck.

Rate supportable from 2 systems Set up a high, but supportable rate of traffic (10,000 \ 1,000) on each of PC93 and PC94, feeding PC91. PC93 itself was almost saturated, but PC91 coped (~200 alerts/sec average, dual CPU)

Surprise Overload (manual) Manually stop PC93 PC91 pops up a message Manually restart PC93 Rush of traffic to PC91 caused PC93 to overload PC93’s EM killed PC93’s DistM PC91 pops up a message

PVSS Self-healing property PVSS self-healing algorithm –Pmon on PC93 restarts PC93’s DistM

Remarks Evasive action taken by EM, cutting connection, is very good. Localises problems, keeping the overall system intact. Self-healing action is very good. Automatic restart of dead managers BUT…

Evasive action and Self-healing Manually stop PC93 PC91 pops up a message Manually restart PC93 Rush of traffic to PC91 causes PC93 to overload PC93’s EM killed PC93’s DistM PC91 pops up a message Pmon restarts PC93’s DistM

Self-healing Improvement To avoid the infinite loop, ETM’s Pmon eventually gives up. Configurable how soon – Still not ideal! ETM are currently considering my suggestion for improvement: –Pmon should issue the restart, but not immediately.

(Old) Alert Screen We fed back many problems with the Alert Screen during the pre- release trials. –E.g. leaves stale information on- screen when systems leave and come back.

New Alert/Event Screen in V Official release now has a completely new Alert/Event Screen which fixes most of the problems. It’s new and still has some bugs, but the ones we have seen are neither design problems nor showstoppers.

More work for ETM: When DistM is killed by EM taking evasive action, the only indication is in the log. But Log viewer, like Alert viewer, is heavy on CPU and shouldn’t be left running when it’s not needed.

Reconnection Behaviour No gaps in the Alert archive of the machine that isolated itself by taking evasive action. No data was lost. It takes about 20 sec for 2 newly restarted Distribution Managers to get back in contact. Existing (new-style!) alert screens are updated with the alerts of new systems that join (or re-join) the cluster.

Is load of many Alerts reasonable? ~200 alerts/sec average would be rather worrying in a production system. So I believe “Yes”. The response to an overload is very good. Though can still be tweaked. Data integrity is preserved throughout.

What were the potential showstoppers? Basic functionality –Synchronised types in V2! Sheer number of systems –Can the implementation cope? Alert Avalanches –How does PVSS degrade? Is load of many Alerts reasonable? Is load of many Trends reasonable? 

Can you see the baby?

What were the potential showstoppers? Basic functionality –Synchronised types in V2! Sheer number of systems –Can the implementation cope? Alert Avalanches –How does PVSS degrade? Is load of many Alerts reasonable? Is load of many Trends reasonable?

Is the load of many Trends reasonable? Same configuration: Trend windows were opened on PC91 displaying data from more and more systems. Mixed platform.

Is Memory Usage Reasonable? RAM (MB) Steady state, no trends open on PC91593 Open plot ctrl panel on On PC91, open a 1 channel trend window from PC03658 On PC91, open a 1 channel trend window from PC04657 On PC91, open a 1 channel trend window from PC05657 On PC91, open a 1 channel trend window from PC06658 On PC91, open a 1 channel trend window from PC07658 Yes

Is Memory Usage Reasonable? RAM Steady state, no trends open on PC91602 On PC91, open 16 single channel trend windows from PC95Crate1Board1604 On PC91, open 16 single channel trend windows from PC03Crate1Board1607 On PC91, open 16 single channel trend windows from PC04Crate1Board1610 Yes

Test 34: Looked at top node plotting data from leaf machines’ archives Performed excellently. Test ceased when we ran out of screen real estate to show even the iconised trends (48 of).

Bland result? No! Did the tests go smoothly? No! –But there was good news at the end

Observed gaps in the trend!! Investigation showed gap was correct –Remote Desktop start-up caused CPU load –Data changes were not generated at this time Zzzzzzz

Proof with a Scattered Generator Steady traffic generation No gaps in the recorded archive –Even when deliberately soak up CPU Gaps were seen in the display –Need a “Trend Refresh” button (ETM) Scattered UI on PC93 Traffic EM Trend UI on PC94 Zzzzzzz

Would sustained overload give trend problems? High traffic (400mS delay\1000 changes) on PC93, as a scattered member of PC94’s system. PC94’s own trend plot could not keep up. PC91’s trend plot could not keep up. “Not keep up” means… Zzzzzzz

“Display can’t keep up” means… Trend screen values updated to here Timenow Zzzzzzz

Evasive action Trend screen values finally updated to here Timen ow EM took evasive action, (disconnected the traffic generator) just here Last 65sec queued in Traffic Generator. Lost when it suicided. Zzzzzzz

Summary of Multiple Trending PVSS can cope PVSS is very resilient to overload Successful tests. Wakey!

Test 31 DP change rates Measured saturation rates on different platform configurations. No surprises. Faster machines with more memory are better. Linux is better than Windows. Numbers on the Web.

Test 32 DP changes with alerts Measured saturation rates; no surprises again. Dual CPU can help in processing when there are a lot of alert screen (user interface) updates.

What were the potential showstoppers? Basic functionality –Synchronised types in V2! Sheer number of systems –Can the implementation cope? Alert Avalanches –How does PVSS degrade? Is load of many Alerts reasonable? Is load of many Trends reasonable ? Conclusions

No showstoppers. We have seen nothing to suggest that PVSS cannot be used to build a very big system.

Further work - I Further “informational” tests will be conducted to assist in making configuration recommendations, eg understanding the configurability of the message queuing and evasive action mechanism. Follow up issues such as “AES needed more CPU when scattered”. Traffic overload from a SIM driver rather than a UI Collaborate with Peter C. to perform network overload tests.

Further work – II Request a Use Case from experiments for a non-stressed configuration: –Realistic sustained alert rates –Realistic peak alert rate + realistic duration i.e. not a sustained avalanche –How many users connected to control room machine? –% viewing alerts; % viewing trends; % viewing numbers (eg CAEN voltages) –Terminal Server UI connections –How many UIs can control room cope with? What recommendations do you want?

In greater detail… The numbers behind these slides will soon be available on the Web at jects/ScalingUpPVSS/welcome.html jects/ScalingUpPVSS/welcome.html Any questions?

Can you see the baby?

Example Numbers CPU PC92Linux2.2 x \1000 PC93W \500 PC94WXP \1000 PC95Linux \1000 PC03Linux \1000 Table showing the Traffic Rates on different machine configurations, that gave rise to 70% CPU usage on those machines. See the Web links for the original table and details on how to interpret the figures.