December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO.

Slides:



Advertisements
Similar presentations
Performance Testing - Kanwalpreet Singh.
Advertisements

Operating System.
Computer Basics Hit List of Items to Talk About ● What and when to use left, right, middle, double and triple click? What and when to use left, right,
MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.
Peter Chochula, January 31, 2006  Motivation for this meeting: Get together experts from different fields See what do we know See what is missing See.
Threads Irfan Khan Myo Thein What Are Threads ? a light, fine, string like length of material made up of two or more fibers or strands of spun cotton,
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.
Chapter 14 Chapter 14: Server Monitoring and Optimization.
1 of 5 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
Chapter 11 - Monitoring Server Performance1 Ch. 11 – Monitoring Server Performance MIS 431 – created Spring 2006.
MCDST : Supporting Users and Troubleshooting a Microsoft Windows XP Operating System Chapter 10: Collect and Analyze Performance Data.
1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 2: Managing Hardware Devices.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
Systems Software Operating Systems.
Understanding and Managing WebSphere V5
Today’s Agenda Chapter 12 Admin Tasks Chapter 13 Automating Admin Tasks.
CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.
The Operating System. Operating Systems (F) What you need to know about –operating system as a program; –directory/folder.
GETTING WEB READY Introduction to Web Hosting. Table of Contents + Websites: The face of your business …………………………………………………………………………1 + Get your website.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
1 The SpaceWire Internet Tunnel and the Advantages It Provides For Spacecraft Integration Stuart Mills, Steve Parkes Space Technology Centre University.
Hands-On Microsoft Windows Server 2008
MCTS Guide to Microsoft Windows 7
LBTO IssueTrak User’s Manual Norm Cushing version 1.3 August 8th, 2007.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 2: Managing Hardware Devices.
Operating Systems  A collection of programs that  Coordinates computer usage among users  Manages computer resources  Handle Common Tasks.
Scaling Up PVSS Phase II. 2 Purpose of this talk Start a discussion about the next phase of the Scaling Up PVSS Project. Start a discussion about the.
University of Management & Technology 1 Operating Systems & Utility Programs.
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Working with Windows 7 at CERN Michał Budzowski.
Operating Systems Lecture 2 Processes and Threads Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of.
Operating Systems TexPREP Summer Camp Computer Science.
Windows Vista Inside Out Chapter 22 - Monitoring System Activities with Event Viewer Last modified am.
Yokogawa Electric Corporation Copyright © Yokogawa Electric Corporation Release 2.10 Functionality Overview September 2004.
Peter Chochula ALICE DCS Workshop, October 6,2005 PVSSII Alert Handling.
Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Copyright © Yokogawa Electric Corporation Release 2.10 Functionality Overview September 2004.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Diagnostic Pathfinder for Instructors. Diagnostic Pathfinder Local File vs. Database Normal operations Expert operations Admin operations.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
Microsoft ® Windows ® Small Business Server 2003 R2 Sales Cycle.
Computer Systems Week 14: Memory Management Amanda Oddie.
Peter Chochula ALICE Offline Week, October 04,2005 External access to the ALICE DCS archives.
L0 DAQ S.Brisbane. ECS DAQ Basics The ECS is the top level under which sits the DCS and DAQ DCS must be in READY state before trying to use the DAQ system.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Marco Cattaneo - DTF - 28th February 2001 File sharing requirements of the physics community  Background  General requirements  Visitors  Laptops 
P.C. Burkimsher IT-CO-BE July 2004 Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO.
The Software for the CERN Detector Safety System G. Morpurgo, R. B. Flockhart and S. Lüders, CERN IT/CO.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
+ Publishing Your First Post USING WORDPRESS. + A CMS (content management system) is an application that allows you to publish, edit, modify, organize,
Power Guru: Implementing Smart Power Management on the Android Platform Written by Raef Mchaymech.
The DCS Databases Peter Chochula. 31/05/2005Peter Chochula 2 Outline PVSS basics (boring topic but useful if one wants to understand the DCS data flow)
1 Chapter Overview Monitoring Access to Shared Folders Creating and Sharing Local and Remote Folders Monitoring Network Users Using Offline Folders and.
JCOP Framework and PVSS News ALICE DCS Workshop 14 th March, 2006 Piotr Golonka CERN IT/CO-BE Outline PVSS status Framework: Current status and future.
TOPSpro Special Topics I: Database Managemen t. Agenda for Module I: Database Management  TOPSpro Backup/Restore Wizard  TOPS-TOPS Import/Export Wizard.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
UNIX U.Y: 1435/1436 H Operating System Concept. What is an Operating System?  The operating system (OS) is the program which starts up when you turn.
The PVSS Oracle Archiver FW WG 6 th July Credits Many people involved IT/DES: Eric Grancher, Nilo Segura, Chris Lambert IT/PSS: Luca Canali ALICE:
Chapter Objectives In this chapter, you will learn:
Chapter Objectives In this chapter, you will learn:
Processes and threads.
WLCG Service Interventions
TexPREP Summer Camp Computer Science
Main Memory Management
Unit 27: Network Operating Systems
Chapter 2: Operating-System Structures
Chapter 2: Operating-System Structures
Presentation transcript:

December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO

Aim of the Scaling Up Project Investigate functionality and performance of large PVSS systems In Phase 1 we reassured ourselves that PVSS scales to support large systems Provided detail rather than bland reassurances

Phase 2: WYSIWYAF Began with a questionnaire to you to establish your concerns Eclectic list of “hot topics of the moment” –Oracle Archiving –Alerts –Regular reconfiguration of channels (alerts and setpoints) –Backup and restore –Configuring all channels at startup

Your requests (cont.) –OPC performance –Local DB cache –Central Panel Repository –Windows/Linux lurking limits –System startup time (DPT distribution) –Task Allocation

Menu From these requests, we initially picked out four for investigation: –Task Allocation –Backup of a running system –Alerts –Panel Repository

Task Allocation Recall that PVSS is manager based and any manager can be scattered to another machine (not just UIs). CTRL Controlmanager API API-Manager D Driver DB Database- Manager UI Userinterface Runtime D Driver D Driver EV Eventmanager UI Userinterface Editor UI Userinterface Runtime EV Eventmanager CTRL Controlmanager UI Userinterface Editor API API-Manager DB Database- Manager D Driver D Driver UI Userinterface Runtime

Task Allocation More than 20 different tests conducted to investigate the effect of moving managers around. Results have been available on the web for some time (URLs at the end) Results were surprising and went against our (& ETM’s!) assumptions of what would be “better”…

What we measured… A task allocation was deemed “better” if it supported a higher number of datapoint changes per second (“throughput”) than a system running entirely on a single processor. We observed the number of changes per second that the system could support before one of the following became overloaded : –CPU usage –Memory usage –Network traffic –Disk traffic

What we saw… As throughput increases on a typical PVSS system, the machine first becomes CPU bound. The Event Manager (EM) is the task most in need of CPU. We expected that scattering the EM away from the Data Manager (DM) would cause slow-down because of the high traffic between these tasks. WRONG!

Scattering the EM Despite the overhead of sending traffic EM   DM over the external network, scattering the EM caused throughput to be significantly increased. (+75%)

AES The Alert-Event Screen (AES) is CPU-hungry. Runs in a UI task which can be scattered. Beware: Each additional AES not only increases the load on its own machine, but also increases the load on the EM to which it is connected.

Recommendation Execute as few AESs as possible outside the main control room. When you are not actually looking at the AES, leave it in “stopped” mode. (Screen is not updated.)

Scattering other managers Can improve throughput, but not as spectacularly as when scattering the EM. Moving the DM is useful, but more delicate (i.e. many Value Archive (VA) connections?)

Absolute Performance The average number of “changes per second” that can be supported depend on the nature of the traffic. A steady data flow is easier to cope with. Irregular bursts of rapid traffic tend to overflow the queues between the managers. (Queue lengths are configurable.)

Load Management PVSS implements several Load Management schemes, e.g. Alert screen update pauses during a brief avalanche Alert screen switches into Stopped mode if the sustained number of alerts arriving is crazy

Load Management - II Load Shedding, where EM will cut the umbilical to rogue managers rather than be brought down itself. I recommend that shift operators be taught to recognise the symptoms when they occur

Multiple CPUs An alternative to scattering: Buy a dual processor! 2 CPUs are generally enough to satisfy even the hungry Event Manager Our dual-CPUs became disk bound when we pushed them. ---Tribute to the well balanced design of modern PCs!

RAM Look how much memory you are using. Buy enough of it. If you are worried about performance, paging is wasted effort!!

Task summary Give plenty CPU capacity to the EM by: –Buying a fast machine –Scattering the EM –Buying a dual CPU machine

Menu –Task Allocation –Backup of a running system –Alerts –Panel Repository

Backup In the development systems nobody did backup. PVSS backup is somewhat intricate. Need for a set of recipes of backup instructions

18-page Report What needs backing up What this means in PVSS How to back it up How to restore (rather important!) Handout

Four Parts 1) Executive Summary 2) Recipes 3) Detailed Background Description 4) Frequently Asked Questions about Backup. (I’m not going to go through them, just let you know that they exist.)

Menu –Task Allocation –Backup of a running system –Alerts –Panel Repository

Alerts PVSS 3.5 (due in 200x) will contain new functionality for summary alerts and alert provocation during ramping. I did not do in depth performance measurements on the existing system, beyond those I described to you in Phase 1 of S.U.P.

At the request of one experiment though, we did investigate “What is the load of an alert definition on a PVSS system?” Results on the web (Test 38). 

Loads of Alert Definitions We showed that it is safe to declare any number of alerts and even to activate them provided that the data values stay in range. It is provocation of the warnings and alerts that incurs a significant CPU load.

Memory load Test 39 looked at memory usage of Alerts. Requirement of 2.5KB per DPE alert.

Menu –Task Allocation –Backup of a running system –Alerts –Panel Repository

Panel Repository Owing to staffing changes in the section, it was not possible to address this topic 

On the subject of panels… During the tests I would have found it helpful to have a ready display of the interconnection status of the distributed systems. I recommend that there is something showing this on the top- level display panel. (Even just a grid of red/green pixels showing connection status.) Lost connections should raise an alert.

Other questions During the tests, I was approached by different experiments with other issues! We agreed to investigate the following…

PVSS Disturbance With Alice we looked together at the effect of heavy external (unrelated) network traffic on PVSS. Results written up as Tests 28 & 29. Use 100Mbit with switches not hubs Conclusion was that external traffic is not a problem

Traffic Pattern For Atlas we compared the CPU load demanded by: –Changing 1 item N times vs –Changing N items once each Same

Long Term Test (LTT) With CMS’ machines (for the use of which we are very grateful!) we ran a long term test: –Generated random data –Recorded it and displayed it continuously on a trend –Distributed system Results 

LTT Results The electricity supply at Cern is unreliable. You really do need a UPS. The Cern campus AFS servers are relatively unreliable and should never be used in a production system! The Cern network infrastructure is very reliable, but can break.

Network Problem One network break revealed that the Cern default Linux O/S settings actually prevent PVSS’s automatic recovery feature from accomplishing its goal. Cache-ing problem. Written up in 2 pages of background, symptoms, explanation, how to fix it if it does happen to you and how to avoid it happening in the first place.

“Side Effects” of SUP Project Accumulated a large body of practical experience wrestling with PVSS. Systematically recorded for your benefit. Where? 

FAQs FAQ pages on Not restricted to today’s frequent questions but ones that we foresee will become frequent in the near future, e.g. –My disk is nearly full! What can I do? –My archive file is corrupt. What can I do? Please spread the word, tell your friends…

FAQ Categories Framework PVSS - Installation PVSS - Project Creation PVSS - Alerts (Alarms) PVSS - Import/Export PVSS - Archiving PVSS - Access Control PVSS - Backup-Restore PVSS - Cross Platform PVSS - Distributed Systems PVSS - Drivers PVSS - Excel Report PVSS - Folklore PVSS - Graphics PVSS - Linux specific PVSS - Messages PVSS - Miscellaneous PVSS - Printing PVSS – Programming PVSS – Programming PVSS - Production Systems PVSS - Run-time problems PVSS - Scattered Systems General Support Issues

Folklore What the FAQs don’t really address is the folklore that is built up in a close-knit team. Often this information is unknown (or inaccessible) to outsiders.

Folklore Enter the Wiki… –Web pages editable from inside a browser. –Controls Wiki. –Only CERN users can add (or change existing) content.content –Readable worldwide. (Is already used as a reference by non-HEP organisations!) Folklore often embodies recommended ways of doing things. Do read it, and keep reading it… …and edit it. It’s belongs to you!

Example Recommendations in the Folklore Assume one PVSS system per machine (Service restriction in Windows) Place EM/DM on a different CPU to OPC client/servers (Protect EM against CPU overload from OPC; Freedom to move EM to Linux) In a Summary (Group) alert, use a CHAR type (not a STRING type) DPE upon which to hang the summary alert. It's more efficient.

Support Issues Final Remark: SUP has generated a fair number of support issues that have been followed up with ETM. “Bugs you didn’t know you nearly had”. Significant contribution to the robustness of the PVSS systems.

Summary I do not claim to have answered all questions about building large systems. –New questions come up frequently anyway. We have shown that PVSS will scale to build large systems We have investigated the “hot topics of the moment” as defined by you.

To read a summary of the salient points of the most recent tests, including a discussion of the observed “Emergent Behaviour” in large systems, see my ICALEPCS paper, “Scaling Up PVSS”.Scaling Up PVSS We are now bringing this project to a close. Thank you! Any (more) questions?

Reference Links Scaling Up Home Page: elcome.html elcome.html IT-CO-BE FAQs: FAQ/ FAQ/ (T)Wiki: VSSFolkLore#PVSS_Folklore VSSFolkLore#PVSS_Folklore ICALEPCS paper “Scaling Up PVSS”: