Wojciech Sliwinski for BE-CO group Special thanks to: E.Hatziangeli, K.Sigerud, P.Charrue, V.Baggiolini, M.Sobczak, M.Arruat, F.Ehm LHC Beam Commissioning.

Slides:



Advertisements
Similar presentations
BE/CO Changes in LS1 to the Software Development Infrastructure and Widely Used Libraries Chris Roderick, Greg Kruk, Katarina Sigerud, Luigi Gallerani,
Advertisements

Controls Configuration Service Overview GSI Antonio on behalf of the Controls Configuration team Beams Department Controls Group Data & Applications.
Supervision of Production Computers in ALICE Peter Chochula for the ALICE DCS team.
BE-CO work for the TS Nov 8 Nov 11P.Charrue - BE/CO - LBOC1.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 2: Managing Hardware Devices.
From Entrepreneurial to Enterprise IT Grows Up Nate Baxley – ATLAS Rami Dass – ATLAS
Accelerator Complex Controls Renovation, LHC Excluded Purpose and Scope M.Vanden Eynden on behalf of the AB/CO Group.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
controls Middleware – OVERVIEW & architecture 26th June 2013
E. Hatziangeli – LHC Beam Commissioning meeting - 17th March 2009.
A. Dworak BE-CO-IN, CERN. Agenda 228th June 2012  Sum up of the previous report  Middleware prototyping  Transport  Serialization  Design concepts.
Pierre Charrue – BE/CO.  Preamble  The LHC Controls Infrastructure  External Dependencies  Redundancies  Control Room Power Loss  Conclusion 6 March.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 2: Managing Hardware Devices.
W. Sliwinski – eLTC – 7March08 1 LSA & Safety – Integration of RBAC and MCS in the LHC control system.
Module 7: Fundamentals of Administering Windows Server 2008.
Microsoft FrontPage 2003 Illustrated Complete Finalizing a Web Site.
Controls Issues Injection beam2 test meeting 28 th Aug 2008 Eugenia Hatziangeli Input from J. Lewis, M. Sobzak, JJ Gras, C. Roderick, M.Pace, N. Stapley,
ACET Accelerator Controls Exploitation Tools Progress and plans, December 2012.
Operational tools Laurette Ponce BE-OP 1. 2 Powering tests and Safety 23 July 2009  After the 19 th September, a re-enforcement of access control during.
What’s New in WatchGuard XCS v9.1 Update 1. WatchGuard XCS v9.1 Update 1  Enhancements that improve ease of use New Dashboard items  Mail Summary >
Chapter 10 Chapter 10: Managing the Distributed File System, Disk Quotas, and Software Installation.
Wojciech Sliwinski for the BE-CO Middleware team: Wojciech Buczak, Joel Lauener Radoslaw Orecki, Ilia Yastrebov, Vitaliy Rapp (GSI)
T HE BE/CO T ESTBED AND ITS USE FOR TIMING AND SOFTWARE VALIDATION 22 June BE-CO-HT Jean-Claude BAU.
LHC BLM Software revue June BLM Software components Handled by BI Software section –Expert GUIs  Not discussed today –Real-Time software  Topic.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
Wojciech Sliwinski BE/CO for the RBAC team 25/04/2013.
VMware vSphere Configuration and Management v6
The DIAMON Project Monitoring and Diagnostics for the CERN Controls Infrastructure Pierre Charrue, Mark Buttner, Joel Lauener, Katarina Sigerud, Maciej.
K.Hanke – PS/SPS Days – 19/01/06 K.Hanke - PS/SPS Days 19/01/06 Recommissioning Linac2/PSB/ISOLDE from CCC  remote operation from CCC  upgrades & changes.
RBAC Content: LHC Operational Mode Piquet Roles RBAC Strict LHC Operational mode and CMW Acknowledgements: Pierre C., Wojtek S., Stephen P., Lars J., Verena.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
BE-CO-DO - Development tools (Eclipse, CBNG, Artifactory, …) - Atlassian (Jira, Wikis, Bamboo, Crucible), CO Testbed - DIAMON/LASER - JMS (Java messaging.
Architectural issues M.Jonker. Things to do MD was a success. Basic architecture is satisfactory. This is not the end: Understanding of actual technical.
Oracle for Physics Services and Support Levels Maria Girone, IT-ADC 24 January 2005.
Strategy to achieve smooth upgrades during operations Vito Baggiolini BE/CO 1.
Post ACCOR until LS2: End of Life for CMW products CO3 meeting, 25th June 2015 Wojciech Sliwinski for the BE-CO Middleware team.
LS1 Review P.Charrue. Audio/Video infrastructure LS1 saw the replacement of BI and RF analog to digital video transport Was organised in close collaboration.
Feedbacks from EN/STI A. Masi On behalf of EN-STI Mathieu Donze` Odd Oyvind Andreassen Adriaan Rijllart Paul Peronnard Salvatore Danzeca Mario Di Castro.
Technical Stop feed-down P.Charrue on behalf of the BE Controls Group 5th September 2011P.Charrue - 8h30 meeting1.
Issues concerning Device Access (JAPC / CMW / FESA) With input from: A.Butterworth, E.Carlier, A. Guerrero, JJ. Gras, St. Page, S. Deghaye, R. Gorbonosov,
FESA S. Deghaye for the FESA team BE/CO. What happened since April? followed by “Our plans”
AB/CO Review, Interlock team, 20 th September Interlock team – the AB/CO point of view M.Zerlauth, R.Harrison Powering Interlocks A common task.
DIAMON Project Project Definition and Specifications Based on input from the AB/CO Section leaders.
V. Kain – eLTC – 7March08 1 V.Kain, S. Gysin, G. Kruk, M. Lamont, J. Netzel, A. Rey, W. Sliwinski, M. Sobczak, J. Wenninger LSA & Safety - RBAC, MCS Roled.
Conclusions on UPS powering test and procedure I. Romera Acknowledgements: V. Chareyre, M. Zerlauth 86 th MPP meeting –
LS1 – View from Applications BE-CO LS1 review – 1 December 2015 Greg Kruk on behalf of the Applications section.
Linac2 and Linac3 D. Küchler for the linac team. Planning first preparative meeting for the start-up of Linac2 in June 2013 –this early kick-off useful.
Marine Pace Technical Committee -12 Dec DRY RUNS COMMISSIONING & EARLY BEAM OPERATION STABLE OPERATION.
JIRA in BE-CO for Exploitation Marine BI Seminar 20 November
E. Hatziangeli – LHC Beam Commissioning meeting - 3 rd March 2009.
BE-CO work for the TS Outcome of the actions 23 – 28 Apr May 12P.Charrue - BE/CO - LBOC1.
Feedback on Controls from 2015 Operation Marine Pace, on behalf of BE-CO. Evian Workshop Dec 2015 Marine Pace, BE-CO -Evian Workshop 2015.
MPE Workshop 14/12/2010 Post Mortem Project Status and Plans Arkadiusz Gorzawski (on behalf of the PMA team)
AB-CO Exploitation 2006 & Beyond Presented at AB/CO Review 20Sept05 C.H.Sicard (based on the work of Exploitation WG)
H2LC The Hitchhiker's guide to LSA Core Rule #1 Don’t panic.
LS1 Review BE-CO-SRC Section Contributions from: A.Radeva, J.C Bau, J.Betz, S.Deghaye, A.Dworak, F.Hoguin, S.Jensen, I.Koszar, J.Lauener, F.Locci, W.Sliwinski,
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
Introduction to CAST Technical Support
Technical Services: Unavailability Root Causes, Strategy and Limitations Data and presentation in collaboration with Ronan LEDRU and Luigi SERIO.
Introduction to RBAC Wojciech Sliwinski BE/CO for the CMW/RBAC team
Status and Plans for InCA
Database Services at CERN Status Update
Injectors BLM system: PS Ring installation at EYETS
Computing infrastructure for accelerator controls and security-related aspects BE/CO Day – 22.June.2010 The first part of this talk gives an overview of.
Middleware – ls1 progress and planning BE-CO Tc, 30th september 2013
Michael Mast Senior Architect
4 different solutions used in BI
Introduction to CAST Technical Support
CMW infrastructure Status report
Presentation transcript:

Wojciech Sliwinski for BE-CO group Special thanks to: E.Hatziangeli, K.Sigerud, P.Charrue, V.Baggiolini, M.Sobczak, M.Arruat, F.Ehm LHC Beam Commissioning workshop, Evian January 2010

 During 2009 the LHC control system became operational through several dry runs and injection tests and contributed to the successful LHC start-up  Today is a good opportunity to assess the quality and performance of the various components of the controls infrastructure both software and hardware  The presentation will give a summary of the controls issues experienced during the LHC start-up and will outline applied and planned actions needed to resolve them LHC Beam Commissioning workshop, Evian 20/01/ W.Sliwinski2

 Infrastructure – disk space, consoles, …  CMW – proxies, subscriptions, …  FESA – front-end instabilities  RBAC – introduction of STRICT policy  JMS – data publishing  BE–CO Development Process LHC Beam Commissioning workshop, Evian 20/01/ W.Sliwinski3

 Problem description  Data is duplicated three times: source, redundant, rsync copy  Reaching the physical limits of the CCR (space, cooling, electricity, budget)  The tape backups take too much time and for some servers do not finish during the night!  Planned action  CO-IN analyses the disk space usage and will propose a long-term solution (February 2010) based on new storage technology from HP LHC Beam Commissioning workshop, Evian 20/01/ W.Sliwinski4  Massive increase of the total amount of controls operational data  2005 => ~400 GB, 2010 => 4Tb  The last 2Tb were filled in only 2 months by users and application data

 High load on consoles  Problem: Consoles became slow and not responsive, even with only few apps. running  Performed actions ▪ Introduced visual indicator in CCM ▪ Identified an issue (Vistar zombie processes) and fixed it promptly before the start-up ▪ Memory of all LHC consoles was doubled (4GB) ▪ EIC consoles were upgraded to the latest HW configuration Since then … green indicators and …. happy people  Java cache for applications on consoles  Problem: Applications started via CCM may use older jar versions from the local cache  Performed action: Provided context menu to completely clean the local Java cache  Planned action: Validate with the latest Java version installed in January 2010 LHC Beam Commissioning workshop, Evian 20/01/ W.Sliwinski5

 Problem description  Under high load Proxy did not respond promptly to a calling client  Client used to interpret it as Proxy being down or unreachable or timeout  Observed particulary when setting up new subscriptions ▪ E.g. in Fixed Displays applications (e.g. LHC Page1)  Performed actions  Performed several iterations to solve the problem (with OP & eqp. groups)  Improved significantly level of concurrent processing in Proxy  Increased the client timeout for RDA calls  Added dedicated diagnostics to monitor current client connections  Validated new Proxy during the dry-run of 13th Jan. on selected BI front-ends (BTV, BCT)  Planned actions  Provide statistics of traffic in order to detect early the overloaded Proxies  Integration of statistics/diagnostics info with the DIAMON facilities LHC Beam Commissioning workshop, Evian 20/01/ W.Sliwinski6

 Problem description  It was not possible to mix direct & Proxy connections from the same client application  Proxy redirection was "all or nothing" ▪ Either all via the same Proxy or all connect directly  Performed actions  New centralized architecture for Proxy resolution ▪ New CMW-RDA (C & Java 2.8.5) released on  Now possible to mix direct & Proxy connections from the same client application  Redirection granularity set on front-end server level  Proxy configuration is database driven (Configuration DB) ▪ If proxy defined for a given front-end server, by default all client connections go via this Proxy  Experts can force old behavior via system properties  Planned actions  New expert feature: disable Proxy handling for selected front-end servers directly from the expert application (requested by BI)  Contact eqp. groups and configure Proxy redirection for their front-end servers LHC Beam Commissioning workshop, Evian 20/01/ W.Sliwinski7

 Problem description  Disconnected subscriptions caused loss of important data  Observed in several critical systems: BLM concentrators, BCT, RF  Problem was also impacted by some faulty network switches  Performed actions  Significant troubleshooting and investigation to locate the cause of the problem  Identified faulty network switches were replaced  Identified CMW timeout was too short for the front-end servers to process the notifications ▪ Increased CMW timeout to 4 seconds to give more time to server to push the notifications ▪ Reduced the subscription disconnections & drastically improved the situation  Identified the configuration parameters (CORBA) which were impacting the performance ▪ New set of CORBA configuration parameters were deployed on the front-end side  The major clients (BLM,BCT,...) highly appreciated the efforts & final outcome LHC Beam Commissioning workshop, Evian 20/01/ W.Sliwinski8

 Problem description  Front-end server was occasionally crashing in several situations ▪ Many clients were directly subscribed to a front-end publishing data with a high rate ▪ At the front-end start-up time when many clients were re-establishing the subscriptions  Observed several times for RF FESA servers  Performed actions  FESA team investigated the issue and managed to locate the cause of the problem  The bug was fixed and released as a patch to FESA 2.10 on  The new fixed FESA version was validated together with the RF team ▪ RF front-end server was stressed with many clients connected and publishing rate 10 times higher than in operational mode ▪ Front-end was restarted to test the behavior at the start-up time  Stability of the FESA servers was improved LHC Beam Commissioning workshop, Evian 20/01/ W.Sliwinski9

 From March 2009 ongoing campaign to introduce STRICT policy in LHC  Several dry runs and tests in collaboration with OP & equipment groups  End July 2009: default RBAC policy for LHC => STRICT ▪ RBAC token mandatory ▪ Protection of SET operations on properties  Main goal for the start-up => STRICT mode everywhere  All LHC applications and majority of equipment became RBAC enabled  Already in STRICT: PO, BI (BLM, BPM, BCT, BTV, …), BT, Collimators, BIS, QPS, most of RF  TODO: Rest of RF (~55 FECs), BI(2 FECs), PVSS-Cryo (~69 FECs), PVSS-Survey (10 FECs), IPOC (8 FECs) LHC Beam Commissioning workshop, Evian 20/01/ W.Sliwinski10 moreover …  New RBAC piquet ROLE with Expiry Time  Access maps dependent on the Operational Mode  New Central CMW Configuration server  Distribution of the Operational Mode and RBAC policy  NEW! Proxy extension: RBAC token for get/set comes from a client but for monitor Proxy token is used

 Problem description  Overloaded brokers stopped publishing the data ▪ It was seen by BLM and Logging  Both brokers CO (TN) & Public (GPN) were on the same machine  Too many client connections => high CPU & memory usage  Machine (hardware) was not able to process all requests in a timely manner  Performed actions  Physical separation => CO broker moved to a new 16 core machine, Public broker stayed on old machine  Brokers are constantly monitored for performance  Possible improvements  Need to evaluate different broker configurations (Load Balancing & Hot Backup)  Usage of dedicated brokers for „special” clients e.g. for Concentrators LHC Beam Commissioning workshop, Evian 20/01/ W.Sliwinski11

 Identified issues  Poorly coordinated and communicated upgrades  Upgrades break existing controls software (side-effects, bugs)  We cannot freeze the controls system and stop all our developments  Operations request new /enhanced functionality  Bug fixes  Better development and deployment procedures  Coordinated development, testing and releases for the controls system core (FESA, CMW, RBAC, JAPC, LSA)  Changes must be motivated, prioritized and traced in JIRA ▪ Cooperation of the clients (OP, eqp. specialists) is essential for this  Testbed to validate Control System core  Rules and tools to manage dependencies and guarantee backward compatibility  To be presented soon in the BE-CO Technical Committee LHC Beam Commissioning workshop, Evian 20/01/ W.Sliwinski12

 Overall, the Controls system was stable and performed reliably during the LHC start-up and commissioning  During operations certain shortcomings were identified in the Controls infrastructure which were thoroughly followed- up and solutions were rapidly applied  In 2010 the emphasis will be on:  Consolidate the communications infrastructure to withstand reliably the load  Push for the remaining equipment to go to the RBAC STRICT policy  Develop and deploy a solid development and release process across the whole Controls infrastructure LHC Beam Commissioning workshop, Evian 20/01/ W.Sliwinski13

LHC Beam Commissioning workshop, Evian 20/01/ W.Sliwinski14

 Problem description  State of a host displayed as correct (green) but sometimes one of the servers (running on that host) is in an error state (red)  Planned actions  Known bug: after searching for a host, host state does not change even if one of the server processes goes down (host status should become also red) LHC Beam Commissioning workshop, Evian 20/01/ W.Sliwinski15

 Problem description  Flooding of the Alarm screen with hundreads of alarms (many not relevant to Operations)  OP are not looking to the Alarm screen when there is a problem  Planned Actions  A new LASER admin tool to aid OP to manage the configuration of alarm categories (PC alarms) is under development by CO for 2010  Clean-up unnecessary alarms LHC Beam Commissioning workshop, Evian 20/01/ W.Sliwinski16