SLAC High Availability Electronics Architectures & Standards for ILC The Case for Five 9’s Ray Larsen SLAC ILC Program.

Slides:



Advertisements
Similar presentations
Goals and status of the ATLAS VME Replacement Working Group Markus Joos, CERN Based on slides prepared by Guy Perrot, LAPP 19/09/2012.
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Q11: Describe how the effects of power supply failures on integrated luminosity will be mitigated. TESLA Response : –Mainly consider two types of magnet.
Selecting & Defining Command and Control Systems for Mine Ventilation Presented By: Sancar James Fredsti.
Americas Region ILC Electronics Manufacturing Opportunities Linear Collider Industrial Forum of America (LFCOA) SLAC, May 1-2, 2006 Ray Larsen.
26-Sep-11 1 New xTCA Developments at SLAC CERN xTCA for Physics Interest Group Sept 26, 2011 Ray Larsen SLAC National Accelerator Laboratory New xTCA Developments.
BHEL – Electronics Division, Bangalore
Operations and Availability GG3. Key decisions Summary of Key Decisions for the Baseline Design The linac will have two parallel tunnels so that the support.
Asis AdvancedTCA Class. What is PICMG? PICMG - The PCI Industrial Computers Manufacturer's Group Is a consortium of over 450 industrial computer product.
ITE PC v4.0 Chapter 1 1 © 2007 Cisco Systems, Inc. All rights reserved.Cisco Public Computer Networks  Week 1: Introduction; Computer Hardware and Software.
1 CS294 Project Virtual and Redundant Switches IRAM Retreat – Winter 2001 Sam Williams.
Shelf Management & IPMI SRS related activities
Deon Blaauw Modular Robot Design University of Stellenbosch Department of Electric and Electronic Engineering.
The primary objective in the implementation of a UPS system is to improve power reliability to the limits of technical capability, the ultimate aim being.
Fermilab ILC School, July 07 1 ILC Global Control System John Carwardine, ANL.
Smart Grid Research Consortium Conference Communications: Technologies Systems Future Trends Dr Rick Russell.
The planned new system Tom Himel Dec 1, Outline Controls has multiple related projects Decided to use mainly µTCA architecture Description of.
Single Board Computers and Industrial PC Hardware at the CLS
Server Hardware Chapter 22 Release 22/10/2010Jetking Infotrain Ltd.
Developing PC-Based Automobile Diagnostic System Based on OBD System Authors : Hu Jie, Yan Fuwu, Tian Jing, Wang Pan, Cao Kai School of Automotive Engineer.
About Samway Electronic SRL Founded in 2005 in Bucharest, Romania Focused on management and monitoring solutions for telecom/industrial computers Active.
XTCA working group M. Hansen, CERN. xTCA Owned by PICMG (PCI Industrial Computer Manufacturers Group) ATCA (2002, 2007) – Advanced Telecommunications.
Computer Maintenance Unit Subtitle: Bus Structures Excerpted from Copyright © Texas Education Agency, All rights reserved.
DCS TCSG November 10th 1999, H.J.Burckhart1 Status of the general purpose I/O system LMB u DCS Architecture u LMB u Local Monitor Box (LMB) u Concept u.
Organization of a computer: The motherboard and its components.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
ILC Control and IHEP Activity Jijiu. Zhao, Gang. Li IHEP, Nov.5~7,2007 CCAST ILC Accelerator Workshop and 1st Asia ILC R&D Seminar under JSPS Core-University.
DISPERSITY ROUTING: PAST and PRESENT Seungmin Kang.
Redundancy. 2. Redundancy 2 the need for redundancy EPICS is a great software, but lacks redundancy support which is essential for some highly critical.
InfiniSwitch Company Confidential. 2 InfiniSwitch Agenda InfiniBand Overview Company Overview Product Strategy Q&A.
ILC Electronics Manufacturing Opportunities ILC Industrial Forum at Fermilab September 21-22, 2005 Ray Larsen for SLAC ILC Electronics Group.
 AUTOMATION  PLC  SCADA  INSTRUMENTATION  DRIVES & MOTORS.
Dec 8-10, 2004EPICS Collaboration Meeting – Tokai, Japan MicroIOC: A Simple Robust Platform for Integrating Devices Mark Pleško
Drive beam magnets powering strategy Serge Pittet, Daniel Siemaszko CERN, Electronic Power Converter Group (TE-EPC) OUTLINE : Suggestion of.
1 Availsim DRFS and klyClus setup, assumptions, questions Tom Himel August 11, 2009.
Tunnel Electronics Baseline Issues Snowmass Workshop August 25, 2005 Ray Larsen for SLAC ILC Group.
Intro to Network Design
Redundant IOC with ATCA(HPI) support Utilizing modern hardware for better availability Artem Kazakov, KEK/SOKENDAI.
SLAC ILC High Availability Electronics R&D LCWS IISc Bangalore India Ray Larsen, SLAC Presented by S. Dhawan, Yale University.
21 March 2007 Controls 1 Hardware, Design and Standards Issues for ILC Controls Bob Downing.
Operations, Test facilities, CF&S Tom Himel SLAC.
Clustering In A SAN For High Availability Steve Dalton, President and CEO Gadzoox Networks September 2002.
Internet 0 for Industrial Applications Fred Cohn 13-Dec-07.
Eugenia Hatziangeli Beams Department Controls Group CERN, Accelerators and Technology Sector E.Hatziangeli - CERN-Greece Industry day, Athens 31st March.
1 Global Design Effort Beijing GDE Meeting, February 2007 Controls for Linac Parallel Session 2/6/07 John Carwardine ANL.
1 Availability and Controls Tom Himel SLAC Controls GG meeting January 20, 2006.
Reliability and availability considerations for CLIC modulators Daniel Siemaszko OUTLINE : Give a specification on the availability of the powering.
1 Recommendations Now that 40 GbE has been adopted as part of the 802.3ba Task Force, there is a need to consider inter-switch links applications at 40.
1 The Cost & Schedule to Accomplish the Transition Proposed SLAC Controls Upgrade December 1, 2010 Ray Larsen.
Managed by UT-Battelle for the Department of Energy SCL Vacuum Control System Upgrade Derrick Williams
1 The ILC Control Work Packages. ILC Control System Work Packages GDE Oct Who We Are Collaboration loosely formed at Snowmass which included SLAC,
Jan Low Energy 10 Hz Operation in DRFS (Fukuda) (Fukuda) 1 Low Energy 10Hz Operation in DRFS S. Fukuda KEK.
ATC / ABOC 23 January 2008SESSION 6 / MTTR and Spare Parts AB / RF GROUP MTTR, SPARE PARTS AND STAND-BY POLICY FOR RF EQUIPMENTS C. Rossi on behalf of.
October 19, 2004R.S. Larsen & R.W. Downing 1 Electronics Packaging Issues for Future Accelerators and Experiments 2004 IEEE Nuclear Science Symposium October.
ATLAS DCS Workshop on PLCs and Fieldbusses, November 26th 1999, H.J.Burckhart1 CAN and LMB in ATLAS u Controls in ATLAS u CAN u Local Monitor Box u Concept.
Motherboard By : Zachary Picht and Bailey Germain.
1 Global Control System J. Carwardine (ANL) 6 November, 2007.
1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.
Johannes Lang: IPMI Controller Johannes Lang, Ming Liu, Zhen’An Liu, Qiang Wang, Hao Xu, Wolfgang Kuehn JLU Giessen & IHEP.
Redundancy in the Control System of DESY’s Cryogenic Facility. M. Bieler, M. Clausen, J. Penning, B. Schoeneburg, DESY ARW 2013, Melbourne,
MicroTCA Development and Status
The Survey of the Power Supply Reliability at SSRF
Accelerator control at iThemba LABS
IOT Critical Impact on DC Design
New xTCA Developments at SLAC
Application of the moderate peak power (6 MW) X-band klystron’s cluster for the CLIC accelerating structures testing program. I. Syratchev.
GlueX Electronics Review Jefferson Lab July 23-24, 2003
ILC Global Control System
Availability and Reliability Issues for the XFEL Tom Himel SLAC/DESY
RF System (HLRF, LLRF, Controls) EDR Plan Overview
Presentation transcript:

SLAC High Availability Electronics Architectures & Standards for ILC The Case for Five 9’s Ray Larsen SLAC ILC Program

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen2 Outline  I. High Availability Electronics Design  II. New Industry Open Architecture  III. Application to ILC  IV. Need for Standards  V. Recommendations  VI. Conclusions

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen3 I. HA Design Motivation  Overall Machine Availability ILC High Availability design critical due to unprecedented size, complexity and high cost of an idle machine. (Opportunity Cost of ILC ~$100K/hr.) T. Himel Availability Collaboration¹ study has strongly demonstrated that ILC cannot be built in same manner as old machines or it will never work. Goal of this presentation is to show feasibility of subsystems with A approaching ideal A=1. Basic design tenets can be applied to all machine subsystems. ¹ T. Himel & Collaboration Availability studies group

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen4 High Availability Primer  Availability A = MTBF/(MTBF+MTTR) MTBF=Mean Time Before Failure MTTR= Mean Time To Repair If MTBF approaches infinity A approaches 1 If MTTR approaches zero A approaches 1 Both are impossible on a unit basis Both are possible on a system basis.  Key features for HA, i.e. A approaching 1: Modular design Built-in 1/n redundancy Hot-swap capable at subsystem unit or subunit level

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen5 Historic HA Examples  SLAC Modulators & Klystrons 2 sectors out of 30 (16 out of 240 stations) or 6% were “hot spares” in original design Beam energy maintained constant by hot-swap of standby station on pulse-by-pulse basis to recover station that tripped off for either klystron or modulator fault M-K System operated very close to A=1. At absolute max. energy where all stations used, no hot spares available, A dropped for every station trip.  NIM-CAMAC-FASTBUS-VME-VXI Quickly replaceable instrument modules minimize MTTR to just repair access time plus a few minutes to swap.  Whole RF stations are hot-swap capable; current standard module designs are not.

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen6 NLC Flashback  NLC proposed an overall target of A=0.85 (ZDR1996)  ZDR defined 16 machine segments (systems) for both linacs Assume each has 10 subsystems, or 160 total  Example subsystem: DR magnet power supplies  Asub = (.85)**(1/160) = ~ 3-9’s = 8.9 h/yr Each subsystem on average is allowed a total downtime of ~ 1 shift/year. If subsystem has 100 units, need 5-9’s per unit implying a 100K hour MTBF.  Feasible for very small but not for multi-KW units.

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen7 How High Availability?  As high as possible! – Costs $100K/hr idle.  A=0.85 gives away ~ $130M/yr.  Consider a goal of 99% up-time, Atotal=0.99  If electronics were the only issue, is this remotely thinkable?  How well would various subsystems have to perform?  With other systems limiting us like water, power, controls– is it reasonable to aim so high?

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen8 Consider A=0.99 Machine Goal Pretend all systems that keep machine off are electronic if you feel better. For 0.99 overall NLC, 1 subsystem must obtain Asub= (.99)**(1/160) = (0.55hr/yr) If subsystem has 100 power supplies, each supply would need A= 6- 9’s, implying 1.6 million hour MTBF. Obviously impossible if have to depend on MTBF alone. Industry shows: “Man shall not live by MTBF alone.”

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen9 II. ATCA Telecom System: A=  2 Control & 12 Applications slots  Up to 200 W/module at 45ºC ambient, 2.8KW Shelf  Redundant speed controlled DC fans Mezzanine Card Option 3x7inch Hot Swappable Up to 8/Mbrd

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen10 Advanced Telecom Computing Architecture  How it gets A= Five 9’s Standard hardware, shelf manager, software Redundant Controls Backbone of Controllers, Serial links  Any controller or link can fail without failing modules in Shelf. Dual independent 48VDC power conditioners to each slot  All modules keep operating of one feed fails All serial multiple dual Gigabit serial links, TR chip sets  Dual star or mesh crate networks customer choice On-Board Smarts: Dual Standard Shelf Manager:  Controls temperature by fan speed, compensates for failed fan, sees standard chip monitoring circuits in every module, detects failed module and powers down for technician “hot swap”, returns module to service after replace, smooth power ramp to eliminate inrush transients, informs processors to reroute data around failed channel, communicates with central management system.

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen11 Systems That Never Shut Down  Any large telecom system will have a few redundant Shelves, so loss of a whole unit does not bring down system – like RF system in the Linac. Load auto-rerouted to hot spare, again like Linac.  Key: All equipment always accessible for hot swap.  Other Features: Open System Non-Proprietary – very important for non- Telecom customers like ILC. Developed by industry consortium¹ of major companies sharing in $100B market. 20X larger market than any of old standards including VME leads to competitive prices. ¹ PICMG -- PCI Industrial Computer Manufacturer’s Group

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen12 III. How Applicable to ILC Systems?  ATCA shelf has 14 modules, 2 dedicated to backbone, so 12 application modules about size of large VME but dual wide 1.2in. What might this do? Each of 12 modules can carry up to 8-25W hot-swappable mezzanine cards, or 96 in a shelf. Each mezzanine card could carry 1 or 2 RF or BPM channels. A double height unit could hold a corrector power supply, tuner motor driver or vacuum pump driver, perhaps an SC Quad supply. Redundant power and communications backbone, 2-level hot swap capability is excellent match to many critical front end instruments. Finally, a single card system is possible where standard shelf results in too long cable runs to critical sensors.

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen13 Controls Cluster FEATURES ◊ Dual Star 1/N Redundant Backplanes ◊ Redundant Fabric Switches ◊ Dual Star/ Loop/ Mesh Serial Links ◊ Dual Star Serial Links To/From Level 2 Sector Nodes Dual Star to/From Sector Nodes Dual Fabric Switches Applications Modules Dual Star/ Loop/Mesh

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen14 Sector Nodes Sector N+1 Sector N Sector N+2 ~1Km Level 2 ◊ Dual Nodes ◊ Dual Star ~1Km FEATURES ◊ Standalone Front End (FE) Modules ◊ Dual Star Links To/From Level 2 Sector N ◊ 48V DC Bus ◊ Water/Air Cooled

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen15 Beam Instrumentation  ATCA Style Modules  Stand- Alone or Cluster  Dual Redundant Serial I/O, Power  Robotic Replaceable

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen16 Front End Instrumentation ~1m ~40m

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen17 ATCA Application in ILC  ATCA Directly Applicable 1. Controls, Networks & Timing  Central Control Cluster, IOC’s, Sector nodes, Dual Star Networks 2. Beam Instrumentation  LLRF, Cavity Tuners, BPM’s, Movers, Temp., Vacuum  ATCA Principles Applicable 3. Power Electronics  Modulators, Large Bulk Power Supplies, Modular Power Supplies, Kickers 4. Detectors  Power systems, data communications chipsets, protocols, modular packaging, front end interfacing

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen18 Review of Design Principles  HA Design principles are simple: Must not allow a single point failure of one electronic element to bring down machine To achieve must include modest level of redundancy at one or more levels:  1. Subsystem level, e.g. a few extra controls crates (units) in the main computer farm  2. Subunit level, e.g. a few extra processor modules within the main computer farm crates  Sub-unit level, e.g. hot-swappable daughter cards on a critical PSC module in the beam instrumentation complex.  Must have access for hot swap – by worker or machine. Degree of redundancy, hot-swappable implementation depends on element’s criticality of interrupting machine.

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen19 Example Applications of Principles to Other ILC Subsystems

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen20 HA Concept Modulator  3-Level n/N Redundancy 1/5 IGBT subassemblies 1/8 Mother-boards +2% Units in overall system  Intelligent Diagnostics Imbedded wireless in every MBrd Networked by dual fiber to Main Control

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen21 HA Concept DC Power  n/N parallel modules for DC magnet supply (1/5 shown)  Overall load current feedback  Independent dual Diagnostics Controller Mezzanine Card each module

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen22 HA Concept DC PS Module  Features Motherboard Dual Serial Control IO Independent Carriers Hot Swappable Optional: Redundant n/N w/ Switchover Dual Bulk 48V DC In DC Out

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen23 HA Concept DR Kicker Systems  Approx 50 unit drivers  n/N Redundancy System level (extra kickers)  n/N Redundancy Unit level (extra cards)  Diagnostics on each card, networked, local wireless

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen24 HA Remote Servicing Concept  Module replace while operating  Depot maintenance system  Instrument & Power Modules Module Service System

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen25 IV: Need for Standards  Our old, noble instrument standards are far too obsolete to support the powerful new IC technologies of today. Parallel backplane crates are obsolescent; serial wire, fiber and wireless has taken over. The old ways will not work for ILC!  ILC needs to adopt packaging standards within which all the custom creative design work will fit. ATCA is the best and newest instrument system available, the only open HA system; it will last for the life of the project.  Custom adaptations must be made in the power and detector fields where form factors must be flexible, but design principles and Engineering Best Practices will remain firm.  It is urgent that ILC evaluate and adapt standards in these early years so platforms are firmly in place when engineering design begins in earnest.

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen26 V. Recommendations ILC Electronics controls and instrumentation systems should modernize on ATCA type High Availability platform, readily adaptable to emerging technologies. ILC custom Systems design should include ATCA features including equivalent of Shelf Manager. In Power Systems such as Marx and LGPS design new diagnostic layer to pinpoint problem areas to:  Avert impending failures  Call attention to treat failures promptly without machine interruption Apply to other large units, e.g. Detectors, that don’t fit ATCA form factor by adopting concepts and scaling design features.

August 18, 2005HA Electronics Snowmass 2005 R. S. Larsen27 VI. Conclusions  Electronic subsystems can attain amazingly high Availability with conscious design of units and subunits to avoid single failure interruption.  Typical 1/n redundancy in power units, n~5, plus either or both extra units at subsystem level and extra subunits at the unit level. The new Marx design is an example.  With these measures and access to swap, most system failures can be averted.  Higher MTBF is always desirable but is not a substitute for HA design.  MTTR can never be zero but failures that would normally interrupt machine can be avoided by hot-swapping in a 1/n subsystem unit design.  We should challenge every subsystem, starting with electronics, to strive for HA design. Our goal should be a machine that never – well, hardly ever -- breaks.