Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fault Management for the SKA - Niruj Mohan Ramanujam, NCRA

Similar presentations


Presentation on theme: "Fault Management for the SKA - Niruj Mohan Ramanujam, NCRA"— Presentation transcript:

1 Fault Management for the SKA - Niruj Mohan Ramanujam, NCRA

2 Overview Faults are anomalous behaviour of a sub-system which could lead to less than optimal functioning. Given the scale of the SKA, we know that Several Components will be unavailable at a given time and Multiple faults are likely to occur during an observation Hence, occurrence and management of faults shall be treated as a routine part of functioning of the SKA. Operator intervention will not be possible for every fault and hence automated fault management is essential. All faults, including the action taken will need to be logged & archived, and also included in the Metadata if needed.

3 Overview We will briefly discuss the following issues ...
Fault detection Alarm Handling Operator notification and response Troubleshooting and remote management Archiving and Logging Standards, best practices and current capabilities

4 Fault Detection Faults are detected within Entity hardware or
Entity software at any level of hierarchy of the SKA by the relevant M&C system or by other entities connected to it. Fault detection mechanisms include thresholds, timeouts, anomalies, correlations and external sources, downstream analysis, safety monitoring or reported by user

5 Alarm Handling Handling of alarms (raising alarms and fixing the fault) is hierarchical - handled by the its Local M&C and/or by the parent M&C. Determining factors include severity and actionability. Alarm handling includes multiple methods ... - ignoring, retrying, reset or restarting, disabling, replacing, running diagnostic tests, repairing, modifying parameters, protect other components, etc. Alarm handling needs to differentiate between (1) basic alarms (2) aggregated alarms (3) model alarms (4) key alarms

6 Alarm Handling - problems
The Alarm handler design typically needs to deal with the following problems - Alarm floods - cascade of alarms of increasing severity, triggered by a lower level alarm, which overwhelmed the operator and leads to unnecessary shutdown of (sub)system (for want of a nail ... ) Nuisance alarms - which need to be ignored or disabled. E.g. wrong range of values, incorrect filtering, irrelevant contexts, oscillatory behaviour around one end of allowed range etc Stale alarms - which are obsolete, refuse to be cleared etc Unclear alarms - which are ill-defined, undocumented, or redundant Progressive alarms - series of alarms of increasing severity raised by a progressively degrading component. Involved issues of latency and time-tagging of alarms

7 Alarm Components Automation in
- acquiring these parameters from various sources - handling the alarm based on these parameters.

8 Alarm Handling - solutions
Method Description Effect Aggregation Similar alarms (by source, nature, time etc) are grouped and presented Minimises flood Abstraction Several pending alarms for an Element or Component are abstracted into a single alarm with drill-down Contextualisation Alarm handler behaviour needs to be different in different contexts (maintenance, switch-off, testing etc) Minimises nuisance and unclear Suppression Alarms raised as a result of another can be suppressed (unless a key alarm), till original alarm is handled First-out The first alarm which triggers multiple alarms, each propagating upwards at different rates, is tagged as first-out. Minimises flood, progressive alarms Filtering Alarm generation is preceded by filtering (e.g. for oscillatory behaviour around one end of allowable range) Minimises flood and nuisance Override, Shelving Operator can move an alarm to an auxillary shelf temporarily or redirect alarms to other roles Minimises nuisance Asynchronous notification Some alarms are automatically passed to other domains (maintenance engineer, e.g.), bypassing the operator Minimises unclear and nuisance

9 Operator Notification and Response
The guiding principle is to minimise intrusiveness for the operator, limiting interrupt-style notifications to only the highest priority alarms. Each subsystem needs to define the severity of generated alarms. Key alarms : actionable, high severity alarms. Key alarms presented to the operator for acknowledgement and response Other alarms abstracted to Element level, decorations on schematic diagrams only. Operators need drill-down mechanisms to probe (key, aggregated etc) alarms further. All operator responses will be archived, and added to metadata if needed. Multiple people with different roles can also respond.

10 Troubleshooting and remote management
The goal is to minimise the need to physically go out to (remote) equipment M&C needs to ensure that all commands and system parameters down to the LRU level are available through remote interfaces, which can bypass intermediate M&C nodes if needed. Hence, the M&C has to ensure remote operations (reset, power on/off, software upgrade, self tests etc) back-up communication support video cameras, etc log state changes of entities, commands, etc integrate with maintenance management for physical maintenance (alarms asynchronously notify, entities are put in maintenance mode, personnel are given physical and M&C access, changes and updates are tracked and logged, etc).

11 Engineering Data Archive
Archiving and logging are essential functions of the M&C in order to store M&C data with linkages aid troubleshooting as well as analyse performance of system The archive needs to be able to store M&C, metadata, alarms, logs, links to science data acquire from archiver applications for each data stream provide links between data, multiple views for analysis, deletion/modification enable drill-down or traceability conform to VO and industry standards

12 Engineering Data Archive

13 Engineering Data Archive
What is archived : all acquired data subsequent to data reduction Who decides : each application Drill-down or traceability : All M&C data, metadata, alarms etc associated with a given observation need to be linked to each other and with the science data in the Archiver. Tagging based on time, component, location etc needs to be done (e.g. AstroWise). These links need to be carefully resolved in the case of deletion/modification. Current projects (LOFAR, ASKAP, EVLA and ALMA) have consistent archiving tools.

14 Standards, best practices and current capabilities
Industry standard for Alarm Handling : ISA SP 18 They consist of a Life-Cycle Model and a number of recommendations e.g. higher the severity of an alarm type, less the frequency of occurence, alarm floods should not halt entire system if aggregation etc do not work, etc PVSS II (LOFAR) & EPICS (ASKAP) LASER (ALMA) EVLA integrated alarm system configuration, automatic trigger, displays, APIs handle alarms locally (PVSS) aggregates alarms (EPICS) device id, fault value, status, severity, timestamp etc adopted the LHC system does not include fault detection service based distribution layer app gathers faults, groups, analyses, archives filters, aggregates, suppresses alarms hierarchical system timestamp, device id, type, consequence, cause, action to take etc alarm filtering, suppression, first-out use IP multicast as messaging bus

15 See section 5. 8 of M&C Design Concept Descriptions, “04_WP2-005. 065
See section 5.8 of M&C Design Concept Descriptions, “04_WP TD-002v0.2_dcd”

16 Fault Detection - extra
Faults are detected within Component hardware or software at any level of hierarchy of the SKA by the relevant M&C system or by other entities connected to it. Fault detection mechanism include Threshold : outside pre-determined range of allowed values Timeout : operation not completing in expected time Anomalies : value or behaviour deviating from what is expected Correlations : discrepancies of values across multiple sources Patterns analysis : faulty trends over time, components etc External sources : external monitoring systems Downstream analysis : feedback from data processing etc Reported a posteriori : by users, analysts etc much later Safety & Security monitoring


Download ppt "Fault Management for the SKA - Niruj Mohan Ramanujam, NCRA"

Similar presentations


Ads by Google