Presentation is loading. Please wait.

Presentation is loading. Please wait.

Log analysis in the accelerator sector Steen Jensen, BE-CO-DO.

Similar presentations


Presentation on theme: "Log analysis in the accelerator sector Steen Jensen, BE-CO-DO."— Presentation transcript:

1 Log analysis in the accelerator sector Steen Jensen, BE-CO-DO

2 Outline  Introduction to the controls system  Our motivation for log analysis  Requirements  The current setup  Use cases  Conclusions Openlab workshop on data analytics, CERN, November 16th 2012 2

3 Outline  Introduction to the controls system  Our motivation for log analysis  Requirements  The current setup  Use cases  Conclusions Openlab workshop on data analytics, CERN, November 16th 2012 3

4 CERN’s accelerator complex Openlab workshop on data analytics, CERN, November 16th 2012 4

5 Cern’s Control Centre, CCC Openlab workshop on data analytics, CERN, November 16th 2012 5

6 Control system infrastructure Openlab workshop on data analytics, CERN, November 16th 2012 GPN T T T T T Operator Consoles Fixed Displays File Servers Appl Servers PVSS Servers Real-time Linux LynxOS Computers WorldFIP Systems PLCs Cryo, Vacuum, … Magnet Control, RF Quench Protection, … Beam instrum. Kickers Machine Protection, … Other operational Consoles Oracle Technet Timing 6

7 Hardware  425 consoles  300 servers  1300 front ends  600 module types  ~85.000 device instances Openlab workshop on data analytics, CERN, November 16th 2012 7

8 Software  Java – operational code  400 GUIs, 200 servers  ~8MLOC, > 1000 jars  Up to 1000 processes  80 people, 10 groups  C/C++  > 1300 Front End servers  Real-time processes  Drivers (Linux/LynxOS)  80 people, 8 groups Openlab workshop on data analytics, CERN, November 16th 2012 8

9 Software environment  Operating systems  Linux, LynxOS, Windows  Languages  Java, C++, C, Scripts  Developers  Staff, Fellows, Students – as primary or secondary activity  CO, OP and Equipment Groups  Legacy  10+ years Openlab workshop on data analytics, CERN, November 16th 2012 9

10 Connections in middleware layer Openlab workshop on data analytics, CERN, November 16th 2012 10

11 Outline  Introduction to the controls system  Our motivation for log analysis  Requirements  The current setup  Use cases  Conclusions Openlab workshop on data analytics, CERN, November 16th 2012 11

12 Motivation  Further improve quality and availability of the controls system  New responsibility model => non-experts must be able to quickly identify who to call  Increase diagnostic efficiency  Widen and deepen diagnostic capabilities  Proactive & preventive maintenance & evolution  We already have many log files that can be exploited better Openlab workshop on data analytics, CERN, November 16th 2012 12

13 Log data characteristics  Continuous stream of text messages  ~2Gb per day, bursts > 10Gb per day  Multiple transport mechanisms  syslog, log4J, JMS  Scattered, system-specific log files  Diversity in format and length  Ambiguity in log level interpretation  Variations in frequency  Differences in relevance Openlab workshop on data analytics, CERN, November 16th 2012 13

14 Outline  Introduction to the controls system  Our motivation for log analysis  Requirements  The current setup  Use cases  Conclusions Openlab workshop on data analytics, CERN, November 16th 2012 14

15 Use cases  Allow people to diagnose systems other than their own  Facilitate correlation of messages across systems  Detect what has changed if something does not work  Observe trends over time  Easily obtain customized views of one or more systems  Receive notifications based on custom criteria Openlab workshop on data analytics, CERN, November 16th 2012 15

16 Non-functional requirements  Easy to use and access by (non-)specialists  Overviews and graphical visualisations  Efficient data mining  Allow for knowledge capture, sharing and re-use  grep/awk/sed does not provide this  Looking at Splunk, it is unrealistic to develop own system Openlab workshop on data analytics, CERN, November 16th 2012 16

17 Outline  Introduction to the controls system  Our motivation for log analysis  Requirements  The current setup  Use cases  Conclusions Openlab workshop on data analytics, CERN, November 16th 2012 17

18 Log data pre-processing  Centralization  Filtering of irrelevant messages  Throttling and burst protection  Down from 2+ Gb / day to ~100 Mb / day, not counting Java, i.e. likely to increase considerably Openlab workshop on data analytics, CERN, November 16th 2012 18

19 Splunk setup  Central instance receiving pre-filtered messages via syslog and JMS  Automated alerts  Dashboards and saved searches  Manual monitoring and follow-up Openlab workshop on data analytics, CERN, November 16th 2012 19

20 Splunk dashboard Openlab workshop on data analytics, CERN, November 16th 2012 20

21 RBAC denials on SET Openlab workshop on data analytics, CERN, November 16th 2012 21

22 Outline  Introduction to the controls system  Our motivation for log analysis  Requirements  The current setup  Use cases  Conclusions Openlab workshop on data analytics, CERN, November 16th 2012 22

23 Detailed case: Leap second  On July 1 st, 01:59:59 a 60 th leap second was added globally  Using Splunk, a system administrator discovered that  Some computers failed to add the leap second on time  Certain types of computers added the leap second at a later moment  There seems to be no pattern over time  No computer added a leap second more than once Openlab workshop on data analytics, CERN, November 16th 2012 23

24 Use cases  RBAC security  Tokens missing, malformed or expired  CMW Client issues  Slow consumer, zombies, incorrect accesses, error handling  Timing system  telegram layout issues  Performance testing of new system  CO Frameworks  Improvement: Token check prior to access  Bug: applying wrong token in certain cases  Bug: Loop on timing error Openlab workshop on data analytics, CERN, November 16th 2012 24

25 Use cases, continued  Design considerations  Architectural decisions based on usage info  Driver error reporting, missing module  System issues  yum updates misbehaving due to race condition  Separation between operational and test environments Openlab workshop on data analytics, CERN, November 16th 2012 25

26 Outline  Introduction to the controls system  Our motivation for log analysis  Requirements  The current setup  Use cases  Conclusions Openlab workshop on data analytics, CERN, November 16th 2012 26

27 Conclusions  Log analysis is very valuable – we learned a lot using Splunk  Knowledge about control system usage  Pro-active detection, trimming, improving  Re-active diagnostics  It is a continuous discover-learn-improve process  It involves a mix of  automated alerts  manual monitoring  ad hoc data mining  Experience allows improving log data quality  Culture: Developer usage  Structure: Key-value pairs  Splunk is a very promising tool in CO’s context, and projects show strong interest Openlab workshop on data analytics, CERN, November 16th 2012 27


Download ppt "Log analysis in the accelerator sector Steen Jensen, BE-CO-DO."

Similar presentations


Ads by Google