DAQ Monitoring and Auto Recovery at DØ B. Angstadt, G. Brooijmans, D. Chapin, D. Charak, M. Clements, S. Fuess, A. Haas, R. Hauser, D. Leichtman, S. Mattingly,

Slides:



Advertisements
Similar presentations
12 October 2011 Andrew Brown IMu Technology EMu Global Users Group 12 October 2011 IMu Technology.
Advertisements

TCP Monitor and Auto Tuner. Need Analysis Enable monitoring of TCP Connections Enable maximum bandwidth utilization No such utility available in MONALISA.
Categories of I/O Devices
COM vs. CORBA.
Database Architectures and the Web
Gnutella 2 GNUTELLA A Summary Of The Protocol and it’s Purpose By
Distributed components
Technical Brief v1.0. Communication tools that broadcast visual content directly onto the screens of computers, using multiple channels and formats Easy.
Mi-Joung choi, Hong-Taek Ju, Hyun-Jun Cha, Sook-Hyang Kim and J
Software Frameworks for Acquisition and Control European PhD – 2009 Horácio Fernandes.
Lesson 11-Virtual Private Networks. Overview Define Virtual Private Networks (VPNs). Deploy User VPNs. Deploy Site VPNs. Understand standard VPN techniques.
The DØ Experiment and Run II at the FNAL Tevatron The DØ Experiment is one of two Colliding beam detectors located at Fermi National Accelerator Laboratory.
Circuit & Application Level Gateways CS-431 Dick Steflik.
WNT Client/Server SDK Tony Vaccaro CS699 Project Presentation.
Architecture & Performance Community Place case study Presented by u Jin Hyung, SEO.
12-1 © Prentice Hall, 2004 Chapter 12: Design Elements Object-Oriented Systems Analysis and Design Joey F. George, Dinesh Batra, Joseph S. Valacich, Jeffrey.
Control of large scale distributed DAQ/trigger systems in the networked PC era Toby Burnett Kareem Kazkaz Gordon Watts DAQ2000 Workshop Nuclear Science.
Firewall and Proxy Server Director: Dr. Mort Anvari Name: Anan Chen Date: Summer 2000.
NETWORK FILE SYSTEM (NFS) By Ameeta.Jakate. NFS NFS was introduced in 1985 as a means of providing transparent access to remote file systems. NFS Architecture.
FIREWALL TECHNOLOGIES Tahani al jehani. Firewall benefits  A firewall functions as a choke point – all traffic in and out must pass through this single.
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
CLIENT A client is an application or system that accesses a service made available by a server. applicationserver.
.NET, and Service Gateways Group members: Andre Tran, Priyanka Gangishetty, Irena Mao, Wileen Chiu.
IT 210 The Internet & World Wide Web introduction.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
DONE-10: Adminserver Survival Tips Brian Bowman Product Manager, Data Management Group.
ASP.NET The.NET Framework. The.NET Framework is Microsoft’s distributed run-time environment for creating, deploying, and using applications over the.
Networked File System CS Introduction to Operating Systems.
1 CMPT 471 Networking II DHCP Failover and multiple servers © Janice Regan,
Enabling Embedded Systems to access Internet Resources.
Distributed File Systems
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
Run II DZERO DAQ / Level 3 Trigger System Ron Rechenmacher, Fermilab The DØ DAQ Group (Brown/FNAL-CD/U.of Washington) CHEP03.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Syzygy Design overview Distributed Scene Graph Master/slave application framework I/O Device Integration using Syzygy Scaling down: simulators and other.
HERTS Paul Larpenteur Lee Murphy CSE 403 – Sp 2003 Hearts Experimental Remote Transportable System.
XML Web Services Architecture Siddharth Ruchandani CS 6362 – SW Architecture & Design Summer /11/05.
Database Architectures Database System Architectures Considerations – Data storage: Where do the data and DBMS reside? – Processing: Where.
© 2006 IBM Corporation Agile Planning Web UI. © 2006 IBM Corporation Agenda  Overview of APT Web UI  Current Issues  Required Infrastructure  API.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
DZERO Data Acquisition Monitoring and History Gathering G. Watts U. Of Washington, Seattle CHEP04 - #449.
1 Implementing Monitoring and Reporting. 2 Why Should Implement Monitoring? One of the biggest complaints we hear about firewall products from almost.
CMS pixel data quality monitoring Petra Merkel, Purdue University For the CMS Pixel DQM Group Vertex 2008, Sweden.
A machine that acts as the central relay between computers on a network Low cost, low function machine usually operating at Layer 1 Ties together the.
API Crash Course CWU Startup Club. OUTLINE What is an API? Why are API’s useful? What is HTTP? JSON? XML? What is a RESTful API? How do we consume an.
The Process Manager in the ATLAS DAQ System G. Avolio, M. Dobson, G. Lehmann Miotto, M. Wiesmann (CERN)
S O A P ‘the protocol formerly known as Simple Object Access Protocol’ Team Pluto Bonnie, Brandon, George, Hojun.
ITGS Network Architecture. ITGS Network architecture –The way computers are logically organized on a network, and the role each takes. Client/server network.
JS (Java Servlets). Internet evolution [1] The internet Internet started of as a static content dispersal and delivery mechanism, where files residing.
June 17th, 2002Gustaaf Brooijmans - All Experimenter's Meeting 1 DØ DAQ Status June 17th, 2002 S. Snyder (BNL), D. Chapin, M. Clements, D. Cutts, S. Mattingly.
Web Services Using Visual.NET By Kevin Tse. Agenda What are Web Services and Why are they Useful ? SOAP vs CORBA Goals of the Web Service Project Proposed.
Copyright © 2002 Pearson Education, Inc. Slide 3-1 Internet II A consortium of more than 180 universities, government agencies, and private businesses.
02/02/20001/14 Managing Commands & Processes through CORBA CHEP 2000 PaduaPadua Sending Commands and Managing Processes across the BaBar OPR Unix Farm.
By Nitin Bahadur Gokul Nadathur Department of Computer Sciences University of Wisconsin-Madison Spring 2000.
Multimedia Retrieval Architecture Electrical Communication Engineering, Indian Institute of Science, Bangalore – , India Multimedia Retrieval Architecture.
ADO .NET from. ADO .NET from “ADO .Net” Evolution/History of ADO.NET MICROSOFT .NET “ADO .Net” Evolution/History of ADO.NET History: Most applications.
Service-Oriented Architecture for Mobile Applications.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
By: Brett Belin. Used to be only tackled by highly trained professionals As the internet grew, more and more people became familiar with securing a network.
The purpose of a CPU is to process data Custom written software is created for a user to meet exact purpose Off the shelf software is developed by a software.
Introducing the Microsoft® .NET Framework
KID - KLOE Integrated Dataflow
Web Development Web Servers.
The Application Layer RIS 251 Dr. ir. S.S. Msanjila.
Software Design and Architecture
Distribution and components
Introduction to J2EE Architecture
Internet Protocols IP: Internet Protocol
Outline System architecture Current work Experiments Next Steps
15. Proxy SE2811 Software Component Design
Presentation transcript:

DAQ Monitoring and Auto Recovery at DØ B. Angstadt, G. Brooijmans, D. Chapin, D. Charak, M. Clements, S. Fuess, A. Haas, R. Hauser, D. Leichtman, S. Mattingly, A. Kulyavtsev, M. Mulders, P. Padley, D. Petravick, R. Rechenmacher, G. Watts, D. Zhang

2 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 Monitor Design The simple monitor system is easy to write –Perhaps not very interesting –Often gets left to last –Crucial for long term debugging. Only a few basic design questions –Push/Pull –Direct connect to data sources or server –Request Granularity –Data Format Data Source L3 Monitor Server Data Source Monitor Displays Monitor Displays Monitor Displays Monitor Displays Monitor Displays Monitor Displays

3 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 Ours Was No Different The DØ DAQ Monitor –Wasn’t written until it was clear we needed it General design principles were based on Run 1 experience –Too many displays affected system performance –Display maintained connections to every source –Data requested was only a large binary block. Run 2 Environment Many more data sources Of order 200 Possibly many more displays Online System Firewall Due to DOE security requirements. Primary debugging tool for shifters and remote experts

4 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 Design Principles Use a Monitor Server Architecture –Only connection to the data sources –Protocol is based on the pull Data Format is XML –Standard format that is easily extensible –Easily readable by any programming language. Bother monitor sources of data as infrequently as possible –Fine grained monitor data request Some data may almost never be asked, but take significant resources to generate –Cache data in the monitor server for all data items. Be robust against crashes and other spurious effects –Auto reconnect built into clients and data sources Be able to handle a large number of data sources and displays –Heavily multithreaded Serve Real-time data –Cached data has a time stamp –Display data requests have an associated maximum age Minimum time is one second

5 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 Monitor Server Implementation Written in C++ Communication is TCP/IP, based on the ACE RT framework Data Sources connect to it Displays connect to it Caches data –One for every data source Highly multithreaded –Prevents blocking due to long latency displays Takes 10% of a 2 CPU machine under full Monitoring Load –PIII Xenon, 1 GHz. 4% of ½ gig memory Serves about ½ meg per second of monitor data –Cache Hit Rate is about 50% –Could setup second layer of a caching server if so desired Typical time to satisfy query is 50ms Relay to serve requests from outside online system –An access point through security firewall. 162 Data Sources, 35 displays Performance

6 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 Wire Protocol All component connections are TCP/IP All run a common format –Message length word followed by the message text. XML format is simple! <luminosity <luminosity Example Request To Client This simplicity is perhaps the single biggest reason for this system’s success.

7 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 Complex Monitor Data What is sent between monitor item tags is up to data source –Can be arbitrarily complex. –Matches best with rest of system if in XML But this is not a requirement!

8 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 Smorgasbord… Monitor Server SBC’s RM Super Online Monitor Server DL Rates COOR NSStore # Nodes Node Collator Channel 13 TFW L1L2 Scraper Muon FE Monitor Translate from other monitor systems Reformats data from the MS

9 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 Node Collator Current Displays… L3 Monitor Server daqAI l3history l3xQT Web Pages Windows Systray fuMon uMon

10 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 Coding… C++ Data Source Code Python Display Code Code frameworks in Python and C++ Code frameworks in Python, C++, Java, and C# l3_monitor_util_reporter_ACE monitor_callback ("tester"); L3_monitor_connector ms_connection ("d0l3mon.fnal.gov", DEFAULT_L3MONITOR_CLIENT_PORT, &monitor_callback); ms_connection.connect_to_server (true); l3_monitor_object_op counter ("count"); while (1) { counter++; std::cout << "Doing iteration " << counter << std::endl; ACE_OS::sleep(1); } return 0; } import time import l3xmonitor_util_module disp = l3xmonitor_util_module.monitor_display_relay() test_item = disp.get_item("tester", "count") while 1: disp.query_monitor_server() print "Count value is %s" % test_item[0] time.sleep(1)

11 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 Complex Displays 120 kb every 5 seconds Strip Charts Main Shifter Display -Maximal amount of information -Prevent 4am phone calls Connections between FEC & Node

12 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 History Logger Keep track of data over long periods of time –Data Storage in Root –Data Storage in Oracle XML Format –Automatically parsed into ntuple like forms Change in data automatically reflected in data stored in DB –Web interface to change what is logged Make Plots in Real time using web interface Written in.NET Monitor Server Web Interface Web Plotter DB

13 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 Lessons and Out Look Problems: –Didn’t understand how to build strings efficiently in C++ –Was generating a monolithic block of data in a single monitor item –Multi-threaded code that stays up for weeks at a time is hard Text vs Binary –Only once have we had to encode information in Text for compression reasons The code is stand alone –We rely on ACE for the monitor server –Display code frequently relies on Xerces and also ACE. –Data Source code relies on ACE and Xerces Wire protocol is very simple –Could write directly to it and skip dependencies. Simple To UseEveryone Uses It

14 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 Complex Systems… Experiment has many 1000’s of computer controlled components. Problems fixable –Long term Visa problem! –Work around –Decrease data taking efficiency. Can this be addressed? Monitor Data Problem Recognizer Effect Changes Simple Design… Called “daqAI” Bad Name!

15 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 The Components Monitor Data Problem Recognizer Effect Changes New problems often require the addition of new data to monitor system All of detector is computer controlled Control is decentralized in DØ Safety/Political allowing computer to send commands Hard!

16 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 Recognizing The Problem Must be dynamic –Each new problem should not require major update to code –Problems change –What problems symptoms indicate change Must be safe –Feedback loops are bad! –Should not cause harm or increase dead time! Rule Based Recognizer Monitor Data CLIPS Detector Commands

17 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 CLIPS Sample (defrule rate_is_low "See if the daq rate is less than 30 Hz" (l3xetg-rate_events ?num&:(< ?num 30)) => (assert (b_daq_rate_is_low ?num))) (defrule rate_low “Rate is too low. Fix with reset" (b_daq_rate_is_low ?r) (coormon-store_number ?s&:(> ?s 0)) => (log_reason “Rate is low") (talk “Rate is low. SCL Init issued”) (issue_coor_request “scl_init”)) (l3xetg-rate_events 55) (coormon-store_number 2326) Monitor Data Converted to facts Rules… If less than 30 Hz… Down Time Log Tell Shifters What Is Up Attempt to Fix It

18 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 Produces Error Reports Had to produce a report of what it had done. All problem detection is associated with a particular problem Shift Problem Reports –Can see how often something goes wrong and how much downtime it causes

19 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 Lessons… Successful –In the sense it dramatically decreased the downtime for common problems Not Successful –Requires constant up keep by myself –Must convince people this is the right thing to do –Adding new conditions is not as easy as it should be Rule based is easier, but there must be a better way! Have to be careful of feedback and unintended consequences

20 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 Conclusions The Monitor Server has been wildly successful –XML format along with simple wire protocol were key to this –A easily usable and unified monitoring system allows projects like the auto detection to be easily written. The daqAI Project was successful –Big help reducing and cataloging DØ’s down time. Future –Monitor Server is stable and not under active development. –Looking for a smarter way to implement the problem recognizer for daqAI.

21 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 The Outside World Online ACL Firewall MS Relay Display d0l3mon2 www-d0l3mon d0l3mon2 only hole No command information is passed back and forth, just request and query Only displays can run outside…

22 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 Adding your data… 1 Choose a name for your data source “sbc” “daqAI”“luminosity” 2 Choose the language you’ll source the data C/C++ Python 3 Write Code! 4 Run behind online ACL No configuration changes required in MS Simplified road map…

23 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 C++ Client #include "l3monitor_utils/client_objects/l3_monitor_object_op.hpp" #include "l3monitor_utils/client_objects/l3_monitor_util_reporter_ACE.hpp" int main() { l3_monitor_util_reporter_ACE monitor_callback ("tester"); L3_monitor_connector ms_connection ("d0l3mon.fnal.gov", DEFAULT_L3MONITOR_CLIENT_PORT, &monitor_callback); ms_connection.connect_to_server (true); l3_monitor_object_op counter ("count"); while (1) { counter++; std::cout << "Doing iteration " << counter << std::endl; ACE_OS::sleep(1); } return 0; } l3monitor_utils/test/test_monitor_obj_client.cpp Comments removed!! Connect to MS Declare Integer to be monitored Change it once per second for fun using usual C++ syntax Client Name

24 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 Python Client l3xmonitor_util_module/test/test_monitor_client.py * import time import l3xmonitor_util_module ms = l3xmonitor_util_module.monitor_client("tester") count=0 while 1: ms.set(“count”, count) time.sleep(1) count += 1 Does the same thing Connect to MS Set monitor item count Must be run inside online ACLs!!

25 Gordon Watts (UW Seattle) CHEP2003 3/27/2003 Python Display import time import l3xmonitor_util_module disp = l3xmonitor_util_module.monitor_display_relay() test_item = disp.get_item("tester", "count") while 1: disp.query_monitor_server() if len(test_item) == 0: print "No data from the MS" else: print "Count value is %s" % test_item[0] time.sleep(1) Get connection to MS via ACL relay Tell system you want the count monitor variable Load latest data from MS Make sure data came back from MS & Print! (MS down, Client Down…) l3xmonitor_util_module/test/test_monitor_client_readback.py