The DØ Experiment and Run II at the FNAL Tevatron The DØ Experiment is one of two Colliding beam detectors located at Fermi National Accelerator Laboratory.

Slides:



Advertisements
Similar presentations
System Integration and Performance
Advertisements

DAQ Monitoring and Auto Recovery at DØ B. Angstadt, G. Brooijmans, D. Chapin, D. Charak, M. Clements, S. Fuess, A. Haas, R. Hauser, D. Leichtman, S. Mattingly,
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
Distributed System Structures Network Operating Systems –provide an environment where users can access remote resources through remote login or file transfer.
The Assembly Language Level
Programming Types of Testing.
The Little man computer
15.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 15: Configuring a Windows.
1 Improving the Performance of Distributed Applications Using Active Networks Mohamed M. Hefeeda 4/28/1999.
Internet Networking Spring 2006 Tutorial 12 Web Caching Protocols ICP, CARP.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #13 Web Caching Protocols ICP, CARP.
From Extensibility to Evolvability Once upon a time, HTTP was simple – what happened?
Internet Networking Spring 2002 Tutorial 13 Web Caching Protocols ICP, CARP.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Control of large scale distributed DAQ/trigger systems in the networked PC era Toby Burnett Kareem Kazkaz Gordon Watts DAQ2000 Workshop Nuclear Science.
Maintaining and Updating Windows Server 2008
Fundamentals of Python: From First Programs Through Data Structures
Virtual Memory Tuning   You can improve a server’s performance by optimizing the way the paging file is used   You may want to size the paging file.
CSE 486/586 CSE 486/586 Distributed Systems PA Best Practices Steve Ko Computer Sciences and Engineering University at Buffalo.
Hands-On Microsoft Windows Server 2008
Passive Monitoring with Nagios Jim Prins
Networked File System CS Introduction to Operating Systems.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.
LiveCycle Data Services Introduction Part 2. Part 2? This is the second in our series on LiveCycle Data Services. If you missed our first presentation,
Drag and Drop Display and Builder. Timofei B. Bolshakov, Andrey D. Petrov FermiLab.
Applications of Advanced Data Analysis and Expert System Technologies in the ATLAS Trigger-DAQ Controls Framework G. Avolio – University of California,
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
IT 456 Seminar 5 Dr Jeffrey A Robinson. Overview of Course Week 1 – Introduction Week 2 – Installation of SQL and management Tools Week 3 - Creating and.
Chapter 15 Recovery. Topics in this Chapter Transactions Transaction Recovery System Recovery Media Recovery Two-Phase Commit SQL Facilities.
Control in ATLAS TDAQ Dietrich Liko on behalf of the ATLAS TDAQ Group.
DZERO Data Acquisition Monitoring and History Gathering G. Watts U. Of Washington, Seattle CHEP04 - #449.
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
1 CMPT 275 High Level Design Phase Modularization.
Chapter 4 MARIE: An Introduction to a Simple Computer.
Switch Features Most enterprise-capable switches have a number of features that make the switch attractive for large organizations. The following is a.
 Registry itself is easy and straightforward in implementation  The objects of registry are actually complicated to store and manage  Objects of Registry.
EPICS Release 3.15 Bob Dalesio May 19, Features for 3.15 Support for large arrays - done for rsrv in 3.14 Channel access priorities - planned to.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
JS (Java Servlets). Internet evolution [1] The internet Internet started of as a static content dispersal and delivery mechanism, where files residing.
Configuration Mapper Sonja Vrcic Socorro,
June 17th, 2002Gustaaf Brooijmans - All Experimenter's Meeting 1 DØ DAQ Status June 17th, 2002 S. Snyder (BNL), D. Chapin, M. Clements, D. Cutts, S. Mattingly.
- Manvitha Potluri. Client-Server Communication It can be performed in two ways 1. Client-server communication using TCP 2. Client-server communication.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
ASP-2-1 SERVER AND CLIENT SIDE SCRITPING Colorado Technical University IT420 Tim Peterson.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
November 1, 2004 ElizabethGallas -- D0 Luminosity Db 1 D0 Luminosity Database: Checklist for Production Elizabeth Gallas Fermilab Computing Division /
Text INTRODUCTION TO ASP.NET. InterComm Campaign Guidelines CONFIDENTIAL Simply Server side language Simplified page development model Modular, well-factored,
// Increment i i += 1; // Restart timer this->start(Cycles::rdtsc() + clock->updateIntervalCycles); updater->start(0); // Start immediately. CS 190 Lecture.
1 DAQ.IHEP Beijing, CAS.CHINA mail to: The Readout In BESIII DAQ Framework The BESIII DAQ system consists of the readout subsystem, the.
A Validation System for the Complex Event Processing Directives of the ATLAS Shifter Assistant Tool G. Anders (CERN), G. Avolio (CERN), A. Kazarov (PNPI),
Integrating and Extending Workflow 8 AA301 Carl Sykes Ed Heaney.
The ATLAS Run 2 Expert System Giuseppe Avolio - CERN.
January 2010 – GEO-ISC KickOff meeting Christian Gräf, AEI 10 m Prototype Team State-of-the-art digital control: Introducing LIGO CDS.
Online Data Monitoring Framework Based on Histogram Packaging in Network Distributed Data Acquisition Systems Tomoyuki Konno 1, Anatael Cabrera 2, Masaki.
Maintaining and Updating Windows Server 2008 Lesson 8.
A Fragmented Approach by Tim Micheletto. It is a way of having multiple cache servers handling data to perform a sort of load balancing It is also referred.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
Monitoring Dynamic IOC Installations Using the alive Record Dohn Arms Beamline Controls & Data Acquisition Group Advanced Photon Source.
Powerpoint Templates Data Communication Muhammad Waseem Iqbal Lecture # 07 Spring-2016.
Topic 4: Distributed Objects Dr. Ayman Srour Faculty of Applied Engineering and Urban Planning University of Palestine.
WWW and HTTP King Fahd University of Petroleum & Minerals
z/Ware 2.0 Technical Overview
LOCO Extract – Transform - Load
Controlling a large CPU farm using industrial tools
Chapter 19: Architecture, Implementation, and Testing
Internet Networking recitation #12
CMSC 611: Advanced Computer Architecture
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

The DØ Experiment and Run II at the FNAL Tevatron The DØ Experiment is one of two Colliding beam detectors located at Fermi National Accelerator Laboratory. It is at one of four collision points for the 1.96 TeV Tevatron proton-antiproton accelerator. The DØ detector underwent numerous upgrades to extend its physics reach before the start of Run II in March One of the major upgrades was an improvement to the DAQ system (which was rebuilt from scratch), out of which the projects on this page were spawned. The Data Format: XML After some debate of binary vs text, we settled on text and then XML. The main reason was the ease with which one could add new monitor items to an already existing structure. The panels to the right show a simple XML request and reply format. Monitor Server Data Source Display What is a monitor Item? A monitor item is supplied by a data source and requested by a display. Each monitor item can be uniquely addressed by the tuple (computer-name, client-type, item-name). For example, the luminosity for the DØ experiment is represented by (d0l3mon2.fnal.gov, luminosity, d0_luminosity). For example, the l3xnode client runs as part of every Level 3 Trigger Node, and so, runs on almost 100 different nodes. The query format allows displays to request a single monitor item from all clients. Monitor data can be of arbitrary complexity, formatted in XML or straight text. Binary would not work. If XML control characters need to be included, then the XML escape clause can be used:. We encourage the use of structured XML data as automatic parsing tools can extract data without specialized encoding. P33 D Ø Online Monitoring And Automatic DAQ Recovery B. Angstadt, G. Brooijmans, D. Chapin, D. Charak, M. Clements, S. Fuess, A. Haas, R. Hauser, D. Leichtman, S. Mattingly, A. Kulyavtsev, M. Mulders, P. Padley, D. Petravick, R. Rechenmacher, G. Watts, D. Zhang www/groups/l3daq/ For DØ and L3 Home Page: For daqAI Home Page: Projects/daqAI/ Online Monitoring It was realized early on that monitoring the health of the new DAQ system would require a subsystem on its own. Once developed it has been slowly expanding to encompass other online subsystems. Extensibility System had to accommodate new monitor items and new data sources through out its life without code changes. Crash Recovery Any component – data source, display, server – should be able to crash and restart without having to touch any other components in the system. This was a big aid during development! Uniform Access Wire protocol had to be uniform and simple so that it could slot into preexisting programs. Scalability Large numbers of displays should not put undo load on the data sources. This was accomplished with caching in the server. Offset Access Ability to run all displays local to Fermilab and remotely. This means dealing with both latency and security issues. Automatic DAQ Recovery There are moments in control room when shifters, addressing the same malfunction for the 10 th time in a row, swear “Why can’t the computer do this.” The daqAI project attempts to address these sorts of problems. Monitor System Data Flow The display forms an XML request containing a list of the monitor items and data sources it wants data from. 2 The Monitor Server looks in its internal cache for monitor data that is recent enough to satisfy the request. Only data not in cache is requested in step 3. 3 The Monitor Server assembles an XML request for a particular data source. The request may ask for several data items. The requests are sent in parallel 4 The Data Source generates an XML reply that contains all the requested data 5 The Monitor Server caches all data in the reply and builds a complete reply for the display 6 The Monitor Server Sends the XML data back to the Display, which parses the data and displays or otherwise processes the data. Monitor Item Request The request is sent in XML. XML Wrapper Client-Type: monitor_server Return data from any node that has this client on it The monitor item we are interested in A Second Client-Type Monitor Server Reply The reply XML is similar to the request:. Many standard packages can be used by display to parse the monitor data. We’ve used Xerces (open source), MSXML, and roll-your-own. Performance Many Copies of one Data Source Complex Data 160 Data Sources, 30 Displays Distributes 1 MB/second of monitor Information 50% cache hit rate 10% of a dual CPU Linux Node 4% of 512 MB RAM Getting Data Into the System The monitor system is only as powerful as the data in it is good. How hard it is to add a new data source or data item depends upon how much work one wants to do. There are C++ template container objects that allow one monitor an item with a single declarative line. The following C++ will create a new monitor item called count and set it to 5. Any queries from the monitor server for count will then return 5: l3_monitor_util_reporter_ACE monitor_callback ("tester"); l3_monitor_object_op counter ("count"); counter = 5; We have similar code for python as well. Getting Data Out of the System We also have code to get data out of the system. We have frameworks for C++, python, java, http, and C#. It is also possible to program directly to the TCP/IP protocol. The following python code will request (error checking removed): disp = l3xmonitor_util_module.monitor_display() test_item = disp.get_item("tester", "count") disp.query_monitor_server() print "Count value is %s" % test_item[0] Display Display Handler Thread request Display Handler Thread Single Threaded Query Processor Display request reply Data Cache Source Handler Thread Data Source Source Handler Thread Data Source Data Flow in the Monitor Server Security The DØ online system is a critical system: declared inaccessible to the outside. We petitioned for a single hole through the firewall and run a repeater on the other side. This allows us to view monitoring data offsite and has been crucial for the debugging of the L3/DAQ system. Monitor Data Pattern Matching Affect Changes daqAI is a display of the monitor system (see figures on right). daqAI uses a facts and rule based inference engine to detect problems. The monitor data is scanned for problems once a second. Based on what it sees and the rule’s actions, daqAI can issue simple commands to the run control system or speak to the shifters using a synthesized voice. See the CLIPS section below Gather Monitor Data Convert to Facts Add Persistent and Timer Facts Process All Rules until no more Fire Process new log messages, commands, persistent facts. daqAI cycles once per second. It runs through a simple set of linear tasks shown at the right. The monitor data is converted to facts and the facts passed to the expert system. Some rules, when they fire, call embedded procedures to request log entries, run timers, set persistent facts, or send a command to the DAQ system. After all the rules have been run, these requests are acted upon. The expert system retains no memory of any previous run; it starts fresh each time. This was a design decision (it is also difficult to retract a fact). Thus, if a problem occurs the rule system will try to fix it every second; even if the fix requires 4 or 5 seconds. Thus, only new requests for commands or log messages are acted upon. The daqAI rule script currently contains 77 rules. The rules will assign DAQ dead time to approximately 8 different sources as well as try to fix 5 of them. A single iteration takes about ½ second in real time. daqAI Architecture daqAI Shift report covering the period :00:00 to :00:00 5 times 'Bad A/O BOT Rate in term(193): muon' took 00:13:33 (162.6 secs on avg) 1 times 'Crate 0x67 is FEB' took 00:00:20 (20 secs on avg) 1 times 'L2 Crate 0x21 lost sync' took 00:00:07 (7 secs on avg) 2 times 'Muon Crate 0x30 has fatal error' took 00:00:25 (12.5 secs on avg) 7 times 'Other' took 00:06:18 (54 secs on avg) Timer Status: Timer in_store: 06:36:34 (off) Timer l3_configured: 06:30:28 (off) % of in_store Timer daq_configured: 06:22:43 (off) % of in_store % of l3_configured Timer good_data_flow_to_l3: 06:16:21 (off) % of in_store % of l3_configured % of daq_configured Adding a New Problem to daqAI Once a problem for daqAI to handle has been identified, the steps to implementing it are fairly simple. 1.Make sure the monitor data is available to uniquely differentiate this problem. In our current system this often means adding new data sources or monitor items. 2.If the problem is to be fixed by daqAI, make sure the commands to affect the change are accessible. This is sometimes a hard pill for a detector group to swallow: giving up control to an automated system. 3.Write the CLIPS rules to identify the problem, assign a downtime reason, and, perhaps, issue commands to make the fix. Also, communicate with shifters. This is especially crucial when issuing commands. Assigning Down Time daqAI was originally designed to fix problems it found. However, once running it was realized it could also classify downtime even if it did nothing to fix the cause of the downtime. The result is the below shift report: These Facts then Fire Rules: (defrule daq_is_active “DAQ Configured?" (l3xetg-n_runs ?num&:(!= ?num 0)) => (assert (b_daq_is_active ?num))) If the fact l3xetg-n_runs n, where n is not zero is present, then this rule will fire. It will assert a new fact – namely b_daq_is_active n. This can then be used in other rules. (defrule log_l2scli_missing “Missing Muon SLIC" (s_slic_missing_input ?ct ?channel) (not (problem_reason ?w)) => (log_reason "L2 SLIC Input Missing (crate 0x“ (tohex ?ct) ", channel 0x" (tohex ?channel) ")") (talk "slick input missing. Muon Crate " (tohex ?ct) ".") (issue_coor_request "scl_init")) A more complex rule. If the s_slic_missing_input fact is present, and the problem_reson fact isn’t, then downtime is assigned to the muon SLIC, the control room is notified, and the DAQ system is sent an init command to resync its timing. CLIPS Monitor Data is used to defines facts: An Embeddable Expert System (l3xetg-n_runs 1) (l3xetg-rate_events ) (muon_ro_crate 20 "running OK") (muon_ro_crate 22 "running OK") How Successful is daqAI? daqAI helped increase the DØ DAQ live time from 75% to about 85%. This was mostly because it could detect and respond to frequent problems much faster than a human shifter. That isn’t to say there weren’t difficulties encountered: Problem symptoms can change, which requires code updates. Each new problem requires significant changes to the rules code. There can be feed back loops, especially if there is a faulty monitoring source. Probably the most important things for an upcoming experiment are to create a centralized control and monitoring system. This will greatly aid in developing systems like this. Problem identification may be able to benefit by an automatic data classification algorithm. All the data is maintained, long term, in an Oracle DB… How long to the next CHEP?