Presentation is loading. Please wait.

Presentation is loading. Please wait.

The DØ Experiment and Run II at the FNAL Tevatron The DØ Experiment is one of two Colliding beam detectors located at Fermi National Accelerator Laboratory.

Similar presentations


Presentation on theme: "The DØ Experiment and Run II at the FNAL Tevatron The DØ Experiment is one of two Colliding beam detectors located at Fermi National Accelerator Laboratory."— Presentation transcript:

1 The DØ Experiment and Run II at the FNAL Tevatron The DØ Experiment is one of two Colliding beam detectors located at Fermi National Accelerator Laboratory. It is at one of four collision points for the 1.96 TeV Tevatron proton-antiproton accelerator. The DØ detector underwent numerous upgrades to extend its physics reach before the start of Run II in March 2001. One of the major upgrades was an improvement to the DAQ system (which was rebuilt from scratch), out of which the projects on this page were spawned. The Data Format: XML After some debate of binary vs text, we settled on text and then XML. The main reason was the ease with which one could add new monitor items to an already existing structure. The panels to the right show a simple XML request and reply format. Monitor Server Data Source Display What is a monitor Item? A monitor item is supplied by a data source and requested by a display. Each monitor item can be uniquely addressed by the tuple (computer-name, client-type, item-name). For example, the luminosity for the DØ experiment is represented by (d0l3mon2.fnal.gov, luminosity, d0_luminosity). For example, the l3xnode client runs as part of every Level 3 Trigger Node, and so, runs on almost 100 different nodes. The query format allows displays to request a single monitor item from all clients. Monitor data can be of arbitrary complexity, formatted in XML or straight text. Binary would not work. If XML control characters need to be included, then the XML escape clause can be used:. We encourage the use of structured XML data as automatic parsing tools can extract data without specialized encoding. P33 D Ø Online Monitoring And Automatic DAQ Recovery B. Angstadt, G. Brooijmans, D. Chapin, D. Charak, M. Clements, S. Fuess, A. Haas, R. Hauser, D. Leichtman, S. Mattingly, A. Kulyavtsev, M. Mulders, P. Padley, D. Petravick, R. Rechenmacher, G. Watts, D. Zhang http://www-d0.fnal.gov http://www-d0online.fnal.gov/ www/groups/l3daq/ For DØ and L3 Home Page: For daqAI Home Page: http://d0.epe.phys.washington.edu/ Projects/daqAI/ Online Monitoring It was realized early on that monitoring the health of the new DAQ system would require a subsystem on its own. Once developed it has been slowly expanding to encompass other online subsystems. Extensibility System had to accommodate new monitor items and new data sources through out its life without code changes. Crash Recovery Any component – data source, display, server – should be able to crash and restart without having to touch any other components in the system. This was a big aid during development! Uniform Access Wire protocol had to be uniform and simple so that it could slot into preexisting programs. Scalability Large numbers of displays should not put undo load on the data sources. This was accomplished with caching in the server. Offset Access Ability to run all displays local to Fermilab and remotely. This means dealing with both latency and security issues. Automatic DAQ Recovery There are moments in control room when shifters, addressing the same malfunction for the 10 th time in a row, swear “Why can’t the computer do this.” The daqAI project attempts to address these sorts of problems. Monitor System Data Flow 1 2 3 4 5 6 1 The display forms an XML request containing a list of the monitor items and data sources it wants data from. 2 The Monitor Server looks in its internal cache for monitor data that is recent enough to satisfy the request. Only data not in cache is requested in step 3. 3 The Monitor Server assembles an XML request for a particular data source. The request may ask for several data items. The requests are sent in parallel 4 The Data Source generates an XML reply that contains all the requested data 5 The Monitor Server caches all data in the reply and builds a complete reply for the display 6 The Monitor Server Sends the XML data back to the Display, which parses the data and displays or otherwise processes the data. Monitor Item Request The request is sent in XML. XML Wrapper Client-Type: monitor_server Return data from any node that has this client on it The monitor item we are interested in A Second Client-Type Monitor Server Reply The reply XML is similar to the request:. Many standard packages can be used by display to parse the monitor data. We’ve used Xerces (open source), MSXML, and roll-your-own. Performance Many Copies of one Data Source Complex Data 160 Data Sources, 30 Displays Distributes 1 MB/second of monitor Information 50% cache hit rate 10% of a dual CPU Linux Node 4% of 512 MB RAM Getting Data Into the System The monitor system is only as powerful as the data in it is good. How hard it is to add a new data source or data item depends upon how much work one wants to do. There are C++ template container objects that allow one monitor an item with a single declarative line. The following C++ will create a new monitor item called count and set it to 5. Any queries from the monitor server for count will then return 5: l3_monitor_util_reporter_ACE monitor_callback ("tester"); l3_monitor_object_op counter ("count"); counter = 5; We have similar code for python as well. Getting Data Out of the System We also have code to get data out of the system. We have frameworks for C++, python, java, http, and C#. It is also possible to program directly to the TCP/IP protocol. The following python code will request (error checking removed): disp = l3xmonitor_util_module.monitor_display() test_item = disp.get_item("tester", "count") disp.query_monitor_server() print "Count value is %s" % test_item[0] Display Display Handler Thread request Display Handler Thread Single Threaded Query Processor Display request reply Data Cache Source Handler Thread Data Source Source Handler Thread Data Source Data Flow in the Monitor Server Security The DØ online system is a critical system: declared inaccessible to the outside. We petitioned for a single hole through the firewall and run a repeater on the other side. This allows us to view monitoring data offsite and has been crucial for the debugging of the L3/DAQ system. Monitor Data Pattern Matching Affect Changes daqAI is a display of the monitor system (see figures on right). daqAI uses a facts and rule based inference engine to detect problems. The monitor data is scanned for problems once a second. Based on what it sees and the rule’s actions, daqAI can issue simple commands to the run control system or speak to the shifters using a synthesized voice. See the CLIPS section below Gather Monitor Data Convert to Facts Add Persistent and Timer Facts Process All Rules until no more Fire Process new log messages, commands, persistent facts. daqAI cycles once per second. It runs through a simple set of linear tasks shown at the right. The monitor data is converted to facts and the facts passed to the expert system. Some rules, when they fire, call embedded procedures to request log entries, run timers, set persistent facts, or send a command to the DAQ system. After all the rules have been run, these requests are acted upon. The expert system retains no memory of any previous run; it starts fresh each time. This was a design decision (it is also difficult to retract a fact). Thus, if a problem occurs the rule system will try to fix it every second; even if the fix requires 4 or 5 seconds. Thus, only new requests for commands or log messages are acted upon. The daqAI rule script currently contains 77 rules. The rules will assign DAQ dead time to approximately 8 different sources as well as try to fix 5 of them. A single iteration takes about ½ second in real time. daqAI Architecture daqAI Shift report covering the period 2003-03-19-16:00:00 to 2003-03-20-00:00:00 5 times 'Bad A/O BOT Rate in term(193): muon' took 00:13:33 (162.6 secs on avg) 1 times 'Crate 0x67 is FEB' took 00:00:20 (20 secs on avg) 1 times 'L2 Crate 0x21 lost sync' took 00:00:07 (7 secs on avg) 2 times 'Muon Crate 0x30 has fatal error' took 00:00:25 (12.5 secs on avg) 7 times 'Other' took 00:06:18 (54 secs on avg) Timer Status: Timer in_store: 06:36:34 (off) Timer l3_configured: 06:30:28 (off) 98.4618% of in_store Timer daq_configured: 06:22:43 (off) 96.5075% of in_store 98.0152% of l3_configured Timer good_data_flow_to_l3: 06:16:21 (off) 94.9021% of in_store 96.3847% of l3_configured 98.3365% of daq_configured Adding a New Problem to daqAI Once a problem for daqAI to handle has been identified, the steps to implementing it are fairly simple. 1.Make sure the monitor data is available to uniquely differentiate this problem. In our current system this often means adding new data sources or monitor items. 2.If the problem is to be fixed by daqAI, make sure the commands to affect the change are accessible. This is sometimes a hard pill for a detector group to swallow: giving up control to an automated system. 3.Write the CLIPS rules to identify the problem, assign a downtime reason, and, perhaps, issue commands to make the fix. Also, communicate with shifters. This is especially crucial when issuing commands. Assigning Down Time daqAI was originally designed to fix problems it found. However, once running it was realized it could also classify downtime even if it did nothing to fix the cause of the downtime. The result is the below shift report: These Facts then Fire Rules: (defrule daq_is_active “DAQ Configured?" (l3xetg-n_runs ?num&:(!= ?num 0)) => (assert (b_daq_is_active ?num))) If the fact l3xetg-n_runs n, where n is not zero is present, then this rule will fire. It will assert a new fact – namely b_daq_is_active n. This can then be used in other rules. (defrule log_l2scli_missing “Missing Muon SLIC" (s_slic_missing_input ?ct ?channel) (not (problem_reason ?w)) => (log_reason "L2 SLIC Input Missing (crate 0x“ (tohex ?ct) ", channel 0x" (tohex ?channel) ")") (talk "slick input missing. Muon Crate " (tohex ?ct) ".") (issue_coor_request "scl_init")) A more complex rule. If the s_slic_missing_input fact is present, and the problem_reson fact isn’t, then downtime is assigned to the muon SLIC, the control room is notified, and the DAQ system is sent an init command to resync its timing. CLIPS Monitor Data is used to defines facts: An Embeddable Expert System (l3xetg-n_runs 1) (l3xetg-rate_events 82.677) (muon_ro_crate 20 "running OK") (muon_ro_crate 22 "running OK") http://www.ghg.net/clips/CLIPS.html How Successful is daqAI? daqAI helped increase the DØ DAQ live time from 75% to about 85%. This was mostly because it could detect and respond to frequent problems much faster than a human shifter. That isn’t to say there weren’t difficulties encountered: Problem symptoms can change, which requires code updates. Each new problem requires significant changes to the rules code. There can be feed back loops, especially if there is a faulty monitoring source. Probably the most important things for an upcoming experiment are to create a centralized control and monitoring system. This will greatly aid in developing systems like this. Problem identification may be able to benefit by an automatic data classification algorithm. All the data is maintained, long term, in an Oracle DB… How long to the next CHEP?


Download ppt "The DØ Experiment and Run II at the FNAL Tevatron The DØ Experiment is one of two Colliding beam detectors located at Fermi National Accelerator Laboratory."

Similar presentations


Ads by Google