ALICE Week 17.11.99 Technical Board TPC Intelligent Readout Architecture Volker Lindenstruth Universität Heidelberg.

Slides:



Advertisements
Similar presentations
TPC / PHOS / HLT Readout Electronics overview Annual Evaluation Meeting for CERN-related Research in Norway November, 2004 University of Oslo Kjetil.
Advertisements

HLT Collaboration; High Level Trigger HLT PRR Computer Architecture Volker Lindenstruth Kirchhoff Institute for Physics Chair of Computer.
High Level Trigger (HLT) for ALICE Bergen Frankfurt Heidelberg Oslo.
HLT - data compression vs event rejection. Assumptions Need for an online rudimentary event reconstruction for monitoring Detector readout rate (i.e.
Copyright© 2000 OPNET Technologies, Inc. R.W. Dobinson, S. Haas, K. Korcyl, M.J. LeVine, J. Lokier, B. Martin, C. Meirosu, F. Saka, K. Vella Testing and.
ALICE experiment Detectors –Photon spectrometer –Central tracking detectors ITS, TPC, TRD –Particle identification Data acquisition and trigger.
27 th June 2008Johannes Albrecht, BEACH 2008 Johannes Albrecht Physikalisches Institut Universität Heidelberg on behalf of the LHCb Collaboration The LHCb.
The LHCb DAQ and Trigger Systems: recent updates Ricardo Graciani XXXIV International Meeting on Fundamental Physics.
CHEP04 - Interlaken - Sep. 27th - Oct. 1st 2004T. M. Steinbeck for the Alice Collaboration1/20 New Experiences with the ALICE High Level Trigger Data Transport.
Timm M. Steinbeck - Kirchhoff Institute of Physics - University Heidelberg - DPG 2005 – HK New Test Results for the ALICE High Level Trigger.
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Router Architectures An overview of router architectures.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. The Power of Data Driven Triggering DAQ Topology.
5 Feb 2002Alternative Ideas for the CALICE Backend System 1 Alternative Ideas for the CALICE Back-End System Matthew Warren and Gordon Crone University.
K. Honscheid RT-2003 The BTeV Data Acquisition System RT-2003 May 22, 2002 Klaus Honscheid, OSU  The BTeV Challenge  The Project  Readout and Controls.
MSS, ALICE week, 21/9/041 A part of ALICE-DAQ for the Forward Detectors University of Athens Physics Department Annie BELOGIANNI, Paraskevi GANOTI, Filimon.
The High-Level Trigger of the ALICE Experiment Heinz Tilsner Kirchhoff-Institut für Physik Universität Heidelberg International Europhysics Conference.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Design and Performance of a PCI Interface with four 2 Gbit/s Serial Optical Links Stefan Haas, Markus Joos CERN Wieslaw Iwanski Henryk Niewodnicznski Institute.
The ALICE DAQ: Current Status and Future Challenges P. VANDE VYVRE CERN-EP/AID.
ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Muon Electronics Upgrade Present architecture Remarks Present scenario Alternative scenario 1 The Muon Group.
CMS ECAL Week, July 20021Eric CANO, CERN/EP-CMD FEDkit FED Slink64 readout kit Dominique Gigi, Eric Cano (CERN EP/CMD)
1 Network Performance Optimisation and Load Balancing Wulf Thannhaeuser.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
DAQ for 4-th DC S.Popescu. Introduction We have to define DAQ chapter of the DOD for the following detectors –Vertex detector –TPC –Calorimeter –Muon.
LHCb DAQ system LHCb SFC review Nov. 26 th 2004 Niko Neufeld, CERN.
1 07/10/07 Forward Vertex Detector Technical Design – Electronics DAQ Readout electronics split into two parts – Near the detector (ROC) – Compresses and.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
Gunther Haller SiD Meeting January 30, 08 1 Electronics Systems Discussion presented by Gunther Haller Research Engineering.
LNL 1 SADIRC2000 Resoconto 2000 e Richieste LNL per il 2001 L. Berti 30% M. Biasotto 100% M. Gulmini 50% G. Maron 50% N. Toniolo 30% Le percentuali sono.
HLT Kalman Filter Implementation of a Kalman Filter in the ALICE High Level Trigger. Thomas Vik, UiO.
Links from experiments to DAQ systems Jorgen Christiansen PH-ESE 1.
XLV INTERNATIONAL WINTER MEETING ON NUCLEAR PHYSICS Tiago Pérez II Physikalisches Institut For the PANDA collaboration FPGA Compute node for the PANDA.
1 Electronics Status Trigger and DAQ run successfully in RUN2006 for the first time Trigger communication to DRS boards via trigger bus Trigger firmware.
ALICE Online Upgrade P. VANDE VYVRE – CERN/PH ALICE meeting in Budapest – March 2012.
DAQ interface + implications for the electronics Niko Neufeld LHCb Electronics Upgrade June 10 th, 2010.
A. KlugeFeb 18, 2015 CRU form factor discussion & HLT FPGA processor part II A.Kluge, Feb 18,
LKr readout and trigger R. Fantechi 3/2/2010. The CARE structure.
Pierre VANDE VYVRE ALICE Online upgrade October 03, 2012 Offline Meeting, CERN.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Introduction to DAQ Architecture Niko Neufeld CERN / IPHE Lausanne.
Workshop ALICE Upgrade Overview Thorsten Kollegger for the ALICE Collaboration ALICE | Workshop |
SuperB DAQ U. Marconi Padova 23/01/09. Bunch crossing: 450 MHz L1 Output rate: 150 kHz L1 Triggering Detectors: EC, DC The Level 1 trigger has the task.
DAQ 1000 Tonko Ljubicic, Mike LeVine, Bob Scheetz, John Hammond, Danny Padrazo, Fred Bieser, Jeff Landgraf.
ROD Activities at Dresden Andreas Glatte, Andreas Meyer, Andy Kielburg-Jeka, Arno Straessner LAr Electronics Upgrade Meeting – LAr Week September 2009.
The ALICE Data-Acquisition Read-out Receiver Card C. Soós et al. (for the ALICE collaboration) LECC September 2004, Boston.
IRFU The ANTARES Data Acquisition System S. Anvar, F. Druillole, H. Le Provost, F. Louis, B. Vallage (CEA) ACTAR Workshop, 2008 June 10.
Eric Hazen1 Ethernet Readout With: E. Kearns, J. Raaf, S.X. Wu, others... Eric Hazen Boston University.
LHCb and InfiniBand on FPGA
LHC experiments Requirements and Concepts ALICE
Enrico Gamberini, Giovanna Lehmann Miotto, Roland Sipos
TELL1 A common data acquisition board for LHCb
ALICE – First paper.
Commissioning of the ALICE HLT, TPC and PHOS systems
CMS DAQ Event Builder Based on Gigabit Ethernet
Trigger, DAQ, & Online: Perspectives on Electronics
J.M. Landgraf, M.J. LeVine, A. Ljubicic, Jr., M.W. Schulz
Evolution of S-LINK to PCI interfaces
PCI BASED READ-OUT RECEIVER CARD IN THE ALICE DAQ SYSTEM
VELO readout On detector electronics Off detector electronics to DAQ
Example of DAQ Trigger issues for the SoLID experiment
LHCb Trigger, Online and related Electronics
The LHCb Level 1 trigger LHC Symposium, October 27, 2001
The CMS Tracking Readout and Front End Driver Testing
Presentation transcript:

ALICE Week Technical Board TPC Intelligent Readout Architecture Volker Lindenstruth Universität Heidelberg

Volker Lindenstruth, November 1999 What‘s new?? l TPC occuppancy is much higher than originally assumed l New Trigger Detector TRD l First time TPC selective readout becomes relevant l New Readout/L3 Architecture l No intermediate buses and buffer memories - use PCI and local memory instead l New dead-time or throtteling architecture

Volker Lindenstruth, November 1999 TRD/TPC Overall Timeline Time in  s event TRD pretrigger end of TEC drift TRD trigger at L1 TEC drift Track segment processing track matching Data shipping off detector data sampling, linear fit Trigger at TPC (Gate opens)

Volker Lindenstruth, November 1999 TPC L3 trigger and processing TRD Trigger Tracking of e+/e- Candidates inside TPC Ship Zero suppressed TPC Data Sector parallel Trigger and readout TPC ~2 kHz Global Trigger Other Trigger Detectors, TRD L0pre L1 Select Regions of Interest L2 (144 Links, ~60 MB/evt) Ship TRD e+/e- Tracks L2 Verify e+/e- Hypothesis L0 Reject event e+/e- Tracks plus RoIs seeds Conical zero suppressed readout Front-End/ Trigger TPC intelligent Readout DAQ On-line data reduction (tracking, reconstruction, partial readout, data compression) Track segments and space points

Volker Lindenstruth, November 1999 Architecture from TP FEDC LDC FEDC FEE LDC DDL Switch STL FEE PDS TPC ITS PHOS PID TRIG Switch GDC Hz Pb Hz p-p Event Rate L1 Trigger L2 Trigger 2500 MB/s Pb MB/s p+p Hz Pb Hz p-p µs BUSY Trigger Data 1250 MB/s Pb MB/s p+p EDM 50 Hz zentral + 1 kHz dimuon Pb - - Pb 550 Hz p-p µs PDS L0 Trigger

Volker Lindenstruth, November 1999 Some Technology Trends DRAM JahrSizeCycle Time Kb250 ns Kb220 ns Mb190 ns Mb165 ns Mb145 ns Mb120 ns ……. ……. ……. KapazitätGeschwindigkeit (Latenz) Logic:2x in 3 years2x in 3 Jahren DRAM:4x in 3 years2x in 15 Jahren Disk:4x in 3 years2x in 10 Jahren 1000:1!2:1!

Volker Lindenstruth, November 1999 Prozessor-DRAM Memory Gap µProc 60%/yr. (2X/1.5yr) DRAM 6%/yr. (2X/15 yrs) DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance Time Dave Patterson, UC Berkeley “Moore’s Law”

Volker Lindenstruth, November 1999 Testing the uniformity of memory // Vary the size of the array, to determine the size of the cache or the // amount of memory covered by TLB entries. for (size = SIZE_MIN; size <= SIZE_MAX; size *= 2) { // Vary the stride at which we access elements, // to determine the line size and the associativity for (stride = 1; stride <= size; stride *=2) { // Do the following test multiple times so that the granularity of the // timer is better and the start-up effects are reduced. sec = 0; iter = 0; limit = size - stride + 1; iterations = ITERATIONS; do { sec0 = get_seconds(); for (i = iterations; i; i--) // The main loop. // Does a read and a write from various memory locations. for (index = 0; index < limit; index += stride) *(array + index) += 1; sec += (get_seconds() - sec0); iter += iterations; iterations *= 2; } while (sec < 1); stride iteration stride stride stride size Address

Volker Lindenstruth, November MHz Pentium MMX 2.7 ns 95 ns 190 ns 32 bytes 4094 bytes L1 Instruct. Cache: 16 kB L1 Data Cache: 16 kB (4-way associative, 16Byte line) L2 Cache: 512 kB (unified) MMU: 32 I / 64D TLB (4-way assoc)

Volker Lindenstruth, November MHz Pentium MMX L2 Cache off All Caches off

Volker Lindenstruth, November 1999 Vergleich zweier Supercomputer HP V-Class (PA-8x00) SUN E10k (UltraSparc II) L1 Instruct. Cache: 16 kB L1 Data Cache:16 kB (write-through, non allocate, direct mapped,32Byte line) L2 Cache:512 kB (unified) MMU:2x64 fully assoc. TLB L1 Instruct. Cache: 512 kB L1 Data Cache:1024 kB (4-way associative, 16Byte line) MMU:160 fully assoc. TLB

Volker Lindenstruth, November 1999 LogP PMPMPM o (overhead) g (gap) o (overhead) L (Latenz) Verbindungs-Netzwerk P (Prozessoren) L: Time, a packet travels in the network from sender to receiver o: CPU overhead to send or receive a message g: shortest time between sent or received message P: Number of processors g (gap) NICNICNIC Culler et. al. LogP: Towards a Realistic Model of Parallel Computation; Culler et. al. LogP: Towards a Realistic Model of Parallel Computation; PPOPP, May 1993 Volume limited by L/g (aggregate Throughput) NIC: Network Interface Card

Volker Lindenstruth, November Node Ethernet Cluster Quelle: Intel Gigabit Ethernet Fast Ethernet (100 Mb/s) Gigabit Ethernet with Carrier Extension SUN Gigabit Ethernet PCI Karte IP 2.0 SUN Gigabit Ethernet PCI Karte IP SUN 450 Ultra Server 1 CPU each 2 SUN 450 Ultra Server 1 CPU each Sender produces TCP datastream with large Data buffers; receiver simply throws data away Sender produces TCP datastream with large Data buffers; receiver simply throws data away Prozessor Utilization: Prozessor Utilization: Sender 40%; Receiver 60% ! Throughput ca. 160 Mbits ! Throughput ca. 160 Mbits ! Netto Throughput increases if receiver is implemented as twin processor Netto Throughput increases if receiver is implemented as twin processor Why is the TCP/IP Gigabit Ethernet performance so much worse than the theoretically possible?? Note: CMS implemented their own propriate network API for Gethernet and Myrinet Test:

Volker Lindenstruth, November 1999 First Conclusions - Outlook l Memory Bandwidth is the limiting and determining factor. Moving Data requires significant memory bandwidth. l Number of TPC Data links dropped from (528 ) to 180 l Aggregate data rate per link ~ Hz l TPC has highest processing requirements - majority of TPC computation can be done on per sector basis. l Keep the number of CPUs that process one sector in parallel to a minimum Today this number is 5 due to TPC granularity è Try to get Sector data directly into one processor l Selective Readout of TPC sectors can reduce data rate requirement by factors of at least 2-5 l Overall complexity of L3 Processor can be reduced by using PCI based receiver modules delivering the data straight into the host memory, thus eliminating the need for VME crates combining the data from multiple TPC links. l DATE already uses a GSM paradigm as memory pool - no software changes

Volker Lindenstruth, November 1999 PCI Receiver Card Architecture Optical Receiver Multi Event Buffer Data FiFo Push readout Pointers FPGA PCI 66/64 PCI Host memory PCI Hostbridge PCI

Volker Lindenstruth, November 1999 PCI Readout of one TPC sector Each TPC sector is readout by four optical links, which are fed by a small derandomizing buffer in the TPC front-end. The optical PCI receiver modules mount directly in a commercial off the shelf (COTS) receiver computer in the counting house The COTS receiver processor performs any necessary hit level functionality on the data in case of L3 processing The receiver processor can also perform loss less compression and simply forward it to DAQ implementing the TP baseline functionality. The receiver processor is much less expensive than any crate based solution

Volker Lindenstruth, November 1999 Overall TPC Intelligent Readout Architecture PCI MEM CPU RORC LDC/L3CPU NIC L2 Trigger PDS 36 TPC Sectors Inner Tracking System Photon Spectrometer FEE Particle Identification DDL L1 Trigger Switch Trigger Data TriggerDecisions Detectorbusy FEEFEE PDSPDSPDS Muon Tracking Chambers L0 Trigger FEE FEE FEE Trigger Detectors: Micro Channelplate - Zero-DegreeCal. -Muon Trigger Chambers -Transition Radiation Detector RORC PCI MEM CPU RORC LDC/FEDC NIC RORC PCI MEM CPU RORC LDC/FEDC NIC RORC PCI MEM CPU RORC LDC/FEDC NIC PCI MEM CPU RORC LDC/L3CPU NIC FEE PCI MEM CPU RORC LDC/L3CPU NIC PCI MEM CPU RORC LDC/L3CPU NIC PCI MEM CPU RORC LDC/L3CPU NIC FEE PCI MEM CPU RORC LDC/L3CPU NIC PCI MEM CPU RORC LDC/L3CPU NIC PCI MEM CPU RORC LDC/L3CPU NIC RORC PCI MEM CPU RORC LDC/FEDC NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC L3 Matrix EDM PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC Computer center Each TPC sector forms an independent sector cluster The sector clusters merge through a cluster interconnect/network to a global processing cluster. The aggregate throughput of this network can be scaled up to beyond 5 GB/sec at any point in time allowing to fall back to simple loss less binary readout All nodes in the cluster are generic COTs processors, which are acquired at the latest possible time All processing elements can be replaced and upgraded at any point in time The network is commercial The resulting multiprocessor cluster is generic and can be used as off-line farm

Volker Lindenstruth, November 1999 Dead Time / Flow Control TPC FEE Buffer (8 black Events) RcvBd NIC PCI TPC reveiver Buffer > 100 Events Event Receipt Daisy Chain Scenario I l l TPC Dead Time is determined centrally l l For every TPC trigger a counter is incremented l l For every completely received event the last receiver module produces a message (single bit pulse), which is forwarded through all nodes after they also received the event l l The event receipt pulse decrements the counter l l The counter reaching count 7 asserts TPC dead time (there could be an other event already in the queue Scenario II l l TPC Dead Time is determined centrally based on rates assuming worst case event sizes l l Overflow protection for FEE buffers: Assert TPC BUSY if 7 events within 50 ms (assuming 120 MB/event, 1 Gbit) l l Overflow protection for receiver buffers: ~100 Events in 1 second - OR high- water mark in any receiver buffer (preferred way) High water mark - send XOFF low water mark - send XOFF No need for reverse flow control on optical link No need for dead time signalling at TPC frontend

Volker Lindenstruth, November 1999 Summary l Memory bandwidth is a very important factor in designing high performance multi processor systems; it needs to be studied in detail l Do not move data if not required - moving data costs money (except for some granularity effects) l Overall complexity can be reduced by using PCI based receiver modules delivering the data straight into the host memory, thus eliminating the need for VME l General purpose COTS processors are less expensive than any crate solution l FPGA based PCI receiver card prototype is built, NT driver completed, Linux driver almost completed l DDL already planned as PCI version l No reverse flow control required for DDL l DDL URD to be revised by collaboration ASAP l No dead time or throtteling required to be implemented at front-end l Two scenarios as to how to implement it for the TPC at back-end without additional cost