Download presentation
Presentation is loading. Please wait.
Published byLynne Carson Modified over 9 years ago
1
ALICE Week 17.11.99 Technical Board TPC Intelligent Readout Architecture Volker Lindenstruth Universität Heidelberg
2
Volker Lindenstruth, November 1999 What‘s new?? l TPC occuppancy is much higher than originally assumed l New Trigger Detector TRD l First time TPC selective readout becomes relevant l New Readout/L3 Architecture l No intermediate buses and buffer memories - use PCI and local memory instead l New dead-time or throtteling architecture
3
Volker Lindenstruth, November 1999 TRD/TPC Overall Timeline Time in s 012345 event TRD pretrigger end of TEC drift TRD trigger at L1 TEC drift Track segment processing track matching Data shipping off detector data sampling, linear fit Trigger at TPC (Gate opens)
4
Volker Lindenstruth, November 1999 TPC L3 trigger and processing TRD Trigger Tracking of e+/e- Candidates inside TPC Ship Zero suppressed TPC Data Sector parallel Trigger and readout TPC ~2 kHz Global Trigger Other Trigger Detectors, TRD L0pre L1 Select Regions of Interest L2 (144 Links, ~60 MB/evt) Ship TRD e+/e- Tracks L2 Verify e+/e- Hypothesis L0 Reject event e+/e- Tracks plus RoIs seeds Conical zero suppressed readout Front-End/ Trigger TPC intelligent Readout DAQ On-line data reduction (tracking, reconstruction, partial readout, data compression) Track segments and space points
5
Volker Lindenstruth, November 1999 Architecture from TP FEDC LDC FEDC FEE LDC DDL Switch STL FEE PDS TPC ITS PHOS PID TRIG Switch GDC 10 4 4 Hz Pb - - 10 5 5 Hz p-p Event Rate L1 Trigger L2 Trigger 2500 MB/s Pb + + 20 MB/s p+p 10 3 3 Hz Pb - - 10 4 4 Hz p-p 1.5-2 µs BUSY Trigger Data 1250 MB/s Pb + + 20 MB/s p+p EDM 50 Hz zentral + 1 kHz dimuon Pb - - Pb 550 Hz p-p 10-100 µs PDS L0 Trigger
6
Volker Lindenstruth, November 1999 Some Technology Trends DRAM JahrSizeCycle Time 198064 Kb250 ns 1983256 Kb220 ns 19861 Mb190 ns 19894 Mb165 ns 199216 Mb145 ns 199564 Mb120 ns ……. ……. ……. KapazitätGeschwindigkeit (Latenz) Logic:2x in 3 years2x in 3 Jahren DRAM:4x in 3 years2x in 15 Jahren Disk:4x in 3 years2x in 10 Jahren 1000:1!2:1!
7
Volker Lindenstruth, November 1999 Prozessor-DRAM Memory Gap µProc 60%/yr. (2X/1.5yr) DRAM 6%/yr. (2X/15 yrs) 1 10 100 1000 19801981198319841985198619871988198919901991199219931994199519961997199819992000 DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance Time Dave Patterson, UC Berkeley “Moore’s Law”
8
Volker Lindenstruth, November 1999 Testing the uniformity of memory // Vary the size of the array, to determine the size of the cache or the // amount of memory covered by TLB entries. for (size = SIZE_MIN; size <= SIZE_MAX; size *= 2) { // Vary the stride at which we access elements, // to determine the line size and the associativity for (stride = 1; stride <= size; stride *=2) { // Do the following test multiple times so that the granularity of the // timer is better and the start-up effects are reduced. sec = 0; iter = 0; limit = size - stride + 1; iterations = ITERATIONS; do { sec0 = get_seconds(); for (i = iterations; i; i--) // The main loop. // Does a read and a write from various memory locations. for (index = 0; index < limit; index += stride) *(array + index) += 1; sec += (get_seconds() - sec0); iter += iterations; iterations *= 2; } while (sec < 1); stride iteration stride stride stride size Address
9
Volker Lindenstruth, November 1999 360 MHz Pentium MMX 2.7 ns 95 ns 190 ns 32 bytes 4094 bytes L1 Instruct. Cache: 16 kB L1 Data Cache: 16 kB (4-way associative, 16Byte line) L2 Cache: 512 kB (unified) MMU: 32 I / 64D TLB (4-way assoc)
10
Volker Lindenstruth, November 1999 360 MHz Pentium MMX L2 Cache off All Caches off
11
Volker Lindenstruth, November 1999 Vergleich zweier Supercomputer HP V-Class (PA-8x00) SUN E10k (UltraSparc II) L1 Instruct. Cache: 16 kB L1 Data Cache:16 kB (write-through, non allocate, direct mapped,32Byte line) L2 Cache:512 kB (unified) MMU:2x64 fully assoc. TLB L1 Instruct. Cache: 512 kB L1 Data Cache:1024 kB (4-way associative, 16Byte line) MMU:160 fully assoc. TLB
12
Volker Lindenstruth, November 1999 LogP PMPMPM o (overhead) g (gap) o (overhead) L (Latenz) Verbindungs-Netzwerk P (Prozessoren) L: Time, a packet travels in the network from sender to receiver o: CPU overhead to send or receive a message g: shortest time between sent or received message P: Number of processors g (gap) NICNICNIC Culler et. al. LogP: Towards a Realistic Model of Parallel Computation; Culler et. al. LogP: Towards a Realistic Model of Parallel Computation; PPOPP, May 1993 Volume limited by L/g (aggregate Throughput) NIC: Network Interface Card
13
Volker Lindenstruth, November 1999 2-Node Ethernet Cluster Quelle: Intel Gigabit Ethernet Fast Ethernet (100 Mb/s) Gigabit Ethernet with Carrier Extension SUN Gigabit Ethernet PCI Karte IP 2.0 SUN Gigabit Ethernet PCI Karte IP 2.0 2 SUN 450 Ultra Server 1 CPU each 2 SUN 450 Ultra Server 1 CPU each Sender produces TCP datastream with large Data buffers; receiver simply throws data away Sender produces TCP datastream with large Data buffers; receiver simply throws data away Prozessor Utilization: Prozessor Utilization: Sender 40%; Receiver 60% ! Throughput ca. 160 Mbits ! Throughput ca. 160 Mbits ! Netto Throughput increases if receiver is implemented as twin processor Netto Throughput increases if receiver is implemented as twin processor Why is the TCP/IP Gigabit Ethernet performance so much worse than the theoretically possible?? Note: CMS implemented their own propriate network API for Gethernet and Myrinet Test:
14
Volker Lindenstruth, November 1999 First Conclusions - Outlook l Memory Bandwidth is the limiting and determining factor. Moving Data requires significant memory bandwidth. l Number of TPC Data links dropped from (528 ) to 180 l Aggregate data rate per link ~34 MB/sec @ 100 Hz l TPC has highest processing requirements - majority of TPC computation can be done on per sector basis. l Keep the number of CPUs that process one sector in parallel to a minimum Today this number is 5 due to TPC granularity è Try to get Sector data directly into one processor l Selective Readout of TPC sectors can reduce data rate requirement by factors of at least 2-5 l Overall complexity of L3 Processor can be reduced by using PCI based receiver modules delivering the data straight into the host memory, thus eliminating the need for VME crates combining the data from multiple TPC links. l DATE already uses a GSM paradigm as memory pool - no software changes
15
Volker Lindenstruth, November 1999 PCI Receiver Card Architecture Optical Receiver Multi Event Buffer Data FiFo Push readout Pointers FPGA PCI 66/64 PCI Host memory PCI Hostbridge PCI
16
Volker Lindenstruth, November 1999 PCI Readout of one TPC sector Each TPC sector is readout by four optical links, which are fed by a small derandomizing buffer in the TPC front-end. The optical PCI receiver modules mount directly in a commercial off the shelf (COTS) receiver computer in the counting house The COTS receiver processor performs any necessary hit level functionality on the data in case of L3 processing The receiver processor can also perform loss less compression and simply forward it to DAQ implementing the TP baseline functionality. The receiver processor is much less expensive than any crate based solution
17
Volker Lindenstruth, November 1999 Overall TPC Intelligent Readout Architecture PCI MEM CPU RORC LDC/L3CPU NIC L2 Trigger PDS 36 TPC Sectors Inner Tracking System Photon Spectrometer FEE Particle Identification DDL L1 Trigger Switch Trigger Data TriggerDecisions Detectorbusy FEEFEE PDSPDSPDS Muon Tracking Chambers L0 Trigger FEE FEE FEE Trigger Detectors: Micro Channelplate - Zero-DegreeCal. -Muon Trigger Chambers -Transition Radiation Detector RORC PCI MEM CPU RORC LDC/FEDC NIC RORC PCI MEM CPU RORC LDC/FEDC NIC RORC PCI MEM CPU RORC LDC/FEDC NIC PCI MEM CPU RORC LDC/L3CPU NIC FEE PCI MEM CPU RORC LDC/L3CPU NIC PCI MEM CPU RORC LDC/L3CPU NIC PCI MEM CPU RORC LDC/L3CPU NIC FEE PCI MEM CPU RORC LDC/L3CPU NIC PCI MEM CPU RORC LDC/L3CPU NIC PCI MEM CPU RORC LDC/L3CPU NIC RORC PCI MEM CPU RORC LDC/FEDC NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC L3 Matrix EDM PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC PCI MEM CPU GDC/L3CPU NIC Computer center Each TPC sector forms an independent sector cluster The sector clusters merge through a cluster interconnect/network to a global processing cluster. The aggregate throughput of this network can be scaled up to beyond 5 GB/sec at any point in time allowing to fall back to simple loss less binary readout All nodes in the cluster are generic COTs processors, which are acquired at the latest possible time All processing elements can be replaced and upgraded at any point in time The network is commercial The resulting multiprocessor cluster is generic and can be used as off-line farm
18
Volker Lindenstruth, November 1999 Dead Time / Flow Control TPC FEE Buffer (8 black Events) RcvBd NIC PCI TPC reveiver Buffer > 100 Events Event Receipt Daisy Chain Scenario I l l TPC Dead Time is determined centrally l l For every TPC trigger a counter is incremented l l For every completely received event the last receiver module produces a message (single bit pulse), which is forwarded through all nodes after they also received the event l l The event receipt pulse decrements the counter l l The counter reaching count 7 asserts TPC dead time (there could be an other event already in the queue Scenario II l l TPC Dead Time is determined centrally based on rates assuming worst case event sizes l l Overflow protection for FEE buffers: Assert TPC BUSY if 7 events within 50 ms (assuming 120 MB/event, 1 Gbit) l l Overflow protection for receiver buffers: ~100 Events in 1 second - OR high- water mark in any receiver buffer (preferred way) High water mark - send XOFF low water mark - send XOFF No need for reverse flow control on optical link No need for dead time signalling at TPC frontend
19
Volker Lindenstruth, November 1999 Summary l Memory bandwidth is a very important factor in designing high performance multi processor systems; it needs to be studied in detail l Do not move data if not required - moving data costs money (except for some granularity effects) l Overall complexity can be reduced by using PCI based receiver modules delivering the data straight into the host memory, thus eliminating the need for VME l General purpose COTS processors are less expensive than any crate solution l FPGA based PCI receiver card prototype is built, NT driver completed, Linux driver almost completed l DDL already planned as PCI version l No reverse flow control required for DDL l DDL URD to be revised by collaboration ASAP l No dead time or throtteling required to be implemented at front-end l Two scenarios as to how to implement it for the TPC at back-end without additional cost
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.