Vertex 2005 November 7-11, 2005 Chuzenji Lake, Nikko, Japan FPGA based signal processing for the LHCb Vertex detector and Silicon Tracker Guido Haefeli EPFL, Lausanne Guido Haefeli
Outline LHCb VELO and Silicon Tracker Short description of the signal chain from the detector to the CPU farm Required signal processing Implementation of the signal processing with an FPGA
LHCb silicon strip detectors Trigger Tracker (TT) 143k channels Inner Tracker (IT) 130k channels Vertex Locator (VELO) 180k channels Olaf Steinkamp “Long Ladder Performance” Lars Eklund “LHCb Vertex Locator (VELO)”
Readout chain On detector Off detector Each Beetle readout chip provides data on 4 analog links @ 40 MHz. The VELO data is transmitted analog The Silicon Tracker data optical Data reception is different for VELO and Silicon Tracker. The VELO digitized The Silicon Tracker data deserialized Signal processing on the “TELL1” board is performed on FPGAs. Signal filtering Clustering Zero-Suppression Interface to the readout network On detector Off detector
The silicon sensors with the Beetle readout chip VELO sensor with 2048 channels 16 Beetle front end ASICs, 128 channels per chip Inner Tracker with 3-chip hybrid (384 channels) Trigger Tracker with 4-chips hybrid (512 channels)
FPGA processor board “TELL1” 16-ADC A-Rx PP-FPGA Ctrl interface Output 12-Way O-Rx Large FPGA: 40 LE and 400 Byte per detector channel
Signal processing (0) 30 Gbit/s 3 Gbit/s
Signal processing (1) Event synchronization (Sync) Change to a common clock domain Verify the incoming data to be from the same event (check Beetle specific event tag) Tag the data with the experiment wide event counters Create detector specific header data including error flags, error counters, status flags, … Cable compensation (FIR) A Finite Impulse Response filter corrects the data after the long analog links Pedestal Pedestals are calculated on the incoming data (pedestal follower) or downloaded See example
Signal processing (2) Channel re-ordering Common mode correction (CMS) Clusterization and zero suppression Data linking … Each detector channel is required about 20 times during processing 20 operations applied to 768 channels @ 1.11MHz event rate = 17 GOperation/s This asks for massive parallel processing!
How can we process this data ? Requirements Fixed input data rate with fixed event size ! 6 Gbit/s ! Some known but also a few less well known processing steps Pedestal subtraction Data reordering Clusterization FIR correction ? CMS ? Sequence of processing chain ???? Adaptable for different detectors, flexible for future changes (… see next slide!) ASIC is not an option ! About 17 GOperations/s to be performed CPU (DSP) is not an option ! Large FPGAs are the only solution !
LHCb trigger system changed September 2005! “Old” readout with Level-1 buffer and two data streams, 7 GByte/s data to the event builder “Full readout at 40kHz” “New” readout provides in total 56 GByte/s of data to the event builder “Full readout at 1MHz” The LHCb readout scheme has been changed. Thanks to FPGA design we didn’t need to introduce any hardware changes
Array of processors Distributed FPGA logic is used to build many processors working simultaneously !
Data preparation for processing The data has to be available simultaneously to all processing channels! To be more efficient the processing clock frequency is increased and the data is multiplexed!
Data, instructions and schedule The Pace Maker is imposing the correct timing to the distributed processors. Every 450ns a new processing cycle is started and the incoming data is processed. Fixed data size and fixed event rate are used to pipeline the processing. Periodic processing cycle counter
Where are the difficulties ? High “bit resolution” increases the logic resources ! Sorting (channel reordering) Zero suppression generates variable event data size Moreover we often need accurate software models of the processing (in C, C++,…) usable by physicists.
High “bit resolution” increases the required FPGA resources! There are operators that scale linearly with the bit width: Adders Comparators Counters But Operators that scale quadratically: Multipliers Square
VELO Phi sensor requires difficult reordering Second metal layer used for routing the signal lines R-measuring sensor: Strips divided into 45° sectors No reordering required Phi-measuring sensors: Each readout chip receives inner and outer strips The outer strips are not read out in order Re-ordering required
Sorting detector channels RAM The main difficulty to reorder is due to the parallel processing ! Two step reordering applied: Intelligent de-multiplexer Inner and outer strip data is separated Intelligent multiplexer Data collection from all inner (outer) data via readout multiplexer Conclusion: Sorting complicated. (A lesson: try to avoid this situation in the future!)
Zero suppression Average event size reduced by zero suppression but for high occupancy events the data size is increased by the cluster encoding and therefore large processing time might occur. De-randomization is required for zero suppression processing ! Large buffers and buffer overflow control is needed. The processing can still be pipelined but the average processing time must be respected.
Hardware description Low level hardware description with VHDL or Verilog are difficult to use for large designs. With the flexibility for changes given by the FPGA one needs automatic generation of simulation models Many languages ready to use: System-C, Handle-C, Impulse-C, Confluence
Example pedestal follower 2 3 1 Keep the pedestal sum of the last 1024 events in a memory - for each detector channel ! Use the binary division (/1024) to get the pedestal value (10-bit right shift) Subtract pedestal values Update the pedestal sum 4
Conclusions We have described the signal processing for the LHCb silicon strip detectors: The data is filtered and corrected for noise, clusterized and zero-suppressed before send to the DAQ Parallel processing allows to cope with 3072 detector channels per board at 1.11MHz event rate For this large FPGAs are required, 40 LE and 400 Byte per detector channel.
Backup slides below
Clusterization data flow
Principle of the LCMS algorithm Input data after pedestal correction Mean value Mean value calculation Mean value correction
Calculate the slope Correct the slope RMS value RMS calculation Hit detected Hit detection
Set detected hit to zero Calculate the slope again Hit set to zero for second iteration Set detected hit to zero Calculate mean value Correct the mean value Calculate the slope again Correct the slope
Insert the hits previously Apply strip individual Insert previous hit Insert the hits previously set to zero Apply strip individual hit threshold mask This is also a hit
Beetle readout chip Amplification and storage of 128 detector channels sampled @ 40MHz. Storage is performed in an analog pipeline during Level-0 latency of 160 clock cycles. Readout of a complete event within 900ns over 4 analog links, each analog link carries the information of 32 detector channels. In addition to the detector data 4 words of header is transmitted on the analog links. 4 x header 100 ns 32 x data 800 ns