GSI CBM Volker Lindenstruth

GSI CBM Volker Lindenstruth
Kirchhoff Institute for Physics Chair of Computer Science University Heidelberg, Germany Phone: Fax: WWW:

Overview L0 L1 L2 DAQ 10 MHz 33 kHz 16 kHz Reaction Counter L0 Trigger
2.5 TB/s <10 GB/s <2 GB/s 200k Ch, 20 Tbin 6 Layers, 10 MHz 5000 kB RAW 500 kB Hits <2 kB Tracklets TRD Tracklet search TRD Layer Tracking L1 TMU tracks 100 MB/s 100 GB/s 84k Ch 100 kB RAW 10 kB Hits <1 kB Rings <1 GB/s RICH Ring search rings kHz <1GB/sec 500 GB/s 50 GB/s 4.5M Ch 7 Layers 500 kB RAW 50 kB Hits Silicon Cluster search Si-Vertex Processor L2 Vertex Wenn 10% occuppancy dann 500KB zero suppressed aber dann sind 100 kB Tracklets viel zu viel ... 70 GB/s 7 GB/s 70k Ch 2 Layers 70 kB RAW 7 kB Hits TOF Readout HLT / DAQ 9 16 kHz

Overall Observations and Axioms
No physics or detector requirement for maximum trigger latency Can afford to implmement large derandomizing Buffers at L0 Move Data only if necessary (keep RAW data in FE until L2 accept) Process local data locally ship only results (tracklets, rings, (displaced) vertices)

TRD Trigger Architecture
Aggregate In: 2.5 TB/s Aggregate Out: <2 GB/s Rate: 1 MHz Reaction Counter L0 Trigger TRD Tracklet search TRD Layer Tracking L1 TMU tracks RICH Ring search rings Silicon Cluster search Si-Vertex Processor L2 Vertex Wenn 10% occuppancy dann 500KB zero suppressed aber dann sind 100 kB Tracklets viel zu viel ... TOF Readout HLT / DAQ

ALICE TRD Overview Detector Stack Chamber Multi-Chip-Module B
Signal layers DGND AGND DPWR APWR preamp Processor Tracklet analog channels 10 MHz digitization rate 18+3 channels per module 64224 sources  1 trigger bit track processing on chamber trigger/readout integrated maximum latency 6µs ~20 space points per tracklet 4-6 tracklets (layers) per track Chamber 8 Chips (144 Channel) 24 padrows Stack 6 Layers Detector 5 subdiv . in z 18 subdivisions in f B 18 Channels Multi-Chip-Module

TRD Electronics Chain L1 trigger To CTP Tracklet Tracklet Tracklet
To HLT TRD PASA ADC Preprocessor Processor Merger GTU & DAQ TPP TP TM event buffer Store RAW data until L1A detector charge ADC dig . filter fit trackets Readout Merge tracklets 6 layers sensitive 10 Bits ( pedestal , @L1A tree Into tracks preamplifier 10 MHz tail corr .) compress & 270 GB/sec Compute RoI Channels Shaper preprocess ship RAW Ship RAW data data data @L2A

TRD Trigger Timing t [ns] Data ship Global Tracking Drift time 5990
Tracklet Merger TM Data ship Global Tracking GTU TRD Trigger Timing Drift time 5990 1996 998 2994 3992 4990 t [ns] relevant pipeline ADC output Calculate S fit PASA Tracklet Preprocessor TPP TRD event buffer Calculate Tracklets Tracklet Preprocessor TPP Processor TP event buffer Power Estimate (very crude here) Assume 100% duty cycle and FARO 1 as baseline (risk here as number of clocked gates will definitely change) 100 MHz corresponds to 120 mW (8 channels or 15 mW/channel) FaRo 1 is about 50k Gates  2.4 mW/kGate FaRoFinal will be about 300kGates for 16 channels (CAUTION – lots of that will be memory with very low comparable toggle frequency – this is likely a high estimate)  FaRoFinal ~750 mW total or 60 mW/channel when operational 10 MHz mode during drift (~2 µs) almost everything runs at 10 MHz (except PLL etc.) assume 15 mW/channel (1/4 of full power) 100 MHz mode for next 2 µs at full power (60 mW/ch) Faro basically off (µW) when stand-by average FaRo on-line power consumption 35 mW/channel during 4 µs per pretrigger 10 kHz Pretrigger (high estimate)  40 ms life time/sec  4% life time  1,5 mW/ch these numbers are too good to be true – add a bit of safety – make it 5 mW/ch for now# let us not forget power regulation losses on the detector and losses in the cables (voltage drop over the cables is between a few 100 mV to 1 V or a lot of copper is required  several kWatts stay in the cables) Readout Latency: Min readout latency (worst case) based on 6-fold branch out with 3 branches  8 clocks latency plus cable delay (80 MHz, 20 m) 100 ns ns Average number of tracklets per chamber is 12. Tracklet = 32 bits  50 ns 8-bit payload links

Tracklet Fit (Tracklet Preprocessor)
During Drift Time: N = hit count yi = position xi = time bin sum yi = position sum  xi yi = time bin*position sum yi2 = position² sum After Drift Time: a = intercept b = slope 2 = track quality merge track segments in pad row Pipelined Calculation : LUT yi xi yi  xi yi yi2 xi yi xi2  xi2  yi2 6 11 10 AP+1 AP-1 AP xi 12  yi  yi´2 HC Amplitude Ax/Ac 7 Calculate sums Calculate fit summands Sum Memory Calculate Position Build Ratio Ax

Tracklet (MIMD) Processor
Four ALUs Shared data via GRF and D-Mem Shared I-Mem Projection of Pre-Processor data inside the MIMD Four Harvard CPUs coupled by multi port instruction memory multi port data memory Global Register File (GRF) Register based interface to Pre-Processor Two stage pipeline (fetch/decode, execute/write back) Architecture can be adopted to any general purpose processor

TrackletTrackTrigger (Tracklet Merger)
Readout Tree L00 L01 M110 L02 L10 R00 1 R01 M111 M100 R02 f Sublayer L10 f 1 f OC48 (L21) 2 f L20 M112 3 L12 L11 (R21) R20 M113 R12 R11 GTU 1 2 3 Readout Board (RB) 4 6 7 8 5 9 11 12 13 10 14 16 17 18 15 19

TRAP1 Die ID ADC 4.5 M transistors 180 nm DSM process all clocks gated
21 digital filters (Xtalk + 1/t) 18 full channels multi-event buffers 3x 10Bit, 10MHz ADC ADC 0.1 mm² PLL clock recovery Gbit Links BIST, JTAG 4 RISC processors redundant, fault tolerant nonfig. network full custom QPM tape-out May 12 5000x5000 µm die 1.8 V Core TOX = 4nm Tracklet Preprocessor FF IMEM Event Buffers IMEM FF DMEM Assuming 1000 chips/wafer  <$4 / chip 25c/channel TM Memories: Dual ported 128x16bits = 2kBits Core 120 DMEM Tracklet Merger

ALICE TRD Global Tracking Unit
global tracking is performed module wise (540 modules). input of one GTU are 6*2 2.5GBd optical fiber channel links which are converted into 6*2 16Bit/120Mhz data streams. readout order of individual chambers are optimized for histogramming and pattern recognition in parallel. histogramming engine fills a data field with tracklet candidates regarding their position. looking for patterns with a cluster of at least 4 candidates adequate similar angle high momentum adequate quality quality and position of found hit is sent to central trigger processor GTU operates also for raw data readout and compression GTU Module (x5x18) optical receiver de- serializer histogram engine & pattern recognition module 1 1 16 chamber optical receiver de- serializer 1 1 16 6 x CTP optical receiver de- serializer 1 1 16 chamber optical receiver de- serializer 1 1 16 6*2*2.5GBd fiber channel 6*2*2.5GBd PECL 6*2*16Bit/120MHz TTL middle plane

CBM TRD Outlook Detector Occuppancy critical
Interaction rate > drift time L0 provides more time-stamp kind of information Multi-Event pile-up has to be detected and sorted out electronically Large numbers of Channels and large event buffers can be well integrated Digitize as early as possible Use digital filters Interference of digital circuitry into analog front-end even more critical 20 MHz Sampling used in figure

L1/L2 Si-Vertex Trigger Processor
Aggregate In: <5 GB/s Rate: 1 MHz Reaction Counter L0 Trigger TRD Tracklet search TRD Layer Tracking L1 TMU tracks RICH Ring search rings Silicon Cluster search Si-Vertex Processor L2 Vertex Wenn 10% occuppancy dann 500KB zero suppressed aber dann sind 100 kB Tracklets viel zu viel ... TOF Readout HLT / DAQ

LHCb Vertex Position 2D tracking Primary vertex Secondary tracks
VELO STATISTICS (2D) : 380 events RefB efficiency (%|Nmc) : | 853 Refprim efficiency (%|Nmc) : | 5179 Refset efficiency (%|Nmc) : | 20797 Allset efficiency (%|Nmc) : | 26565 Extra efficiency (%|Nmc) : | 5768 Clone probability (%|Rec) : | 700 Ghost probability (%|Rec) : | 498 MC tracks/event found : 64 PRIMARY VERTEX RESOLUTION: s(x,y) = 16 mm, sz = 47 mm Primary vertex Secondary tracks 3D tracking Secondary vertices Dx Xrec – X MonteCarlo for primary vertex Started with 500 events 120 events are pile-up Only 2D reconstruction 90.01 % efficiency to reconstruct the B decay particle tracks (NMC is number of monte carly tracks)  853 tracks from B decays RefPrim: fast primary vertex tracks > 1 GeV RefSet: all tracks momentum > 1GeV Extra: Allset – Refset (tracks with low momnetum) I made everything what you have asked, except the histogram for resolution of the primary vertex z position. The problem with it is that it is still shifted by 40um. This is because we use z of r sensitive plane instead of mean z position of detector station, I guess. Instead I have drawn a distribution for normalized x error (pull) of primary vertex, that is of the same importance as primary vertex resolution itself. We have pull width 1.5 (must be 1.0) which is relatively good as we do not know neither phi of tracks nor their momentum. Timing for 2d and PV calculation is slightly increased because of some parts of source code added to investigate behaviour of the primary vertex reconstruction algorithm. There is a general road of timing vs number of R clusters that has combinatorial nature. Another spread of points at lower right part of the timing histogram corresponds to events with second primary vertex found and therefore rejected at earlier stage. The mean time was about 2.5 ms. It could be not obvious from the contour representation or maybe even encreased due to additional parts of code. I try to implement simplifications into the algorithm (some kind of simulation of FPGA results similar to what Konstantin is doing). The event display at the upper left shows a tipical event with upper and lower parts corresponding to two (180 deg.) halves of R stations. The thickness of a sensor is 200/300 microns for different prototypes. Even 150 micron sensor exist but not tested. I have found this information in the file other links are at i. Kisel

Scheduled Subevent Transfer @ LHCb
Packets of 128 Bytes are assembled in PCI Cards and distributed in PC cluster, implementing event building and load balanced data distribution at rates exceeding 1.9 MHz. Hardware initiated DMA through Scalable Coherent Interface Final System will implement about 300 PCs. Maximum Latency 2ms. SCH RS 128 byte a rate of 1.96 MHz hardware initiated MHZ Transfer Rates Demonstrated in 3x10 Torus

MHz Event Rate in PC Clusters
1.94 MHz == 34 clocks in 66 MHz PCI 32/33, 2xORCA interleaved Address (1 cc) Address (1 cc) Wait ( Wait ( 4 cc 4 cc ) ) Data (16 cc) Data (16 cc) Idle (1 cc) Idle (1 cc) fmax = 1.50 MHz PCI 64/66, 1xALTERA Address (1 cc) Address (1 cc) Wait ( Wait ( 7 cc 7 cc ) ) Data (16 cc) Data (16 cc) Idle (10 cc) Idle (10 cc) fmax = 1.94 MHz

Trigger Processor Farms
Aggregate In: 9 GB/s Rate: 16 kHz Reaction Counter L0 Trigger TRD Tracklet search TRD Layer Tracking L1 TMU tracks RICH Ring search rings Silicon Cluster search Si-Vertex Processor L2 Vertex Wenn 10% occuppancy dann 500KB zero suppressed aber dann sind 100 kB Tracklets viel zu viel ... TOF Readout HLT / DAQ

Receiving Detector Data into Cluster
Probably cheapest memory available is in commodity computers Use that memory as event- and latency buffer Perform first level processing (e.g. cluster finder) where data exists Avoid unnecessary copying of data Use programmable FPGA to off-load host processors (FPGA Coprocessor) Make PCI(-X) baseline bus enabling use of all commodity computers Use CMC form factor for optical detector data link circuitry on mezzanine Eventuell 15 und 17 kombinieren .... Bild 17 nach 15.

Modular Cluster Communication Package
Data Block Distributor Data Block Collector New Event New Event Publisher Subscriber New Event New Event Load balancing Fan-out Subscriber Publisher Subscriber Publisher Publisher Subscriber Data Block Merger Bridging between Nodes Event M Block 0 Event M Block 1 SubscriberBridgeHead Subscribes to upstream Gets input data from publisher Sends data over network Subs Subs Merging Code Publisher PublisherBridgeHead Publishes to downstream Reads data from network Event M Block 0, 1 …

System benchmark pp pileup removal at 270 Hz (TCP/IP, fast ethernet)
Rate >270 Hz EG PM EM SM Slice Merger SM EG Event Gatherer CPU: 100% Patch Merger PM Event Merger EM Netto: 6*131 kB/sec Brutto: 6*374 kB/sec EG ES EG ES EG ES EG ES EG ES EG EG Event Gatherer T T T T T T T T T T T T CPU: % T Tracker (2x) ES ES Event Scatterer Netto: 6*1.8 MB/sec Brutto: 6*2.4 MB/sec CF AU FP CF AU FP CF AU FP CF AU FP CF AU FP CF CF Cluster Finder AU CPU: 60-70% AU ADC Unpacker FP FP File Publisher Patches 1 2 3 4 5 6

Publisher / Subscriber Fault Tolerance
Not a single event lost During error recovery

Fault Tolerance in large Clusters
PCI Linux PC in every PC BIOS access Hardware controlling POST codes Floppy emulation Remote access to a broken PC Detecting hard- & software errors Starting self-acting error correction Web interface for Monitoring & interacting - Part of DataGRID Project

B A C D Cluster RAID Cluster RAID 25% 25% 25% 25%
other‘s backup local exclusive local redundant B synchronous backup: parity = d(A) Å d(B) Å d(C) Å d(D) incremental backup: paritynew = parityold Å dold(X) Å dnew(X) = 25% of local redundant non-critical data 25% data with backup local exclusive other‘s backup local exclusive local redundant other‘s backup local exclusive local redundant A C local redundant 25% 25% other‘s backup - writes are more expensive because of the 3 operations in backup each write requires to send the old data to all other nodes, i.e. 100% more data transfer oder zyklisch Block 1 bis Block n reihum über alle Nachbarn verteilen, geht aber nur mit Snapshots (nicht inkrementell) 25% Distributed Block Device Enabling implementation of Distributed, reliable mass storage within cluster Cluster RAID other‘s backup local exclusive local redundant D

Heidelberg HLT cluster

Summary - Conclusions PC Clusters are a powerful tool up to MHz Subevent Rates and GB/sec transfer rates Data easily transferred into L1 / L2 Caches Scalability due to Moores law At higher (data or transaction) rates Systolic FPGA (Co)Processors ASICs for large channel counts Complex low-power readout and (pre)trigger circuitry possible and likely to benefit much from Moores Law High integration density and high-speed processes allow implementation of very inexpensive and low-power ADCs Co-existence of ADC and clocked circuitry demonstrated to work well Challenging GSI CBM Program does not seem to present any unsurmountable problems w.r.t. trigger and readout

GSI CBM Volker Lindenstruth

Similar presentations

Presentation on theme: "GSI CBM Volker Lindenstruth"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GSI CBM Volker Lindenstruth

Similar presentations

Presentation on theme: "GSI CBM Volker Lindenstruth"— Presentation transcript:

Similar presentations

About project

Feedback