Download presentation
Presentation is loading. Please wait.
Published byJerome Hunter Modified over 9 years ago
1
Monitoring and Diagnostics infrastructure & SW for LHC QPS WorldFIP first analysis Michel Arruat, Frank Locci, Julien Palluel BE/CO1
2
Summary Our studies focused on a specific bus DL3J reported by QPS as a special bad case. We observed also interesting results on other bus but we need more information about the faults itself. AGENDA: Infrastructure and configuration Results Overall observations Conclusions Solutions Michel Arruat, Frank Locci, Julien Palluel BE/CO2
3
Infrastructure and configuration Michel Arruat, Frank Locci, Julien Palluel BE/CO3 FEC Optical repeater Repeater FIPDiag BEFORE AFTER: 88 segments QPS agents Surface Tunnel RE QPS DOUBLING
4
Infrastructure and configuration Michel Arruat, Frank Locci, Julien Palluel BE/CO4 IP4 IP3
5
Infrastructure and configuration Buses focused : DL3J (errors on all QPS agents) DL3K (not a lot of agents) DL4J (same configuration as DL3J) DL3D (parallel to DL3J) Bus load Michel Arruat, Frank Locci, Julien Palluel BE/CO5 Segment Nb agents 1st subsegment Nb agents 2nd subsegment TotalBA ms DL3D36 100 DL3J202444100 DL3K31 100 DL4J202444100
6
Infrastructure and configuration FIPWatcher (frame analyser) placed at the beginning of the bus: Michel Arruat, Frank Locci, Julien Palluel BE/CO6
7
Infrastructure and configuration BA Table: ID_DAT 3000 time variable WAIT 4ms ID_DAT 05 commands ID_DAT 00 reads ID_DAT 02 reads ID_DAT 04 reads ID_DAT 06 reads ID_DAT 0100 sync ID_DAT 067F/057F fipdiag variables APER WINDOW (presence, list presence…) SYN_WAIT 100ms NB: If the BA take too much time, it will ignore the next trigger in order to complete his cycle and increase the cycle period. Michel Arruat, Frank Locci, Julien Palluel BE/CO7 4ms 12ms 20ms 0.20ms 0.50ms 1.4ms
8
Results Michel Arruat, Frank Locci, Julien Palluel BE/CO8 On all recorded segments, there is no error frames, no bad CRC, too long / small frames etc... ==> no electrical problem disrupting the signal. This is confirmed by the low error (almost zero, 2 out of 90 million) seen on fipdiag (diagnostic module at the end)
9
Time between last RP_DAT and first ID_DAT (3000): cycle freetime Michel Arruat, Frank Locci, Julien Palluel BE/CO9 DL3K With presence list (ID_DAT+RP_DAT=732us+TRs): 28.5ms Without: 29.9ms DL3J With presence list: 0.16ms Without: 1.6ms DL4J With presence list: 1.5ms Without: 3ms DL3D With presence list: 17.6ms Without: 19ms 1.4ms
10
Time between 2 same ID_DAT >100ms (strictly, it includes the jitter) Michel Arruat, Frank Locci, Julien Palluel BE/CO10 DL3K: 9.8% DL3J: 100% (2 cycles cumulated delay on 1200 cycles) DL4J: 2.6% DL3D: 1% DL3K TimeDTFree time 0.666950.09987-0.00013ID_DAT(0567) CRC : AFA7 0.766820.09987-0.00013ID_DAT(0567) CRC : AFA7 0.866680.09986-0.00014ID_DAT(0567) CRC : AFA7 0.966690.1000111E-05ID_DAT(0567) CRC : AFA7 1.0665560.099866-0.00013ID_DAT(0567) CRC : AFA7 1.1664190.099863-0.00014ID_DAT(0567) CRC : AFA7 1.2662870.099868-0.00013ID_DAT(0567) CRC : AFA7 1.3661540.099867-0.00013ID_DAT(0567) CRC : AFA7 DL3J TimeDTFree time 42.093550.1001910.00019ID_DAT(05AE) CRC : 3BE7 42.193760.1002110.00021ID_DAT(05AE) CRC : 3BE7 42.294070.1003110.00031ID_DAT(05AE) CRC : 3BE7 42.394280.1002110.00021ID_DAT(05AE) CRC : 3BE7 42.494460.1001810.00018ID_DAT(05AE) CRC : 3BE7 42.594630.1001710.00017ID_DAT(05AE) CRC : 3BE7 42.694820.1001910.00019ID_DAT(05AE) CRC : 3BE7 42.7950.1001810.00018ID_DAT(05AE) CRC : 3BE7
11
Interframes (TR + cable time): total of all interframes from agents produced variables during 1 cycle Michel Arruat, Frank Locci, Julien Palluel BE/CO11 DL3K: 9648 µs DL3J: 14593.2 µs DL4J: 12857.2 µs DL3D: 11471.2 µs 1.7ms
12
Overall observations Michel Arruat, Frank Locci, Julien Palluel BE/CO12 Enough time to execute read/write during the callback before next cycle? : DL4J : ID_DAT 0100 to ID_DAT 3000 : 3.6ms ID_DAT 3000 to First ID_DAT 05 : 4ms It can leads to incorrect datas because of trying to access a data that is already refreshed by the next cycle. Need to calculate how long is the callback Every buses has a lot of bad MPS_STATUS (=04) on read variables, so refreshment is NOT_OK Every buses has some agents answering FFFFFFFFFF….FFF values. Strange? Optimization if possible: using a single variable of 96 bytes is more efficient than 4 variables of 24 bytes ID_DAT 3000 time variable WAIT 4ms ID_DAT 05 commands ID_DAT 00 reads ID_DAT 02 reads ID_DAT 04 reads ID_DAT 06 reads ID_DAT 0100 sync ID_DAT 067F/057F APER WINDOW SYN_WAIT 100ms
13
Conclusions Michel Arruat, Frank Locci, Julien Palluel BE/CO13 DL3J bus is the longest bus. We see the effect on the interframes, longer than DL4J with the same number of agents. In addition it has the maximum of agents, the result is the BA is overloaded and may get out of the 100ms (free running). Some buses have not a lot of free time. Enough to execute the read/write on the card ?
14
Solutions Michel Arruat, Frank Locci, Julien Palluel BE/CO14 Up the BA to 120ms for instance Move some agents from DL3J to DL3D but risk to saturate DL3D Remove diagnostic part but it is bad (maybe not sufficient) Reduce the TR to lower value than 70µs, ex: 33µs. Possible? Use only one variable of 96 bytes. Possible ? Explain/understand agents refreshment 04 MPS status and FFFFF….FF datas Id_dat order, sometime in layout order, sometime not To give more time to the callback to compute datas, use 2 callbacks with 2 dummy sync var in order to make 2 groups of id_dat.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.