Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t.

Debugging of #100019 P. Hristov 04/03/2013

Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t show what actually happens – The standard tools (i.e. valgrind) do not detect it – All the tricks I tried did not help Important problem – Many jobs crash with the same message

Binary search for problem identification Needs fast multiprocessor machine for compilation (thanks to DAQ!) One has to repeat several (N) times the same test to be sure that given revision works or crashes (I used N=5) In this way I identified rev. 59755: the fix for “label==0” problem You remember that this also caused #99670 Increased virtual memory after the "Label 0 fix"

Static initialization order fiasco See the details in – http://www.parashift.com/c++-faq-lite/static-init-order.html http://www.parashift.com/c++-faq-lite/static-init-order.html – http://www.parashift.com/c++-faq-lite/static-init-order-on-intrinsics.html http://www.parashift.com/c++-faq-lite/static-init-order-on-intrinsics.html Short description: in the implementation file #include “MyClass.h” … const int fkLookForTrouble = AnotherClass::GetValue(); //Methods … void MyClass::DoMess() { // Use fkLookForTrouble … } We had similar problem in 2007 with pointers initialized from a factory method: easier since the crash is “less random” The fix is to use AnotherClass::GetValue() directly

Test after the fix Run 195566 – one of the worst runs – DONE 2,686 from 4,648 (57.8%) – ERROR_V 1227 (26.5%) – bug – EXPIRED 724 (15.6%) – memory After the fix – DONE 4,089 from 4,648 (88.0%) – ERROR_V 415 (8.9%) – bug – EXPIRED 144 (3.1%) – memory

To Do Check all similar places in AliRoot and provide a fix Fix some memory leaks found by Insure++ (an evaluation license was provided by Parasoft) Run again the test on run 195566

Old slides

Introduction More than 15% of the jobs crash in one of the two streamers: AliTRDtrackV1::Streamer or AliTRDcluster::Streamer The problem is reproducible only on SLC5 – The same Root/AliRoot with the same raw files work on Ubuntu or MacOS – If the code is compiled without optimization (-O0) it works also on SLC5 – If you start directly from the event that crashed, the job is OK => the crash depends on the “history” – Sometimes the reconstruction doesn’t crash, and the probability that it is OK again in the next run is higher Hypothesis – memory corruption – problem in IO

Localization of problem Replace AliEn OCDB with local one Replace the AliEn raw file with local one => xrootd is not involved in the crash Reduce the list of detectors: – “minimal configuration” to reproduce the crash: ITS, TPC, TRD, PHOS, EMCAL, HLT Debug printout (gDebug): large and useless

“Simple” debugging with gdb Find the exact place of crash Investigate the content – Corrupted structure in CINT – Try watchpoint on the address with wrong content: this doesn’t work because the corrupted address changes Compile without optimization only the affected class – The problem is reproducible, but almost no additional information came out

Debugging with test function The test function examines the content of the global list, where the corruption occurs Possibility to “bracket” the place of corruption (closer to the actual place, less reproducible) Localized to the reading of PHOS raw data Possibility to set watchpoint (worked once/twice out of many attempts) Full calling chain: involves TBufferFile, TStreamerInfoReadBuffer, TStreamerInfoActions, TBranchElement, TBranch, TBranchRef, TRefTable, TRef, AliRawEquipmentV2

Changes in Root/AliRoot Inspection of all modifications in the affected classes – Nothing suspicious Test with old Root tag: works, but probably by chance PHOS raw data format/consistency: tested by the PHOS experts, no changes since 2011 RAW data framework: no changes since 2011

Runs with Valgrind Memcheck – it detects the use of corrupted CINT structures, but not the moment they are corrupted – no errors when we only read RAW SGcheck – one invalid write in string operations – no errors when we only read RAW Latest version of Valgrind (3.8.1): no new problems detected

Other tools Free – electric fence: put 2 words for each allocated word, too “memory hungry” => cannot be easily used – duma: clone of electric fence – libcwd: not tried Commercial – Insure++: problem with the license server, no reply from Sdt.Support@cern.ch Sdt.Support@cern.ch Parasoft contacted for evaluation license – Purify: the same problem + no experience – TotalView: no version is available Coverity: check carefully the remaining defects

Status on 21/02/13 Hypothesis: “second order” corruption – Corruption in the IO caused by unknown code (in allocated memory since Valgrind doesn’t detect it) – Corruption in the CINT structures caused by the problem in IO Difficult to debug

Additional simplification rec.SetRunVertexFinderTracks(kFALSE); rec.SetRunMultFinder(kFALSE); rec.SetRunCascadeFinder(kFALSE); rec.SetFillTriggerESD(kFALSE); rec.SetWriteAlignmentData(kFALSE); rec.SetRunLocalReconstruction("ITS TPC TRD PHOS HLT"); rec.SetRunTracking("ITS TPC TRD"); rec.SetFillESD("HLT"); When HLT is out of FillESD, everything works!

Investigation of FillESD Set “return” at different places to localize the code that causes crash Since somehow the success of the attempt is correlated with the previous attempt, always rerun the “crashing” version and then go to the changed one Calling chain – AliHLTReconstructor::FillESD – AliHLTSystem::ProcessHLTOUT – AliHLTOUTHandler::ProcessData -> AliHLTTriggerAgent::ProcessData – AliHLTOUT::GetDataObject – AliHLTMessage::Extract – AliHLTMessage::ReadObject (calls TBufferFile::ReadObject) The object that causes problems is AliHLTGlobalTriggerDecision Everything works with default object (no IO)

AliHLTGlobalTriggerDecision This class has not changed since long time (607 days) The object is written using Root v5-33-02b. Contains – AliHLTDomainEntry (594 days after the last change) – AliHLTComponentDataType (in AliHLTDataTypes.h, 374 d.) – AliHLTTriggerDomain (807 days) – AliHLTTriggerDecision (861 days) – AliHLTLogging (209 days) – AliHLTCTPData (857 days) – AliHLTReadoutList (502 days) – AliHLTEventDDLV1 (in AliHLTDataTypes.h, 374 days)

Plans Try with realistic AliHLTGlobalTriggerDecision object without IO Investigate all changes between Root v5-33- 02b and v5-34-02 Check with different raw files

Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t.

Similar presentations

Presentation on theme: "Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t.

Similar presentations

Presentation on theme: "Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t."— Presentation transcript:

Similar presentations

About project

Feedback