Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t.

Slides:



Advertisements
Similar presentations
CS 11 C track: lecture 7 Last week: structs, typedef, linked lists This week: hash tables more on the C preprocessor extern const.
Advertisements

Chapter 17 vector and Free Store John Keyser’s Modifications of Slides By Bjarne Stroustrup
INTRODUCTION 1. QA Department business systems 2. All the bug reports and all the bug tracking systems are very similar.
{ Dominion - Test Plan Version 1 – 22 nd Apr Aravind Palanisami.
Malloc Recitation Section K (Kevin Su) November 5 th, 2012.
Hastings Purify: Fast Detection of Memory Leaks and Access Errors.
Traffic Server Debugging using ASAN / TSAN Brian Geffon.
. Pointers to functions Debugging. Logistics u Mid-term exam: 18/11  Closed books  List of topics – see web page Some you have to read by yourself!
Debugging CPSC 315 – Programming Studio Fall 2008.
CS1061 C Programming Lecture 3: The Programming Environment + Introduction to the Concept of an Algorithm A. O’Riordan, 2004.
More on protocol implementation Packet parsing Memory management Data structures for lookup.
Chapter 3.7 Memory and I/O Systems. 2 Memory Management Only applies to languages with explicit memory management (C or C++) Memory problems are one of.
. Memory Management. Memory Organization u During run time, variables can be stored in one of three “pools”  Stack  Static heap  Dynamic heap.
Programming. Software is made by programmers Computers need all kinds of software, from operating systems to applications People learn how to tell the.
Introduction To C++ Programming 1.0 Basic C++ Program Structure 2.0 Program Control 3.0 Array And Structures 4.0 Function 5.0 Pointer 6.0 Secure Programming.
Cameron McColl Developer Visual Basic Team.  VB Compiler Architecture Overview  Best Practices  Known Issues/Common pitfalls  Improvements made for.
Min Kwan Park Test Tech Lead Visual C# QA team. Fail fast To-Dos for fail fast Analyze issues Information for further action Q&A Agenda.
Dr. Pedro Mejia Alvarez Software Testing Slide 1 Software Testing: Building Test Cases.
Unit Testing & Defensive Programming. F-22 Raptor Fighter.
C++ / G4MICE Course Session 3 Introduction to Classes Pointers and References Makefiles Standard Template Library.
© 2012 IBM Corporation Rational Insight | Back to Basis Series Chao Zhang Unit Testing.
CS 11 C track: lecture 5 Last week: pointers This week: Pointer arithmetic Arrays and pointers Dynamic memory allocation The stack and the heap.
1 Welcome to CS 362 Applied Software Engineering What happens after (and during) design? Testing, debugging, maintaining programs Lessons for software.
Computer Science Detecting Memory Access Errors via Illegal Write Monitoring Ongoing Research by Emre Can Sezer.
CSC 230: C and Software Tools Rudra Dutta Computer Science Department Course Introduction.
Question of the Day  On a game show you’re given the choice of three doors: Behind one door is a car; behind the others, goats. After you pick a door,
Testing and Debugging Version 1.0. All kinds of things can go wrong when you are developing a program. The compiler discovers syntax errors in your code.
Question of the Day  On a game show you’re given the choice of three doors: Behind one door is a car; behind the others, goats. After you pick a door,
Use of Coverity & Valgrind in Geant4 Gabriele Cosmo.
Problem of the Day  Why are manhole covers round?
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
Computer Programming I Hour 2 - Writing Your First C Program.
Constructors CMSC 202. Object Creation Objects are created by using the operator new in statements such as… The following expression invokes a special.
V4-20-Release P. Hristov 08/08/ Changes: v4-20-Rev-38 #85151 Memory leak in T0 DQM agent. From rev #85276 AliGRPPreprocessor.cxx: Port to.
Pointers and Dynamic Memory Allocation Copyright Kip Irvine 2003, all rights reserved. Revised 10/28/2003.
Chapter 7 Pointers: Java does not have pointers. Used for dynamic memory allocation.
CSCI Rational Purify 1 Rational Purify Overview Michel Izygon - Jim Helm.
Debugging Computer Networks Sep. 26, 2007 Seunghwan Hong.
CSE 374 Programming Concepts & Tools Hal Perkins Fall 2015 Lecture 11 – gdb and Debugging.
Debugging 1/6/2016. Debugging 1/6/2016 Debugging  Debugging is a methodical process of finding and reducing the number of bugs, or defects, in a program.
C++ 程序语言设计 Chapter 12: Dynamic Object Creation. Outline  Object creation process  Overloading new & delete.
CMSC 202 Advanced Section Classes and Objects: Object Creation and Constructors.
CSc 352 Debugging Tools Saumya Debray Dept. of Computer Science The University of Arizona, Tucson
17/02/2016S. Ponce / EP-LBC1 Debugging Under Linux Sebastien Ponce Friday, 8 March 2002.
1 Debugging (Part 2). “Programming in the Large” Steps Design & Implement Program & programming style (done) Common data structures and algorithms Modularity.
Summary of User Requirements for Calibration and Alignment Database Magali Gruwé CERN PH/AIP ALICE Offline Week Alignment and Calibration Workshop February.
1 C Basics Monday, August 30, 2010 CS 241. Announcements MP1, a short machine problem, will be released today. Due: Tuesday, Sept. 7 th at 11:59pm via.
StEvent I/O Model And Writing a Maker Or How to Add a New Detector Akio Ogawa BNL 2003 Nov Dubna.
V5-01-Release & v5-02-Release Peter Hristov 20/02/2012.
20 October 2005 LCG Generator Services monthly meeting, CERN Validation of GENSER & News on GENSER Alexander Toropin LCG Generator Services monthly meeting.
Debugging Malloc Lab Detecting Memory-Related Errors.
Some topics for discussion 31/03/2016 P. Hristov 1.
AliRoot survey: Calibration P.Hristov 11/06/2013.
V4-19-Release P. Hristov 11/10/ Not ready (27/09/10) #73618 Problems in the minimum bias PbPb MC production at 2.76 TeV #72642 EMCAL: Modifications.
ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.
1 14th June 2012 CPass0/CPass1 status and development.
Code improvement: Coverity static analysis Valgrind dynamic analysis GABRIELE COSMO CERN, EP/SFT.
V4-18-Release P. Hristov 21/06/2010.
YAHMD - Yet Another Heap Memory Debugger
Chapter 2: System Structures
v4-18-Release: really the last revision!
Checking Memory Management
Analysis framework - status
Lab: ssh, scp, gdb, valgrind
Lab: ssh, scp, gdb, valgrind
Jihyun Park, Changsun Park, Byoungju Choi, Gihun Chang
CSCE 315 – Programming Studio, Fall 2017 Tanzir Ahmed
CSc 352 Debugging Tools Saumya Debray Dept. of Computer Science
Programming.
Offline framework for conditions data
Presentation transcript:

Debugging of # P. Hristov 04/03/2013

Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t show what actually happens – The standard tools (i.e. valgrind) do not detect it – All the tricks I tried did not help Important problem – Many jobs crash with the same message

Binary search for problem identification Needs fast multiprocessor machine for compilation (thanks to DAQ!) One has to repeat several (N) times the same test to be sure that given revision works or crashes (I used N=5) In this way I identified rev : the fix for “label==0” problem You remember that this also caused #99670 Increased virtual memory after the "Label 0 fix"

Static initialization order fiasco See the details in – – Short description: in the implementation file #include “MyClass.h” … const int fkLookForTrouble = AnotherClass::GetValue(); //Methods … void MyClass::DoMess() { // Use fkLookForTrouble … } We had similar problem in 2007 with pointers initialized from a factory method: easier since the crash is “less random” The fix is to use AnotherClass::GetValue() directly

Test after the fix Run – one of the worst runs – DONE 2,686 from 4,648 (57.8%) – ERROR_V 1227 (26.5%) – bug – EXPIRED 724 (15.6%) – memory After the fix – DONE 4,089 from 4,648 (88.0%) – ERROR_V 415 (8.9%) – bug – EXPIRED 144 (3.1%) – memory

To Do Check all similar places in AliRoot and provide a fix Fix some memory leaks found by Insure++ (an evaluation license was provided by Parasoft) Run again the test on run

Old slides

Introduction More than 15% of the jobs crash in one of the two streamers: AliTRDtrackV1::Streamer or AliTRDcluster::Streamer The problem is reproducible only on SLC5 – The same Root/AliRoot with the same raw files work on Ubuntu or MacOS – If the code is compiled without optimization (-O0) it works also on SLC5 – If you start directly from the event that crashed, the job is OK => the crash depends on the “history” – Sometimes the reconstruction doesn’t crash, and the probability that it is OK again in the next run is higher Hypothesis – memory corruption – problem in IO

Localization of problem Replace AliEn OCDB with local one Replace the AliEn raw file with local one => xrootd is not involved in the crash Reduce the list of detectors: – “minimal configuration” to reproduce the crash: ITS, TPC, TRD, PHOS, EMCAL, HLT Debug printout (gDebug): large and useless

“Simple” debugging with gdb Find the exact place of crash Investigate the content – Corrupted structure in CINT – Try watchpoint on the address with wrong content: this doesn’t work because the corrupted address changes Compile without optimization only the affected class – The problem is reproducible, but almost no additional information came out

Debugging with test function The test function examines the content of the global list, where the corruption occurs Possibility to “bracket” the place of corruption (closer to the actual place, less reproducible) Localized to the reading of PHOS raw data Possibility to set watchpoint (worked once/twice out of many attempts) Full calling chain: involves TBufferFile, TStreamerInfoReadBuffer, TStreamerInfoActions, TBranchElement, TBranch, TBranchRef, TRefTable, TRef, AliRawEquipmentV2

Changes in Root/AliRoot Inspection of all modifications in the affected classes – Nothing suspicious Test with old Root tag: works, but probably by chance PHOS raw data format/consistency: tested by the PHOS experts, no changes since 2011 RAW data framework: no changes since 2011

Runs with Valgrind Memcheck – it detects the use of corrupted CINT structures, but not the moment they are corrupted – no errors when we only read RAW SGcheck – one invalid write in string operations – no errors when we only read RAW Latest version of Valgrind (3.8.1): no new problems detected

Other tools Free – electric fence: put 2 words for each allocated word, too “memory hungry” => cannot be easily used – duma: clone of electric fence – libcwd: not tried Commercial – Insure++: problem with the license server, no reply from Parasoft contacted for evaluation license – Purify: the same problem + no experience – TotalView: no version is available Coverity: check carefully the remaining defects

Status on 21/02/13 Hypothesis: “second order” corruption – Corruption in the IO caused by unknown code (in allocated memory since Valgrind doesn’t detect it) – Corruption in the CINT structures caused by the problem in IO Difficult to debug

Additional simplification rec.SetRunVertexFinderTracks(kFALSE); rec.SetRunMultFinder(kFALSE); rec.SetRunCascadeFinder(kFALSE); rec.SetFillTriggerESD(kFALSE); rec.SetWriteAlignmentData(kFALSE); rec.SetRunLocalReconstruction("ITS TPC TRD PHOS HLT"); rec.SetRunTracking("ITS TPC TRD"); rec.SetFillESD("HLT"); When HLT is out of FillESD, everything works!

Investigation of FillESD Set “return” at different places to localize the code that causes crash Since somehow the success of the attempt is correlated with the previous attempt, always rerun the “crashing” version and then go to the changed one Calling chain – AliHLTReconstructor::FillESD – AliHLTSystem::ProcessHLTOUT – AliHLTOUTHandler::ProcessData -> AliHLTTriggerAgent::ProcessData – AliHLTOUT::GetDataObject – AliHLTMessage::Extract – AliHLTMessage::ReadObject (calls TBufferFile::ReadObject) The object that causes problems is AliHLTGlobalTriggerDecision Everything works with default object (no IO)

AliHLTGlobalTriggerDecision This class has not changed since long time (607 days) The object is written using Root v b. Contains – AliHLTDomainEntry (594 days after the last change) – AliHLTComponentDataType (in AliHLTDataTypes.h, 374 d.) – AliHLTTriggerDomain (807 days) – AliHLTTriggerDecision (861 days) – AliHLTLogging (209 days) – AliHLTCTPData (857 days) – AliHLTReadoutList (502 days) – AliHLTEventDDLV1 (in AliHLTDataTypes.h, 374 days)

Plans Try with realistic AliHLTGlobalTriggerDecision object without IO Investigate all changes between Root v b and v Check with different raw files