An Empirical Study of Reported Bugs in Server Software with Implications for Automated Bug Diagnosis Swarup Kumar Sahoo, John Criswell, Vikram Adve Department.

Slides:



Advertisements
Similar presentations
Connecting to Databases. relational databases tables and relations accessed using SQL database -specific functionality –transaction processing commit.
Advertisements

Categories of I/O Devices
Delta Debugging and Model Checkers for fault localization
Automatic Memory Management Noam Rinetzky Schreiber 123A /seminar/seminar1415a.html.
An Case for an Interleaving Constrained Shared-Memory Multi-Processor Jie Yu and Satish Narayanasamy University of Michigan.
An Empirical Study of the Reliability in UNIX Utilities Barton Miller Lars Fredriksen Brysn So Presented by Liping Cai.
Using Likely Invariants For Automated Software Fault Localization Swarup Kumar Sahoo John Criswell Chase Geigle Vikram Adve 1 Department of Computer Science.
Guoliang Jin, Linhai Song, Wei Zhang, Shan Lu, and Ben Liblit University of Wisconsin–Madison Automated Atomicity- Violation Fixing.
Prime’ Senior Project. Presentation Outline What is Our Project? Problem Definition What does our system do? How does the system work? Implementation.
CHESS: A Systematic Testing Tool for Concurrent Software CSCI6900 George.
Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager.
Module 20 Troubleshooting Common SQL Server 2008 R2 Administrative Issues.
Prioritizing User-session-based Test Cases for Web Applications Testing Sreedevi Sampath, Renne C. Bryce, Gokulanand Viswanath, Vani Kandimalla, A.Gunes.
Byzantine Generals Problem: Solution using signed messages.
Continuously Recording Program Execution for Deterministic Replay Debugging.
Microsoft Research Faculty Summit Yuanyuan(YY) Zhou Associate Professor University of Illinois, Urbana-Champaign.
CS590 Z Software Defect Analysis Xiangyu Zhang. CS590F Software Reliability What is Software Defect Analysis  Given a software program, with or without.
SIM5102 Software Evaluation
Hands-On Microsoft Windows Server 2003 Networking Chapter 7 Windows Internet Naming Service.
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
AFID: An Automated Fault Identification Tool Alex Edwards Sean Tucker Sébastien Worms Rahul Vaidya Brian Demsky.
Learning From Mistakes—A Comprehensive Study on Real World Concurrency Bug Characteristics Shan Lu, Soyeon Park, Eunsoo Seo and Yuanyuan Zhou Appeared.
MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,
Replay Debugging for Distributed Systems Dennis Geels, Gautam Altekar, Ion Stoica, Scott Shenker.
Automated Diagnosis of Software Configuration Errors
Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.
Exceptions and Mistakes CSE788 John Eisenlohr. Big Question How can we improve the quality of concurrent software?
Lucent Technologies – Proprietary Use pursuant to company instruction Learning Sequential Models for Detecting Anomalous Protocol Usage (work in progress)
Designing For Testability. Incorporate design features that facilitate testing Include features to: –Support test automation at all levels (unit, integration,
Software Metrics - Data Collection What is good data? Are they correct? Are they accurate? Are they appropriately precise? Are they consist? Are they associated.
Tracking The Problem  By Aaron Jackson. What’s a Problem?  A suspicious or unwanted behavior in a program  Not all problems are errors as some perceived.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
Computer Security and Penetration Testing
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
DCE (distributed computing environment) DCE (distributed computing environment)
Bug Localization with Machine Learning Techniques Wujie Zheng
 Chapter 13 – Dependability Engineering 1 Chapter 12 Dependability and Security Specification 1.
What Change History Tells Us about Thread Synchronization RUI GU, GUOLIANG JIN, LINHAI SONG, LINJIE ZHU, SHAN LU UNIVERSITY OF WISCONSIN – MADISON, USA.
DEBUGGING. BUG A software bug is an error, flaw, failure, or fault in a computer program or system that causes it to produce an incorrect or unexpected.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
1. When things go wrong: how to find SQL error Sveta Smirnova Principle Technical Support Engineer, Oracle.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
1 Ch. 1: Software Development (Read) 5 Phases of Software Life Cycle: Problem Analysis and Specification Design Implementation (Coding) Testing, Execution.
Software Development Problem Analysis and Specification Design Implementation (Coding) Testing, Execution and Debugging Maintenance.
Configuring and Troubleshooting Identity and Access Solutions with Windows Server® 2008 Active Directory®
Software Quality Assurance SOFTWARE DEFECT. Defect Repair Defect Repair is a process of repairing the defective part or replacing it, as needed. For example,
1 Intro stored procedures Declaring parameters Using in a sproc Intro to transactions Concurrency control & recovery States of transactions Desirable.
CAPP: Change-Aware Preemption Prioritization Vilas Jagannath, Qingzhou Luo, Darko Marinov Sep 6 th 2011.
Simplifying and Isolating Failure-Inducing Input Andreas Zeller and Ralf Hildebrandt IEEE Transactions on Software Engineering (TSE) 2002.
Software Managed Resiliency Siva Hari Lei Chen, Xin Fu, Pradeep Ramachandran, Swarup Sahoo, Rob Smolenski, Sarita Adve Department of Computer Science University.
R Some of these slides are from Prof Frank Lin SJSU. r Minor modifications are made. 1.
CIS-NG CASREP Information System Next Generation Shawn Baugh Amy Ramirez Amy Lee Alex Sanin Sam Avanessians.
Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi.
Cs498dm Software Testing Darko Marinov January 24, 2012.
Content Coverity Static Analysis Use cases of Coverity Examples
Learning from Mistakes: A Comprehensive Study on Real-World Concurrency Bug Characteristics Ben Shelton.
Presented by: Daniel Taylor
Testing Tutorial 7.
Chapter 8 – Software Testing
Verification and Testing
Effective Data-Race Detection for the Kernel
Fault Injection: A Method for Validating Fault-tolerant System
Software testing strategies 2
Reference-Driven Performance Anomaly Identification
Outline System architecture Current work Experiments Next Steps
Abstractions for Fault Tolerance
Presentation transcript:

An Empirical Study of Reported Bugs in Server Software with Implications for Automated Bug Diagnosis Swarup Kumar Sahoo, John Criswell, Vikram Adve Department of Computer Science University of Illinois at Urbana-Champaign 1

Motivation In-the-field software failures are becoming increasingly common –Software failures results in losses over billions of dollars every year [Charette et.al., IEEE Spectrum, 2005] –Increasing the reliability of systems is critical Off-site analysis of production run failures is difficult –Difficulty in reproducing failures at development site –Same bug may generate different faults at multiple production sites –Customers have privacy concerns 2

Motivation – Production Site Diagnosis Problem: Need to reproduce failures fast and checkpoint based replay limits their usefulness Question: Will a simple restart/replay mechanism work? Problem: Minimal test case generation is too slow Question: Can the knowledge of fault types and #inputs help? To know answers to these questions we need to understand Characteristics of software bugs 3

Application Selection Server applications are widely used and mission critical Server applications challenging for diagnosis –Run for long periods of time (-) –Handle large amounts of data (-) –Concurrent (-) –Inputs are well-structured (+) We studied 266 randomly selected bug reports and 30 extra concurrency bug reports from 6 servers * (Apache, Squid, Tomcat, sshd, SVN, MySQL) * A detailed spreadsheet of bugs can be found at 4

Goals and key results of the study How many inputs are needed to trigger the symptoms? –77% of the bugs need just one input (12/266 bugs need >3) Time duration from first fault-triggering input to symptom? –57% of multi-input failures, all inputs likely to occur within short time –Time between first fault-triggering input and symptom usually small Which symptoms appear as a manifestation of bugs? –Majority (63%) bugs result in incorrect outputs Two applications have fewer incorrect outputs What fractions of failures are deterministic? –82% bugs showed deterministic behavior Very few concurrency bugs, nearly all are non-deterministic, need many more inputs, fewer incorrect outputs 5

Outline Motivation and Findings Methodology and Limitations Definitions and Terminology Classification of Software Bugs Analysis of Multiple Input Bugs Concurrency Bugs Implications Conclusions and Future Work 6

Bug Selection selected a recent major version of the software in production use for at least a year Selected a set of bugs from bug database with a set of filters (Status field as RESOLVED, Resolution field as FIXED) Randomly selected a set of bugs from the list of bugs using a seeded rand() function 472 server bugs 7

Bug Selection Manual Filtering –Removed bugs in development code versions –Removed trivial bugs like build errors, documentation errors etc. –After filtering, 266 bugs remained out of 472 bugs We analyzed each bug (reports, test cases, patches) Classified them into different categories based on –Bug symptom –Reproducibility –#inputs 8

Applications and Software Bugs Application Description #LOC #total bugs #bugs after sampling Selected MySQL 4.x Database server 1,028K9055 Tomcat 5 Servelet container and web server 274K7053 Sshd x, 4.x Secure shell server 27K6154 Apache 2.0.x Web server 283K6552 Squid 3.0.x Caching web proxy 93K17040 SVN Version control server 587K1612 Total --- 2,018K

Limitations Servers only –Studied a subset of server applications Only two Programming languages –5 were in C/C++, 1 in Java Reported bugs only –Unreported bugs are likely to be less frequent –Difficult to reproduce bugs are possibly less likely to get reported Fixed bugs only –Bugs unfixed for a long time may have different properties Human error 10

Outline Motivation and Findings Methodology and Limitations Definitions and Terminology Classification of Software Bugs Analysis of Multiple Input Bugs Concurrency Bugs Implications Conclusions and Future Work 11

Definitions and Terminology An input is –Logical input from client to server at the application level Login input, HTTP request, SQL query, command from SSH client An input is not –Messages coming from sources other than client File system, back-end databases, DNS queries –Inputs creating persistent environment SVN checkout command, create/insert/delete commands in database Login Select Database db1 Set sql_mode = FULL_GROUP_BY Insert into foo values (1,2) Select count(*) from foo group by a POST /login.jsp HTTP/1.1 Host: User-Agent: Mozilla/4.0 Content-Length: 27 Content-Type: application/x-www-form-urlencoded userid=joe&password=guessme….. 12

Definitions and Terminology Symptoms –Incorrect program behavior which is externally visible Incorrect Output –External program output is different from the correct output without any catastrophic symptom 13

Definitions and Terminology Deterministic Bug –Triggers the same symptom each time application is run with the same set of inputs in the same order on a fixed platform Timing Dependent Bug –Timing in addition to order determines symptom is triggered or not –A special case of non-deterministic bug Ex: An input arriving before a download input completes crashes server Non-deterministic Bug –Symptom may not be triggered each time same requests are input into the application in same order 14

Outline Motivation and Findings Methodology and Limitations Definitions and Terminology Classification of Software Bugs Analysis of Multiple Input Bugs Concurrency Bugs Implications Conclusions and Future Work 15

Bug Symptoms * Memory errors include Seg Fault, Memory Leak, NULL Pointer Exception etc Most of the bugs (63%) result in incorrect outputs 16

Bug Symptoms Squid, Tomcat have lower incorrect outputs Many more assertion violations (23%-28%) Squid, Tomcat have lower incorrect outputs Many more assertion violations (23%-28%) 17

Implications –New techniques needed to detect incorrect outputs at run time –Adding assertions or automatically generated program invariants may help in detecting incorrect outputs Bug Symptoms - Implications 18

Bug Reproducibility 82% show deterministic behavior (Similar to Chandra et.al., DSN’02) Few show timing dependence and non-deterministic behavior 19

Bug Reproducibility - Implications Implications –Tools should be able to reproduce most bugs by replaying inputs –Need new techniques to reproduce small fraction of bugs classified as timing-dependent or non-deterministic Time Stamping inputs or controlling thread scheduling 20

Number of Bug Triggering Inputs 21

Number of Bug Triggering Inputs Excluding Session Setup Inputs Nearly 77% of the bugs need single input to trigger 11% needed more than one input –Apache/SVN need maximum 2 inputs, Squid/Tomcat 3 inputs –Only 12 bugs (excluding the unclear cases) need more than 3 inputs –Remaining 11% were unclear from the reports 22

Number of Bug Triggering Inputs - Implications Implications –Most of the bugs can be reproduced with just a single input –Nearly, all of the bugs can be reproduced with a small num of inputs Few input from the session which triggers the bug is enough –Failure symptom occurs shortly after last faulty input is received (See paper) Except hang or time-out bugs 23

Detailed Analysis Appl # ≤1-input # >1-input Unclear Total 9 (41%) 10 (45%) 3 (14%) Classification of 22 non-deterministic bugs Appl Deterministic Timing- dependent Non- deterministic Total 12 (40%) 8 (27%) 10 (33%) Classification of 30 multi-input bugs Appl # ≤1-input # >1-input Unclear Total 9 (41%) 10 (45%) 3 (14%) Appl Deterministic Timing- dependent Non- deterministic Total 12 (40%) 8 (27%) 10 (33%) 24

Outline Motivation and Findings Methodology and Limitations Definitions and Terminology Classification of Software Bugs Analysis of Multiple Input Bugs Concurrency Bugs Implications Conclusions and Future Work 25

Analysis of Multiple Input Bugs Goal: Time from first fault-triggering input to last input Classified into three categories –Clustered: input requests must occur within some time bound Ex: All inputs should occur within socket timeout period –Likely clustered: fault-triggering inputs are likely to occur within a short duration for most cases Ex: Two successive login requests with wrong passwords –Arbitrary: there is nothing to indicate that inputs must be or are usually clustered within a short duration Ex: Request a static file, Request the same file again 26

Analysis of Multiple Input Bugs Appl.TotalClusteredLikely ClusteredArbitrary Squid5302 Apache3012 sshd4031 SVN3120 MySQL8224 Tomcat7214 Total Out of 30 multi-input bugs 8 were Clustered 9 were likely clustered 13 were Arbitrary 27

Analysis of Multiple Input Bugs Implications –Majority multi-input bugs will trigger symptom shortly after the first faulty input Replay tools need to buffer session inputs & a small suffix of the inputs –Locality of the faulty inputs within an input stream can simplify creation of a reduced test case Appl.TotalClusteredLikely ClusteredArbitrary Squid5302 Apache3012 sshd4031 SVN3120 MySQL8224 Tomcat7214 Total

Outline Motivation and Findings Methodology and Limitations Definitions and Terminology Classification of Software Bugs Analysis of Multiple Input Bugs Concurrency Bugs Implications Conclusions and Future Work 29

Study of Concurrency Bugs Found very few (3) concurrency bugs in our bug set –Perhaps because servers process each input relatively independently –Even for multi-threaded servers (Apache, MySQL, Tomcat) Separately selected 30 extra concurrency bugs –From 3 server applications (Apache, MySQL, Tomcat) –Searched on keywords like ’race(s),’ ’atomic,’ ’concurrency,’ ’deadlock,’ ’lock(s),’ and ’mutex(s)’ –23 were data race/atomicity violation bugs, 5 were deadlock bugs, 2 were not clear 30

Concurrency Bug Symptom Classification A much higher fraction of bugs are hangs or crashes Much fewer incorrect o/p (20% overall, but 45% in MySQL). Five (17%) of the concurrency bugs produced different, symptoms in different executions Appl. Seg Fault Crash Assertion Violation Hang Incorrect Output Multiple Symptoms Total3 (10%)1 (3%)6 (20%) 9 (30%) 6 (20%)5 (17%) 31

Concurrency Bug Reproducibility Appl. DeterministicTiming-dependentNon-deterministic Total 2 (7%) 16 (87%) Most of the bugs (87% overall, and 100% in Apache, Tomcat) show non-deterministic behavior. 32

Concurrency Bug Input Characteristics Appl. # 0-2 input # 3-8 input # >8-input Unclear Max #ip Total 0 (0%) 3 (10%) 17 (57%) 10 (33%)15000 (max) All bugs need multiple inputs (>1) to trigger a symptom (excluding session setup inputs) Some of the cases need a large number of inputs Many bugs needed executions with multiple threads and multiple client connections for some time Most bugs can usually be triggered using 2/3 threads, client connections 33

Implications for Concurrency Bugs Very few reported bugs are concurrency bugs Implications for tools targeting concurrency bugs –Need new techniques to reliably reproduce symptoms –Need to buffer larger number of inputs –Need to use inputs from multiple different client connection Validation of results for overall reported bugs –Study of concurrency bugs successfully identified non-deterministic behavior and need for multiple inputs –Similar methodology found a very low occurrence of these behavior for overall reported bugs 34

Outline Motivation and Findings Methodology and Limitations Definitions and Terminology Classification of Software Bugs Analysis of Multiple input Bugs Concurrency Bugs Implications Conclusions and Future Work 35

Implications for Automated Tools A systematic procedure to isolate faulty inputs to reproduce failures –Record a prefix of the per-session input To buffer session establishment inputs –Buffer a suffix of the input Most likely to contain faulty inputs needed to trigger the symptom Will a suffix of inputs containing all the faulty inputs trigger the same symptom that was triggered by the original input stream? 36

Effect of Global State CrashCrash??? Global State 37

Effect of Global State Preliminary experiments to verify the effects of global state –4 reported memory-related bugs from four applications (Sshd, Apache, Squid, NullHTTPD) –Run with different length suffixes of an input stream containing the faulty input Each run produced the same symptom 38

Implications for Automated Tools Diagnosis tools like DDmin (implements delta debugging) [Zeller et.al., TOSE 02] –Test small suffixes of inputs before trying a more general algorithm –One can possibly try subsets of small sizes From our results, trying subsets of 2 or 3 inputs should work for most Diagnosis tools like Triage [Tucek et.al., SOSP 08] –Can reduce the input stream to a much smaller set –Symptoms can possibly be triggered by restarting the server and replaying a small num of inputs after session establishment inputs Alleviates the need for checkpointing 39

Outline Motivation and Findings Methodology and Limitations Definitions and Terminology Classification of Software Bugs Analysis of Multiple Input Bugs Concurrency Bugs Implications Conclusions and Future Work 40

Conclusion and Future Work We report the results of an empirical study of server bugs –Most of the bugs were deterministic –Most of the bugs (77%) needed a single input –Set of inputs for multi-input bugs are usually small and clustered –Many bugs produce incorrect outputs –Very few bugs are concurrency bugs –Most of the concurrency bugs need multiple inputs To create light-weight detectors to detect incorrect outputs To build production-site automated tools –To automatically diagnose root cause at production site Reproduce failures Reduce input stream to a minimal faulty set 41

Questions Questions? (Detailed Technical Report

Motivation – Server Bugs Server applications challenging –Run for long periods of time –Handle large amounts of data –Concurrent –Inputs are well-structured Do the server bugs lend themselves to be automatically reproduced? Are there characteristics that ease automatic bug diagnosis? Are there characteristics of inputs that ease the procedure to find minimal test case? 43

Summary of Findings Concurrency bugs –Nearly all are non-deterministic –Need more inputs to trigger symptom –Fewer incorrect outputs, more crashes/hangs –17% result in different symptoms in different executions 44

Bug Selection We selected a recent major version of the software –Development and production use for at least a year Selected a set of bugs from bugzilla with a set of filters –Status field with RESOLVED / VERIFIED / CLOSE –Resolution field with FIXED –Severity field with anything other than TRIVIAL / ENHANCEMENT Randomly selected a set of bugs from the list of bugs –Using a seeded rand() function –For sshd and SVN, selected the complete list of bugs 45

Number of Bug Triggering Inputs 46

Number of Bug Triggering Inputs 47

Number of Bug Triggering Inputs 48

Q & A Checkpointing –Os support, appl. support(multi-threading complicates checkpointing) –Bug occurs before checkpoint –Cost is minor (read s/w only checkpointing papers….) Static analysis –Simple cases may have different characteristics Simplified test case –Some case simplified query –If don’t remove faulty input, its fine Fault-trigerring input Long unfixed bugs - mechansims prevent unfixed bugs….. (include numbers) 49

Q & A Likely clustered – subjective, conservative …..based on bug properties not human assumption Security bugs- very few, won’t imapct much –I don’t know, but memy memoy bugs may have similar characteristics Structured input – all servers will have structured input, we didn’t select servers in a biased way => not likely to be biased 50