Download presentation
Presentation is loading. Please wait.
1
Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen Rutgers University Vivo Project http://vivo.cs.rutgers.edu Funding from NSF grants: #EIA- 0103722, #EIA-9986046, and #CCR-0100798.
2
Kiran Nagaraja, Rutgers University Vivo Project 2 Motivation Internet services are ubiquitous, e.g., Google, Yahoo!, Amazon, Ebay, etc. Expectation of 24 x 7 availability, but service outages still happen! Sorry.... We apologize for the inconvenience, but the system is currently unavailable. Please try your request in an hour. If you require assistance please call Customer Service at 1-866-325-3457. A significant number of outages in Internet services are a result of operator actions [Oppenheimer03] #1: Architecture is complex #2: Systems are constantly evolving #3: Lack of tools for operators to reason about the impact of their actions Offline testing, emulation, simulation Very little detail on operator mistakes Details strongly guarded by companies and administrators
3
Kiran Nagaraja, Rutgers University Vivo Project 3 Talk Outline Approach and Contributions Operator Study: Understanding the Mistakes Validation: Preventing Exposure of Mistakes Conclusion and Future Work
4
Kiran Nagaraja, Rutgers University Vivo Project 4 This Work Understanding: Gather detailed data on operators’ mistakes What categories of mistakes? What’s the impact on the service? How do mistakes correlate with experience, impact ? Approaches to deal with operator mistakes: prevention, recovery, automation Validation: Allow operators to evaluate the correctness of their actions prior to exposing them to the service Similar to offline testing, but: Virtual environment (extension of online environment) Real workload Migration back and forth with minimal operator involvement
5
Kiran Nagaraja, Rutgers University Vivo Project 5 Contributions Detailed information on operator tasks and mistakes 43 experiments detailed data on operator behavior including 42 mistakes 64% immediately degraded throughput 57% were software configuration mistakes Demonstrate that human experiments are possible and valuable Designed and prototyped a validation infrastructure Implemented on 2 cluster-based services: cooperative Web server (PRESS) and a multi-tier auction service 2 techniques to allow operators to validate their actions Demonstrated that validation is a promising technique for reducing impact of operator mistakes 66% of all mistakes observed in operator study were caught 6 of 9 mistakes caught in live operator experiments with validation Successfully tested with synthetically injected mistakes
6
Kiran Nagaraja, Rutgers University Vivo Project 6 Talk Outline
7
Kiran Nagaraja, Rutgers University Vivo Project 7 Multi-Tiered Internet Services Client Requests Web Server Application Server Application Server Application Server Application Server Application Server Application Server Database Tier 1: Web servers Tier 2: App servers Tier 3: Database server
8
Kiran Nagaraja, Rutgers University Vivo Project 8 Tasks, Operators & Training Tasks Scheduled maintenance tasks (proactive), e.g. upgrade Apache Diagnose-and-repair tasks (reactive), e.g. diagnose a disk failure Operator composition 14 computer science graduate students 5 professional programmers (Ask Jeeves) 2 sysadmins from our department Categorization of operators - based on filled in questionnaire 11 novices – some familiarity with set up 5 intermediates – experience with a similar service 5 experts - in-charge of a service requiring high uptime Operator training Novice operators given warm-up tasks Material describing service, and detailed steps for tasks
9
Kiran Nagaraja, Rutgers University Vivo Project 9 Experimental Setup Service 3-tier auction service, and client emulator from Rice University’s DynaServer Project Loaded at 35% of capacity Machines 2 Web servers (Apache), 5 application servers (Tomcat), 1 database machine (MYSQL) Operator assistance & data capture Monitor service throughput Modified bash shell for command and result trace Manual observation Noting anomalies in operator behavior Bailing out ‘lost’ operators
10
Kiran Nagaraja, Rutgers University Vivo Project 10 Example Trace Task: Add an application server Mistake: Apache misconfiguration Impact: Degraded throughput Application server added First Apache misconfigured and restarted Second Apache misconfigured and restarted
11
Kiran Nagaraja, Rutgers University Vivo Project 11 Sampling of Other Mistakes Adding a new application server Omission of new application server from backend member list Syntax errors, duplicate entries, wrong hostnames Launching the wrong version of software Migrating the database for performance upgrade Incorrect privileges for accessing the database Security vulnerability Database installed on wrong disk
12
Kiran Nagaraja, Rutgers University Vivo Project 12 Operator Mistakes: Category Vs Impact 64% of all mistakes had immediate impact on service performance 36% resulted in latent faults Obs. #1: Significant no. of mistakes can be checked by testing with a realistic environment Obs. #2: Undetectable latent errors will still require online-recovery techniques
13
Kiran Nagaraja, Rutgers University Vivo Project 13 Operator Mistakes Misconfigurations account for 57% of all errors Configuration mistakes spanning multiple components are more likely Obs. #1: Tools to manipulate and check configurations are crucial Obs. #2: Be extremely careful when maintaining multiple versions of s/w
14
Kiran Nagaraja, Rutgers University Vivo Project 14 Operator Categories Experts also made mistakes! Complexity of tasks executed by experts were higher
15
Kiran Nagaraja, Rutgers University Vivo Project 15 Operator Study Summary 43 experiments 42 mistakes 27 (64%) mistakes caused immediate impact on service performance 24 (57%) were software configuration mistakes Mistakes were made across all operator categories Trace of operator commands & service performance for all experiments Available at http://vivo.cs.rutgers.edu
16
Kiran Nagaraja, Rutgers University Vivo Project 16 Talk Outline
17
Kiran Nagaraja, Rutgers University Vivo Project 17 Validation of Operator’s Actions Validation Allow operator to check correctness of his/her actions prior to exposing their impact to the service interface (clients) Correctness is tested by: 1.Migrate the component(s) to virtual sand-box environment, 2.Subject to a real load, 3.Compare behavior to a known correct one, and 4.Migrate back to online environment Types of validation: Replica-based: Compare with online replica (real time) Trace-based: Compare with logged behavior
18
Kiran Nagaraja, Rutgers University Vivo Project 18 Validating a Component: Replica-Based Web Server Database Tier 1 Tier 3 Tier 2 Validation sliceOnline slice Application Server Application Server Database Proxy Web Server Proxy Application Server Application Server Application Server Application Server Client Requests Compare Application State Shunt Compare
19
Kiran Nagaraja, Rutgers University Vivo Project 19 Validating a Component: Trace-Based Validation sliceOnline slice Application Server Application Server Database Proxy Web Server Proxy State Compare Web Server Database Tier 1 Tier 3 Tier 2 Application Server Application Server Application Server Application Server Client Requests Shunt State
20
Kiran Nagaraja, Rutgers University Vivo Project 20 Implementation Details Shunting performed in middleware layer Each request tagged with a unique ID all along the request path Component proxies can be constructed with little effort Reuse discovery and communication interfaces, common messaging core State management requires well-defined export and import API Stateful servers often support such API Comparator functions to detect errors Simple throughput, flow, and content comparators
21
Kiran Nagaraja, Rutgers University Vivo Project 21 Validating Our Prototype: Results Live operator experiments Operator given option of type of validation, duration, and to skip validation Validation caught 6 out of 9 mistakes from 8 experiments with validation Mistake-injection experiments Validation caught errors in data content (inaccessible files, corrupted files) and configuration mistakes (incorrect # of workers in Web Server degraded throughput) Operator-emulation experiments Operator command scripts derived from the 42 operator mistakes Both trace-based and replica validation caught 22 mistakes Multi-component validation caught 4 latent (component interaction) mistakes
22
Kiran Nagaraja, Rutgers University Vivo Project 22 Reduction in Impact with Validation
23
Kiran Nagaraja, Rutgers University Vivo Project 23 Reduction in Mistakes with Validation
24
Kiran Nagaraja, Rutgers University Vivo Project 24 Shunting & Buffering Overheads Shunting overhead for replica-based validation 39% additional CPU All requests and responses are captured and forwarded to validation slice Trace-based validation is slightly better 32 % additional CPU Overhead is incurred on single component, and only during validation Various optimizations can reduce overhead to 13-22% Examples: response summary (64byte), sampling (session boundaries) Buffering capacity during state check pointing and duplication Required to buffer only about 150 requests for small state sizes
25
Kiran Nagaraja, Rutgers University Vivo Project 25 Caveats, Limitations, and Open Issues Non-determinism increases complexity of comparators and proxies E.g., choice of back-end server, remote cache vs. local disk, pseudo- random session-id, time stamps Hard state management may require operator intervention Component requires initialization prior to online migration Bootstrapping the validation Validating an intended modification of service behavior – no traces or replica for comparison! How long to validate? What types of validation? Duration spent in validation implies reduced online capacity
26
Kiran Nagaraja, Rutgers University Vivo Project 26 Conclusions & Future Work Gathered data on operator execution & mistakes Majority of the mistakes were configuration errors Many of them degraded system throughput Validation is an effective technique to check operator mistakes Simple techniques caught majority of mistakes Feasible in overhead and implementation effort ‘Validation ready’ components: hooks for logging, forwarding & buffering messages, saving/restoring state Future work: Taking validation further… Validate operator actions on databases, network components Combine validation with diagnosis for assisting operators Other validation techniques: Model-based validation
27
Kiran Nagaraja, Rutgers University Vivo Project 27 Acknowledgements We are thankful to our volunteer operators: fellow students, professional programmers, and LCSR staff members We also would like to express our gratitude to Christine Hung, Neeraj Krishnan, and Brian Russell for their help in building the monitoring infrastructure in the early stages of the project
28
Thank you! Questions? For more information and traces of operator experiments: http://vivo.cs.rutgers.edu
29
Back up Slides
30
Kiran Nagaraja, Rutgers University Vivo Project 30 Operator Mistakes: Category Vs Impact 64% of all mistakes had immediate impact on service performance 36% resulted in latent faults Obs. #1: Significant no. of mistakes can be checked by testing with a realistic workload Obs. #2: Undetectable latent errors will still require online-recovery techniques
31
Kiran Nagaraja, Rutgers University Vivo Project 31 Mendosus and Slice Isolation Mendosus virtualizes a network of nodes on an Ethernet LAN Injects network level failures including network partitions Allows easy isolation of nodes into online and validation slices Migration does not require any network level modifications
32
Kiran Nagaraja, Rutgers University Vivo Project 32 Validation Techniques Trace-Based Validation Request/response trace logged to disk State management State checkpointed to disk is used for initializing Validation scenarios Can have higher directed coverage Replica-Based Validation Real-time forwarding from online- replica State management State from replica is directly used for initializing Validation scenarios Reflects current online characteristics Multi-Component Validation Test interaction with working components from online slice
33
Kiran Nagaraja, Rutgers University Vivo Project 33 Implementation Details Shunting performed in middleware layer E.g., Auction service: Apaches’ mod_jk module, Tomcat valves, JDBC driver Each request tagged with a unique ID all along the request path Component proxies can be constructed with little effort Reuse discovery and communication interfaces, and add a common request/response messaging core E.g., Auction service required 4 proxies – derived by adding/modifying only 232, 307, 274 and 384 lines of C/Java code State management requires well-defined export and import API Stateful servers often support such API For Tomcat App server, regular state manager required small modification to export API to validation infrastructure Simple comparator functions to detect errors Throughput, flow and content comparators
34
Kiran Nagaraja, Rutgers University Vivo Project 34 Shunting & Buffering Overheads Various optimizations can reduce overhead to 13-22% Example, summary (64byte), sampling (session boundaries) Buffering capacity during state checkpointing and duplication Required to buffer only about 150 requests for small state sizes 39%
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.