Presentation on theme: "Autonomous Recovery in Componentized Internet Application Candea et. al Vikram Negi."— Presentation transcript:
Autonomous Recovery in Componentized Internet Application Candea et. al Vikram Negi
Introduction Autonomic Problem Approach Results Discussion
The Autonomic Problem To allow the application to recover automatically from transient and intermittent software failure.
The Approach Introduce the idea : –Microanalysis (fault detection) –Microrebooting (rapid recovery) –External Management (recovery action) Integrate and Test with JBOSS
Design Overview Autonomous Process –Monitoring Java probes –Fault detection Generate Anomaly report –Recovery Takes action Total time to recovery.
J2EE Review J2EE enterprise apps = collection of reusable Java modules JSPs / servlets invoke EJBs, which invoke other EJBs,... EJB = Java component that complies to a certain interface and provides a service Deployment descriptor (per-bean XML file) conveys run-time characteristics and dependencies; used in deploying the application
JBoss Design Open-source J2EE app server Written entirely in Java Microkernel with components held together by JMX (Mgmt Support)
Pinpoint : Detection and Localization Store Observation –IP address of machine, timestamp –Globally unique request ID. –# of calls/returns to EJBs –Association between sender and receiver. –Collect SQL Queries, update, read
Pinpoint : Analysis Analysis Engine –Centralized Engine –Plugin based architecture Modeling Components –Assume both present component behavior and historical (normal) behavior have same probability distribution. –Ki square test to determine different probability distribution.
Recovery : micro-reboot is not expensive State Segregation –Store impt. state outside the application in database. –Persistent State CMP (container managed persistence, J2EE) is a requirement for prototype. –Session State Store in modified SSM(external session state store) Containment and Reintegration –Microreboot transitive closure of all inter-EJB references –XML deployment descriptors to determine grouping for closure –Complete or micro reboot
Recovery Enabling Micro reboot –Method in JBOSS EJB Container –Preserve Class Loader
Manage Recovery Recovery Policy –Read failure report consider components > 1.0 –Micro-reboot(top n) or all >1.0 –Allow delay (~30sec) –If error is present still try few time or reboot completely –Finally report it to sys admin
Evaluation Test Framework Application –Petstore 1.1 (12 comp, 233 java file, 11K Loc) –Petstore 1.3.1(47 comp, 310 java file 10K Loc) –RUBiS (21 comp, 500 java file, 25K Loc) Workload –Implement Simulators with Transition table. –350 client (max utilization principle) Faultload –Based on industry experience –No low level hardware or OS faults.
Evaluation Detection Result similar to other detector No discussion on absolute numbers? Forced Java Runtime/Declared Exceptions, call emission and src code bug 1# How well the fault was detected, 2#how well major outage was detected ?
Evaluation : Localization Localization % for a algorithm per fault type CIA > 85% No absolute data again ?
Evaluation : Recovery Introduce faults in SSM- RUBiS. Restart SSM-RUBiS or micro reboot component. Observation from 10 trials per 350 concurrent client.
Full v/s Micro reboot Injected a null reference fault in SB CommitBid, then a corrupt User-Item, SB BrowseCategories and SB CommitUserFeedback. Microreboot maintains steady response. 425 vs 3916 failed request 61527 vs 56028 success request What error condition did other trials had?
Total Recovery Time Corrupt SB_ViewItem set it to NULL. 19.4 sec TRT 18.5 sec in analysis Pinpoint is bottleneck in micro reboot.
Pinpoint is app generic ? Upgrade to Petstore v.1.3.2 –Works for the confidence interval How different was the updated version??
Perfomance Overload Results for 30min fault free run w/ 350 clients In memory v/s Out memory (SSM) Marshalling costs
Assumption Well defined interface for components (.Net,J2ee) Deterministic call path b/w component No critical service request Training data for statistical model Guidelines (Crash Only Software)
Discussion Overall one of the Good Papers maybe bit verbose in introduction ! Integrating framework for earlier work by Candea. Limitation of the present statistical model. Shared EJB state –Modify JIT, disable microreboots(ref, static var) Application – Global data not scrubbed. Cost Benefit : micro reboot v/s total reboot
Supplementary Application server = operating system for Internet applications (instantiates app components in containers, provides runtime system services, integrates with web server to make app webaccessible) http://people.epfl.ch/george.candea