Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.

Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of Illinois at Urbana-Champaign In collaboration with Pradip Bose (IBM) and Subhasish Mitra (Stanford)

Pradeep Ramachandran, University of Illinois, Urbana Champaign Motivation Failures will happen in the field –Design defects –Aging –Soft errors –Inadequate burn-in –Aggressive design for power/performance/reliability –… Low-cost method to detect/recover from all sources of failure? –Reliability problem pervasive across many markets –Traditional solutions (e.g. nMR) too expensive –Must incur low performance, power overhead

Pradeep Ramachandran, University of Illinois, Urbana Champaign A Low-Cost, Unified Reliability Solution Need handle only faults that propagate to software –Hardware faults appear as software bugs –Leverage software reliability solutions for hardware? One-size-fits-all near-100% coverage often unnecessary –Solution must be customizable to application needs

Pradeep Ramachandran, University of Illinois, Urbana Champaign Outline Motivation of Framework Unified Framework for H/W + S/W Reliability Understanding the Impact of H/W Failures on S/W Future Work

Pradeep Ramachandran, University of Illinois, Urbana Champaign Unified Framework for H/W + S/W Reliability Unified hardware/software co-designed framework –Tackles hardware and software faults –Software-centric solutions with near-zero H/W overhead –Customizable to app needs, flexible for new error sources Error undetected Fault Error CHECKPOINT Error detected CHECKPOINT Detection with more overhead Fault Error Testing CHECKPOINT Repair, recovery No error Fault Error Symptom detected Recovery CHECKPOINT Ideal: symptom-based detection Repair Diagnosis

Pradeep Ramachandran, University of Illinois, Urbana Champaign Framework Components Detection: Software symptoms, online testing Recovery: Software/hardware checkpoint and rollback Diagnosis: Firmware layer for rollback/replay, online testing Repair/reconfiguration: Redundant, reconfigurable hardware Need to understand how hardware faults propagate to S/W – How do hardware faults become visible to software? –What is the latency? –Do H/W faults affect application and/or system state?

Pradeep Ramachandran, University of Illinois, Urbana Champaign Methodology Microarchitecture-level fault injection –Trade-off between accuracy and simulation time –GEMS timing models for out-of-order processor, memory –Simics full-system simulation of Solaris + UltraSPARC III –SPEC workloads for ten million instructions Fault model –Stuck-at, bridging faults in many micro-arch structures Fault detection –Crashes detected through hardware generated fatal traps  Misaligned memory access, RED state, watchdog reset, etc. –Hangs detected using simple hardware hang detector

Pradeep Ramachandran, University of Illinois, Urbana Champaign How do Hardware Faults Propagate to Software? 97% faults (w/o FPU) detectable with simple H/W & S/W –Need H/W support or S/W monitoring for FPU

Pradeep Ramachandran, University of Illinois, Urbana Champaign How do Hardware Faults Propagate to Software? 97% faults (w/o FPU) detectable with simple H/W & S/W –Need H/W support or S/W monitoring for FPU > 50% crashes/hangs in OS

Pradeep Ramachandran, University of Illinois, Urbana Champaign S/W Components Corrupted 62% of faults corrupt system state –Need to recover system state

Pradeep Ramachandran, University of Illinois, Urbana Champaign Latency to Detection from Application Corruption 80% have latency < 100K instr, amenable to H/W recovery –Buffering for 50µs on 2 GHz processor May need to use software checkpoint/recovery for others Total instructions executed between app state corruption and detection

Pradeep Ramachandran, University of Illinois, Urbana Champaign Latency to Detection from OS Corruption 92% of injections result in latency of < 100K OS instructions –Amenable to hardware recovery OS-only instructions executed between OS state corruption and detection

Pradeep Ramachandran, University of Illinois, Urbana Champaign Summary so far Hardware faults highly visible –Over 97% of faults in 6 structures result in crashes/hangs –Simple H/W and S/W sufficient Recovery through checkpointing –S/W and/or H/W checkpoints for application recovery –H/W checkpoints and buffering for OS recovery

Pradeep Ramachandran, University of Illinois, Urbana Champaign Next Steps (1 of 3) Improving understanding of fault propagation –Accurate fault models, effect of transients, intermittents –Lower-level simulations –Better workloads Detection –More software level monitoring  Software signals, invariants, perturbations, … –H/W support to aid detection in some structures (e.g., FPU) –Selective backup testing Recovery –Enhanced detection may reduce latency –Explore software vs. hardware, application customizability

Pradeep Ramachandran, University of Illinois, Urbana Champaign Next Steps (2 of 3) Diagnosis –Assume rollback/restart mechanism, multicore system Original symptom doesn’t recur Original symptom recurs Transient h/w bug, or non-deterministic s/w bug Continue execution … Deterministic s/w bug, or Permanent h/w bug Rollback, restart on different core Permanent defect in original core Bug detected Rollback to previous checkpoint, restart on original core No symptom Deterministic s/w bug Symptom

Pradeep Ramachandran, University of Illinois, Urbana Champaign Next Steps (3 of 3) Repair/reconfigure –What should be the right field configurable unit? –Core, FU, array entries? Avoidance –Dynamic reliability management Implementation architecture –Hardware + firmware + OS –Itanium machine check architecture has hooks

Thank You Questions?

Pradeep Ramachandran, University of Illinois, Urbana Champaign Backup Slides

Pradeep Ramachandran, University of Illinois, Urbana Champaign Types of fatal traps Faults cause different fatal traps thrown before crashes –Junk data access leads to memory misalignment –Repeatedly trapping leads to RED state

Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.

Similar presentations

Presentation on theme: "Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.

Similar presentations

Presentation on theme: "Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of."— Presentation transcript:

Similar presentations

About project

Feedback