Presentation on theme: "DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in Multi-core Chips Andrew DeOrio †, Konstantinos Aisopos ‡§ Valeria Bertacco †, Li-Shiuan."— Presentation transcript:
DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in Multi-core Chips Andrew DeOrio †, Konstantinos Aisopos ‡§ Valeria Bertacco †, Li-Shiuan Peh § DAC 2011 † University of Michigan ‡ Princeton University § Massachusetts Institute of Technology
Reliable Networks on Chip 2 R up $ processor cache router Detect if fault has occurred Diagnose what fault has occurred Recover and resume normal operation Reconfigure network to account for fault Drain fault-tolerant routing detectiondiagnosis recon- figuratio recovery nodes are disconnected, state is lost!
Recovery Approaches Checkpoint/recovery approaches Drain takes a reactive approach, incurring performance overhead only when errors occur 3 R uP $ checkpoint buffers data stuck in checkpoint buffer! R uP $ MEM high performance overhead!
Data Recovery with Drain Recover data lost during reconfiguration – Emergency links provide alternate path – Transfers cache contents and architectural state primary link Mem uP $ Router uP $ uP $....................................... processor core local cache memory controller DRAIN emergency link 4
Conclusions DRAIN is a lightweight recovery mechanism for CMPs – 5,000 gates per node Recoup cache data and architectural state from disconnected nodes Performance overhead only during recovery – ~3ms at 1GHz 16