Presentation on theme: "DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in Multi-core Chips Andrew DeOrio †, Konstantinos Aisopos ‡§ Valeria Bertacco †, Li-Shiuan."— Presentation transcript:
DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in Multi-core Chips Andrew DeOrio †, Konstantinos Aisopos ‡§ Valeria Bertacco †, Li-Shiuan Peh § DAC 2011 † University of Michigan ‡ Princeton University § Massachusetts Institute of Technology
Reliable Networks on Chip 2 R up $ processor cache router Detect if fault has occurred Diagnose what fault has occurred Recover and resume normal operation Reconfigure network to account for fault Drain fault-tolerant routing detectiondiagnosis recon- figuratio recovery nodes are disconnected, state is lost!
Recovery Approaches Checkpoint/recovery approaches Drain takes a reactive approach, incurring performance overhead only when errors occur 3 R uP $ checkpoint buffers data stuck in checkpoint buffer! R uP $ MEM high performance overhead!
Data Recovery with Drain Recover data lost during reconfiguration – Emergency links provide alternate path – Transfers cache contents and architectural state primary link Mem uP $ Router uP $ uP $....................................... processor core local cache memory controller DRAIN emergency link 4
Drain Example up $ M $ $ $ 5
Drain Example up $ M $ $ $ X link failure 6
Drain Example up $ M $ $ $ 7 reconfigure interconnect
Drain Example up $ M $ $ $ X link failure 8
Drain Example up $ M $ $ $ 9 node isolated!
Drain Example up $ M $ $ $ drain connected nodes via primary links 10
Drain Example up $ M $ $ $ drain disconnected node via emergency link 11
Drain Example up $ M $ $ $ drain connected node again 12
Drain Example up $ M $ $ resume normal operation 13 up $
Drain Performance as Links Fail 14 increasing emergency link time decreasing functional network size
Memory Latency Before and After 15
Conclusions DRAIN is a lightweight recovery mechanism for CMPs – 5,000 gates per node Recoup cache data and architectural state from disconnected nodes Performance overhead only during recovery – ~3ms at 1GHz 16