Presentation is loading. Please wait.

Presentation is loading. Please wait.

SE-Aware HPC Extension : Selective Data Protection for reducing failures due to soft errors 7/20/2006 Kyoungwoo Lee.

Similar presentations


Presentation on theme: "SE-Aware HPC Extension : Selective Data Protection for reducing failures due to soft errors 7/20/2006 Kyoungwoo Lee."— Presentation transcript:

1 SE-Aware HPC Extension : Selective Data Protection for reducing failures due to soft errors
7/20/2006 Kyoungwoo Lee

2 Contents Motivation Previous Work Current Work Next Step

3 Soft Error What is soft error? Why is soft error important?
How to recover soft error?

4 Definition of Soft Error
Soft Error (SE) Transient Fault = Bit Flip = Single Event Upset (SEU) A charged particle strikes electronic circuits and changes the amount of charge stored at sensitive nodes, hence affects the logic state (e.g.: ‘0’ to ‘1’ or vice versa) Random, non-catastrophic, non-destructive, recoverable Caused by Radiation Neutrons Alpha particles High-energy cosmic rays Solar particles Robert Bauman, “Soft Errors in Advanced Computer Systems” in IEEE Design and Test of Computers 2005

5 Soft Error Rate Soft Error Rate (SER)
FIT: Failure in Time (one billion hours) (e.g.) 1,000 FITs per Mbits ≒ 114 years MTTF (Mean Time To Failure) SER ∝ Nflux * CS * exp{-(Qcritical/Qs)} Nflux : intensity of the Neutron Flux CS : the area of the cross section of the node QS : the charge collection efficiency Qcritical : the min required charge for a cell to retain data, Qcirtical = C * V where Capacitance (C) and Voltage (V)

6 Importance of SE Critical SE High Integration and Density
e.g.: 1 GB memory with 1,000 FIT per Mbits  8 * 106 FITs/memory  5 days MTTF Technology Advancements e.g.: 1,000 FIT per Mbits in 0.18 µm tech  10,000 to 100,000 FIT per Mbits in 0.13 µm tech Latitude and Altitude e.g.: 10 to 100 times higher SER at flight than at ground Voltage Scaling e.g.: lower voltage decreases Qcritical, which increases SER exponentially

7 SER Trend B. SRAM C. Core Logic A. DRAM D. Contributions in Processors
Robert Bauman, “Soft Errors in Advanced Computer Systems” in IEEE Design and Test of Computers 2005 S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” IEEE Computer 2005

8 SE Detection and Recovery
Information Redundancy E.g.: ECC (Error Correction Coding) and Parity Hardware Redundancy E.g.: TMR (Triple Modular Redundancy) Temporal Redundancy E.g.: Checkpointing and Recovery Effects of Redundancy on Cost, Performance and Power E.g.: ECC implemented by Hamming Code (250 nm) Coding/Decoding modules and extra bits 1.45 ns for Coding and 2.66 ns for Decoding 14.5 mW for Coding and 26.3 mW for Decoding Coding Data Extra Decoding L. Li, V. Degalahal, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, “Soft Error and Energy Consumption Interactions: A Data Cache Perspective,” Proc. of ISLPED, pp , 2004

9 Related Works Reliability and Power Management (Cache) Architecture
Dr. D. Mossé group in Univ. of Pittsburgh D. Zhu, R. Melhem, and D. Mossé, “The Effects of Energy Management on Reliability in Real-Time Embedded Systems,” Proc. of ICCAD, Nov D. Zhu, R. Melhem, D. Mossé, and E. Elnozahy, “Analysis of an Energy Efficient Optimistic TMR Scheme,” Proc. of ICPDS, Jul Dr. G. De Micheli group in Stanford Univ. K. Mihic, T. Simunic, and G. De Micheli, “Reliability and Power Management of Integrated Systems,” Proc. of EuroMicro Systems on Digital System Design, 2004. T. Simunic, K. Mihic, and G. De Micheli, “Optimization of Reliability and Power Consumption in Systems on a Chip,” Proc. of PATMOS, 2005. (Cache) Architecture Dr. M. J. Irwin and Dr. N. Vijaykrishnan group in PSU L. Li, V. Degalahal, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, “Soft Error and Energy Consumption Interactions: A Data Cache Perspective,” Proc. of ISLPED, pp , 2004 Dr. S. M. Reddy group in Univ. of Iowa Y. Cai, M. T. Schmitz, A. Ejlali, B. M. Al-Hashimi & S. M. Reddy“Cache Size Selection for Performance, Energy and Reliability of Time-Constrained Systems" in ASP-DAC 2006 Soft Error and Core Logic Intel S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” IEEE Computer, pp , 2005 Dr. K. Roy group in Purdue Univ. A. Goel, S. Bhunia, H. Mahmoodi and K. Roy, “Low-Overhead Design of Soft-Error-Tolerant Scan Flip-Flops with Enhanced-Scan Capability”, in ASP-DAC2006

10 Overhead of ECC in Cache
Power: up to 22% power overhead Performance: 95% overhead of access time Area: more than 25% area overhead Coding Data Data Extra Decoding Unprotected Cache Protected Cache

11 Our Approach HPC presents low performance overhead as well as high energy saving All the data are not equally critical for failures eg: Pixel data in video applications are not important for quality and reliability while quantization value is more important Provide the comparative Reliability to Protected $ with small power and performance overheads Coding Coding Data Data Extra Data Data Extra Decoding Decoding Unprotected Cache Selective Data Protection HPC Protected Cache

12 Previous Work “Mitigating Soft Error Failures for Multimedia Applications using Selective Data Protection” submitted to CASES 2006 Put Multimedia Data on SE-unprotected Main cache and the others on SE-protected mini cache Present the comparative failure rates to those of only SE-protected cache with significant reduction with respect to energy, area and performance

13 Current Work Main objective is how to extend this idea for general applications How to partition data into important and not-important ones Intensive simulation study Cache Active Time: how long the page stays on cache The longer the cache active time, the more vulnerable to soft errors Cache Access: how many the page is accessed The more the page is accessed, the more the page affect the application resulting in failures

14 Preliminary Results

15 Next Step More Simulations Strong Metric Writing a paper for DATE 2007
Combined one with Cache Active Time and Cache Access Temporal and Behavioral Analysis to support this Writing a paper for DATE 2007


Download ppt "SE-Aware HPC Extension : Selective Data Protection for reducing failures due to soft errors 7/20/2006 Kyoungwoo Lee."

Similar presentations


Ads by Google