Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

Similar presentations


Presentation on theme: "A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev."— Presentation transcript:

1 A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev

2 overview Motivation Current Techniques Proposed Mechanism for Online Fault Diagnosis Results Challenges Conclusion

3 Hard Faults Electron MigrationGate Oxide Breakdown background Transient Faults Single Event Upset

4 motivation Process Scaling

5 current fault handling techniques DIVA Redundancy

6 DIVA UTILIZE REDUNDANCY UTILIZE REDUNDANCY error detection and correction hybrid approach

7 online diagnosis Track Units DIVA ERROR deconfigure unit error_count++ If(error_count > threshold) YES NO No Action

8 ALU DIVA CHECKER Reorder Buffer Reservation Station Units that can be turned off in case of a fault Field Deconfigurable Units (FDU)

9 Deconfigure entries in circular bufferDeconfigure entries in tabular structure deconfiguring mechanism

10 Hard fault diagnosis latency Performance impact of losing component to hard fault analysis DIVA: 6% of an Alpha 21264 core Error counters (~1227 bits total) Instruction resource usage (19 wires in total) Deconfiguration logic Can be reduced using coarse granularity

11 challenges Error count threshold Related to resource usage Heavily used resources have higher counters Pipeline flushes before threshold is reached

12 challenges Error count threshold Related to resource usage Heavily used resources have higher counters Pipeline flushes before threshold is reached

13 Transient faults Independent resource usage ERROR HARD FAULT TRANSIENT FAULT ABC DEF Desired Observed DIVA CHECKER challenges

14 Certain structures cannot be protected Register File Issue logic Common Data Bus (CDB) Transient fault  False Deconfiguration Possibly masked by error counter Faults in the error counter or deconfiguration logic Periodically test counters Permanently configure or deconfigure FDU upon error Window of vulnerability DIVA produces errors until counter saturates limitations

15 As transistors shrink, hard fault rate increases Current reliability mechanisms Redundancy (TMR) Thread level redundancy Pre shipment testing and deconfiguration Low cost solutions such as DIVA Online diagnosis Low cost and hardware overhead Use FDUs along with DIVA to diagnose faults dynamically Increase yield  Binned to a lower performance bin conclusion

16 discussion What are the advantages of this hybrid scheme over using just a DIVA checker? As process technology gets smaller, can this mechanism help increase the lifetime of the processor a significant amount? As transistors shrink, the number of cores will increase, can this mechanism be used still as opposed to turning off a faulty core? How can we extend this mechanism to take care of the issue logic, singleton resources and CDB?

17 citations images Electron Migration. Digital image. Wikimedia.org. Wikimedia, 6 Mar. 2007. Web.. Gate Oxide Breakdown. Digital image. Attopsemi Technology. Attopsemi Technology, n.d. Web.. Sawant, Minal. Single Event Upset. Digital image. COTS. Microsemi, Jan. 2012. Web.. Sawant, Minal. Soft Error Rate. Digital image. CCCP. University of Michigan, 11 May 2012. Web.. Carr, Robert. Simultaneous Multithreading. Digital image. Prezi. Prezi, 31 Oct. 2013. Web.. Wong, William. Out of Order Pipeline. Digital image. Electronic Design. Electronic Design, 19 Oct. 2011. Web.. Mark Brehob, EECS 470 Lecture Slides Fred A. Bower, Daniel J. Sorin, and Sule Ozev. A Mechanism for Online Diagnosis of Hard Faults Microprocessors. In Proc. Of the 38 th Annual IEEE/ACM International Symposium on Microarchiteceture (MICRO’05), 2005 T.M. Austin. DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design. In Proc. Of the 32 nd Annual IEEE/ACM Int’l Symposium on Microarchitecture, pages 196-207, Nov. 1999. papers


Download ppt "A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev."

Similar presentations


Ads by Google