Presentation is loading. Please wait.

Presentation is loading. Please wait.

® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

Similar presentations


Presentation on theme: "® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,"— Presentation transcript:

1 ® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum, & Steven K. Reinhardt* Fault Aware Computing Technology (FACT) Group Massachusetts Microprocessor Design Center, Intel Corporation 10th IEEE International Symposium Pacific Rim Dependable Computing, French Polynesia, March 3-5, 2004 * Also, University of Michigan, Ann Arbor

2 ® 2 Shubu Mukherjee, FACT Group Summary SECDED ECC (single error correction, double error detection) SECDED ECC (single error correction, double error detection) Øcommonly used in on-chip caches Øinterleaving converts spatial multi-bit errors to multiple single bit errors Scrubbing Scrubbing Øperiodically read cache blocks and correct all single bit errors Øthis prevents single bit errors from accumulating, thereby avoiding temporal double bit errors Our conclusion: given detected error target of 10 year MTTF Our conclusion: given detected error target of 10 year MTTF ØScrubbing necessary only for very large caches (e.g., 100s of megabytes to gigabytes)

3 ® 3 Shubu Mukherjee, FACT Group Origin of Cosmic Rays Cosmic rays come from deep space Cosmic rays come from deep space Earth’s Surface p n p p n n p p n n n

4 ® 4 Shubu Mukherjee, FACT Group Impact of Neutron Strike on a Si Device Secondary source of upsets: alpha particles from packaging Secondary source of upsets: alpha particles from packaging Strikes release electron & hole pairs that can be absorbed by source & drain to alter the state of the device + - + + + - - - Transistor Device source drain neutron strike

5 ® 5 Shubu Mukherjee, FACT Group Strike Changes State of a Single Bit 0 1 Example Solution Example Solution ØError correction codes (ECC) for single bit correction ØOverhead = 7 bits for 64 bits of data

6 ® 6 Shubu Mukherjee, FACT Group Strike Changes State of Two Adjacent Bits Spatial Double Bit Error Example solution Example solution ØSECDED ECC (single error correction, double error detection)  8 bits of code per 64 bits of data ØInterleaving for the more general case … 0 1 1 0

7 ® 7 Shubu Mukherjee, FACT Group Interleaving bits Interleaving converts Interleaving converts Øspatial multi-bit error  multiple single bit errors bits X X X X = covered with single ECC code + + + + = covered with different ECC code / / / 0 0 0

8 ® 8 Shubu Mukherjee, FACT Group Two Separate Strikes on Different Bits Temporal Double Bit Errors SECDED ECC (single error correction, double error detection) SECDED ECC (single error correction, double error detection) Øcould detect error, but cannot correct the error Øif errors accumulate –single bit correctable error becomes a double bit detectable error Cycle 100 Cycle 1,000,000

9 ® 9 Shubu Mukherjee, FACT Group Solutions for Temporal Double Bit Errors Natural Effects Natural Effects Øwhenever a processor reads a cache block, we can correct the single bit error Øcheck for errors when cache blocks are replaced from the cache More Powerful ECC More Powerful ECC ØSECDED ECC requires 8 bits per 64 bits –7 bits for single bit correction –8 th bit for double bit detection –Overhead = 13% ØECC with two bit correction requires 12 bits per 64 bits –Overhead = 19% Scrubbing Scrubbing ØPeriodically read memory and correct all single bit errors ØDisallows accumulation of temporal double bit errors ØStandard technique in main memories (DRAMs) ØOur calculations (later) will assume the worst case for soft errors –cache blocks don’t get scrubbed naturally

10 ® 10 Shubu Mukherjee, FACT Group Memory Hierarchy of a Processor Do we need to scrub on-chip caches? Do we need to scrub on-chip caches? Ødepends on the size of these caches L1 Cache CPU L2 Cache Main Memory (gigabytes) megabytes kilobytes

11 ® 11 Shubu Mukherjee, FACT Group Detected Unrecoverable Error (DUE) Interval-based Interval-based ØMTTF = Mean Time to Failure ØE.g., goal = 10 years MTTF for application crash  Bossen, IRPS 2002 Rate-based Rate-based ØFIT = Failure in Time = 1 failure in a billion hours Ø10 year MTTF = 10 9 / (24 * 365 * 10) FIT = 11,415 FITs Total of 210 FIT + Cache: 62 FIT IQ: 100 FIT FU: 58 FIT + Hypothetical Example

12 ® 12 Shubu Mukherjee, FACT Group MTTF calculations: probabilities 1 quadword = 64 bits + 8 bits = 72 bits of data + SECDED ECC 1 quadword = 64 bits + 8 bits = 72 bits of data + SECDED ECC Q = # quadwords in cache memory Q = # quadwords in cache memory P d [n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the n th strike P d [n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the n th strike P d [1] = 0 P d [1] = 0 P d [2] = 1 / Q P d [2] = 1 / Q First Strike, Probability = Q / Q Second Strike, Probability = 1 / Q P d [2] = (Q/Q) * (1/Q) = 1/Q

13 ® 13 Shubu Mukherjee, FACT Group MTTF calculations: probabilities 1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC 1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC Q = # quadwords in cache memory Q = # quadwords in cache memory P d [n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the n th strike P d [n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the n th strike P d [3] = [ (Q-1)/Q ] * [2/Q] P d [3] = [ (Q-1)/Q ] * [2/Q] First Strike, Probability = Q / Q Second Strike, Probability = (Q-1) / QThird Strike, Probability = 2/Q P d [3] = (Q/Q) * (Q-1/Q) * (2/Q)

14 ® 14 Shubu Mukherjee, FACT Group MTTF calculations: probabilities 1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC 1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC Q = # quadwords in cache memory Q = # quadwords in cache memory P d [n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the n th strike P d [n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the n th strike P d [1] = 0 P d [1] = 0 P d [2] = 1 / Q P d [2] = 1 / Q P d [3] = [ (Q-1)/Q ] * [2/Q] P d [3] = [ (Q-1)/Q ] * [2/Q] P d [4] = [ (Q-1)/Q ] * [ (Q-2)/Q ] * [3/Q] P d [4] = [ (Q-1)/Q ] * [ (Q-2)/Q ] * [3/Q] … P d [n] = [ (Q-1/Q ] * [ (Q-2)/Q ] * [ (Q-3)/Q ] * … * [ (Q-n+2)/Q ] * [ (n-1)/Q ] P d [n] = [ (Q-1/Q ] * [ (Q-2)/Q ] * [ (Q-3)/Q ] * … * [ (Q-n+2)/Q ] * [ (n-1)/Q ]

15 ® 15 Shubu Mukherjee, FACT Group MTTF calculations: Equation M = mean # of single bit errors to get a double bit error M = mean # of single bit errors to get a double bit error = Expected value of random variable with P d [n] as the = Expected value of random variable with P d [n] as the probability distribution function probability distribution function M can be easily generated using a computer program M can be easily generated using a computer program MTTF (double bit error) = M * MTTF (single bit error) MTTF (double bit error) = M * MTTF (single bit error)  For a 32 megabyte cache & FIT/bit = 0.001 [Normand 1996, Tosaka 1996] MTTF (double bit error) = M * MTTF (single bit error) MTTF (double bit error) = M * MTTF (single bit error) = 2567 * (1 / Cache FIT) = 2567 * (1 / Cache FIT) = 2567 * (10 9 / (0.001 * 2 22 * 72 * 24 * 365)) = 2567 * (10 9 / (0.001 * 2 22 * 72 * 24 * 365)) = 970 years = 970 years Saleh, et al.’s, 1990 closed form equation Saleh, et al.’s, 1990 closed form equation ØMTTF (double bit error) = [ 1 / (72 * f)] * sqrt(  / 2Q) = 970 years, f = FIT/bit = 970 years, f = FIT/bit

16 ® 16 Shubu Mukherjee, FACT Group Temporal Double Bit MTTF variations with cache size FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996) FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996) Øhigher at higher altitudes (e.g., 3-5x at 1.5km in Denver) Temporal double bit error has very small contribution to DUE rate Temporal double bit error has very small contribution to DUE rate Øcompared to a goal of 10 years DUE MTTF

17 ® 17 Shubu Mukherjee, FACT Group MTTF with Scrubbing I = scrubbing interval, scrub at the end of each interval I I = scrubbing interval, scrub at the end of each interval I N = # scrubbing intervals to reach MTTF N = # scrubbing intervals to reach MTTF = Expected value of random variable with probability distribution = Expected value of random variable with probability distribution function: (1-pf) N * pf, where pf = probability of a temporal double bit function: (1-pf) N * pf, where pf = probability of a temporal double bit error at the end of an interval error at the end of an interval Assuming 16 GB cache, FIT/bit = 0.001 (Normand 1996, Tosaka 1996), scrub once a year (I = 1 year) MTTF(double bit error) = N * I MTTF(double bit error) = N * I = 2281 * 1 = 2281 years = 2281 * 1 = 2281 years Saleh, et al. 1990 closed form equation Saleh, et al. 1990 closed form equation Ø2 / [Q * I * (f * 72) 2 ] = 2341 years, f = FIT/bit I I I

18 ® 18 Shubu Mukherjee, FACT Group Impact of Scrubbing on Temporal Double Bit MTTF FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996) FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996) Øhigher at higher altitudes (e.g., 3-5x at 1.5km in Denver) For 16 gigabytes of cache, scrubbing can help For 16 gigabytes of cache, scrubbing can help Øcompared to a DUE MTTF goal of 10 years 16 Gigabyte Cache

19 ® 19 Shubu Mukherjee, FACT Group Summary SECDED ECC (single error correction, double error detection) SECDED ECC (single error correction, double error detection) Øcommonly used in on-chip caches Øinterleaving converts spatial multi-bit errors to multiple single bit errors Scrubbing Scrubbing Øperiodically read cache blocks and correct all single bit errors Øthis prevents single bit errors from accumulating, thereby avoiding temporal double bit errors Our conclusion: given detected error target of 10 year MTTF Our conclusion: given detected error target of 10 year MTTF ØScrubbing necessary only for very large caches (e.g., 100s of megabytes to gigabytes)

20 ® 20 Shubu Mukherjee, FACT Group BACKUPS

21 ® 21 Shubu Mukherjee, FACT Group Raw soft error rate: 0.001 – 0.010 FIT/bit Y.Tosaka, S.Satoh, K.Suzuki, T.Suguii, H.Ehara, G.A.Woffinden, and S.A.Wender, “Impact of Cosmic Ray Neutron Induced Soft Errors, on Advanced Submicron CMOS circuits,” VLSI Symposium on VLSI Technology Digest of Technical Papers, 1996. Y.Tosaka, S.Satoh, K.Suzuki, T.Suguii, H.Ehara, G.A.Woffinden, and S.A.Wender, “Impact of Cosmic Ray Neutron Induced Soft Errors, on Advanced Submicron CMOS circuits,” VLSI Symposium on VLSI Technology Digest of Technical Papers, 1996. Normand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December 1996. Normand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December 1996.


Download ppt "® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,"

Similar presentations


Ads by Google