Applied Statistics for Intermediate Difficulty Methods

Applied Statistics for Intermediate Difficulty Methods
Thoratec Workshop in Applied Statistics for QA/QC, Mfg, and R+D Part 2 of 3: Intermediate Difficulty Methods Instructor : John Zorich Part 2 was designed for students who have taken Part 1 of these workshops, or who have used statistical methods on the job. © 2008 by JOHN ZORICH, 1 1 1

John Zorich's Qualifications:
20 years as a "regular" employee in the medical device industry (R&D, Mfg, Quality) ASQ Certified Quality Engineer (since 1996) Statistical consultant+instructor (since 1999) for many companies, including Siemens Medical, Boston Scientific, Stryker, and Novellus Instructor in applied statistics for Ohlone College (CA), Pacific Polytechnic Institute (CA), and KEMA/DEKRA Past instructor in applied statistics for UC Santa Cruz Extension, ASQ Silicon Valley Biomedical Group, & TUV . Publisher of 9 commercial, formally validated, statistical application Excel spreadsheets that have been purchased by over 80 companies, world wide. Applications include: Reliability, Normality Tests & Normality Transformations, Sampling Plans, SPC, Gage R&R, and Power. You’re invited to “connect” with me on LinkedIn. © 2008 by JOHN ZORICH, 2 2 2 2

Self-teaching & Reference Texts
RECOMMENDED by John Zorich Clements: Handbook of Statistical Methods in Mfg. Cohen: Statistical Power Analysis D'Agostino & Stevens: Goodness-of-Fit Techniques Dovich: Quality Engineering Statistics Dovich: Reliability Statistics Gross: A Normal Distribution Course Kaminsky et. al.: Statistics & QC for the Workplace Kraemer: How Many Subjects? Mace: Sample-Size Determination Motulsky: Intuitive Biostatistics Murphy & Myors: Statistical Power Analysis Natrella: Experimental Statistics (1st edition) NIST Engineering Statistics Internet Handbook , found at Philips: How to Think about Statistics Thode: Testing For Normality Recently re-published Free

Main Topics in Today's Workshop
Confidence Intervals Significance Tests Null Hypothesis t-Tests, ANOVA, and P-values Power calculations (e.g., for t-Tests) Confidence & Reliability Calculations Attribute ( pass / fail ) data Variables (measurement) data Normal vs. Non-Normal data K-tables MTTF and MTBF Normality Tests This is a lot to cover in 1 day, but your studying Student files & Instructor accessibility by , complete the course.

Width of Distribution is measured in... Std Errors Std Deviations
(as taught in Part 1 of this Workshop) Distribution of Sample Avgs. vs. Population Theoretical distribution of thousands of individual avgs taken from the population. Width of Distribution is measured in... Std Errors Std Deviations

(as taught in Part 1 of this Workshop) Calculating a "standard error"
Multiple samples (with replacement) of the same size, from the same population, generated these Avgs. Avg#1 Avg# Avg#3 Avg#4 etc. Avg#N Std Dev of Avgs = Std Error of the Mean This is a theoretically correct but impractical method for calculation.

Practical formula for "Std Error of Mean"
Standard Error of the (sample) Mean ( estimated from 1 sample ) Practical application ( new topic for Part 2 ): 95% confidence interval for the Population Mean can be estimated using this equation: Sample Average + / – " t " x (Std Error of Mean) As taught in Part 1 of this workshop Sample Standard Deviation Sample Size .

Confidence intervals are serious business!
In 2009, the US FDA told a medical-device start-up company client of John Zorich's that, in regards to the company's planned clinical trials....."The equivalence of the device to the predicate can be demonstrated if the confidence interval for the difference in the mean values for the tested parameter excludes a difference larger than 20% from the predicate." What the FDA was saying is this: Enroll a large number of patients, or we won't approve your product for sale in the United States! Unfortunately, the number of patients required to meet the FDA's confidence-interval mandate was determined to be larger than the company had time, money, or product to meet (the company had to close down & lay-off everyone).

Confidence intervals are valuable!
Many statistical tests or methods can be thought of in terms of confidence intervals. For example: A 2-sided t-test is really an examination of whether or not the “Null Hypothesis” is inside or outside the sample’s 2-sided confidence interval. Reliability calculations are really lower 1-sided confidence limits on the observed % in-specification; and (as we’ll see upcoming slides) because confidence limits are automatically adjusted based upon sample size, ANY sample size if valid. “SPC chart control-limit lines” are really upper and lower 2-sided confidence limits on the current process average.

95% confidence interval for Sample’s “Parameter Mean” is 91.6 to 108.4
These 2 curves are theoretical distributions of sample Avgs taken from Populations having identical Std Deviations but different Means & 108.4, respectively. White area = 95% of area under both curves ( colored tails = 5% ) Let's look at this in more detail, using Excel. Smpl Avg

95% Confidence Interval ( & Limits)
A "95% confidence interval of the mean" can be thought of in at least 2 ways: an interval around the observed mean of a sample, in which interval you can expect (with 95% confidence) to find the true mean (= parameter) of the population from which the sample was taken (e.g., for cable diameters, sample avg = 0.020, 95% conf. interval = to 0.021). THIS IS THE CORRECT & MOST COMMONLY USED INTERPRETATION OF THE TERM. an interval around the true mean of a population, in which interval you can expect to find 95% of all possible random sample means of a given size sample taken from that population. THIS IS NOT A CORRECT USE OF THE TERM !! Some people use this term to refer to the range of 95% of the population data --- that is also incorrect.

95% Confidence Interval ( & Limits)
True Mean Sample Mean NOT the "correct" interpretation of what a "Confidence Interval" is. THIS is the "correct" interpretation of what a "Confidence Interval" is. The "Confidence Limits" are at the extreme left and right ends of this range.

Where is the true mean ( = the Parameter Mean)?
Answer: Somewhere in Confidence Interval. How confident can we be of that answer? Answer: It depends on the size of the interval. Sample Mean 90 % confidence interval 95 % confidence interval 99 % confidence interval What is size of interval in which we are sure ( = 100% confident) of finding Parameter Mean? Answer: Minus infinity to plus infinity!

Where is the true mean ( = the Parameter Mean)?
Answer: Somewhere in Confidence Interval (at the chosen confidence level, typically 95%). Sample Mean Confidence interval based on a LARGE sample-size Confidence interval based on a MEDIUM sample-size Confidence interval based on a SMALL sample-size The choice of sample size is arbitrary, based upon how narrow you want the confidence interval ( i.e. how precisely you want to know the parameter). CI width is inversely proportional to square root of sample size: width = +/– t x SampleStdev / sqrt (SampleSize)

Example of misuse of Confidence Intervals:
In 2009, a billion dollar manufacturing company submitted to a regulatory agency a report claiming that performance data between the stressed and unstressed new product were not significantly different, because the confidence intervals of the two populations overlapped. The agency officially requested a literature or text book reference that explained such a rationale. After a few rounds of s and re-writings of the report (and still no literature reference) the company consulted a professional statistician, who used a different statistical method to proved equivalency. NOTE: As you can tell from the previous 2 slides, confidence intervals of virtually "any" two samples can overlap, if you use a small enough sample size and/or a large-enough confidence level.

This is a " t " Table (which is used to calculate Confidence Intervals)
This is a t-curve "A" = the sum of BOTH dark areas of the curve above, expressed as a decimal fraction of the whole area under the curve. " v " or "d.f." is always smaller than the sample size, (in most cases, it’s equal to “sample size – 1” ).

"one tailed" this would be 0.05 = 5%
For sample size of 6 (d.f. = 5 ), the 95% confidence interval is Avg +/– SEavg ≈ ≈ ≈ ≈ The #s in this last line are useful values to memorize.

95% upper 1-tail confidence limit of the mean = 106.7
95% of area under curve The solid area (5% of the whole) represents the % of the area under the curve to the left of the sample avg. Smpl Avg

Let's look at this in more detail, using Excel
95% lower 1-tail confidence limit of the mean = 93.3 The solid area (5% of the whole) represents the % of the area under the curve to the right of the sample avg. 95% of area under curve Smpl Avg Let's look at this in more detail, using Excel

Use this " t " Table to do the exercises on next few slides.
≈ 2.0

Class exercise: Confidence Limits
As a group, let's calculate the 90% 2-sided confidence limits for the population mean, assuming the sample has... Avg = 100 Standard deviation = 10 Sample size = 9 Answer: = / – ( x 10 / sqrt( 9 ) ) = / – ( / 3 ) = / – ( ) = and t-table value for A = ( 1.0 – 0.9 ), d.f. = ( 9 – 1 ) We are 90% sure that Parameter Avg is between these limits.

DIFFICULT exercise: Sample Size
What minimum sample size ( n ) is needed to know the true (Parameter) Average to within +/– 3 , with a confidence of 98% ( = tails are 2 % = 0.02 ) if we anticipate sample Std Deviation = 2.83 ? Answer: Conf. Interval = Avg +/– t x SEavg We need: t x SEavg = using t-table column with A = 0.02 t x SEavg = x 2.83 / Sqrt( 4 ) = 6.43 = x 2.83 / Sqrt( 6 ) = 3.89 = x 2.83 / Sqrt( 8 ) = 3.00 n = 8

This is called a " Z " Table In a normal distribution, +/– Z std values from the Parameter Avg encompasses 2 x A of the population. +/– 2.00 standard values equals 2 x = 95.5% of the area under the normal curve +/– 3.00 standard values equals 2 x = 99.7% of the area under the normal curve © 2008 by JOHN ZORICH, 23 23 23

" Z " Table Some statistical tests use Z-tables instead of t-tables, for simplicity when sample sizes are large (sample size does not appear in a Z-table, as it does in a t-Table). However... a " t-Test " always provides a (slightly) more accurate answer than a " Z-test " when comparing most statistics (e.g., a comparison of averages), and so Z-tests will not be taught in this workshop. +/– 2.00 standard values equals 2 x = 95.5% of the area under the normal curve; but on a t-Table, +/– 2.00 std values = just 95.0% even if sample size is 60 !! Only if n = infinity are t-table values identical to Z-table values. © 2008 by JOHN ZORICH, 24 24 24

Microsoft Excel "Confidence" function
Microsoft Excel has a 3 different "functions" that claims to calculate 1/2 of the confidence interval of the mean: Excel 2007 & earlier: CONFIDENCE ( 1– confidence, StdDev, sample size ) Starting in Excel 2010: CONFIDENCE.NORM ( 1– confidence, StdDev, sample size ) CONFIDENCE.T ( 1– confidence, StdDev, sample size ) Do not use CONFIDENCE nor CONFIDENCE.NORM, because they base their calculations on the Z-table rather than t-table and so produce a confidence interval that is too small for a sample mean and a sample standard deviation. Use only the last one ( CONFIDENCE.T ), because it will produce the exact same results as a manual calculation using the formula: SmplAvg +/− t x StdErrorMean.

What about ATTRIBUTE Confidence Limits?
Over a dozen different methods exist for calculating binomial confidence limits for sample % defective --- each of those methods gives a different length & different conf. limits !!! The classic method is called the “Exact” binomial --- it can be calculated via Excel's "Beta" function --- for example: UPPER 2-tailed Confidence Limit =betainv( 1 – (1 – C ) / 2 , k + 1 , N – k ) LOWER 2-tailed Confidence Limit =betainv((1 – C ) / 2 , k , N – k + 1 ) C = Confidence N = Sample size (e.g., 100 ) k = observed number of heads, or "yes" votes, or etc.) 95% conf. limits for observed 10 defects in sample size = 100 betainv( 1 – (1 – 0.95) / 2, , 100 – 10 ) = betainv( (1 – 0.95) / 2, 10 , 100 – ) =

Continued from previous slide...
If the Parameter % Defective is %, then we have a 2.5% chance of observing 10 or less defective parts in a sample of 100 parts: This is similar to what we saw with variables confidence limits, except here the 2 curves are not identically shaped. If the Parameter % Defective is 4.900%, then we have a 2.5% chance of a observing 10 or more defective parts in a sample of 100 parts.

Attribute confidence limits, continued
One-sided confidence limits are calculated like so: UPPER 1-tailed Confidence Limit =betainv( C, k + 1 , N – k ) LOWER 1-tailed Confidence Limit =betainv( 1 – C , k , N – k + 1 ) 95% 1-sided limits for observed 10 defects when N = 100 upper = betainv( 0.95 , , 100 – 10 ) = lower = betainv( 1 – , 10 , 100 – ) = Which one is useful when calculating % in-spec for population from which that sample was taken?

Binomial Confidence Limits (cont.)
VERY IMPORTANT NOTE: Do not use the "Z table" or "Poisson" formula / table given in some text books for calculation of binomial confidence limits. Those methods are pre-computer-era approximations for the "Exact" binomial results (binomial calculations are VERY difficult to do by hand, whereas Z and Poisson are easy). Using the 10% Defective example (see previous slides), the approximate confidence limits are... Z table: % and % Poisson: % and % Whereas, the "Exact" (correct !!!) confidence limits are... Binomial: % and % Beta approximation: % and %

TESTS OF STATISTICAL SIGNIFICANCE
Most statistical tests are tests of “statistical significance” (e.g., t-Test, Chi-Square, F-test, ANOVA). These next sections discuss this topic in general using the t-Test and ANOVA as examples. © 2008 by JOHN ZORICH, 30 30 30

What is Statistical Significance?
Significance (also referred to as either " alpha " or " α " ) is a number between 0 and 1 (or a % between 0% and 100%) that you choose, that... in your opinion, indicates how odd an event has to be... before you think it is fair to conclude that... "something's fishy" or "I'm being conned" or "the null hypothesis is false" For example, flip of 1 coin by seminar presenter, plus audience participation (all audience members are asked to now raise a hand high into the air). (e.g.) Significance is the probability of concluding that there is a difference, when in fact there is no difference. © 2008 by JOHN ZORICH, 31 31 31

Statistical Significance
# of heads Approximate chance (= probability) in a row of such occurrence, (null hypothesis using 1 coin is that the coin is honest 1 50 % 2 25 % 3 12 % 4 6 % ( ≈ now popular "0.05") 7 1 % % % % % JZ to describe coin-toss demo results when training 100+ TUV auditors in 2002 We can demo significance with a pair of dice. © 2008 by JOHN ZORICH, 32 32 32

t-Tests There are a huge number of different t-Tests --- for example:
To determine if cables from a new Supplier have a mean diameter that we suspect is significantly larger than the value we’ve been getting for months from the old supplier, we perform a one-tailed test of a mean vs. a historical average. To determine if cables from a new Supplier have a mean diameter that is significantly different than the value we’ve been getting for months from the old supplier, we perform a two-tailed test of a mean vs. a historical average. © 2008 by JOHN ZORICH, 34 34 34

t-Tests There are at least 3 (mathematically identical) ways to explain how a t-Test works; e.g.: Using confidence intervals Using P-values Using the “t-statistic” The first and second methods can be explained graphicly and therefore are easy to understand. The third method is basicly one that can be applied without necessarily understanding it, and therefore it should be used with caution. © 2008 by JOHN ZORICH, 35 35 35

When using the “confidence interval” method to explain the t-Test... Significance = 1 – Confidence Therefore, Significance can be any value between 0 and 1 ( = 0% and 100% ). However, because Confidence values are typically relatively large (90% or more), “Significance” values are typically relatively small (10% or less). Commonly used "significance" terms: "significant" = a significance of 0.05 or less "highly significant" = a significance of 0.01 or less © 2008 by JOHN ZORICH, 36 36 36

What is the "NULL HYPOTHESIS" ?
If you're testing whether dice are "loaded" (that is, dishonest), the null hypothesis is this: "the dice are honest" If you are testing whether Croatians are taller than Bosnians, the null hypothesis is this: "Croatians and Bosnians are, on average, the same height". If you are testing whether 2 products are significantly different, the "null hypothesis" is that there is no difference between the products. The NULL HYPOTHESIS VALUE is what your "test of significance" assumes is the Parameter, until you get a result that's so odd ( = has such a low probability of occurring, assuming the Null Hypothesis Value is the Parameter) that you decide to "reject the Null Hypothesis" rather than "accept" it. © 2008 by JOHN ZORICH, 37 37 37

Confidence Interval explanation of t-Tests
Null Hypothesis Value 95% conf. interval 95% conf. interval Sample Avg +/– t x SEavg (example of when Smpl Avg is smaller than Null Hypoth.) Sample Avg +/– t x SEavg (example of when Smpl Avg is larger than Null Hypoth.) If the value of the Null Hypothesis IS outside the 95% confidence interval, then the Sample Avg IS “statisticly significantly different” than the Null Hypothesis Value (the case above does show “significance”)

Null Hypothesis Value 99% conf. interval 99% conf. interval Sample Avg +/– t x SEavg (example of when Smpl Avg is smaller than Null Hypoth.) Sample Avg +/– t x SEavg (example of when Smpl Avg is larger than Null Hypoth.) If the Null Hypothesis is NOT outside the 99% confidence interval, then the Sample Avg IS NOT “statisticly highly significantly different” than the Null Hypothesis Value (the case above does not show “high significance”)

Null Hypothesis Value Sample Avg Upper 1-sided 95% Confidence Limit on the Sample Avg Sample Avg + t x SEavg (example of when Smpl Avg is smaller than Null Hypoth.) Only if the value of the Null Hypothesis IS larger than the 1-sided upper 95% confidence limit can we say that the Sample Avg IS “statisticly significantly smaller” than the Null Hypothesis Value (the case shown above does show “significance”).

Null Hypothesis Value Sample Avg Upper 1-sided 99% Confidence Limit on the Sample Avg Sample Avg + t x SEavg (example of when Smpl Avg is smaller than Null Hypoth.) Only if the value of the Null Hypothesis IS larger than the 1-sided upper 99% confidence limit can we say that the Sample Avg IS “statisticly highly significantly smaller” than the Null Hypothesis Value (the case above does not show “high significance”).

Null Hypothesis Value Lower 1-sided 95% Confidence Limit on the Sample Avg Sample Avg Sample Avg +/– t x SEavg (example of when Smpl Avg is larger than Null Hypoth.) Only if the value of the Null Hypothesis IS smaller than the 1-sided lower 95% confidence limit can we say that the Sample Avg IS “statisticly significantly larger” than the Null Hypothesis Value (the case shown above does show “significance”).

Null Hypothesis Value Lower 1-sided 99% Confidence Limit on the Sample Avg Sample Avg Sample Avg +/– t x SEavg (example of when Smpl Avg is larger than Null Hypoth.) Where would the Lower 99% confidence limit have to be, in order to conclude that the sample average is NOT “statisticly highly significantly larger” than the Null Hypothesis Value ?

t-Table used in t-Tests is same one used for Confidence Intervals.
Use this in exercise on next slides...

Class exercise: 1-sample t-Test
Calculate whether this Sample Avg is "significantly" larger than Null Hypothesis: Null Hypothesis = 100 Sample Avg = 107 Sample Std Dev = 10 Sample size = 9 = 107 – ( x 10 / sqrt( 9 ) ) = 107 – ( / 3 ) = 107 – ( ) = 1.86 = t-table value for A = 0.10, d.f. = ( 9 – 1 ) on t-Table... A = tailed equals A = tailed Null Hypoth. Value is below the Lower 1-tailed Confidence Limit, & so Sample Avg IS statisticly significantly LARGER. Lower 1-tailed Conf. Limit

Class exercise: t-Test
Calculate whether this Sample Avg is "significantly" different than Null Hypothesis: Null Hypothesis = 100 Sample Avg = 107 Sample Std Dev = 10 Sample size = 9 = / – ( x 10 / sqrt( 9 ) ) = / – ( / 3 ) = / – ( ) = to 2.306 = t-table value for A = 0.05, d.f. = ( 9 – 1 ) Null Hypoth. Value is INside Conf. Interval, & so Sample Avg IS NOT statisticly significantly DIFFERENT. LowerConLimit UpperConLimit

1-tailed vs. 2-tailed Another argued-about point, in the use of tests of significance, is whether or not to use a 1-tailed or 2-tailed test. The most commonly accepted view is this: You MUST make your decision BEFORE you start your study (e.g., when you write your protocol). If you KNOW that the Sample Avg will be larger (or if you KNOW it will be smaller) than the Null Hypothesis Value, and BEFORE the study starts you can prove it (e.g., on the basis of multiple preliminary studies), then you MUST use a 1-tailed test. Otherwise, you MUST use a 2-tailed test. To do any other way tempts you to modify your results to create the conclusion you want. © 2008 by JOHN ZORICH, 47 47 47

" P " vs. "Significance" Significance is the probability chosen by you before the experiment (or study) was conducted, with the understanding that if the "p" value of your experimental result is equal to or less than the significance value, you will reject the null hypothesis. "P" value represents the probability of the experimental result that you actually observed, assuming that the null-hypothesis is true. It is the probability of getting the observed result or a result that's even further out on the "tail" of the null-hypothesis distribution. If the "p" value is equal to or less than the chosen "significance" value, then you can say that the result is "significant" (or highly significant, if significance = 0.01 was chosen). © 2008 by JOHN ZORICH, 48 48 48

Do we need to examine this relationship more thoroughly, using Excel?
Each of the X-values is a possible result. Each is a "statistic". The statistics on the tails are less likely than the ones closer to the middle of the curve. The probability of getting 20 heads or more ( = a more extremely unlikely result sample “average” than 20), is 0.05 ( = 5% ).

When 30 coins are tossed, any number of heads is possible (from 0 to 30)
, but... ≈ 3% chance of getting 21 or more heads ≈ 28% chance of getting 17 or more heads Result here is NOT significant Result here IS significant Significance point 5% (= our chosen “Alpha”) chance of getting 20 or more heads

Distribution of 1000’s of Sample Avgs (all of one size) taken from a theoretical Null Hypothesis Normal population. Any of these are possible to be drawn by chance from the population. A t-Test assumes that the Null Hypothesis is true, and then checks whether or not the Sample Avg being evaluated is in the red tail(s) or not (see next slide) Distribution of the Raw Data in the theoretical Null Hypothesis population.

≈ 3% chance of getting this avg or larger.
Theoretical distribution of thousands of individual avgs taken from the population. ≈ 3% chance of getting this avg or larger. ≈ 25% chance of getting this avg or larger. Avg here is NOT significant Avg here IS significant Significance point 5% (= our chosen “Alpha”) chance of getting this Average or larger.

“P-value” explanation of t-Tests
"p" values are typically given as part of the output of a statistical test or statistical evaluation, e.g. Assuming the "null hypothesis" is true, a "p" value is the probability of occurrence of the observed result OR a result that's even more extreme. © 2008 by JOHN ZORICH, 53 53 53

“P-value” explanation of t-Tests
Are the means of the 2 samples that generated this output from Excel "significantly different " from each other ? No, because the 2-tailed P-value is not 0.05 ( = 5% ) or less. We used the 2-tail value, because the question had to do with being "different", not "larger" or "smaller". However, because one-tailed p = 0.041, we can say that is statistically smaller than that seeming contradiction is why there are arguments !!! © 2008 by JOHN ZORICH, 54 54 54

Instructions for a t-table version of a t-test:
Calculate the number of standard errors your observed mean is from the Null Hypothesis mean; this is your “observed t-statistic”. Compare that value to the appropriate value in a t-table. If your observed t-statistic ≥ the t-table value, you have a “significant” result; otherwise, your result is not significant. For example: Is sample avg significantly different from 100 ? Null Hypothesis = 100 Sample Avg = 107 Sample Std Dev = 10 Sample size = 9 Observed t-statistic = (107−100) / (10 / sqrt ( 9 ) ) = 2.100 2-sided t-table value (at alpha = 0.05, df = 8) = 2.306 2.100 < 2.306, thus 107 is not statistically different from 100. © 2008 by JOHN ZORICH, 55 55 55

The Significance of Significance If it’s significant or not significant, who cares, so what ?
An almost-century-old controversy revolves around the role that tests of significance should play in evaluating the results of an experiment. On one side of the controversy are those who preach that a "statistically significant" result indicates that the experimental results are important, and a "non-significant" result indicates that the results are not important. On the other side are those who argue that the importance of a result must be decided upon by the researcher him/herself. If he/she decides "important", then the degree of statistical significance indicates how much confidence one should have in the result, especially in regards to whether or not one should perform a confirmational study. If he/she decides "not important", then the researcher considers the statistical significance to be irrelevant. © 2008 by JOHN ZORICH, 56 56 56

A big company and a small company each conduct clinical trials on their own unique, new, unapproved medical device, both of which are designed to extend the life of seriously ill patients. The results are… Big company & Small company & large sample size small sample size Avg. life extension: months months Statistically different from 0.00 months? Yes No FDA gives approval? Yes No Subsequent big-company advertisements say product is “Clinically shown to significantly extend life of patients!” Most scientists make the mistake of thinking a “statisticly significant” result means it is of practical importance. © 2008 by JOHN ZORICH, 57 57 57

[per Murphy & Myors, Statistical Power Analysis] "With a sufficiently large N, virtually any [ test ] statistic will be significantly different from zero, and virtually any null hypothesis that is tested will be rejected." See STUDENT_SmplSize_vs_Significance.XLS for the following t-Test example: Null Hypothesis = , SmplAvg = 999.9, SmplStdDev = 10.0, 1-tailed alpha = 0.05, but if N = 30, not significant, if N = 30,000, yes significant. Read...The Cult of Statistical Significance by Ziliak & McCloskey, 2008, Univ. of Mich. Press, Ann Arbor John Zorich’s Solution (per Ziliak & McCloskey): if the observed difference is NOT a practical importance, then ignore a “significant” result; if the observed difference IS a practical importance, then ignore a “non significant” result (and then repeat the study with a larger sample size). © 2008 by JOHN ZORICH, 58 58 58

( ANOVA ) Understanding the Basis of Analysis of Variance
Copyright 2002, by John Zorich and Zorich Technical Consultants

Understanding the Basis of Analysis of Variance ( ANOVA )
Example of data to be analyzed in a “One-Factor ANOVA” test: *Effect of Different Conditioning Methods on the Breaking strength of cement briquettes (lbs/in2) Method#1 Method#2 Method#3 * data taken from Juran’s Quality Control Handbook, 4th edition, p

Example of data to be analyzed in a “Two-Factor ANOVA” test: Part#1 Part#2 Part#3 Operator# Operator# Operator# Data from a Gage R&R study. Values in inches.

Example of data to be analyzed in a “Three-Factor ANOVA” test: Seal-Strength vs. Sealer Dwelltime, Temperature, Pressure Time# Time# Time#3 . P1 P2 P3 P1 P2 P3 P1 P2 P3 Temp# Temp# Temp# Data from validation of a pouch sealer. Values in lbs. Copyright 2002, by John Zorich and Zorich Technical Consultants

For the sake of simplicity, we’re going to discuss only One-Factor ANOVA. The principles & methods that we’ll see here apply to multi-factor ANOVA. We first need to discuss t-Tests, in order to understand some concepts needed for explaining ANOVA tests

t-Tests and ANOVA Analyses both evaluate Sample Avg’s, to see if they are significantly different from what would be expected if the Samples came from the same population. If the Samples come from the same population, then the differences between Sample Avg’s are due to random chance, rather than caused by their coming from different populations that have truly different Avg’s. t-Test (as used here) evaluates 2 sample averages. ANOVA test evaluates more than 2 sample averages. (FYI: With two samples, a 1-factor ANOVA gives the same p-value as a 2-tailed t-Test.) Copyright 2002, by John Zorich and Zorich Technical Consultants

A two-sample t-Test is performed by dividing the difference between Sample Avg’s, by a value calculated from two estimates of the... Population Std Error of the Mean ( = SEM ). If that ratio is too large, then we reject the idea that the Samples came from the same Population. Tables of “ t ” are used to decide if the ratio is “too large”.

Standard Error of the Mean (SEM) SEM = Std Dev of the Population of all possible Sample Avg’s (of a given Smpl Size, from a given Population). In practice, assuming you have at least 2 samples, SEM can be estimated one of 2 ways (as we saw during our discussion of “Standard Errors” earlier today): SEM (theoretical formula) = Std Dev (n-1) of Sample Avg’s (Pooled) Sample Std Dev SEM (practical formula) = Sqrt(Smpl Size)

(Pooled) Sample Std Dev Is derived from a relatively simple equation found in any introductory Statistics textbook. That equation combines all the individual Sample Std Dev’s into a single, better estimate of the Population Std Dev than is any single one of them. We don’t have time today to examine the equation today, but it’s important to know is that a Pooled Std Dev is NOT the Avg of the Sample Std Dev’s !!

EXAMPLE OF A ONE-FACTOR ANOVA ANALYSIS Effect of Different Conditioning Methods on the Breaking Strength of Cement Briquettes (lbs/in2) Method#1 Method#2 Method#3 = Average = Sample size = StdDev (n-1)

ANOVA table for “Breaking strength” data SS _df_ MS F P . Total 10, Between 3, , Within 6, If the P value is equal to or less than your chosen “alpha” value, then there is a statistically significant difference between the means of the 3 samples.

Mean Square ( MS ) Each MS in an ANOVA table is really a variance (variance = square of the standard deviation) In our example: the “ F ” ratio = MS(between) / MS(within) = Variance(between) / Variance(within) If that ratio is too large, then we reject the idea that the Samples that generated the variances came from the same Population. F-tables are used determine whether or not F-ratios are “too large”.

From earlier in this discussion, we have these equations: 1) (Pooled Smpl Std Dev) = Population Std Dev 2) SEM = (Std Dev (n-1) of Smpl Avg’s) 3) SEM = (Pooled Smpl Std Dev) / Sqrt(Smpl Size) Rearranging equation 3), we have... 4) Sqrt(Smpl Size) x (SEM) = (Pooled Smpl Std Dev) Substituting the definition of SEM from 2) into 4), we have... 5) Sqrt(Smpl Size) x (Std Dev (n-1) of Smpl Avg’s) = (Pooled Smpl Std Dev) Substituting the “Pooled” definition from 1) into 5), we have... 6) Sqrt(Smpl Size) x (Std Dev (n-1) of Smpl Avg’s) = Population Std Dev These are 2 different estimates of Population Std Dev

Applying those equations (on previous slide), to the “Breaking Strength of Cement Briquettes” data, we have... Method#1 Method#2 Method# Sample... = Average = Size = StdDev (n-1) StdDev (n-1) of Sample Averages = 18.7 Estimate#1 of Population StdDev = sqrt(5) x 18.7 = 41.89 Estimate#2 of Population = Pooled Sample Std Dev = Std Dev

Estimate#1 of population StdDev = 41.89 Estimate#2 of population StdDev = 23.35 (Estimate#1) (41.89)2 F = = = 3.22 (Estimate#2) (23.35)2 which is the same value derived from the classic ANOVA calculations described earlier and of course, this works in an F-test, because Variance = ( Std Dev )2

Conclusion: Instead of viewing ANOVA in terms of unfamiliar terms (such as: sums of squares, degrees of freedom, and mean squares), view it as a comparison of two estimates of the Population Standard Deviation. One estimate is derived from the Standard Deviation of the Sample Averages, and the other is derived from the Pooled Standard Deviation of the Samples’ individual data points. If the ratio of the squares of the two estimates is too large (as determined by an F-test), then we conclude that not all the samples came from the same population.

Choosing Sample Size based upon POWER Calculations
(focus will be on t-Tests) © 2008 by JOHN ZORICH, 75 75 75

Statistical Power Power (which also is referred to as either " 1 - Beta" or "1 - ß") is a number between 0 and 1 ( = 0 to 100%) that you calculate (or choose) based upon... Your choice of Confidence ( = 1 – Significance) Your choice of Sample Size Your choice of the difference (between the populations) that is important to detect Std Deviation of the data (estimated or known) ...that represents the probability of "detecting" an important difference when there really is one (where "detecting" means test result is "significant"). (e.g.) Power is the probability of concluding that there is a difference, when in fact there is an important difference. (compare to Significance). © 2008 by JOHN ZORICH, 76 76 76

When is power important?
(as we shall see on the next several slides...) Calculation of "statistical power" is important... only if the actual or desired conclusion from your significance test is that the “Null Hypothesis could very possibly be true” and that the “Alternate Hypothesis is most likely not true”. Another way to say the same thing is... Calculation of “statistical power” is important only if the actual or desired conclusion is that there is "no statistically significant difference“ between the sample result and the Null Hypothesis value. © 2008 by JOHN ZORICH, 77 77 77

100 SAMPLE AVG In this case, if we obtain a sample average of ≈ 107 or larger, we have a "significant" result. This represents the situation if we were to perform a 1-tailed t-test of whether or not the Parameter Mean = 100 © 2008 by JOHN ZORICH, 79 79 79

There is a Δ = 6 between these 2 populations.
100 106 There is a Δ = 6 between these 2 populations. SAMPLE AVG What if (unbeknownst to us) the Parameter Mean = 106 (that is, we assume incorrectly that the Parameter Mean = 100 ) © 2008 by JOHN ZORICH, 80 80 80

If true Mean = 106, a t-test that assumes Mean = 100
SAMPLE AVG If true Mean = 106, a t-test that assumes Mean = 100 won't have a good chance of being "significant" (see next slide). © 2008 by JOHN ZORICH, 81 81 81

If the true Mean is 106, a t-test that assumes Mean = 100
Power ≈ 45 % to detect Δ = + 6 100 106 [per Murphy & Myors] "The power of a statistical test is the proportion of the [ true ] distribution of test statistics... that is above the critical value used to establish statistical significance." SAMPLE AVG If the true Mean is 106, a t-test that assumes Mean = 100 will reject that false assumption ≈ 45 % of the time. © 2008 by JOHN ZORICH, 82 82 82

100 106 Power ≈ 80 % to detect Δ = + 6 100 106 SAMPLE AVG
"Power" increases as we increase the sample size from 2 (on previous slide) to 6 (on this slide). © 2008 by JOHN ZORICH, 83 83 83

100 106 Power ≈ 100 % to detect Δ = + 6 100 106 SAMPLE AVG
[per Murphy & Myors] "...the effects of sample size on statistical power are so profound that is tempting to conclude that a significance test is little more than a roundabout measure of how large the sample is. If the sample is sufficiently small, then [we] never reject the null hypothesis. If the sample is sufficiently large, [we] always reject the null hypothesis." 100 106 Power ≈ 100 % to detect Δ = + 6 SAMPLE AVG "Power" increases as sample size increases (or if the Std Error is reduced some other way). © 2008 by JOHN ZORICH, 84 84 84

Statistical Power : t-Tests in general
There are a huge number of different t-tests, each with a different way to calculate their own "standard error". To estimate "power" for any t-test, use the test's Std Error & Null Hypothesis to draw the Null Hypothesis curve, and then draw an identically shaped one in the location of the Alternate Hypothesis (all as we did on a previous slide). © 2008 by JOHN ZORICH, 85 85 85

Statistical Power : General Comments:
(per Murphy & Myors) "...power of 0.80 or above is usually judged to be adequate. The 0.80 convention is arbitrary (in the same way that significance criteria of 0.05 or 0.01 are arbitrary), but it seems to be widely accepted." Power is most useful in planning, so that you don't spend time and money on a study only to find out it was doomed from the beginning. All major statistical software programs (such as StatGraphics) will calculate power for any test that the program can perform. Some textbooks explain the concept of Power and how to hand-calculate power values by using formulas and tables found in the books (e.g., Cohen, Kraemer, or Murphy). © 2008 by JOHN ZORICH, 86 86 86

" t-Tests" require "normal" data !!
Is “nearly” good enough? ( The text above is a scanned image from William Mendenhall, Intro. to Probability & Statistics, 5th ed., p. 281 ) © 2008 by JOHN ZORICH, 89 89 89

" t-Tests" require "normal" data !!
( in regards to the theorem that underlies the use of the t-Test ...) Don’t we want to be sure our inferences are “valid”? (Central Limit Theorem applies to Means, not to raw data !!) ( The text above is a scanned image from C. L. Chiang's Statistical Methods of Analysis, 2003 by World Press ) © 2008 by JOHN ZORICH, 90 90 90

The "original data" (below) is not Normally distributed (it has an “inverse normal” distribution, based upon an analysis not shown here). Therefore, a t-test on the original data gives the wrong answer for significance and power !! . . © 2008 by JOHN ZORICH, 91 91 91

Reliability Calculations
© 2008 by JOHN ZORICH, 92 92 92

Regulatory requirements:
Product safety regulations (e.g. MDD) require that "risks" be "acceptable" when weighed against "benefits”, which are described in "risk management" docs (e.g., an FMEA). ISO (one such regulation) requires that the “output of risk management” be used as “Design...Input”. For example, if your FMEA states that the risk of failure of a given component is acceptable only if mitigation (e.g., process improvement) reduces the frequency of failure to 0.1% at a confidence of 95%, then you must perform Verification studies on that component to prove it has 100% − 0.1% = 99.9% reliability at 95% confidence, or... You must update your FMEA with the reliability results you observe in your Verification study, and then decide if (at the new frequency-of-failure level) the risk is still "acceptable". © 2008 by JOHN ZORICH, 93 93 93 93

NOTE: The field of reliability statistics is vast. Today, we will not cover system reliability, altho we will discuss "mean time between failures“ and “mean time to failure” (MTBF and MTTF) which are concepts typically associated with electronic finished goods. Rather, we will primarily discuss component failures, and will focus on calculation of the time or stress level at which the first failures occur in a population (e.g., when the stress level at which the first patients are put at risk of injury or death by a failing medical device component, or when the first space-shuttle tile falls off during re-entry to Earth's atmosphere).

Definitions of " Failure " and " Reliability "
In many of the slides in this section of the class, the words " Failure " and " Reliability " are used. By "Failure" is meant that an individual component or product has been put on-test or under inspection and has either not passed specification or has literally failed (e.g., broke, separated, or burst -- it may have passed spec but then been taken past spec, until it eventually failed) --- which meaning is intended is obvious (or should be !!) in each situation. "Failure Rate" refers to the % of a lot or sample that has failed in testing, so far (that is, up to a given stress level). By "Reliability" is meant the % of the lot that does not exhibit "failure" (Reliability = 100% minus the Failure Rate)… AT OR BELOW A SPECIFIC STRESS LEVEL (a level that is typically set equal to the “QC” specifications)

"Confidence" = 1 – Significance
Therefore, Confidence is a value between 0 and 1 ( = 0% and 100% ) Typical desired values are 95% or 99% "Confidence" represents the probability that you are "right" when you make a statistical claim such as... "This product is 99.99% reliable". Reliability calculations are really lower 1-sided confidence limits on the observed % in-specification; and (as we saw previously) because confidence limits are automatically adjusted based upon sample size, ANY sample size if valid. © 2008 by JOHN ZORICH, 96 96 96

ATTRIBUTE DATA: Pass/Fail testing, 0 or more Failures
Method : Beta Equation (see Krishnamoorthy, Handbook...p.38) Don't use Dovich's Reliability Statistics beta table (many errors!) =betainv ( 1 – C , N – F , F + 1 ) where... C = Confidence desired (expressed as a decimal fraction) N = sample size F = # of failures seen in the sample That formula outputs the lower 1-tailed "exact" binomial confidence limit on the success rate (see conf. limit discussion). If no failures in a sample of 299, then 95% confidence in... =betainv( 1 – 0.95 , 299 – 0 , ) = 0.99 = 99% reliability If 2 failures in a sample size of 30, then 95% confidence in... =betainv( 1 – 0.95 , 30 – 2 , ) = 0.80 = 80% reliability

We are 95% sure that the Parameter is somewhere in this interval
Why does a sample of 299, with zero failures, equal 95% confidence of at least 99% reliability? A reliability calculation on a binomial proportion is, in effect, a lower 1-sided confidence limit on the observed proportion. It's the lower-most edge of the interval in which we predict we will find the true ("parameter") proportion. For reliability, we get to claim the worst value in that interval (in this case, 99%) We are 95% sure that the Parameter is somewhere in this interval 98% 99% 100% Sample Statistic Lower 1-tailed 95% Confidence Limit on Sample Statistic, when N = 299 and no failures are found in sample.

ATTRIBUTE DATA: Pass/Fail testing, 0 or more Failures
If Sample Size is 100% of the lot, use BetaInv formula; but if between 1% & 100% of Lot Size, reliability is more accurately calculated using Hypergeometric function (but it is more work!!): Confidence = – SUM (hypgeomdist(F,N,D,P)) F = from F = 0 to F = # of failures seen in Sample N = Sample size D = P x ( 100% – %Reliability to be determined) P = Population Size Keep modifying “%Reliability” until Confidence = 95%. If F = 2, P = 300, N = 30, then %Reliability ≈ 80.9% (vs. 80.5% if use the binomial approximation method); that is... = hypgeomdist ( 0 , 30 , 300 ( 1 – ), 300 ) = hypgeomdist ( 1 , 30 , 300 ( 1 – ), 300 ) = hypgeomdist ( 2 , 30 , 300 ( 1 – ), 300 ) Sum ≈ 0.05 (subtracted from 1.00 equals 95% Confidence) © 2007 by John Zorich -- JOHNZORICH.COM 99

Variables Data, Normally Distributed
Situation: Based on analysis of prior R&D work, failures are believed to be "Normally Distributed"; you likewise also have estimates of the Mean and StdDev (e.g., from R&D work). Method (e.g.): K-factor Table for "Normal" data (e.g., see Juran's Q-Handbook, Table V ) To use the table, calculate the “Observed K”. Then, compare the “Observed K” to the K in the "Normal" K-factor Table. “Observed K” = number of Std Deviations that the Process Mean is from nearest side of a 1 or 2-sided specification, i.e., |(SmplAvg – NearestSpecLimit)| ÷ SmplStdDev You can claim the confidence and reliability that is associated with a given normal k-table value, if your “observed k” is equal to or greater than the k-table value. See next slide >>

K Juran's QH confidence reliability
If the "observed K" is at least 3.520, & the population is "Normally Distributed”, & sample size is 15, then we are 95% confident that the Lot from which the sample came has at least 99% in-spec parts. K

Why does a sample of 15, with whose average is 3
Why does a sample of 15, with whose average is 3.52 std dev’s away from a 1-sided QC spec, equal 95% confidence of at least 99% reliability? A reliability calculation on a sample from a Normal population is, in effect, a lower 1-sided confidence limit on the % in-spec that would be calculated from the observed K. It's the lower-most edge of the 95% confidence interval of % in-spec. For reliability, we get to claim the worst value in that interval (in this case, 99%) We are 95% sure that the Parameter % in-spec is somewhere in this interval 98% 99% 99.98% Observed Sample Statistic: =NORMSDIST(3.52) = % in-spec but no confidence can be claimed for this statement! When sample Avg is 3.52 stdevs from the one-sided spec, can claim 99% reliability at 95% confidence.

Use this table for the Class Exercise on the next slide.
Juran's QH confidence k Use this table for the Class Exercise on the next slide.

Class exercise: Sample size = 20 Avg = 1000 StdDev = 10
Using K-factor Tables, determine what is the reliability, at 95% confidence, of a population from which this sample was taken (assume population is "normal"): Sample size = Avg = StdDev = 10 2-Sided specification = to 1040 Answer: The sample mean is 4 StdDevs from the nearest specification limit ( 1040 – 1000 = 40, and 40 / 10 = 4 ). And, on the sample size = 20 line (on the K-table), under 95% confidence, K = is midway between 99% and 99.9% reliability. Therefore, we are 95% confident that the product is more than 99% reliable.

K-factors give accurate reliability estimates only if the raw data has a " normal " distribution.
(The text below is a scanned image from Juran's Quality Control Handbook.)

Basing reliability on the correct distribution is critical because you're always concerned with the tail regions, & slight differences in the shape of a tail make huge differences in reliability estimates (99.1% vs. 99.9% may be the difference between deciding to launch a new product or not !!). This is the distribution curve for individual values in a hypothetical "normal" population, as estimated from the mean & std deviation of one sample. QC Spec is 1.00

"Cumulative Distribution"
Sometimes this is referred to as an " S " curve.

Normal Probability Plotting Paper
In the pre-computer days, you would use special graph paper, called Normal Probability Plotting (NPP) paper, to determine if data were "normal". "Normal data” plots as a straight line on NPP paper. What NPP paper does, in effect, is to straighten out the "S" shaped cumulative "probability plot" curve we saw on the previous slide. Different versions of NPP paper are shown on the next two slides, followed by a way to create such paper using MS Excel.

1.0 0.8 0.6 0.4 0.2 0.0 0.99 0.8 0.6 0.4 0.2 0.01 This is an S-curve and NPP plot that have been combined by using 2 different Y-axes. "S" curve of Cumulative Normal Probability Distribution (Y-axis for this is on the left here) This axis never gets to 1.000 Straight line of Normal Distribution on “Normal Probability Plotting Paper” (Y-axis for this is on the right here)

Cumulative % Y-axis never gets to 100% (this is NOT a “log” scale !!).
Z=2 NORMAL PROBABILITY PLOTTING PAPER Z=1 If Avg = 100 StdDev = 10 & Normal Distribution Z=0 Cumulative % Z= -1 These Z values are taken from a “Normal Distribution Z-table”; however, MSExcel can give them to us automaticly (more about this, soon). Z= -2 Z values allow use of Linear Regression

F = Median Rank = ( Rank – 0.3 ) / ( SampleSize + 0.4 )
Definition of " F " Regarding the use of NPP paper, textbooks provide various transformation of % Cumulative values; such transformations are called “plotting positions”; one purpose for them is to allow all data, even the "100%" point, to be plotted onto the Y-axis of NPP paper. In textbooks on Reliability Statistics, such a "plotting position" is given the symbol " F ". There are many different formulas for F, but a commonly used one is... F = Median Rank = ( Rank – 0.3 ) / ( SampleSize ) where “rank" of the lowest value in the data set = 1, next lowest value = 2, next lowest = 3, and so on. A “more accurate and theoretically justified” calculation (per one of the authors of Applied Reliability) is (using Excel)... F = BETAINV ( 0.5 , Rank , SampleSize – Rank + 1 )

Plot “F” on a Probability Plot
Rank 1 2 3 4 5 6 7 8 9 10

"Normal Probability Paper" using MS Excel
Create an X,Y chart with... X-axis = the observed measurement values Y-axis = Z ( F ) To calculate Z(F), do the following: Arrange measurements in order of magnitude. "Rank" of the lowest value = 1, next = 2, & so on. " F " = ( Rank – 0.3 ) / ( Sample Size ) " Z (F) " is = Normsinv( F ) the MS Excel function If Sample Size = 10, then " F " for the lowest rank = (1 – 0.3 ) / ( ) = 0.7 / 10.4 = 0.067, and therefore Z(F) = Normsinv( F ) = – 1.50, whereas... " F " for the highest rank = (10 – 0.3 ) / ( ) = 9.7 / 10.4 = 0.933, and therefore Z(F) = Normsinv( F ) =

Data from a few slides ago, using cumulative Z-table values of F ( that is " Z(F) " )

Y-axis = Z( F ) = Normsinv( F ) X-axis = Xi
This is actual data; if the data were "normally distributed", the plotted points would lie on a straight line (on this electronic version of NPP paper). dsd d FSDF How straight do data points have to be, before claiming that they fit a straight line? John Zorich's personal choice is that the Correlation Coefficient must be at least 0.975, but preferably 0.99 or larger! (Juran says to use your "judgment...the sample is never a perfect fit" FSDF

An important point: Mathematical methods, including statistical ones, do not require "data". They require "numbers". Said differently, the "data" that is inputted to some statistical methods do not have to exhibit normality, but the "numbers" do. That is, if the "data" is not normal, then transform them into "normal" numbers.

Normal Probability Plotting Transformations
In the pre-computer-era of the 20th Century, if data was not straight on regular NPP paper, then transformed NPP paper would be used (sometimes provided at the back of textbooks). If data is straight on this NPP paper, then the data distribution is “Log-Normal” If data is straight on this NPP paper, then the data distribution is "SquareRoot-Normal” Cumulative % Cumulative % SquareRoot ( DATA ) Log( DATA )

10,000 100 dsfg jjj jjj The upper curve has been "transformed" into the lower curve, by taking the Log of each data point. 2 4

The % of data that is outside the Specification Interval ( = 100 to 10,000) is same, no matter how data is "transformed". X Log ( X ) Sqrt ( X ) / X . 10, 50, 99, d In this case, 2 / 3 of the data is out of spec, whether or not the data is transformed, and no matter how the data is transformed.

This is a scanned image from Juran's Q-Handbook ("Basic Statistical Methods" section), who says there that: "These convenient methods do require judgment (e.g., how 'straight' must the [normal probability plot] line be?) because the sample is never a perfect fit...." d Data is "Lognormal" if it plots straight on Normal Probability Plotting Paper after the X-axis values are transformed by this formula

Non-normal Data The file called "Student Normal Transformations" uses most of the formulas on the previous pages to create a series of charts that you can use to help determine if data can be transformed into "normality".

Actual data from presenter's client...

continued from previous slide...
In reliability statistics textbooks, a plot like this, or one that is not even as straight as this, is sometimes shown as an example of a “Normal" distribution; but... even tho this data does “pass” the best “tests” for Normality (Anderson-Darling A2*, Cramer-von Mises W2*, and Shapiro-Francia W' ), with test p-values all > 0.425, ... and even tho the correlation coefficient is very high... this plot is slightly curved; and therefore this data is not truly normal (it is almost Normal). Is "almost" good enough for critical products? df This is the Excel equivalent of a Normal Probability Plot (data is “Normal” if it shows as a straight line on this plot).

The "inverse" ( = 1 / X ) transformation gives a much straighter line on "Normal Probability Plotting" paper, and so the distribution is "Inverse Normal" rather than “Normal”. © 2008 by JOHN ZORICH, 124 124 124

1 / X = "inverse" F F dfasdf observed ( at 95% confidence --- see next slide) ( at 95% confidence

If the input numbers are not "Normal", you get the wrong answer!
Juran's QH If the input numbers are not "Normal", you get the wrong answer! confidence k Untransformed, average on the previous slide is less than 2.2 StdDev from the Spec, and so is (at 95% confidence) less than 90% reliable when population is incorrectly assumed to be "normal"... ...but is almost 4.9 StdDev from the Spec if inverse transformed, and so is almost 99.9% reliable (when analyzed correctly).

MTTF and MTBF The discussion of MTTF and MTBF on this and the following slides applies to testing of electronic products that occurs AFTER burn-in has eliminated cases of “infant mortality”. That is, these calculations are valid only for the second part of a product’s lifetime, when failures are random and the failure rate is relatively constant, and when therefore those failures are accurately modeled by the “Exponential Distribution”. Typically, assessment of “Exponentiality” is done the same way as testing is done for “Normality”, i.e., with probability plots, as shown on the next slide.

Is data exponentially distributed?
30 devices put on-test; 9 failures occurred at these # of hours: 367, 422, 476, 508, 552, 589, 642, 683, 738 Create a plot with Exponential(F) vs. raw data; if the line appears straight, data can be considered exponential. F = as defined on previous slides (using N = 30, not N = 9 ) Exponential(F) = Ln ( 1 / ( 1 – F ) )

MTTF and MTBF MTTF = Mean Time To Failure
This term applies to products that are not repairable; that is, once the product fails, it cannot be repaired. MTBF = Mean Time Between Failures This term applies to products that are repairable; that is, such a product may fail (and be repaired) multiple times during its “lifetime”. Possibly a better term for this might be “Mean time between repairs”.

MTTF = (400+100+500+200) / 3 failures = 400 hours
MTTF and MTBF MTTF = Mean Time To Failure is calculated by adding up all the time the on-test devices were functioning correctly during the study, and then dividing by the total number of failures observed during the study. Device Failure indicated by “X” on line Hours in service # X # X # # X After each failure, the device is taken out of service. MTTF = ( ) / 3 failures = 400 hours

MTBF = (500+500+500+500) / 5 failures = 400 hours
MTTF and MTBF MTBF = Mean Time Between Failure is calculated similarly to MTTF, but takes into consideration that repairs have occurred; all devices are typically in-service for the same length of time. Device Failure indicated by “X” on line Hours in service # X X # X # # X X 500 After failure, device is quickly repaired & put back into service. MTBF = ( ) / 5 failures = 400 hours

class exercise (taken from a reliability textbook):
class exercise (taken from a reliability textbook): Calculate MTTF and MTBF MTTF: 30 devices put on-test; 9 failures occurred at these # of hours: 367, 422, 476, 508, 552, 589, 642, 683, 738 After failure, each failed device was NOT put back into service. After the 9th failure, the study was terminated (as planned). ANSWER: Sum of the 9 failure hours = 4977 Sum of non-failure hours = (30 – 9) x 738 = 15,498 MTTF = ( ) / 9 = 2,275 hours MTBF: Using same data as above, assume all failed devices were immediately repaired (or replaced with a good device, which is the same thing, for purposes of MTBF calcs) and put back into service; all 30 devices were then in-service for 999 hours each. ANSWER: MTBF = ( 30 x 999 ) / 9 = 3,330 hours

Confidence Limits on MTTF & MTBF
There is no difference in the way Confidence Limits are calculated for either MTTF or MTBF; but consideration does need to be given to how the study was conducted... Virtually all studies are RIGHT CENSORED (that is, ended before all the on-test devices have failed). “Type I” censored studies are terminated after a pre-defined time (e.g., # of hours, or # of cycles) (as in the MTBF example on the previous slide). “Type II” censored studies are terminated after a pre-defined number of failures have occurred in all the on-test devices combined (as in the MTTF example on the previous slide).

To calculate MTTF or MTBF confidence limits, we use a formula from the Chi-squared distribution, because MTTF and MTBF values are known to be “Chi-squared” distributed. The shape of such a distribution changes drastically, depending upon the # of instrument failures in the sample. As the number of failures becomes large, the distribution of the chi-square statistic takes on a more “Normal” shape. But if you want your calculation to be as accurate as possible, you should resist the temptation to use the Normal approximation calculation instead of a Chi-squared one. Few failures Many Failures probability calculated chi-square value

We could calculate 2-sided intervals and limits, but typically those are not what are sought in a reliability study. Typically, we want to know how bad the product might be. That is, we want to know what is the MTTF or MTBF that we can claim, based on our data --- to do that, we calculate their... LOWER 1-SIDED CONFIDENCE LIMITS: Type I study: x T / Chiinv( 1 – Conf, ( 2 x F ) + 2 ) Type II study: 2 x T / Chiinv( 1 – Conf, ( 2 x F ) ) Where T = Total in-service test time (all devices combined) Chiinv = the Excel function Conf = desired confidence (as a decimal fraction) F = Total number of failures (all devices combined)

class exercises: Calculate MTTF & MTBF conf. limits
Calculate lower 1-sided confidence limits for MTTF & MTBF answers to the “class exercise”, a few slides back. Lower 1-sided 95% confidence limit on the MTTF = 2,275 is...? = 2 ( ) / Chiinv ( 1 – 0.95, 2 x 9 ) = 1,418 hours (using the formula for a Type II censored study) Lower 1-sided 95% confidence limit on the MTBF = 3,330 is...? = 2 ( 30 x 999) / Chiinv ( 1 – 0.95, ( 2 x 9 ) + 2 ) = 1,908 hours (using the formula for a Type I censored study)

Reliability Calculations based upon MTTF or MTBF
The following formulas predict with confidence either... the % of the population that has not yet experienced failure after T hours of use (or after T number of cycles), or the probability of an individual device surviving to T without experiencing a failure. Lower 1-sided RELIABILITY CONFIDENCE LIMIT: = e ^ ( − T / MTTF(or MTBF)_Confidence_Limit ) Where T = time (or cycles) at which to calculate reliability e = the base of the natural logarithm = For example, if the lower 1-sided 95% confidence limit for MTTF is 1418 hours, then we can be 95% confident that... e ^ ( − 500 / 1418 ) = = 70.29% of the population will run without failure for 500 hours, or that a single device has a 70.29% chance of not failing in 500 hours.

Tests for Normality -- primary references
A Normal Distribution Course by J. Gross, (Peter Lang, GmbH, 2004) Testing for Normality by H. C. Thode (Marcel Dekker Inc., 2002) Applied Reliability by Tobias & Trindade (Chapman & Hall, 2nd ed., 1995) How to Test Normality & Other Distributional Assumptions by S. S. Shapiro (ASQC Press, 2nd ed., 1990) Goodness-of-Fit Techniques by D'Agostino & Stephens (Marcel Dekker Inc., 1986) There are dozens of different tests for normality! Thode: "In our review of the literature, we found more tests than we ever imagined existed." Shapiro: "Unfortunately, there is no one overall 'best' test ".

Tests for Normality All commercial statistical packages provide Normality Tests For example, StatGraphics Centurion XV provides Kolmogorov-Smirnov D Kuiper’s V Cramer-Von Mises W2 Watson U2 Anderson-Darling A2 Such tests typically involve simple algebraic calculations and subsequent comparisons of the results to values in tables. How to perform the tests are detailed in the reference text mentioned at end of this webinar. See also the explanations in the “Normality Tests…” demo-spreadsheet found on the statistics page at

Tests for Normality

WARNING about Tests for Normality
Data that is obviously non-normal might not fail a “test for normality”. Your best bet is to rely upon the shape of the Normal Probability Plot to help you decide if data is non-normal. For example, as we saw above in the example involving the use of K-tables on 12 “actual data” points, that data showed an obviously curved line on NPP paper. However, none of the available tests reject normality for that data, as we see on the next slide...

Normality tests on non-normal "actual data" data that we saw above:
xzfsdfsdf 143 143

Recommendations.... The recommendations by Gross, Thode, Tobias, Shapiro, D'Agostino and John Zorich can be summarized as follows: Plot the data on Normal Probability Plotting paper. If that plot looks curved (even slightly!), the data set is definitely not Normally distributed. If non-Normal, try Normal Probability Plots of transformations of the data, looking for a straight line. Choose straightest plot. Only then, perform one or more of the most highly recommended “Tests for Normality” on the (transformed?) data. If the test(s) pass, you can assume Normality in subsequent statistical analyses of the (transformed?) data (e.g. Normal K-tables, ANOVA, t-Tests, & Cp or Cpk --- all of which have Normality as a requirement for using the method). If no plot looks straight, even after transformation, then use “Reliability Plotting” or “Non Parametric” methods.

(Normality) In conclusion...
Just because raw data does not fail a “test for normality” does not mean it really is “normal”. Incorrectly concluding normality often leads to incorrectly rejecting product (in the experience of the presenter). It’s simple to use Normality Tests, and it’s simple to choose a Normality Transformation, with commercial software or spreadsheets you create yourself, altho you must be willing to make decisions regarding the straightness of NPP plots. In the presenter’s experience in the past few years... the FDA accepts “transformation to normality” (even in PMA supplements) when justified solely on the basis of a curved NPP plot of the raw data (that is, the FDA accepts the “non-normal” conclusion, even if the raw data “passes” a “normality test”); the FDA accepts the use of transformed raw data, based only on (1) the relatively straighter NPP plot of the transformed raw data, and (2) the transformed raw data “passing” a “normality test”. © 2008 by ZTC -- 145 145 145

Reliability df TEST DATA FROM ZTC CLIENTS
RELIABILITY USING K-FACTORS ** RELIABILITY USING K-FACTORS AFTER DATA TRANSFORMATION 94.2 % however, data not "normal" 99.7% data actually had a Log-Normal distribution Crimp-Joint Bond Strength 92.1% however, data not "normal" 99.999% data actually had an Inverse-Normal distribution Burst Pressure 99.3% however, data not "normal" 89.4% data had a CubeRoot-Normal distribution Lubricity ** Assuming (incorrectly) that data is "normal" and NOT applying any "transformation".

How to implement what you learned today?
Read your company's SOP (or ??) on statistical techniques. Ask to read some of the validation protocols and validation reports that relate to your work, and study their "statistics" section (or it might be called the "data analysis" section). Ask your boss to explain statistical statements made in meetings, reports, or SOPs. Ask to be part of the planning team for verification, validation, or new-product "transfers", especially in regards to choosing what sample sizes to use for product evaluations. © 2008 by JOHN ZORICH, 147 147 147

Applied Statistics for Intermediate Difficulty Methods

Similar presentations

Presentation on theme: "Applied Statistics for Intermediate Difficulty Methods"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Applied Statistics for Intermediate Difficulty Methods

Similar presentations

Presentation on theme: "Applied Statistics for Intermediate Difficulty Methods"— Presentation transcript:

Similar presentations

About project

Feedback