Download presentation
1
Probability distribution functions
Normal distribution Lognormal distribution Mean, median and mode Tails Extreme value distributions
2
Normal (Gaussian) distribution
Probability density function (PDF) What does figure tell about the cumulative distribution function (CDF)? The most commonly used distribution is the normal distribution. This may reflect the fact that, if the response of an engineering system is random due to large number of random properties, none dominating, the distribution is likely to be close to normal. The slide shows the equation for the probability density function (PDF), known as the bell-shaped distribution, with a figure taken from Wikipedia. In the figure, the mean is denoted by 𝑥 , while we use the notation 𝜇, the standard deviation is denoted by σ in both the equation and the figure. The area under it in a given region is equal to the probability of the random variable being in this region, so the total area is 1. So the figure shows that X has about 68% chance of being within one standard deviation of the mean. This means that the area of the central region under the curve is about 0.68. The cumulative distribution function is the probability of X being smaller than a given value x, so it is the integral of the PDF from minus infinity to x. Noting that the distribution is symmetric, the CDF at the mean should be 0.5, and the CDF at the mean plus one standard deviation should be about /2=0.84. A more accurate answer from Matlab normcdf(1)= The well known six-sigma standard in industry is actually (for historical reasons) a 4.5 sigma standard 1-normcdf(4.5)=3.4e-6, or 3.4 defects per million. The general form of normcdf is normcdf(X,mu, sigma), where the three variables can be vectors or matrices of the same size.
3
More on the normal distribution
Normal distribution is denoted 𝑁 𝜇, 𝜎 2 , with the square giving the variance. If X is normal, Y=aX+b is also normal. What would be the mean and standard deviation of Y? Similarly, if X and Y are normal variables, any linear combination, aX+bY is also normal. Can often use any function of a normal random variables by using a linear Taylor expansion. Example: X=N(10,0.52) and Y=X2 . Then 𝑋 2 ≈100+20(𝑋−10 𝑋 2 ≈100+20(𝑋−10 Y ≈ N(100,102) The notation for a normal distribution is 𝑁 𝜇, 𝜎 2 , with the square of the standard deviation being the variance. The fact that we specify the variance rather than the standard deviation may be related that it is easier to estimate the former than the latter as we will see on the next slide. One of the attractions of the normal distribution is that a linear function of a normal variable is normal, as can be checked from the PDF. For any random variable, if we add a constant we change the mean without changing the standard deviation. So if X has mean 𝜇, X+b will have mean 𝜇+𝑏. Similarly, if we multiply a random variable by a constant a, both the mean and standard deviation are multiplied by that constant. As useful is that any linear combination of normal variables is a normal variable. This can extend to any function of normal variables if the randomness induced in the function is small enough so that a linear Taylor series of the function is a good approximation. For example, if X=N(10,0.52) and Y=X2 then we can use the Taylor series expansion 𝑋 2 ≈100+20(𝑋−10 to approximate 𝑋 2 as N(100,102). In fact, X has a mean of about , and standard deviation of about
4
Estimating mean and standard deviation
Given a sample from a normally distributed variable, the sample mean is the best linear unbiased estimator (BLUE) of the true mean. For the variance the equation gives the best unbiased estimator, but the square root is not an unbiased estimate of the standard deviation For example, for a sample of 5 from a standard normal distribution, the standard deviation will be estimated on average as 0.94 (with standard deviation of 0.34) Given a sample from a random variable, the mean of the sample is the best linear unbiased estimator (BLUE) of the true mean. For a normal variable the standard estimate of 𝜎 2 ≈ 1 𝑛−1 𝑖=1 𝑛 𝑥 𝑖 − 𝑥 𝑥 = 1 𝑛 𝑖=1 𝑛 𝑥 𝑖 is also BLUE. Note that the variance of the sample has n instead of n-1 in the denominator. This is due to the fact that we estimate the mean in the sample. If the mean in known (e.g. if we estimate from a sample the standard deviation of 𝑥 9 +sin𝑥, where x is N(0,1)) we do use n rather than n-1. However, taking the square root does not provide an unbiased estimate of the standard deviation. The following Matlab sequence x=randn(5, ); s=std(x); s2=s.^2; mean(s) ; mean(s2) ; std(s) shows that for a sample of 5 numbers from the standard normal distribution, the estimate of the standard deviation will average only 0.94 with a substantial standard deviation of Look in Wikipedia under “unbiased estimation of standard deviation” for more accurate formulas.
5
Lognormal distribution
If ln(X) has normal distribution X has lognormal distribution. That is, if X is normally distributed exp(X) is lognormally distributed. Notation: ln𝑁 𝜇, 𝜎 2 PDF Mean and variance The normal distribution is not appropriate for variables that have to be positive, like density or length. The longnormal distribution is one of the popular distributions for such random variables. It is defined such as ln(X) is normally distributed and therefore often denoted as ln N(𝜇, 𝜎 2 ). The probability distribution function is 𝑓(𝑥)= 1 𝑥 2𝜋𝜎 exp − ln𝑥−𝜇 𝜎 2 And the mean and variance are then given as 𝜇 𝑋 =exp 𝜇+ 𝜎 , 𝜎 𝑋 2 =𝑉𝑎𝑟 𝑋 = 𝑒 𝜎 2 −1 𝑒 2𝜇+ 𝜎 2 The figure in the slide is is taken from a Matlab publication Suppose the income of a family of four in the United States follows a lognormal distribution with µ = log(20,000) and σ2 = 1.0. ( 𝜇 𝑋 =32974, 𝜎 𝑋 =43224). Then the figure is produced with the following sequence;: x = (10:1000:125010)'; y = lognpdf(x,log(20000),1.0); plot(x,y) set(gca,'xtick',[ ]) set(gca,'xticklabel',{'0','$30,000','$60,000', '$90,000','$120,000'})
6
Question Suppose the income of a family of four in the United States follows a lognormal distribution with µ = log(20,000) and σ2 = 1.0. ( 𝜇 𝑋 =32974, 𝜎 𝑋 =43224). See figure: What is your estimate of the mode (that is the most common income)? The median?
7
Mean, mode and median Mode (highest point) = exp[𝜇− 𝜎 2
Median (50% of samples) =𝑒 𝜇 Figure for 𝜇=0. The lognormal distribution also allows us to introduce the concepts of mode and median. The mode is the point with the highest PDF, and for the lognormal distribution it is at = exp[𝜇− 𝜎 2 . The median is the point where 50 percent of the samples will be below and 50% above. That is, the area of the PDF on both sides is 0.5, or the value of the CDF there is 0.5. For the longnormal distribution the median is at 𝑒 𝜇 . For the income distribution shown on the previous slide the equations indicate that the mode is $7,357. That is, if we sample many families, the largest concentration (highest point on a histogram) would be near $7357. The median is $20,000, that is half of the families would have income below 20,000 and half above. Finally, the mean was $32,974. The figure on this slide shows the lognormal distribution for 𝜇=0 and two values of 𝜎. For the lower value of 𝜎, the distribution is not strongly skewed, so that the mode, median and mean are close. For 𝜎=1, on the other hand, the distribution is highly skewed, and the three parameters are very different as they are for the income figure from the previous slide.
8
Light and heavy tails Normal distribution has light tail; 4.5 sigma is equivalent to 3.4e-6 failure or defect probability. Lognormal can have heavy tail 𝜇=0,𝜎=0.25,7.5e−4 , 𝜇=0,𝜎=1,0.0075 For many safety problems, the probability of failure must be very low, which means that we are interested not in the center of the distribution, but in its tails. The normal distribution is light tailed. This means that being more than 3-4 standard deviations from the mean is very unlikely. For example, in Slide 2 (see notes) we saw that the so-called six-sigma standard, which corresponds to 4.5 standard deviations from the mean reflects probability of failure of 3.4 per million. This applies to any normal distribution regardless of the mean and standard deviation. Many distributions, such as income or strength are heavier tailed, and the lognormal distribution may fit them better. For example, the almost symmetric case in the figure with 𝜇=0,𝜎=0.25, has a probability of 7.5e-4 at 4.5 standard deviations, and the case with 𝜇=0,𝜎=1, has a probability of This latter case was calculated with the following Matlab sequence: m=exp(0.5); v=exp(1)*(exp(1)-1); sig=sqrt(v); sig6=m+4.5*sig sig6 = logncdf(sig6,0,1) =0.9925
9
Fitting distribution to data
Usually fit CDF to minimize maximum distance (Kolmogorov-Smirnoff test) Generated 20 points from N(3,12). Normal fit N(3.48,0.932) Lognormal lnN(1.24,0.26) Almost same mean and standard deviation. Given sampling data we fit a distribution by finding a CDF that is close to the experimental CDF. Usually, we use the Kolmogorov-Smirnoff (K-S) criterion, which is the maximum difference between the two CDfs. Here this is illustrated by first generating a sample of twenty points from N(3,12) The figure shows in blue the experimental CDF and the lower and upper 90% confidence bounds (in blue). Then the normal fit in red and the lognormal fit in green. The normal fit, N(3.48,0.932) indicates 16% error in the mean and 7% in the standard deviation compared to the distribution used to generate the data. The lognormal fit has almost the same mean and standard deviation, but it is substantially different in the tail. Surprisingly it is a better fit to the data using the K-S test than the normal. However, in view of the large uncertainty bounds on the experimental CDF this is clearly believable. The Matlab sequence used to generate the fit and plot is x=randn(20,1)+3; [ecd,xe,elo,eup]=ecdf(x); pd=fitdist(x,'normal') mu = sigma = pd=fitdist(x,'lognormal') mu = sigma = xd=linspace(1,8,1000); cdfnorm=normcdf(xd,3.4819, ); cdflogn=logncdf(xd,1.2147, ) plot(xe,ecd,'LineWidth',2); hold on; plot(xd,cdflogn,'g','LineWidth',2) plot(xd,cdfnorm,'r','LineWidth',2); xlabel('x');ylabel('CDF') legend('experimental','lognormal','normal','Location','SouthEast') plot(xe,elo,'LineWidth',1); plot(xe,eup,'LineWidth',1)
10
Extreme value distributions
No matter what distribution you sample from, the mean of the sample tends to be normally distributed as sample size increases (what mean and standard deviation?) Similarly, distributions of the minimum (or maximum) of samples belong to other distributions. Even though there are infinite number of distributions, there are only three extreme value distributions. Type I (Gumbel) derived from normal. Type II (Frechet) e.g. maximum daily rainfall Type III (Weibull) weakest link failure
11
Maximum of normal samples
With normal distribution, maximum of sample is more narrowly distributed than original distribution. Max of 10 standard normal samples mean, 0.59 standard deviation Max of 100 standard normal samples mean, 0.43 standard deviation The normal distribution decays exponentially, that is, has a light tail. Therefore when you take the maximum of a set of samples, its distribution is narrower than the original distribution. This is illustrated here for the case of 10 samples and 100 samples drawn from the standard normal distribution. The left histogram and the values of the mean and standard deviation are obtained with the Matlab sequence; x=randn(10,100000); maxx=max(x); hist(maxx,50) mean(maxx) std(maxx) We see that by the time weruse 100 samples the maximum has a standard deviation of only 0.43 compared to 1 for the original distribution.
12
Gumbel distribution . Mean, median, mode and variance
For large number of samples, the minimum of normal samples converges to a distribution called Type 1 Extreme Value Distribution or the Gumbel distribution. The slide provides its PDF CDF and its mean, median and mode. Note that the distribution is defined for the minimum of a sample. If we desire the distribution for the maximum of a sample, we need to look for the minimum of the negative. This was done in fitting a distribution to the maximum of samples of size 10 and 100 drawn from the standard normal distribution. 10,000 such sets of samples were drawn, and the negative of their maxima were fitted to the Gumbel distribution. The left figure shows that the CDF of samples of 10 is markedly different from the Gumbel, but for 100 they agree quite well. The Matlab sequence for the 100 samples was as follows (the information on mu and sigma was output, and then it was input to define them). x=randn(100,100000); maxx=max(x); fitdist(-maxx','ev') extreme value distribution mu = sigma = [F,X]=ecdf(-maxx); plot(X,F,'r'); hold on xd=linspace(-5.3,-1,1000); evcd=evcdf(xd,mu,sigma); plot(xd,evcd); legend('fitted ev1','-max100 data')
13
Weibull distribution Probability distribution Its log has Gumbel dist.
Used to describe distribution of strength or fatigue life in brittle materials. If it describes time to failure, then k<1 indicates that failure rate decreases with time, k=1 indicates constant rate, k>1 indicates increasing rate. Can add 3rd parameter by replacing x by x-c. The Gumbel distribution, being the limiting case of the normal, is fairly light tail. Weibul is a heavier tailed limiting distribution. This can be also shown by the fact that its logarithm obeys the Gumbel distribution. It is also called Type 3 Extreme Value distribution. The specific equation for the PDF and the figure are taken from Wikipedia. The figure on the right shows that variety of the PDF for Weibull. The figure on the left compares the experimental CDF of the logarithm of a sample generated from the Weibull distribution (Matlab wblrnd) and the Gumbel distribution (Matlab ev) to the sample. The excellent agreement confirms the relation between Weibull and Gumbell.
14
Exercises Estimate how much rain will Gainesville have in 2014 as well as the aleatory and and epistemic uncertainty in your estimate. Find how many samples of normally distributed numbers you need in order to estimate the mean with an error that will be less than 5% of the true standard deviation 90% of the time. Use the fact that the mean of a sample of a normal variable has the same mean and a standard deviation that is reduced by the square root of the number of samples. Both the lognormal and Weibull distributions are used to model strength. Fit 100 data generated from a standard lognormal distribution by both lognormal and Weibull distributions. Repeat with 5 randomly generated samples. In each case measure the distance using the KS distance, and translate the result to a sentence of the following format: The maximum difference between the two CDFs is at x=2, where the true probability of x<2 is 60%, the probability from the experimental CDF is 61%, the probability from the lognormal fit is 62% and the probability from the Weibull fit is 64% (these numbers are invented for the purpose of illustrating the format). Generate a histogram of word lengths in this assignment, including hyphens and the math (e.g., x=2 is a 3-letter word), but not punctuation marks. Select an appropriate number of boxes for the histogram and explain your selection). Then fit the distribution of word lengths with five standard distributions including normal, lognormal, and Weibull using the K-S criterion. What distribution fits best? Compare the graphs of the CDFs.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.