Presentation on theme: "Normal Distribution The shaded area is the probability of z > 1."— Presentation transcript:
Normal Distribution The shaded area is the probability of z > 1
The normal distribution is actually a family of distributions, all with the same shape and parameterised by mean , and standard deviation . It is usually defined by a reference member of the family which is used to define other members. This reference member has =0 and =1.
Definition: A random variable Z has a normal (or Gaussian) distribution with mean 0 and standard deviation 1, if and only if its distribution function Ф(z) (defined by p(Z z) ) is given by we write Z ~ N(0, 1) and say that Z has a standard normal distribution
Definition: A random variable X has a normal (or Gaussian) distribution with mean and standard deviation , if and only if we write X ~ N( , 2 ) and say that X has a normal distribution
The normal distribution is symmetric about its mean . In particular, if Z ~ N(0, 1), then p(Z ≤ -z) = p(Z ≥ z) i.e. Ф(-z) +Ф(z) = 1 for all z
Whatever the values of and , the area between - 2 and + 2 is always 0.95 (95%).
Similarly, Whatever the values of and , the area between - and + is always 0.68 (68%).
Example It has been suggested IQ scores follow a normal distribution with mean 100 and standard deviation 15. Find the probability that any person chosen at random will have (a) An IQ less than 70 (b) An IQ greater than 110 (c) An IQ between 70 and 110.
In R, The function dnorm gives the density of the normal distribution. Generally more useful, though, is pnorm, which gives the cumulative distribution function.
So in the IQ example, the probability of an IQ less than 70 is: > pnorm(70,100,15)  > Approximately
And the probability of an IQ less than 110 is: > pnorm(110,100,15)  >
Thus, the probability of an IQ more than 110 is > t=pnorm(110,100,15) > 1-t  > Approximately
Finally, for the probability of an IQ between 70 and 110, carry out a subtraction. > pnorm(110,100,15) - pnorm(70,100,15)  > Approximately
> pnorm(0.6667) - pnorm(-2)  > These are the converted variables in the standardised normal (z) scales. The answer is, of course, the same.
z = -2 z =0.6667
The Central Limit Theorem
Let X 1, X 2 ………. X n be independent identically distributed random variables with mean µ and variance σ 2. Let S = X 1,+ X 2+ ………. +X n Then elementary probability theory tells us that E(S) = nµ and var(S) = nσ 2. The Central Limit Theorem (CLT) further states that, provided n is not too small, S has an approximately normal distribution with the above mean nµ, and variance nσ 2.
In other words, S approx ~ N(nµ, nσ 2 ) The approximation improves as n increases. We will use R to demonstrate the CLT.
Let X 1,X 2 ……X 6 come from the Uniform distribution, U(0,1) 01 1
For any uniform distribution on [A,B], µ is equal to and variance, σ 2, is equal to So for our distribution, µ= 1/2 and σ 2 = 1/12
The Central Limit Theorem therefore states that S should have an approximately normal distribution with mean nµ (i.e. 6 x 0.5 = 3) and var nσ 2 (i.e. 6 x 1/12 = 0.5) This gives standard deviation In other words, S approx ~ N(3, )
Generate results in each of six vectors for the uniform distribution on [0,1] in R. > x1=runif(10000) > x2=runif(10000) > x3=runif(10000) > x4=runif(10000) > x5=runif(10000) > x6=runif(10000) >
Let S = X 1,+ X 2+ ………. +X 6 > s=x1+x2+x3+x4+x5+x6 > hist(s,nclass=20) >
Consider the mean and standard deviation of S > mean(s)  > sd(s)  > This agrees with our earlier calculations
A method of examining whether the distribution is approximately normal is by producing a normal Q-Q plot. This is a plot of the sorted values of the vector S (the “data”) against what is in effect a idealised sample of the same size from the N(0,1) distribution.
If the CLT holds good, i.e. if S is approximately normal, then the plot should show an approximate straight line with intercept equal to the mean of S (here 3) and slope equal to the standard deviation of S (here 0.707).
> qqnorm(s) >
From these plots it seems that agreement with the normal distribution is very good, despite the fact that we have only taken n = 6, i.e. the convergence is very rapid!
Application Confidence Intervals for Mean
Suppose that the random variables Y 1,Y 2, …………Y n model independent observations from a distribution with mean µ and variance σ 2. Then is the sample mean.
Now by the CLT This is because µ is replaced by µ/n and σ by σ /n (for means)
Recall from Statistics 2 that, if σ 2 is estimated by the sample variance, s 2, an approximate confidence interval for µ is given by: Here y is the observed sample mean, and z is proportional to the level of confidence required. _
So for 95% confidence an approximate interval for µ is given by: 2 is approximate - an accurate value can be obtained from tables or by using the qnorm function on R.
Thus in R, an approximate 95% confidence interval for the mean µ is given by > mean(y)+c(-1,1)*qnorm(0.975)*sqrt(var(y)/length(y)) where y is the vector of observations. A more accurate confidence interval, allowing for the fact that s 2 is only an estimate of σ 2,is given by use of the function t.test.