DATA ANALYSIS Module Code: CA660 Lecture Block 3.

DATA ANALYSIS Module Code: CA660 Lecture Block 3

2 MEASURING PROBABILITIES – RANDOM VARIABLES & DISTRIBUTIONS (Primer) If a statistical experiment only gives rise to real numbers, the outcome of the experiment is called a random variable. If a random variable X takes values X 1, X 2, …, X n with probabilities p 1, p 2, …, p n then the expected or average value of X is defined E[X] = p j X j and its variance is VAR[X] = E[X 2 ] - E[X] 2 = p j X j 2 - E[X] 2

3 Random Variable PROPERTIES Sums and Differences of Random Variables Define the covariance of two random variables to be COVAR [ X, Y] = E [(X - E[X]) (Y - E[Y]) ] = E[X Y] - E[X] E[Y] If X and Y are independent, COVAR [X, Y] = 0. LemmasE[ X  Y] = E[X]  E[Y] VAR [ X  Y] = VAR [X] + VAR [Y]  2COVAR [X, Y] and E[ k. X] = k.E[X], VAR[ k. X] = k 2.E[X] for a constant k.

4 Example: R.V. characteristic properties B =1 2 3 Totals R = 1 8 10 9 27 2 5 7 4 16 3 6 6 7 19 Totals 19 23 20 62 E[B] = {1(19)+2(23)+3(20) / 62 = 2.02 E[B 2 ] = {1 2 (19)+2 2 (23)+3 2 (20) / 62 = 4.69 VAR[B] = ? E[R] = {1(27)+2(16)+3(19)} / 62 = 1.87 E[R 2 ] = {1 2 (27)+2 2 (16)+3 2 (19)} / 62 = 4.23 VAR[R] = ?

5 Example Contd. E[B+R] = { 2(8)+3(10)+4(9)+3(5)+4(7)+ 5(4)+4(6)+5(6)+6(7)} / 62 = 3.89 E[(B + R) 2 ] = {2 2 (8)+3 2 (10)+4 2 (9)+3 2 (5)+4 2 (7)+ 5 2 (4)+4 2 (6)+5 2 (6)+6 2 (7)} / 62 = 16.47 VAR[(B+R)] = ? * E[B R] = {1(8)+2(10)+3(9)+2(5)+4(7)+6(4) +3(6)+6(6)+9(7)}/ 62 = 3.77 COVAR (B, R) = ? Alternative calculation to * VAR[B] + VAR[R] + 2 COVAR[ B, R] Comment?

6 DISTRIBUTIONS - e.g. MENDEL’s PEAS

7 P.D.F./C.D.F. If X is a R.V. with a finite countable set of possible outcomes, {x 1, x 2,…..}, then the discrete probability distribution of X and D.F. or C.D.F. While, similarly, for X a R.V. taking any value along an interval of the real number line So if first derivative exists, then is the continuous pdf, with

8 EXPECTATION/VARIANCE Clearly, and

9 Moments and M.G.F’s For a R.V. X, and any non-negative integer k, kth moment about the origin is defined as expected value of X k Central Moments (about Mean): 1 st = 0 i.e. E{X}= , second = variance, Var{X} To obtain Moments, use Moment Generating Function If X has a p.d.f. f(x), mgf is the expected value of e tX For a continuous variable, then For a discrete variable, then Generally: r th moment of the R.V. is r th derivative evaluated at t = 0

10 PROPERTIES - Expectation/Variance etc. Prob. Distributions (p.d.f.s) As for R.V.’s generally. For X a discrete R.V. with p.d.f. p{X}, then for any real-valued function g e.g. Applies for more than 2 R.V.s also Variance - again has similar properties to previously: e.g.

11 MENDEL’s Example Let X record the no. of dominant A alleles in a randomly chosen genotype, then X= a R.V. with sample space S = {0,1,2} Outcomes in S correspond to events Note: Further, any function of X is also a R.V. Where Z is a variable for seed character phenotype

12 Example contd. So that, for Mendel’s data, And with And Note: Z = ‘dummy’ or indicator. Could have chosen e.g. Q as a function of X s.t. Q = 0 round, (X >0), Q = 1 wrinkled, (X=0). Then probabilities for Q opposite to those for Z with and

13 JOINT/MARGINAL DISTRIBUTIONS Joint cumulative distribution of X and Y, marginal cumulative for X, without regard to Y and joint distribution (p.d.f.) of X and Y then, respectively where similarly for continuous case e.g. (2) becomes

14 Example: Backcross 2 locus model (AaBb  aabb) Observed and Expected frequencies Genotypic S.R 1:1 ; Expected S.R. crosses 1:1:1:1 Cross Genotype 1 2 3 4 Pooled Frequency AaBb 310(300) 36(30) 360(300) 74(60) 780(690) Aabb 287(300) 23(30) 230(300) 50(60) 590(690) aaBb 288(300) 23(30) 230(300) 44(60) 585(690) aabb 315(300) 38(30) 380(300) 72(60) 805(690) Marginal A Aa 597(600) 59(60) 590(600) 124(120) 1370(1380) aa 603(600) 61(60) 610(600) 116(120) 1390(1380) Marginal B Bb 598(600) 59(60) 590(600) 118(120) 1365(1380) bb 602(600) 61(60) 610(600) 122(120) 1395(1380) Sum 1200 120 1200 240 2760

15 CONDITIONAL DISTRIBUTIONS Conditional distribution of X, given that Y=y where for X and Y independent and Example: Mendel’s expt. Probability that a round seed (Z=1) is a homozygote AA i.e. (X=2) AND - i.e. joint or intersection as above i.e. JOINT

16 Standard Statistical Distributions Importance Modelling practical applications Mathematical properties are known Described by few parameters, which have natural interpretations. Bernoulli Distribution. This is used to model a trial/expt. which gives rise to two outcomes: success/ failure: male/ female, 0 / 1..… Let p be the probability that the outcome is one and q = 1 - p that the outcome is zero. E[X] = p (1) + (1 - p) (0) = p VAR[X] = p (1) 2 + (1 - p) (0) 2 - E[X] 2 = p (1 - p). 01p Prob 1 1 - p p

17 Standard distributions - Binomial Binomial Distribution. Suppose that we are interested in the number of successes X in n independent repetitions of a Bernoulli trial, where the probability of success in an individual trial is p. Then Prob{X = k} = n C k p k (1-p) n - k, (k = 0, 1, …, n) E[X] = n p VAR[X] = n p (1 - p) (n=4, p=0.2) Prob 1 4 np This is the appropriate distribution to model e.g. Number of recombinant gametes produced by a heterozygous parent for a 2-locus model. Extension for  3 loci is multinomial

18 Standard distributions - Poisson Poisson Distribution. The Poisson distribution arises as a limiting case of the binomial distribution, where n , p  in such a way that np  Constant) P{X = k} = exp ( -    … ). E [X] = VAR [X] = Poisson is used to model No.of occurrences of a certain phenomenon in a fixed period of time or space, e.g. O particles emitted by radioactive source in fixed direction for interval  T O people arriving in a queue in a fixed interval of time O genomic mapping functions, e.g. cross over as a random event X 5 1

19 Other Standard examples: e.g. Hypergeometric, Exponential…. Consider a population of M items, of which W are deemed to be successes. Let X be the number of successes that occur in a sample of size n, drawn without replacement from the finite population Prob { X = k} = W C k M-W C n-k / M C n ( k = 0, 1, 2, … ) Then E [X] = n W / M VAR [X] = n W (M - W) (M - n) / { M 2 (M - 1)} Exponential : special case of the Gamma distribution with n = 1 used e.g. to model inter-arrival time of customers or time to arrival of first customer in a simple queue, e.g. fragment lengths in genome mapping etc. The p.d.f. is f (x)= exp ( - x ),x  0  0 = 0otherwise

20 Standard p.d.f.’s - Gaussian/ Normal A random variable X has a normal distribution with mean  and standard deviation  if it has density with and Arises naturally as the limiting distribution of the average of a set of independent, identically distributed random variables with finite variances. Plays a central role in sampling theory and is a good approximation to a large class of empirical distributions. Default assumption  in many empirical studies is that each observation is approx. ~ N(  2 ) Statistical tables of the Normal distribution are of great importance in analysing practical data sets. X is said to be a Standardised Normal variable if  = 0 and  = 1.

21 Standard p.d.f.’s : Student’s t-distribution A random variable X has a t -distribution with ‘n’ d.o.f. ( t n ) if it has density = 0 otherwise. Symmetrical about origin, with E[X] = 0 & V[X] = n / (n -2). For small n, the t n distribution is very flat. For n  25, the t n distribution  standard normal curve. Suppose Z a standard Normal variable, W has a  n 2 distribution and Z and W independent then r.v. form If x 1, x 2, …,x n is a random sample from N(   , and, if define then

22 Chi-Square Distribution A r.v. X has a Chi-square distribution with n degrees of freedom; (n a positive integer) if it is a Gamma distribution with = 1, so its p.d.f. is E[X] =n ; Var [X] =2n Two important applications: - If X 1, X 2, …, X n a sequence of independently distributed Standardised Normal Random Variables, then the sum of squares X 1 2 + X 2 2 + … + X n 2 has a  2 distribution (n degrees of freedom). - If x 1, x 2, …, x n is a random sample from N(  2 ), then and and s 2 has  2 distribution, n - 1 d.o.f., with r.v.’s and s 2 independent. X  2 ν (x) Prob

23 F-Distribution A r.v. X has an F distribution with m and n d.o.f. if it has a density function = ratio of gamma functions for x>0 and = 0 otherwise. For X andY independent r.v.’s, X ~  m 2 and Y~  n 2 then One consequence: if x 1, x 2, …, x m ( m  is a random sample from N(  1,  1 2 ), and y 1, y 2, …, y n ( n  a random sample from N(  2,  2 2 ), then

DATA ANALYSIS Module Code: CA660 Lecture Block 3.

Similar presentations

Presentation on theme: "DATA ANALYSIS Module Code: CA660 Lecture Block 3."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DATA ANALYSIS Module Code: CA660 Lecture Block 3.

Similar presentations

Presentation on theme: "DATA ANALYSIS Module Code: CA660 Lecture Block 3."— Presentation transcript:

Similar presentations

About project

Feedback