Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.

Similar presentations


Presentation on theme: "Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests."— Presentation transcript:

1 Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests

2 Sources of Variation  Definition: Sampling variation results because we only sample a fraction of the full population (e.g. the mapping population).  Definition: There is often substantial experimental error in the laboratory procedures used to make measurements. Sometimes this error is systematic.

3 Parameters vs. Estimates  Definition: The population is the complete collection of all individuals or things you wish to make inferences about it. Statistics calculated on populations are parameters.  Definition: The sample is a subset of the population on which you make measurements. Statistics calculated on samples are estimates.

4 Types of Data  Definition: Usually the data is discrete, meaning it can take on one of countably many different values.  Definition: Many complex and economically valuable traits are continuous. Such traits are quantitative and the random variables associated with them are continuous (QTL).

5 Random We are concerned with the outcome of random experiments.  production of gametes  union of gametes (fertilization)  formation of chiasmata and recombination events

6 Set Theory I Set theory underlies probability.  Definition: A set is a collection of objects.  Definition: An element is an object in a set.  Notation: s  S  “s is an element in S”  Definition: If A and B are sets, then A is a subset of B if and only if s  A implies s  B.  Notation: A  B  “A is a subset of B”

7  Definition: Two sets A and B are equal if and only if A  B and B  A. We write A=B.  Definition: The universal set is the superset of all other sets, i.e. all other sets are included within it. Often represented as .  Definition: The empty set contains no elements and is denoted as . Set Theory II

8 Sample Space & Event  Definition: The sample space for a random experiment is the set  that includes all possible outcomes of the experiment.  Definition: An event is a set of possible outcomes of the experiment. An event E is said to happen if any one of the outcomes in E occurs.

9 Example: Mendel I  Mendel took inbred lines of smooth AA and wrinkled BB peas and crossed them to make the F1 generation and again to make the F2 generation. Smooth A is dominant to B.  The random experiment is the random production of gametes and fertilization to produce peas.  The sample space of genotypes for F2 is AA, BB, AB.

10 Random Variable  Definition: A function from set S to set T is a rule assigning to each s  S, an element t  T.  Definition: Given a random experiment on sample space , a function from  to T is a random variable. We often write X, Y, or Z. If we were very careful, we’d write X(s).  Simply, X is a measurement of interest on the outcome of a random experiment.

11 Example: Mendel II  Let X be the number of A alleles in a randomly chosen genotype. X is a random variable.  Sample space is  = {0, 1, 2}.

12 Discrete Probability Distribution  Suppose X is a random variable with possible outcomes {x 1, x 2, …, x m }. Define the discrete probability distribution for random variable X as with

13 Example: Mendel III

14 Cumulative Distribution  The discrete cumulative distribution function is defined as  The continuous cumulative distribution function is defined as

15 Continuous Probability Distribution  If exists, then f(x) is the continuous probability distribution. As in the discrete case,

16 Expectation and Variance

17 Moments and MGF  Definition: The r th moment of X is E(X r ).  Definition: The moment generating function is defined as E(e tX ).

18 Example: Mendel IV  Define the random variable Z as follows:  If we hypothesize that smooth dominates wrinkled in a single-locus model, then the corresponding probability model is given by:

19 Example: Mendel V

20 Joint and Marginal Cumulative Distributions  Definition: Let X and Y be two random variables. Then the joint cumulative distribution is  Definition: The marginal cumulative distribution is

21 Joint Distribution  Definition: The joint distribution is  As before, the sum or integral over the sample space sums to 1.

22 Conditional Distribution  Definition: The conditional distribution of X given that Y=y is  Lemma: If X and Y are independent, then p(x|y)=p(x), p(y|x)=p(y), and p(x,y)=p(x)p(y).

23 Example: Mendel VI P(homozygous | smooth seed) =

24 Binomial Distribution  Suppose there is a random experiment with two possible outcomes, we call them “success” and “failure”. Suppose there is a constant probability p of success for each experiment and multiple experiments of this type are independent. Let X be the random variable that counts the total number of successes. Then X  Bin(n,p).

25 Properties of Binomial Distribution

26 Examples: Binomial Distribution  recombinant fraction  between two loci: count the number of recombinant gametes in n sampled.  phenotype in Mendel’s F2 cross: count the number of smooth peas in F2.

27 Multinomial Distribution  Suppose you consider genotype in Mendel’s F2 cross, or a 3-point cross.  Definition: Suppose there are m possible outcomes and the random variables X 1, X 2, …, X m count the number of times each outcome is observed. Then,

28 Poisson Distribution  Consider the Binomial distribution when p is small and n is large, but np= is constant. Then,  The distribution obtained is the Poisson Distribution.

29 Properties of Poisson Distribution

30 Normal Distribution  Confidence intervals for recombinant fraction can be estimated using the Normal distribution.

31 Properties of Normal Distribution

32 Chi-Square Distribution  Many hypotheses tests in statistical genetics use the chi-square distribution.

33 Likelihood I  Likelihoods are used frequently in genetic data because they handle the complexities of genetic models well.  Let  be a parameter or vector of parameters that effect the random variable X. e.g.  =( ,  ) for the normal distribution.

34 Likelihood II  Then, we can write a likelihood where we have observed an independent sample of size n, namely x 1,x 2,…,x n, and conditioned on the parameter .  Normally,  is not known to us. To find the  that best fits the data, we maximize L(  ) over all .

35 Example: Likelihood of Binomial

36 The Score  Definition: The first derivative of the log likelihood with respect to the parameter is the score.  For example, the score for the binomial parameter p is

37 Information Content  Definition: The information content is  If evaluated at maximum likelihood estimate, then it is called expected information.

38 Hypothesis Testing  Most experiments begin with a hypothesis. This hypothesis must be converted into statistical hypothesis.  Statistical hypotheses consist of null hypothesis H 0 and alternative hypothesis H A.  Statistics are used to reject H 0 and accept H A. Sometimes we cannot reject H 0 and accept it instead.

39 Rejection Region I  Definition: Given a cumulative probability distribution function for the test statistic X, F(X), the critical region for a hypothesis test is the region of rejection, the area under the probability distribution where the observed test statistic X is unlikely to fall if H 0 is true.  The rejection region may or may not be symmetric.

40 Rejection Region II 1-  F(x l ) or 1-F(x u )  1-F(x c )  Distribution under H 0

41 Acceptance Region Region where H 0 cannot be rejected.

42 One-Tailed vs. Two-Tailed  Use a one-tailed test when the H 0 is unidirectional, e.g.  H 0 :  0.5.  Use a two-tailed test when the H 0 is bidirectional, e.g.  H 0:  =0.5.

43 Critical Values  Definition: Critical values are those values corresponding to the cut-off point between rejection and acceptance regions.

44 P-Value  Definition: The p-value is the probability of observing a sample outcome, assuming H 0 is true.  Reject H 0 when the p-value .  The significance value of the test is .

45 Chi-Square Test: Goodness-of- Fit  Calculate e i under H 0.   2 is distributed as Chi-Square with a-1 degrees of freedom. When expected values depend on k unknown parameters, then df=a- 1-k.

46 Chi-Square Test: Test of Independence  e ij = np 0i p 0j  degrees of freedom = (a-1)(b-1)  Example: test for linkage

47 Likelihood Ratio Test  G=2log(LR)  G ~  2 with degrees of freedom equal to the difference in number of parameters.

48 LR: goodness-of-fit & independence test  goodness-of-fit  independence test

49 Compare  2 and Likelihood Ratio  Both give similar results.  LR is more powerful when there are unknown parameters involved.

50 LOD Score  LOD stands for log of odds.  It is commonly denoted by Z.  The interpretation is that H A is 10 Z times more likely than H 0. The p-values obtained by the LR statistic for LOD score Z are approximately 10 -Z.

51 Nonparametric Hypothesis Testing  What do you do when the test statistic does not follow some standard probability distribution?  Use an empirical distribution. Assume H 0 and resample (bootstrap or jackknife or permutation) to generate:


Download ppt "Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests."

Similar presentations


Ads by Google