1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING.

1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

2 Genetic mapping Genetic map gives the relative positions of genes on the chromosomes with distances between them typically measured in centimorgans (cM) Linkage analysis aims to find approximate location of genes associated with certain traits in plants and animals. It is a statistical method that compares genetic similarity between two individuals (at a marker) to similarity of their physical or psychological traits (phenotype). Among the most studied traits are inheritable diseases.

3 QTL Quantitative trait: A measurable trait that shows continuous variation, e.g. skin pigmentation, height, cholesterol, etc. Quantitative traits are normally influenced by several genes and the environment. QTL or quantitative trait locus: a locus (or a gene) affecting quantitative trait. There is even The Journal of Quantitative Trait Loci.

4 Genetic similarity between two individuals at a given locus is typically measured by a number called identity by descent (IBD) status. Two genes of two different people are IBD if one is a physical copy of the other, or if they are both copies of the same ancestral gene. For any two people IBD status is a number in the set {0,1,2}. In real-life, this number typically needs to be estimated. Genetic similarity between two individuals at a given locus is typically measured by a number called identity by descent (IBD) status. Two genes of two different people are IBD if one is a physical copy of the other, or if they are both copies of the same ancestral gene. For any two people IBD status is a number in the set {0,1,2}. In real-life, this number typically needs to be estimated.

5 Linkage analysis is very effective with Mendelian inheritance. Mapping genes involved in inheritable diseases can be done by comparing IBD status of affected relatives (e.g. breast cancer) Mapping QTLs in animals or plants is performed by arranging a cross between two inbred strains, which are substantially different in a quantitative trait (e.g. tomato fruit mass or pH). Linkage analysis is very effective with Mendelian inheritance. Mapping genes involved in inheritable diseases can be done by comparing IBD status of affected relatives (e.g. breast cancer) Mapping QTLs in animals or plants is performed by arranging a cross between two inbred strains, which are substantially different in a quantitative trait (e.g. tomato fruit mass or pH).

6 IBD status of two half sibs Mother chromosomesChromosomes of two half sibs Sib 1 Sib 2 t s After two meiosis and some other developments X(t)=0, X(s)=1 X(t)= number of alleles identical by descent distance in Morgans

7 Recombinations, or more specifically, locations of crossovers in meiosis are frequently modelled by a stochastic process (standard choice is the Poisson process, suggested by Haldane in 1919.) The process (X(t)) is an ON-OFF process in the case of half-sibs, or sum of two independent such processes in the case of siblings. In particular, under Poisson process model, (X(t)) is a stationary Markov process. Moreover, X(t) is Bernoulli distributed for each t in the case of half sibs.

8 In the Haldane model, we have where is the recombination probability. For simplicity, we assume that IBD status is known at each marker (i.e. markers are completely genetically informative). In the Haldane model, we have where is the recombination probability. For simplicity, we assume that IBD status is known at each marker (i.e. markers are completely genetically informative).

9 Human genome consists of over 3 10^9 basepairs (in two copies) on 23 chromosomes. The average length of a chromosome is 140 cM. Total length of female (autosomal) genome is 4296cM Total length of male genome is 2851 cM That is: there is 1 expected crossover over 105 Mb in males and over 88 Mb in females. Thus, on human genome, 1 cM approximately equals 1Mb.

10 Data From n sib-pairs we observe - a sequence of iid phenotypes, with continuous marginal distribution and - a sequence of iid processes

11 IBD 1 at t IBD 0 at t

12 Haseman-Elston In 1972, they suggested to test whether there is a linear regression with negative slope between Soon, this became the standard tool for mapping of QTLs in human genetics

13 Variance Components Model Variance components model (Fulker and Cherny) essentially assumes that the joint distribution of the phenotypes is bivariate normal, conditionally on the IBD status x, with the same marginal distributions, and the correlation

14 Linkage Analysis The main question: –Does higher IBD status mean stronger dependence between the two trait values? In variance components model this translates into the test of H o : against H A :

15 Test statistic Statistical test is based on the log-likelihood ratio statistic Or (equivalently) on the efficient score statistic

16 Where is the score function, and is appropriate entry of Fisher information matrix and needs to be estimated in practice. Where is the score function, and is appropriate entry of Fisher information matrix and needs to be estimated in practice.

17 Z(t) t max

18 Significance in genome-wide scans If we have more than one marker we need to deal with the issue of multiple testing. The solution of this problem depends on the intermarker spacings and the sample size. One could use permutation tests or other simulation based methods to obtain p-values. If the sample size is large, one can apply a nice asymptotic theory that determines significance thresholds from the analysis of extremes of certain Gaussian processes (see. Lander and Botstein, Siegmund et al.)

19 For an illustration, we assume that the markers are “dense”, that is IBD status is measured continuously along the genome. It turn’s out that under our assumptions and the null hypothesis one can show that where is Ornstein-Uhlenbeck process with mean zero and covariance function over each chromosome. For an illustration, we assume that the markers are “dense”, that is IBD status is measured continuously along the genome. It turn’s out that under our assumptions and the null hypothesis one can show that where is Ornstein-Uhlenbeck process with mean zero and covariance function over each chromosome.

20 Now, approximate thresholds for a given significance level can be obtained by studying extremes of Ornstein- Uhlenbeck process (cf. Leadbetter et al) over finite interval. Hence, we get For 23 human chromosomes with average length of 140 cM and significance level 0.05 we get threshold b=4.08 (3.62 on LOD scale).

21 Other models The asymptotic theory does not change for other more realistic models of the recombination process (e.g. Kosambi model or chi squared model), since the asymptotic results for extremes of Gaussian processes depend only on the local behavior of the autocorrelation function of the process. Howver, for all of these models it holds that corr(X s,X t )~1-r|t-s| as |t-s| converges to 0. So in the limit we obtain Gaussian process with the same behavior of autocorrelations.

22 Disadvantages Normality assumption is frequently questionable Correlation can be a very bad measure of dependence if this assumption does not hold Risch and Zhang (1995) show how "The majority of such pairs provide little power to detect linkage; only pairs that are concordant for high values, low values, or extremely discordant pairs (for example, one in the top 10 percent and other in the bottom 10 percent of the distribution) provide substantial power"

23 Copula Copula of a random pair is the distribution function C of the random vector where we assume that the marginal distributions F 1 and F 2 of Y 1 and Y 2 are invertible. Hence the marginal distributions of the copula are both uniform on [0,1]. It is well known that the distribution of a random pair splits into two marginal distributions and the copula. Also copula is invariant under continuous increasing transformations.

24 It is straightforward to check that i.e. the distribution of a random pair splits into two marginal distributions and the copula Copula is invariant under monotone transformations, that is have the same copula, for increasing function h.

25 Basic Examples

26 Linkage analysis rephrased The main question: –Does higher IBD status mean stronger dependence between the two trait values? could be rephrased as –Does higher IBD status mean that the two trait values have “more diagonalized” copula? Note: marginal distributions do not change with IBD status.

27 Normal Copula Normal copula is a copula of a normally distributed random vector. Thus, if then the random vector has the bivariate normal copula. Since it depends only on we denote it by

28 Bivariate Normal Copula

29 New Model Assume that the pair has the same copula as in the variance components model, i.e. conditionally on the IBD status x and the same (but arbitrary) continuous marginal distribution i.e. F 1 = F 2.

30 The model is not so new after all, equivalently, there is an h such that satisfies the assumption of the v.c. model. Suppose that has the standard normal distribution function then That is

31 We can proceed in two ways: a)we could guess (estimate) h, or b)we could guess (estimate) F 1 The first method is already frequently applied in practice, while the second one is easier to justify using the empirical distribution function of the phenotypes. To estimate F 1 we may use data from a larger sample if available.

32 Transformation In practice we might have only 2n sib-pairs to estimate marginal distribution. So we could use Transformed phenotypes are

33 If, one can show the following Theorem as Observe that we essentially use van der Waerden normal scores rank correlation coefficient to measure dependence between the traits. Klaassen and Wellner (1997) showed that this is asymptotically efficient estimator of the correlation parameter in bivariate normal copula model.

34 Hence, it is also efficient estimator of the maximum correlation coefficient. For a pair of random variables Y 1 and Y 2, maximum correlation coefficient is defined as where supremum is taken over all real transformations a and b such that a(Y 1 ) and b(Y 2 ) have finite nonzero variance. Hence, it is also efficient estimator of the maximum correlation coefficient. For a pair of random variables Y 1 and Y 2, maximum correlation coefficient is defined as where supremum is taken over all real transformations a and b such that a(Y 1 ) and b(Y 2 ) have finite nonzero variance.

35 Simulation study

36 Application - Lp(a) Twin data on lipoprotein levels, collected in 4 populations in three countries (Australia, the Netherlands, Sweden). Analysis was performed using the variance components method and published by Beekman et al. (2003).

37 Ad hoc transformation

38 Lp(a) - chromosome 1

39 Lp(a) - chromosome 6

40 Discussion The normal copula based method has correct critical levels under the null hypothesis for any marginal distribution. Its power seems to be close to optimal. The method easily extends to general pedigrees, discrete data, multiple QTLs, etc. It is straightforward to implement in any existing software. Other families of copulas (Clayton, Gumbel, etc.) could be more suitable in certain applications.

41 Discrete data In biomedical applications, phenotypes are frequently measured on some ordinal scale; that is for some natural number l If we want to detect if higher IBD status translates into more similar phenotypic values we may apply nonparametric methods or discretize some parametric family of copulas, and test if the parameters change with IBD status.

42 Discrete data

43 Acknowledgments C. Klaassen (UvA, Eurandom) D. Boomsma (VUA) M. Beekman (LUMC) N. Martin (Australia)

1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING.

Similar presentations

Presentation on theme: "1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING.

Similar presentations

Presentation on theme: "1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING."— Presentation transcript:

Similar presentations

About project

Feedback