Presentation is loading. Please wait.

Presentation is loading. Please wait.

Association Analysis, Logistic Regression, R and S-PLUS Richard Mott

Similar presentations


Presentation on theme: "Association Analysis, Logistic Regression, R and S-PLUS Richard Mott"— Presentation transcript:

1 Association Analysis, Logistic Regression, R and S-PLUS Richard Mott http://bioinformatics.well.ox.ac.uk/lectures/

2 Logistic Regression in Statistical Genetics Applicable to Association Studies Data: –Binary outcomes (eg disease status) –Dependent on genotypes [+ sex, environment] Aim is to identify which factors influence the outcome Rigorous tests of statistical significance Flexible modelling language Generalisation of Chi-Squared Test

3 What is R ? Statistical analysis package Free Similar to commercial package S-PLUS Runs on Unix, Windows, Mac www.r-project.org Many packages for statistical genetics, microarray analysis available in R Easily Programmable

4 Modelling in R Data for individual labelled i=1…n: –Response y i –Genotypes g ij for markers j=1..m

5 Coding Unphased Genotypes Several possibilities: –AA, AG, GG original genotypes –12, 21, 22 –1, 2, 3 –0, 1, 2 # of G alleles Missing Data –NA default in R

6 Using R Load genetic logistic regression tools > source(‘logistic.R’) Read data table from file –> t <- read.table(‘geno.dat’, header=TRUE) Column names –names(t) –t$y response (0,1) –t$m1, t$m2, …. Genotypes for each marker

7 Contigency Tables in R ftable(t$y,t$m31) prints the contingency table > ftable(t$y,t$m31) 11 12 22 0 515 387 75 1 28 11 2 >

8 Chi-Squared Test in R > chisq.test(t$y,t$m31) Pearson's Chi-squared test data: t$y and t$m31 X-squared = 3.8424, df = 2, p-value = 0.1464 Warning message: Chi-squared approximation may be incorrect in: chisq.test(t$y, t$m31) >

9 The Logistic Model Prob(Y i =0) = exp(  i  exp(  i ))  i =  j x ij b j - Linear Predictor x ij – Design Matrix (genotypes etc) b j – Model Parameters (to be estimated) Model is investigated by –estimating the b j ’s by maximum likelihood –testing if the estimates are different from 0

10 The Logistic Function Prob(Y i =0) = exp(  i  exp(  i ))  Prob(Y=0)

11 Types of genetic effect at a single locus AAAGGG Recessive001 Dominant110 Additive012 Genotype0 

12 Additive Genotype Model Code genotypes as –AA x=0, –AG x=1, –GG x=2 Linear Predictor –  = b 0 + xb 1 P(Y=0|x) = exp(b 0 + xb 1 )/(1+exp(b 0 + xb 1 )) P AA = P(Y=0|x=0) = exp(b 0 )/(1+exp(b 0 )) P AG = P(Y=0|x=1) = exp(b 0 + b 1 )/(1+exp(b 0 + b 1 )) P GG = P(Y=0|x=2) = exp(b 0 + 2b 1 )/(1+exp(b 0 + 2b 1 ))

13 Additive Model: b 0 = -2 b 1 = 2 P AA = 0.12 P AG = 0.50 P GG = 0.88  Prob(Y=0)

14 Additive Model: b 0 = 0 b 1 = 2 P AA = 0.50 P AG = 0.88 P GG = 0.98  Prob(Y=0)

15 Recessive Model Code genotypes as –AA x=0, –AG x=0, –GG x=1 Linear Predictor –  = b 0 + xb 1 P(Y=0|x) = exp(b 0 + xb 1 )/(1+exp(b 0 + xb 1 )) P AA = P AG = P(Y=0|x=0) = exp(b 0 )/(1+exp(b 0 )) P GG = P(Y=0|x=1) = exp(b 0 + b 1 )/(1+exp(b 0 + b 1 ))

16 Recessive Model: b 0 = 0 b 1 = 2 P AA = P AG = 0.50 P GG = 0.88  Prob(Y=0)

17 Genotype Model Each genotype has an independent probability Code genotypes as (for example) –AA x=0, y=0 –AG x=1, y=0 –GG x=0, y=1 Linear Predictor –  = b 0 + xb 1 +yb 2 two parameters P(Y=0|x) = exp(b 0 + xb 1 +yb 2 )/(1+exp(b 0 + xb 1 +yb 2 )) P AA = P(Y=0|x=0,y=0) = exp(b 0 )/(1+exp(b 0 )) P AG = P(Y=0|x=1,y=0) = exp(b 0 + b 1 )/(1+exp(b 0 + b 1 )) P GG = P(Y=0|x=0,y=1) = exp(b 0 + b 2 )/(1+exp(b 0 + b 2 ))

18 Genotype Model: b 0 = 0 b 1 = 2 b 2 = -1 P AA = 0.5 P AG = 0.88 P GG = 0.27  Prob(Y=0)

19 Models in R response y genotype g AAAGGGmodelDF Recessive001y ~ dominant(g)1 Dominant011y ~ recessive(g)1 Additive012y ~ additive(g)1 Genotype0  y ~ genotype(g)2

20 Data Transformation g <- t$m1 use these functions to treat a genotype vector in a certain way: –a <- additive(g) –r <- recessive(g) –d <- dominant(g) –g <- genotype(g)

21 Fitting the Model afit <- glm( t$y ~ additive(g),family=‘binomial’) rfit <- glm( t$y ~ recessive(g),family=‘binomial’) dfit <- glm( t$y ~ dominant(g),family=‘binomial’) gfit <- glm( t$y ~ genotype(g),family=‘binomial’) Equivalent models: –genotype = dominant + recessive –genotype = additive + recessive –genotype = additive + dominant –genotype ~ standard chi-squared test of genotype association

22 Parameter Estimates > summary(glm( t$y ~ genotype(t$m31), family='binomial')) Coefficients: Estimate Std. Error z value Pr(>|z|) b 0 (Intercept) -2.9120 0.1941 -15.006 <2e-16 *** b 1 genotype(t$m31)12 -0.6486 0.3621 -1.791 0.0733. b 2 genotype(t$m31)22 -0.7124 0.7423 -0.960 0.3372 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 >

23 Analysis of Deviance Chi-Squared Test > anova(glm( t$y ~ genotype(t$m31), family='binomial')) Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 1017 343.71 genotype(t$m31) 2 3.96 1015 339.76

24 Model Comparison Compare general model with additive, dominant or recessive models: > afit <- glm(t$y ~ additive(t$m20)) > gfit <- glm(t$y ~ genotype(t$m20)) > anova(afit,gfit) Analysis of Deviance Table Model 1: t$y ~ additive(t$m20) Model 2: t$y ~ genotype(t$m20) Resid. Df Resid. Dev Df Deviance 1 1016 38.301 2 1015 38.124 1 0.177 >

25 Scanning all Markers > logscan(t,model=‘additive’) Deviance DF Pval LogPval m1 8.604197e+00 1 3.353893e-03 2.474450800 m2 7.037336e+00 1 7.982767e-03 2.097846522 m3 6.603882e-01 1 4.164229e-01 0.380465360 m4 3.812860e+00 1 5.086054e-02 1.293619014 m5 7.194936e+00 1 7.310960e-03 2.136025588 m6 2.449127e+00 1 1.175903e-01 0.929628598 m7 2.185613e+00 1 1.393056e-01 0.856031566 m8 1.227191e+00 1 2.679539e-01 0.571939852 m9 2.532562e+01 1 4.842353e-07 6.314943565 m10 5.729634e+01 1 3.748518e-14 13.426140380 m11 3.107441e+01 1 2.483233e-08 7.604982503 …

26 Multilocus Models Can test the effects of fitting two or more markers simultaneously Several multilocus models are possible Interaction Model assumes that each combination of genotypes has a different effect eg t$y ~ t$m10 * t$m15

27 Multi-Locus Models > f <- glm( t$y ~ genotype(t$m13) * genotype(t$m26), family='binomial') > anova(f) Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 1017 343.71 genotype(t$m13) 2 108.68 1015 235.03 genotype(t$m26) 2 1.14 1013 233.89 genotype(t$m13):genotype(t$m26) 3 6.03 1010 227.86 > pchisq(6.03,2,lower.tail=F) calculate p-value [1] 0.04904584

28 Adding the effects of Sex and other Covariates Read in sex and other covariate data, eg. age from a file into variables, say a$sex, a$age Fit models of the form fit1 <- glm(t$y ~ additive(t$m10) + a$sex + a$age, family=‘binomial’) fit2 <- glm(t$y ~ a$sex + a$age, family=‘binomial’)

29 Adding the effects of Sex and other Covariates Compare models using anova – test if the effect of the marker m10 is significant after taking into account sex and age anova(fit1,fit2)

30 Multiple Testing Take care interpreting significance levels when performing multiple tests Linkage disequilibrium can reduce the effective number of independent tests Permutation is a safe procedure to determine significance Repeat j=1..N times: –Permute disease status y between individuals –Fit all markers –Record maximum deviance maxdev[j] over all markers Permutation p-value for a marker is the proportion of times the permuted maximum deviance across all markers exceeds the observed deviance for the marker –logscan(t,permute=1000) slow!

31 Haplotype Association –Different from multiple genotype models –Phase taken into account –Haplotype association can be modelled in a similar logistic framework Treat haplotypes as extended alleles Fit additive, recessive, dominant & genotype models as before –Eg haplotypes are h = AAGCAT, ATGCTT, etc –y ~ additive(h) –y ~ dominant(h) etc


Download ppt "Association Analysis, Logistic Regression, R and S-PLUS Richard Mott"

Similar presentations


Ads by Google