Presentation is loading. Please wait.

Presentation is loading. Please wait.

Association Analysis, Logistic Regression, R and S-PLUS

Similar presentations


Presentation on theme: "Association Analysis, Logistic Regression, R and S-PLUS"— Presentation transcript:

1 Association Analysis, Logistic Regression, R and S-PLUS
Richard Mott

2 Logistic Regression in Statistical Genetics
Applicable to Association Studies Data: Binary outcomes (eg disease status) Dependent on genotypes [+ sex, environment] Aim is to identify which factors influence the outcome Rigorous tests of statistical significance Flexible modelling language Generalisation of Chi-Squared Test

3 What is R ? Statistical analysis package Free
Similar to commercial package S-PLUS Runs on Unix, Windows, Mac Many packages for statistical genetics, microarray analysis available in R Easily Programmable

4 Modelling in R Data for individual labelled i=1…n: Response yi
Genotypes gij for markers j=1..m

5 Coding Unphased Genotypes
Several possibilities: AA, AG, GG original genotypes 12, 21, 22 1, 2, 3 0, 1, 2 # of G alleles Missing Data NA default in R

6 Using R Load genetic logistic regression tools
> source(‘logistic.R’) Read data table from file > t <- read.table(‘geno.dat’, header=TRUE) Column names names(t) t$y response (0,1) t$m1, t$m2, …. Genotypes for each marker

7 Contigency Tables in R ftable(t$y,t$m31) prints the contingency table
>

8 Chi-Squared Test in R > chisq.test(t$y,t$m31)
Pearson's Chi-squared test data: t$y and t$m31 X-squared = , df = 2, p-value = Warning message: Chi-squared approximation may be incorrect in: chisq.test(t$y, t$m31) >

9 The Logistic Model Prob(Yi=0) = exp(hi)/(1+exp(hi))
hi = Sj xij bj - Linear Predictor xij – Design Matrix (genotypes etc) bj – Model Parameters (to be estimated) Model is investigated by estimating the bj’s by maximum likelihood testing if the estimates are different from 0

10 The Logistic Function Prob(Yi=0) = exp(hi)/(1+exp(hi))

11 Types of genetic effect at a single locus
AA AG GG Recessive 1 Dominant Additive 2 Genotype a b

12 Additive Genotype Model
Code genotypes as AA x=0, AG x=1, GG x=2 Linear Predictor h = b0 + xb1 P(Y=0|x) = exp(b0 + xb1)/(1+exp(b0 + xb1)) PAA = P(Y=0|x=0) = exp(b0)/(1+exp(b0)) PAG = P(Y=0|x=1) = exp(b0 + b1)/(1+exp(b0 + b1)) PGG = P(Y=0|x=2) = exp(b0 + 2b1)/(1+exp(b0 + 2b1))

13 Additive Model: b0 = -2 b1 = 2 PAA = 0.12 PAG = 0.50 PGG = 0.88
Prob(Y=0) h

14 Additive Model: b0 = 0 b1 = 2 PAA = 0.50 PAG = 0.88 PGG = 0.98
Prob(Y=0) h

15 Recessive Model Code genotypes as Linear Predictor
AA x=0, AG x=0, GG x=1 Linear Predictor h = b0 + xb1 P(Y=0|x) = exp(b0 + xb1)/(1+exp(b0 + xb1)) PAA = PAG = P(Y=0|x=0) = exp(b0)/(1+exp(b0)) PGG = P(Y=0|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))

16 Recessive Model: b0 = 0 b1 = 2 PAA = PAG = 0.50 PGG = 0.88
Prob(Y=0) h

17 Genotype Model Each genotype has an independent probability
Code genotypes as (for example) AA x=0, y=0 AG x=1, y=0 GG x=0, y=1 Linear Predictor h = b0 + xb1+yb2 two parameters P(Y=0|x) = exp(b0 + xb1+yb2)/(1+exp(b0 + xb1+yb2)) PAA = P(Y=0|x=0,y=0) = exp(b0)/(1+exp(b0)) PAG = P(Y=0|x=1,y=0) = exp(b0 + b1)/(1+exp(b0 + b1)) PGG = P(Y=0|x=0,y=1) = exp(b0 + b2)/(1+exp(b0 + b2))

18 Genotype Model: b0 = 0 b1 = 2 b2 = -1 PAA = 0.5 PAG = 0.88 PGG = 0.27
Prob(Y=0) h

19 Models in R response y genotype g
AA AG GG model DF Recessive 1 y ~ dominant(g) Dominant y ~ recessive(g) Additive 2 y ~ additive(g) Genotype a b y ~ genotype(g)

20 Data Transformation g <- t$m1
use these functions to treat a genotype vector in a certain way: a <- additive(g) r <- recessive(g) d <- dominant(g) g <- genotype(g)

21 Fitting the Model Equivalent models: genotype = dominant + recessive
afit <- glm( t$y ~ additive(g),family=‘binomial’) rfit <- glm( t$y ~ recessive(g),family=‘binomial’) dfit <- glm( t$y ~ dominant(g),family=‘binomial’) gfit <- glm( t$y ~ genotype(g),family=‘binomial’) Equivalent models: genotype = dominant + recessive genotype = additive + recessive genotype = additive + dominant genotype ~ standard chi-squared test of genotype association

22 Parameter Estimates > summary(glm( t$y ~ genotype(t$m31), family='binomial')) Coefficients: Estimate Std. Error z value Pr(>|z|) b0 (Intercept) <2e-16 *** b1 genotype(t$m31) b2 genotype(t$m31) --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 >

23 Analysis of Deviance Chi-Squared Test
> anova(glm( t$y ~ genotype(t$m31), family='binomial')) Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL genotype(t$m31)

24 Model Comparison Compare general model with additive, dominant or recessive models: > afit <- glm(t$y ~ additive(t$m20)) > gfit <- glm(t$y ~ genotype(t$m20)) > anova(afit,gfit) Analysis of Deviance Table Model 1: t$y ~ additive(t$m20) Model 2: t$y ~ genotype(t$m20) Resid. Df Resid. Dev Df Deviance >

25 Scanning all Markers > logscan(t,model=‘additive’)
Deviance DF Pval LogPval m e e m e e m e e m e e m e e m e e m e e m e e m e e m e e m e e

26 Multilocus Models Can test the effects of fitting two or more markers simultaneously Several multilocus models are possible Interaction Model assumes that each combination of genotypes has a different effect eg t$y ~ t$m10 * t$m15

27 Multi-Locus Models > f <- glm( t$y ~ genotype(t$m13) * genotype(t$m26) , family='binomial') > anova(f) Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL genotype(t$m13) genotype(t$m26) genotype(t$m13):genotype(t$m26) > pchisq(6.03,2,lower.tail=F) calculate p-value [1]

28 Adding the effects of Sex and other Covariates
Read in sex and other covariate data, eg. age from a file into variables, say a$sex, a$age Fit models of the form fit1 <- glm(t$y ~ additive(t$m10) + a$sex + a$age, family=‘binomial’) fit2 <- glm(t$y ~ a$sex + a$age, family=‘binomial’)

29 Adding the effects of Sex and other Covariates
Compare models using anova – test if the effect of the marker m10 is significant after taking into account sex and age anova(fit1,fit2)

30 Multiple Testing Take care interpreting significance levels when performing multiple tests Linkage disequilibrium can reduce the effective number of independent tests Permutation is a safe procedure to determine significance Repeat j=1..N times: Permute disease status y between individuals Fit all markers Record maximum deviance maxdev[j] over all markers Permutation p-value for a marker is the proportion of times the permuted maximum deviance across all markers exceeds the observed deviance for the marker logscan(t,permute=1000) slow!

31 Haplotype Association
Different from multiple genotype models Phase taken into account Haplotype association can be modelled in a similar logistic framework Treat haplotypes as extended alleles Fit additive, recessive, dominant & genotype models as before Eg haplotypes are h = AAGCAT, ATGCTT, etc y ~ additive(h) y ~ dominant(h) etc


Download ppt "Association Analysis, Logistic Regression, R and S-PLUS"

Similar presentations


Ads by Google