Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unit 3, Session 1 Statistical Models.

Similar presentations


Presentation on theme: "Unit 3, Session 1 Statistical Models."β€” Presentation transcript:

1 Unit 3, Session 1 Statistical Models

2 Outline Logistic Regression Linear Regression
ROC curve and AUC Linear Regression Kaplan-Meier plot and log-rank test Cox Proportional hazards model

3 Logistic Model Logistic model is used for case/control study
Usage scenario: when the response is binary, say, disease/healthy or recurrence/non-recurrence log 𝑝 π‘ π‘‘π‘Žπ‘‘π‘’π‘ =π‘Ÿπ‘’π‘π‘’π‘Ÿπ‘Ÿπ‘’π‘›π‘π‘’ 1βˆ’π‘ π‘ π‘‘π‘Žπ‘‘π‘’π‘ =π‘Ÿπ‘’π‘π‘’π‘Ÿπ‘Ÿπ‘’π‘›π‘π‘’ = Ξ² 0 + 𝛽 1 π‘₯ 1 +β‹―+ 𝛽 𝑛 π‘₯ 𝑛 Where π‘₯ 𝑖 are predictors and 𝛽 𝑖 are the parameters of interest

4 Linear Model Response: continuous, say weight, or gene expression.
Predictors: any variables (say gene expression) Model 𝑦= Ξ² 0 + 𝛽 1 π‘₯ 1 +β‹―+ 𝛽 𝑛 π‘₯ 𝑛 +πœ– Assumptions: error term πœ–βˆΌπ‘–π‘–π‘‘ 𝑁 0, 𝜎 2

5 Survival Methods Kaplan-Meier plot: visually checking the survival curve between groups Cox Proportional hazards model and log-rank test as formal statistical test Response: survival time (say DFS) and censor Predictors: any variables (say group or specific genes) Recurrence: censor = 1 and Non-recurrence: censor = 0

6 Load data Toy example data
toy_data<- read.csv("toy_example_data.csv")

7 Logistic Model Response --- recurrence/non-recurrence status
Predictor --- the expression of gene HOXB13 # logistic regresion, use gene HOXB13 to predict the recur/non-recur status fit.logistic <- glm(status~ gene_HOXB13,data = toy_data,family = binomial(link = 'logit')) summary(fit.logistic) #plot ROC curvep <- predict(fit.logistic, type="response") pr <- prediction(p, toy_data$status) prf <- performance(pr, measure = "tpr", x.measure = "fpr") plot(prf,main="ROC plot of logistic regression") # calculate the auc auc <- performance(pr, measure = "auc") auc <-

8 Logistic Regression Result

9 ROC Curve

10 Linear Model Response --- expression of HOXB13
Predictor --- expression of IL17BR # linear model, use gene IL17BR to predict another gene HOXB13 HOXB13fit.lm<- lm(gene_HOXB13~gene_IL17BR,data = data_toy) summary(fit.lm)

11 Linear Regression Result

12 Kaplan-Meier Plot We use Kaplan-Meier plot and log-rank test to check whether the survival time is significantly different from each other between groups (say high/low ratio group) ratio.surv <- survfit(Surv(time,censor) ~ ratio_group, data = toy_data) autoplot(ratio.surv,pVal = T,pX=0.25,pY =0.25,title = paste0("Kaplan-Meier plot of toy example "),yLab = "Survival Probability")

13 Kaplan-Meier Plot

14 Cox Proportional Hazards Model
We use high/low ratio group to predict the survival probability. Here the response is the survival time and the censor information fit.cox <- coxph(Surv(time,censor) ~ group, data = toy_data) summary(fit.cox)

15 Cox Model Result

16 Data Downloading, Processing and Analysis

17 Outline Download data Parsing data Normalization
Variance based filtering (top 25%) T test based filtering(based on the P-value cutoff) The above steps are implemented in β€œget_DEG_table.R” script.

18 Data Availability Microdissected dataset GSE1378: Whole tissue dataset: GSE1379: The easiest way to download data is using β€œgetGEO” function from β€œGEOquery” package

19 Use β€œgetGEO” to Download Data
We have downloaded the data, you can use β€œgetGEO” function to get data locally or online Local (loading_method = β€˜local’) geo_Name <- β€˜GSE1378’ geodata2 <-getGEO(filename paste0('geo_data/',geo_Name,'_series_matrix.txt.gz'), GSEMatrix = TRUE) Online (loading_method = β€˜online’) geodata <- getGEO(geo_Name, GSEMatrix = TRUE,destdir = "geo_data") You can set loading_method variable in the get_DEG_table function to ”local” or β€œonline” to change the way of downloading data Note that the downloaded geno matrix is in log2 scale

20 Parsing Data Extract the geno matrix, pheno table and feature table
idx <- 1 ;geno <- assayData(geodata[[idx]])$exprs pheno <- pData(phenoData(geodata[[idx]])) feature <- as(featureData(geodata[[idx]]), 'data.frame') Parsing phenotype table to get variable Age, Size, DFS, censor infos_df$Age = as.numeric(unlist(strsplit(infos_df$X9, split = "="))[seq(2, 2 * n, 2)]) infos_df$Size = as.numeric(unlist(strsplit(infos_df$X3, split = "="))[seq(2, 2 *n, 2)]) infos_df$DFS = as.numeric(unlist(strsplit(infos_df$X10, split = "="))[seq(2, 2 * n, 2)]) infos_df$censor = ifelse(infos_df$status == "Status=recur", 1, 0)

21 Normalization Gene wise normalization (subtract the median log2 value)
tmp_gm <- apply(geno, 2, median) geno <- geno - matrix(rep(1, numOfGene), numOfGene, 1) %*% matrix(tmp_gm, 1, n) Sample wise normalization (divided by mean value in original scale) geno <- apply(geno, c(1, 2), function(x) { 2 ^ x }) geno <- t(apply(geno, 1, function(x) { x / (mean(x)) })) geno <- apply(geno, c(1, 2), function(x) { log2(x) })

22 Variance Based Filtering
Calculate the variance for each gene and choose the top 25% # variance based filtering (75th percentile) var_geno <- apply(geno, 1, var) var_filtered_idx <- var_geno > quantile(var_geno, 0.75) feature_var_filtered <- feature[var_filtered_idx,] geno_var_filtered <- geno[var_filtered_idx,]

23 T test Based Filtering For each gene, do T test between the recurrence and non-recurrence group. The status variable indicates the group information tmp_test <- t.test(gene_express ~ status, data = sdata, alternative = "two.sided") pvalue_list[i] <- tmp_test$p.value Fitering the gene by the P-value cutoff ttest_filtered_idx <- which(pvalue_list < cutoff) feature_ttest_filtered <- feature_var_filtered[ttest_filtered_idx,] geno_ttest_filtered <- geno_var_filtered[ttest_filtered_idx,]

24 Sample Results (GSE1378,microdissected, 0.0011 cutoff)

25 Sample Results (GSE1379, whole tissue dataset, cutoff 0.0011)

26 Statistical Modeling (examples)

27 Outline Select overlapped genes between GSE1378 and GSE1379 for subsequent analysis Heatmap and Dendrogram Univariate logistic regression for selected genes and two-gene ratio predictor Multivariate logistic regression (size and the other two potential predictors) Survival analysis part 1: Kaplan-Meier plot Survival analysis part 2: Cox proportional hazards model

28 Overlapped Genes In the prepossessing step, we obtained two DEG tables for the datasets GSE1378 and GSE1379 We used the overlapped genes in this two DEG tables for the subsequent analysis GSE1378: Micro-dissected breast cancer cell (LCM) GSE1379: Whole tissue section The overlapped genes are HOXB13 (identified twice as AI and BC007092), IL17BR (AF ) and AI (EST) We will study the prognostic value of these markers

29 Heatmap and Dendrogram
We use Heatmap and Dendrogram to Visually check the relationship (correlation) among genes or samples

30 Heatmap (microdissected,GSE1378) consistent with the paper

31 Heatmap (whole section tissue, GSE 1379)

32 Model Set 1 Univariate logistic regression for each gene
Response variable: recur/non-recur status Predictors: one of the overlapped genes, HOXB13 / IL17BR(AF ) / AI240933(EST)

33 Model Set 2 Univariate logistic regression for ratio of genes
Response variable : recur/non-recur status Predictors : HOXB13:IL17BR

34 Model Set 3 Multivariate logistic regression
Response variable : recur/non-recur Predictors: tumor size, HOXB13:IL17BR, PGR and ERBB2

35 Model Set 4 Survival model
Response variable: DFS (disease free survival time), censor Predictor: use β€œ-intercept/beta” from logistic regression as the cutoff to divide the sample into two groups: high ratio group and low ratio group

36 Important Note Please remember there are two datasets GSE1378 and GSE1379 Can fit the same sets of model on these two datasets Need to set the working dataset variable working_dataset = "GSE1378" #whole tissue section,GSE1379 #working_dataset = "GSE1378" #microdissected breast cancer cells, GSE1378 Use working dataset GSE1378 as example

37 Univariate Logistic Regression for Each Gene
As an example, we check the gene HOXB13 gb_acc = "BC007092" # HOXB13 geno_selected = geno[which(feature$GB_ACC == gb_acc),] logit_data = data.frame(status = infos_df$status,gene = geno_selected ) fit <- glm(status~ geno_selected,data = logit_data,family = binomial(link = 'logit')) p <- predict(fit, type="response") pr <- prediction(p, infos_df$status) prf <- performance(pr, measure = "tpr", x.measure = "fpr") plot(prf,main=paste0("ROC plot of gene ",gb_acc)) auc <- performance(pr, measure = "auc") auc <- auc

38 Sample Output (gene HOXB13 )

39 ROC (auc 0.796, gene HOXB13 )

40 Univariate Logistic Regression (HOXB13:IL17BR)
gb_acc1 = "BC007092" # HOXB13 gb_acc2 = "AF208111" # IL17BR geno_selected1 = geno[which(feature$GB_ACC == gb_acc1),] geno_selected2 = geno[which(feature$GB_ACC == gb_acc2),] # in the log2 scale, the ratio is the difference. gene_ratio = geno_selected1-geno_selected2 logit_data = data.frame(status = infos_df$status,gene1 = geno_selected1, gene2 = geno_selected2,ratio =gene_ratio) # fit the model fit <- glm(status~ gene_ratio,data = logit_data,family = binomial(link = 'logit')) summary(fit)

41 Sample Output (HOXB13:IL17BR)

42 ROC (auc=0.84, HOXB13:IL17BR)

43 Multivariate Logistic Regression (tumor size, gene ratio, PGR, ERBB2)
gb_acc1 = "BC007092" # HOXB13 gb_acc2 = "AF208111" # IL17BR gene_name3 = "PGR_3UTR1" # PGR gene_name4 = "BF108852" # ERBB2 geno_selected1 = geno[which(feature$GB_ACC == gb_acc1),] geno_selected2 = geno[which(feature$GB_ACC == gb_acc2),] geno_selected3 = geno[which(feature$GeneName == gene_name3),] geno_selected4 = geno[which(feature$GeneName == gene_name4),] # in the log2 scale, the ratio is the difference. gene_ratio = geno_selected1-geno_selected2 logit_data = data.frame(status = infos_df$status,size = infos_df$Size,gene1 = geno_selected1, gene2 = geno_selected2,ratio =gene_ratio,gene3= geno_selected3,gene4= geno_selected4) # fit the multinvariate logistic regression fit <- glm(status~ gene_ratio+size+gene3+gene4,data = logit_data,family = binomial(link = 'logit')) summary(fit)

44 Sample Output (Multivariate)

45 ROC (auc = 0.86, Multivariate )

46 Kaplan-Meier Plot (gene ratio high/low group, cutoff = -1.2)

47 Cox Proportional Hazards Model (gene ratio high/low group, cutoff = -1
fit.cox <- coxph(Surv(time,censor) ~ group, data = surv_data) summary(fit.cox)

48 Sample Output (Cox)

49 Validation: GSE6532 The link to this dataset
Sample size:87 Number of total markers: 54675 Gene HOXB13,IL17RB and ESTs are included in this dataset. We use this dataset as validation. Result: They are not significant on this independent set.


Download ppt "Unit 3, Session 1 Statistical Models."

Similar presentations


Ads by Google