Download presentation
Presentation is loading. Please wait.
1
Unit 3, Session 1 Statistical Models
2
Outline Logistic Regression Linear Regression
ROC curve and AUC Linear Regression Kaplan-Meier plot and log-rank test Cox Proportional hazards model
3
Logistic Model Logistic model is used for case/control study
Usage scenario: when the response is binary, say, disease/healthy or recurrence/non-recurrence log π π π‘ππ‘π’π =ππππ’ππππππ 1βπ π π‘ππ‘π’π =ππππ’ππππππ = Ξ² 0 + π½ 1 π₯ 1 +β―+ π½ π π₯ π Where π₯ π are predictors and π½ π are the parameters of interest
4
Linear Model Response: continuous, say weight, or gene expression.
Predictors: any variables (say gene expression) Model π¦= Ξ² 0 + π½ 1 π₯ 1 +β―+ π½ π π₯ π +π Assumptions: error term πβΌπππ π 0, π 2
5
Survival Methods Kaplan-Meier plot: visually checking the survival curve between groups Cox Proportional hazards model and log-rank test as formal statistical test Response: survival time (say DFS) and censor Predictors: any variables (say group or specific genes) Recurrence: censor = 1 and Non-recurrence: censor = 0
6
Load data Toy example data
toy_data<- read.csv("toy_example_data.csv")
7
Logistic Model Response --- recurrence/non-recurrence status
Predictor --- the expression of gene HOXB13 # logistic regresion, use gene HOXB13 to predict the recur/non-recur status fit.logistic <- glm(status~ gene_HOXB13,data = toy_data,family = binomial(link = 'logit')) summary(fit.logistic) #plot ROC curvep <- predict(fit.logistic, type="response") pr <- prediction(p, toy_data$status) prf <- performance(pr, measure = "tpr", x.measure = "fpr") plot(prf,main="ROC plot of logistic regression") # calculate the auc auc <- performance(pr, measure = "auc") auc <-
8
Logistic Regression Result
9
ROC Curve
10
Linear Model Response --- expression of HOXB13
Predictor --- expression of IL17BR # linear model, use gene IL17BR to predict another gene HOXB13 HOXB13fit.lm<- lm(gene_HOXB13~gene_IL17BR,data = data_toy) summary(fit.lm)
11
Linear Regression Result
12
Kaplan-Meier Plot We use Kaplan-Meier plot and log-rank test to check whether the survival time is significantly different from each other between groups (say high/low ratio group) ratio.surv <- survfit(Surv(time,censor) ~ ratio_group, data = toy_data) autoplot(ratio.surv,pVal = T,pX=0.25,pY =0.25,title = paste0("Kaplan-Meier plot of toy example "),yLab = "Survival Probability")
13
Kaplan-Meier Plot
14
Cox Proportional Hazards Model
We use high/low ratio group to predict the survival probability. Here the response is the survival time and the censor information fit.cox <- coxph(Surv(time,censor) ~ group, data = toy_data) summary(fit.cox)
15
Cox Model Result
16
Data Downloading, Processing and Analysis
17
Outline Download data Parsing data Normalization
Variance based filtering (top 25%) T test based filtering(based on the P-value cutoff) The above steps are implemented in βget_DEG_table.Rβ script.
18
Data Availability Microdissected dataset GSE1378: Whole tissue dataset: GSE1379: The easiest way to download data is using βgetGEOβ function from βGEOqueryβ package
19
Use βgetGEOβ to Download Data
We have downloaded the data, you can use βgetGEOβ function to get data locally or online Local (loading_method = βlocalβ) geo_Name <- βGSE1378β geodata2 <-getGEO(filename paste0('geo_data/',geo_Name,'_series_matrix.txt.gz'), GSEMatrix = TRUE) Online (loading_method = βonlineβ) geodata <- getGEO(geo_Name, GSEMatrix = TRUE,destdir = "geo_data") You can set loading_method variable in the get_DEG_table function to βlocalβ or βonlineβ to change the way of downloading data Note that the downloaded geno matrix is in log2 scale
20
Parsing Data Extract the geno matrix, pheno table and feature table
idx <- 1 ;geno <- assayData(geodata[[idx]])$exprs pheno <- pData(phenoData(geodata[[idx]])) feature <- as(featureData(geodata[[idx]]), 'data.frame') Parsing phenotype table to get variable Age, Size, DFS, censor infos_df$Age = as.numeric(unlist(strsplit(infos_df$X9, split = "="))[seq(2, 2 * n, 2)]) infos_df$Size = as.numeric(unlist(strsplit(infos_df$X3, split = "="))[seq(2, 2 *n, 2)]) infos_df$DFS = as.numeric(unlist(strsplit(infos_df$X10, split = "="))[seq(2, 2 * n, 2)]) infos_df$censor = ifelse(infos_df$status == "Status=recur", 1, 0)
21
Normalization Gene wise normalization (subtract the median log2 value)
tmp_gm <- apply(geno, 2, median) geno <- geno - matrix(rep(1, numOfGene), numOfGene, 1) %*% matrix(tmp_gm, 1, n) Sample wise normalization (divided by mean value in original scale) geno <- apply(geno, c(1, 2), function(x) { 2 ^ x }) geno <- t(apply(geno, 1, function(x) { x / (mean(x)) })) geno <- apply(geno, c(1, 2), function(x) { log2(x) })
22
Variance Based Filtering
Calculate the variance for each gene and choose the top 25% # variance based filtering (75th percentile) var_geno <- apply(geno, 1, var) var_filtered_idx <- var_geno > quantile(var_geno, 0.75) feature_var_filtered <- feature[var_filtered_idx,] geno_var_filtered <- geno[var_filtered_idx,]
23
T test Based Filtering For each gene, do T test between the recurrence and non-recurrence group. The status variable indicates the group information tmp_test <- t.test(gene_express ~ status, data = sdata, alternative = "two.sided") pvalue_list[i] <- tmp_test$p.value Fitering the gene by the P-value cutoff ttest_filtered_idx <- which(pvalue_list < cutoff) feature_ttest_filtered <- feature_var_filtered[ttest_filtered_idx,] geno_ttest_filtered <- geno_var_filtered[ttest_filtered_idx,]
24
Sample Results (GSE1378,microdissected, 0.0011 cutoff)
25
Sample Results (GSE1379, whole tissue dataset, cutoff 0.0011)
26
Statistical Modeling (examples)
27
Outline Select overlapped genes between GSE1378 and GSE1379 for subsequent analysis Heatmap and Dendrogram Univariate logistic regression for selected genes and two-gene ratio predictor Multivariate logistic regression (size and the other two potential predictors) Survival analysis part 1: Kaplan-Meier plot Survival analysis part 2: Cox proportional hazards model
28
Overlapped Genes In the prepossessing step, we obtained two DEG tables for the datasets GSE1378 and GSE1379 We used the overlapped genes in this two DEG tables for the subsequent analysis GSE1378: Micro-dissected breast cancer cell (LCM) GSE1379: Whole tissue section The overlapped genes are HOXB13 (identified twice as AI and BC007092), IL17BR (AF ) and AI (EST) We will study the prognostic value of these markers
29
Heatmap and Dendrogram
We use Heatmap and Dendrogram to Visually check the relationship (correlation) among genes or samples
30
Heatmap (microdissected,GSE1378) consistent with the paper
31
Heatmap (whole section tissue, GSE 1379)
32
Model Set 1 Univariate logistic regression for each gene
Response variable: recur/non-recur status Predictors: one of the overlapped genes, HOXB13 / IL17BR(AF ) / AI240933(EST)
33
Model Set 2 Univariate logistic regression for ratio of genes
Response variable : recur/non-recur status Predictors : HOXB13:IL17BR
34
Model Set 3 Multivariate logistic regression
Response variable : recur/non-recur Predictors: tumor size, HOXB13:IL17BR, PGR and ERBB2
35
Model Set 4 Survival model
Response variable: DFS (disease free survival time), censor Predictor: use β-intercept/betaβ from logistic regression as the cutoff to divide the sample into two groups: high ratio group and low ratio group
36
Important Note Please remember there are two datasets GSE1378 and GSE1379 Can fit the same sets of model on these two datasets Need to set the working dataset variable working_dataset = "GSE1378" #whole tissue section,GSE1379 #working_dataset = "GSE1378" #microdissected breast cancer cells, GSE1378 Use working dataset GSE1378 as example
37
Univariate Logistic Regression for Each Gene
As an example, we check the gene HOXB13 gb_acc = "BC007092" # HOXB13 geno_selected = geno[which(feature$GB_ACC == gb_acc),] logit_data = data.frame(status = infos_df$status,gene = geno_selected ) fit <- glm(status~ geno_selected,data = logit_data,family = binomial(link = 'logit')) p <- predict(fit, type="response") pr <- prediction(p, infos_df$status) prf <- performance(pr, measure = "tpr", x.measure = "fpr") plot(prf,main=paste0("ROC plot of gene ",gb_acc)) auc <- performance(pr, measure = "auc") auc <- auc
38
Sample Output (gene HOXB13 )
39
ROC (auc 0.796, gene HOXB13 )
40
Univariate Logistic Regression (HOXB13:IL17BR)
gb_acc1 = "BC007092" # HOXB13 gb_acc2 = "AF208111" # IL17BR geno_selected1 = geno[which(feature$GB_ACC == gb_acc1),] geno_selected2 = geno[which(feature$GB_ACC == gb_acc2),] # in the log2 scale, the ratio is the difference. gene_ratio = geno_selected1-geno_selected2 logit_data = data.frame(status = infos_df$status,gene1 = geno_selected1, gene2 = geno_selected2,ratio =gene_ratio) # fit the model fit <- glm(status~ gene_ratio,data = logit_data,family = binomial(link = 'logit')) summary(fit)
41
Sample Output (HOXB13:IL17BR)
42
ROC (auc=0.84, HOXB13:IL17BR)
43
Multivariate Logistic Regression (tumor size, gene ratio, PGR, ERBB2)
gb_acc1 = "BC007092" # HOXB13 gb_acc2 = "AF208111" # IL17BR gene_name3 = "PGR_3UTR1" # PGR gene_name4 = "BF108852" # ERBB2 geno_selected1 = geno[which(feature$GB_ACC == gb_acc1),] geno_selected2 = geno[which(feature$GB_ACC == gb_acc2),] geno_selected3 = geno[which(feature$GeneName == gene_name3),] geno_selected4 = geno[which(feature$GeneName == gene_name4),] # in the log2 scale, the ratio is the difference. gene_ratio = geno_selected1-geno_selected2 logit_data = data.frame(status = infos_df$status,size = infos_df$Size,gene1 = geno_selected1, gene2 = geno_selected2,ratio =gene_ratio,gene3= geno_selected3,gene4= geno_selected4) # fit the multinvariate logistic regression fit <- glm(status~ gene_ratio+size+gene3+gene4,data = logit_data,family = binomial(link = 'logit')) summary(fit)
44
Sample Output (Multivariate)
45
ROC (auc = 0.86, Multivariate )
46
Kaplan-Meier Plot (gene ratio high/low group, cutoff = -1.2)
47
Cox Proportional Hazards Model (gene ratio high/low group, cutoff = -1
fit.cox <- coxph(Surv(time,censor) ~ group, data = surv_data) summary(fit.cox)
48
Sample Output (Cox)
49
Validation: GSE6532 The link to this dataset
Sample size:87 Number of total markers: 54675 Gene HOXB13,IL17RB and ESTs are included in this dataset. We use this dataset as validation. Result: They are not significant on this independent set.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.