Presentation is loading. Please wait.

Presentation is loading. Please wait.

Day 4 Analysis of Qualitative data Anne Segonds-Pichon v

Similar presentations


Presentation on theme: "Day 4 Analysis of Qualitative data Anne Segonds-Pichon v"— Presentation transcript:

1 Day 4 Analysis of Qualitative data Anne Segonds-Pichon v2019-03

2 Outline of this section
Comparing 2 groups (one factor) Fisher’s exact or Chi2 tests Example: Dancing cats More than one factor Logistic regression Example: Methylation data

3 Qualitative data = not numerical
= values taken = usually names (also nominal) e.g. causes of death in hospital Values can be numbers but not numerical e.g. group number = numerical label but not unit of measurement Qualitative variable with intrinsic order in their categories = ordinal Particular case: qualitative variable with 2 categories: binary or dichotomous e.g. alive/dead or presence/absence Can be good or horrible data qualitative data do not use many stats, one is chi square Cats and dogs, first plot it

4 Comparing 2 groups One factor

5 Fisher’s exact and Chi2 Example: cats.dat Cats trained to line dance
2 different rewards: food or affection Question: Is there a difference between the rewards? Is there a significant relationship between the 2 variables? does the reward significantly affect the likelihood of dancing? To answer this type of question: Contingency table Fisher’s exact or Chi2 tests Food Affection Dance ? No dance Is there a difference between the rewards? Dogs more happy, cats more picky (on demonstration is the other way around, so cats more picky But first: how many cats do we need?

6 Exercise 14: Power calculation cats.dat
Preliminary results from a pilot study: 25% line-danced after having received affection as a reward vs. 70% after having received food. How many cats do we need? power.prop.test(p1= 0.25, p2= 0.7, sig.level= 0.05, power= 0.8) Providing the effect size observed in the experiment is similar to the one observed in the pilot study, we will need 2 samples of 18 to 19 cats to reach significance (p<0.05) with a Fisher’s exact test.

7 Plot cats data (From raw data)
cats<-read.delim("cats.dat") head(cats) plot(cats$Training, cats$Dance, xlab = "Training", ylab = "Dance") table(cats)

8 Plot cats data (From raw data)
contingency.table <- table(cats) contingency.table100<-prop.table(contingency.table,1) contingency.table100<-round(contingency.table100*100) contingency.table100 barplot(t(contingency.table100), legend.text=TRUE)

9 Plot cats data (From raw data) Prettier
barplot(t(contingency.table100), col=c("chartreuse3","lemonchiffon2"), cex.axis=1.2, cex.names=1.5, cex.lab=1.5, ylab = "Percentages", las=1) legend("topright", title="Dancing", inset=.05, c("Yes","No"), horiz=TRUE, pch=15, col=c("chartreuse3","lemonchiffon2"))

10 Chi-square and Fisher’s tests
Chi2 test very easy to calculate by hand but Fisher’s very hard Many software will not perform a Fisher’s test on tables > 2x2 Fisher’s test more accurate than Chi2 test on small samples Chi2 test more accurate than Fisher’s test on large samples Chi2 test assumptions: 2x2 table: no expected count <5 Bigger tables: all expected > 1 and no more than 20% < 5 Yates’s continuity correction All statistical tests work well when their assumptions are met When not: probability Type 1 error increases Solution: corrections that increase p-values Corrections are dangerous: no magic Probably best to avoid them

11 Chi-square test Example with ‘cats’
In a chi-square test, the observed frequencies for two or more groups are compared with expected frequencies by chance. With observed frequency = collected data Example with ‘cats’ Your frequency, compare the frequencies with the expected ones observed by chance. HOW DIFFERENT IS WHAT I observe from what I expect? THIS DIFFERENCE CAN BE BIG SO SIGNOFICANT OR NOT

12 Is 25.35 big enough for the test to be significant?
Chi-square test Formula for Expected frequency = (row total)*(column total)/grand total Example: expected frequency of cats line dancing after having received food as a reward: Expected = (38*76)/200=14.44 Alternatively: Probability of line dancing: 76/200 Probability of receiving food: 38/200 (76/200)*(38/200)=0.072 Expected: 7.2% of 200 = 14.44 Distance between expected and observed Chi2 = ( )2/ ( )2/ ( )2 / ( )2/14.4 = 25.35 Is big enough for the test to be significant?

13 Chi-square and Fisher’s Exact tests
Odds of dancing 48/114 = affection 28/10 = food Ratio of the odds food affection = 6.6 Answer: Training significantly affects the likelihood of cats line dancing (p=4.8e-07).

14 Chi-square and Fisher’s Exact tests
From a matrix Sometimes data to test are just a few values: These numbers come from an experiment/query In the entire genome (n=29395), 221 genes are associated with a function (e.g. antigen binding) . From an experiment, we identified 342 genes out of which 23 are showing such an association. Question: Is the enrichment significant? We need to create a matrix with the 4 values: enrichment <- matrix(c(23, 342, 221, 29395), nrow=2, ncol=2, byrow = TRUE) chisq.test(enrichment, correct=F) fisher.test(enrichment) Answer: Yes there is a significant enrichment in our list of genes (p < 2.2e-16).

15 with more than one factor
Comparisons with more than one factor

16 Real Example: DNA Methylation
Genome Read 1 Read 2 Read 3 SRR _0021:5:1:1361: u SRR _0021:5:1:1361: u SRR _0021:5:1:1355: m SRR _0021:5:1:1361: u SRR _0021:5:1:1355: m SRR _0021:5:1:1358: u SRR _0021:5:1:1361: u SRR _0021:5:1:1353: u

17 Chi-square and Fisher’s Exact tests
More than one factor Still comparing 2 groups: one factor/condition of interest (e.g. treatment vs control) but each group has 3 biological samples/3 experiments Example: dna_methylation_format2.csv Values are independent observations of individual DNA bases. A DNA base is either methylated (Meth) or unmethylated (Unmeth). Equivalent to a 2-way ANOVA but with a binary outcome: Binary logistic regression

18 Logistic regression dna_methylation_format2.csv
For each gene: Is there a difference in methylation between the 2 conditions (AL and DR) accounting for the variability between samples? Question 1: is the difference between samples significant? (experimental variability) Question 2: is the difference between conditions significant accounting for the variation between samples?

19 Exercise 15: dna_methylation_format2.csv
Load dna_methylation_format2.csv Use head() to get to know the structure of the data Calculate the percentage of methylation for each sample Plot the percentage for each gene separately and within each gene for each sample. No clues but here is how the graph should look -ish.

20 Exercise 15: dna_methylation_format2.csv - Answers
## Load the file ## dna.methylation <- read.csv("dna_methylation_format2.csv") head(dna.methylation) ## calculate percentage ## dna.methylation$MethPercent<-(dna.methylation$Meth/(dna.methylation$Meth+dna.methylation$Unmeth)*100) ## create 2 files: one for each gene ## dna.methylation.Pno1<-dna.methylation[dna.methylation$Gene=="Pno1",] dna.methylation.Pnpla5<-dna.methylation[dna.methylation$Gene=="Pnpla5",] head(dna.methylation.Pnpla5)

21 Exercise 15: dna_methylation_format2.csv - Answers
## plot percentage methylation for each sample and for each gene ### par(mfrow=c(1,2)) ## dna.methylation.Pno1<-as.data.frame(dna.methylation.Pno1) ## barplot(dna.methylation.Pno1$MethPercent, names.arg = dna.methylation.Pno1$Sample, col=rep(c("lightblue","lightpink"), each=3), main = "Pno1", ylim=c(0,100), cex.names=0.6, ylab = "Percentage Methylation", las=1) ## dna.methylation.Pnpla5<-as.data.frame(dna.methylation.Pnpla5) ## barplot(dna.methylation.Pnpla5$MethPercent, names.arg = dna.methylation.Pnpla5$Sample, main = "Pnpla5", Ta-dah!

22 Exercise 16: dna_methylation_format2.csv
Question 1: is the difference between samples significant? (experimental variability) Work a Chi2 test for each group/gene so that you can workout the experimental variability. Restructure the file in the long format (1 file per gene) for Question 2

23 Exercise 16: dna_methylation_format2.csv - Answers
Question 1: is the difference between samples significant? (experimental variability) Chi2 test dna.methylation.Pnpla5.AL<- dna.methylation.Pnpla5[dna.methylation.Pnpla5$Group=="AL",] chisq.test(dna.methylation.Pnpla5.AL[,4:5], correct=F) 3x2 contingency table Unmeth Meth AL-1174 287 254 AL-1180 121 97 AL-1220 163 187

24 Exercise 16: dna_methylation_format2.csv - Answers
Question 1: is the difference between samples significant? (experimental variability) ## test difference between samples for each gene/condition ### 1 dna.methylation.Pno1.AL<-dna.methylation.Pno1[dna.methylation.Pno1$Group=="AL",] chisq.test(dna.methylation.Pno1.AL[,4:5], correct=F) 2 dna.methylation.Pno1.DR<-dna.methylation.Pno1[dna.methylation.Pno1$Group=="DR",] chisq.test(dna.methylation.Pno1.DR[,4:5], correct=F) 3 dna.methylation.Pnpla5.AL<-dna.methylation.Pnpla5[dna.methylation.Pnpla5$Group=="AL",] chisq.test(dna.methylation.Pnpla5.AL[,4:5], correct=F) 4 dna.methylation.Pnpla5.DR<-dna.methylation.Pnpla5[dna.methylation.Pnpla5$Group=="DR",] 1 2 3 4

25 Exercise 16: dna_methylation_format2.csv - Answers
Question 2: is the difference between conditions significant accounting for the variation between samples? Data need to be in the long format (1 file per gene) Restructure dna.methylation.Pno1 and dna.methylation.Pnpla5 Rename the columns “Methylation” and “Counts” Extra credits if you remove the first and the last columns 

26 Exercise 16: dna_methylation_format2.csv - Answers
## prepare file (long format)## dna.methylation.Pno1.long <- melt(dna.methylation.Pno1[,2:5], ID=c("Sample", "Group")) dna.methylation.Pnpla5.long <- melt(dna.methylation.Pnpla5[,2:5], ID=c("Sample", "Group")) colnames(dna.methylation.Pno1.long)[3:4]<-c("Methylation", "Counts") colnames(dna.methylation.Pnpla5.long)[3:4]<-c("Methylation", "Counts") head(dna.methylation.Pno1.long)

27 Logistic regression dna_methylation_format2.csv y = β0 + β1*x
Question 2: is the difference between conditions significant accounting for the variation between samples? Logistic regression: Equivalent to a 2-way ANOVA or a linear model but with a binary outcome Family: Linear Model (General Linear Model or Generalized Linear Model) y = f(x) One factor: y = Constant + Coefficient * x Two factors: y = Constant + Coefficient1 * x1 + Coefficient2 * x2 y = f(x1, x2 … xn) In R: Outcome ~ factor1 + factor2 With the interaction: Outcome ~ factor1 + factor2 + factor1*factor2 y = β0 + β1*x

28 Logistic regression dna_methylation_format2.csv
Question 2: is the difference between conditions significant accounting for the variation between samples? Logistic regression: Equivalent to a 2-way ANOVA but with a binary outcome In R: Continuous outcome: lm or aov(Outcome ~ factor1 + factor2 + factor1*factor2) = glm(Outcome ~ factor1 + factor2 + factor1*factor2) Binary outcome: lm or aov(does not work) = glm(Outcome ~ factor1 + factor2 + factor1*factor2, family = binomial()) Data need to be in the long format (1 file per gene)

29 Logistic regression dna_methylation_format2.csv
Question 2: is the difference between conditions significant accounting for the variation between samples? dna.model.Pno1 <- glm(Methylation ~ Group+Sample, data=dna.methylation.Pno1.long, family = binomial(), weights=Counts) Group Methylation AL 1 DR Group Methylation Counts AL Unmeth 2 Meth 3 DR 1 4

30 Logistic regression dna_methylation_format2.csv
Question 2: is the difference between conditions significant accounting for the variation between samples? dna.model.Pno1 <- glm(Methylation ~ Group+Sample, data=dna.methylation.Pno1.long, family = binomial(), weights=Counts) summary(dna.model.Pno1)

31 Logistic regression dna_methylation_format2.csv
Question 2: is the difference between conditions significant accounting for the variation between samples? dna.model.Pnpla5<-glm(Methylation ~ Group+Sample,data=dna.methylation.Pnpla5.long, family = binomial(), weights=Counts) summary(dna.model.Pnpla5)

32 Real Example: DNA Methylation
Measures per gene Chi-Square p<0.001 (FDR)

33 Our addresses:


Download ppt "Day 4 Analysis of Qualitative data Anne Segonds-Pichon v"

Similar presentations


Ads by Google