Presentation on theme: "Course in Statistics and Data analysis Course B DAY2 September 2009 Stephan Frickenhaus www.awi.de/en/go/bioinformatics."— Presentation transcript:
Course in Statistics and Data analysis Course B DAY2 September 2009 Stephan Frickenhaus www.awi.de/en/go/bioinformatics
DAY2 How to import data from excel Multivariate data in plots, linear models ANOVA Ideas of Clustering and Modelling
Import of Data from Excel Change in R to the directory where the „tab1.txt“ is (File->Change Dir.). Load in R into a variable V V=read.table(file=“tab1.t xt“,header=T) You may use the column named „day“ as row-names: V=read.table(file=“tab1.t xt“,header=T, row.names=“day“) Copy a rectangular part (or all) from your table Paste into a TEXT-file in the Windows- EDITOR check column-names Save as „tab1.txt“
Problems In case of prblems with decimal `,` or `.`: Tell R which is the decimal point in read.table If you get a text-file with commas, not tabs separating columns: V=read.table(…, dec=“.“) V=read.table(…, sep=“,“)
Saving result tables R-analysis results, e.g., from filtering etc. are sometimes exported to text-files - can be imported in Excel or R later Do this without quotes for each entry: write.table(V, file=“res.txt“, quote=F) Save only 2 desired columns („size“ and „class“): write.table(rbind(V$size,V$class), file=“res2.txt“)
Multivariate Data Suppose we have Diameter and Height of Diatoms measured Work with „diatoms.txt“ What is the relation between these? It one dependent on the other? What is the strategy of the organism?
Correlation test Is there a significant correlation? cor.test(D,H) Checks if the observed correlation is significant non-zero We find negative corr., near -1 (strong) A good p-value shows significant correlation.
Text We can conclude that these diatoms show a special trend: increasing height, when decreasing diameter. What does this mean? Can we say that this has a compensating function? It could be that the cell does maintain volume (centric shape). Volume V=R^2*pi*H = 1/4 D^2 *pi * H So we expect a linear relation between H and 1/D^2 we need a regression… …it is found in R: lm(Y~X) Try ?lm to see how. See „diatoms.R“
Linear models To fit a model to data Suppose we have a sample of measured (y,x1,x2,x3) The simplest model showing influence of all 3 x has the form y=a*x1+b*x2+c*x3+d Coefficients a,b,c,d obtained from lm(y~x1+x2+x3) Each coefficients value may be non- significant, so it could as well be set to zero. summary(lm()) shows these significances
Check „lm.R“ The data y was created with coefficients 1, 1, 0.5 and a random term runif/3 We see estimates of these coefficients from the fit under „Estimate“. Now, we could write the fitted model as y.fit(x1,x2,x3)=0.26826+1.0 0595*x1+1.01167*x2+0.473 11*x3 Use this to draw a ± error bar around the y.fit If you want no intercept, use y~x1+x2+x3-1
conclusions Variables x with significant coefficients, i.e., Pr(|t|>)
"name": "conclusions Variables x with significant coefficients, i.e., Pr(|t|>)
ANOVA With two different treatments we make the t- test to compare means. The influence of a factor/treatment with more than 2 variants is commonly analysed by ANOVA, i.e., more than two means are compared at the same time. The Null is that all samples means are from the same pop [the treatment has no effect].
ANOVA In R ist like linear models, but with factors that influence the means. See dataset ANOVA.txt Try aov(y~f.c) A weak p, effect may be unclear because of the other factors
But which means do differ? f.c has 3 levels. We are not allowed to look at the means of each level. We must make all pairwise comparisons for significance This is known as „post-hoc“-test One is TukeyHSD It gives a table of pairwise tests of means Since data is used more than once, well discover more likely some effect. HSD corrects p- values for multiple-tests
Post-hoc Almost significant effect, comparing group 1 with 0 adjusted p for 3 tests
Compare with a T-test So, the adjusted p-value 0.06 from HSD is greater
Ideas of clustering and modeling Clustering is a way to detect/display groups in data that might point to a factor which affects the sample. Different ways: –Mapping: plot multivariate data in a special way to see groups –Discriminant analysis: use a known factor (e.g., strain) to find a maping that best seperates the known groups Use the discriminant to classify new data !!!
PCA Download data PCA.txt See PCA.R to make a PCA for that multivariate data PC1 is rotated data, with maximal variance PC2 has smaller variance we could separate / discriminate with this line
Linear Discriminant check LDA.R and LDA.txt to see similar results the original 3-class 3D-data in a 2D LDA new data (squares) classified (predicted) accoring to the LDA