Lab 1: Using data output from Qiime, transformations, quality control

Slides:



Advertisements
Similar presentations
Haas MFE SAS Workshop Lecture 3:
Advertisements

Summary Statistics/Simple Graphs in SAS/EXCEL/JMP.
Using R with Python.
Introduction to R Brody Sandel. Topics Approaching your analysis Basic structure of R Basic programming Plotting Spatial data.
Parallel R Andrew Jaffe Computing Club 4/5/2015. Overview Introduction multicore Array jobs The rest.
Creating Tables in a Web Site
An Introduction to R: Logic & Basics. The R language Command line Can be executed within a terminal Within Emacs using ESS (Emacs Speaks Statistics)
R for Macroecology Functions and plotting. A few words on for  for( i in 1:10 )
WaveMaker Visual AJAX Studio 4.0 Training
Basic Microbiome Analysis with QIIME
Introduction to MATLAB for Biomedical Engineering BME 1008 Introduction to Biomedical Engineering FIU, Spring 2015 Lesson 2: Element-wise vs. matrix operations.
Jack Davis Andrew Henrey FROM N00B TO PRO. PURPOSE Create a simulator from scratch that: Generates data from a variety of distributions Makes a response.
Differential Expression Analysis Introduction to Systems Biology Course Chris Plaisier Institute for Systems Biology.
Graphics in R data analysis and visualization Katia Oleinik Scientific Computing and Visualization Boston University
Using Excel Biostatistics 212 Lecture 4. Housekeeping Questions about Lab 3? –replace vs. recode Final Project Dataset! –“Housekeeping” commands vs. data.
Practical Bioinformatics Community structure measures for meta-genomics István Albert Bioinformatics Consulting Center Penn State.
CMPUT 101 Lab #6 October 29, :00 – 17:00. Array in C/C++ Array is a structure type variable. One dimension of array int: int num[3]; There are.
1 iSee Player Tutorial Using the Forest Biomass Accumulation Model as an Example ( Tutorial Developed by: (
From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.
Introduction to SAS BIO 226 – Spring Outline Windows and common rules Getting the data –The PRINT and CONTENT Procedures Manipulating the data.
1 Experimental Statistics - week 4 Chapter 8: 1-factor ANOVA models Using SAS.
Class 3 Programming in Visual Basic. Class Objectives Learn about input/output Learn about strings Learn about subroutines Learn about arrays Learn about.
Automatic ssu- rRNA novelty ranking pipeline Ssu-RNA sequence s One ranking score for each sequence for phylogenetic novelty Dongying Wu 03/2015.
Key Data Management Tasks in Stata
Phylogenetic trees School B&I TCD Bioinformatics May 2010.
Web Scripting [PHP] CIS166AE Wednesdays 6:00pm – 9:50pm Rob Loy.
An Introduction to R graphics Cody Chiuzan Division of Biostatistics and Epidemiology Computing for Research I, 2012.
I❤RI❤R Kin Wong (Sam) Game Plan Intro R Import SPSS file Descriptive Statistics Inferential Statistics GraphsQ&A.
R Programming Yang, Yufei. Normal distribution.
Basic Microbiome Analysis with QIIME
Homework #5 New York University Computer Science Department Data Structures Fall 2008 Eugene Weinstein.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
IS201 Agenda: 09/19  Modify contents of the database.  Discuss queries: Turning data stored in a database into information for decision making.  Create.
RDP – Capturing the Unclassified Use only on data that can be publicly shared. These are not secure tools.
An Introduction Katherine Nicholas & Liqiong Fan.
Bioinformatics for biologists
Review for MassHunter and reporting
Aggregator  Performs aggregate calculations  Components of the Aggregator Transformation Aggregate expression Group by port Sorted Input option Aggregate.
B Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Working with PDF and eText Templates.
Chapter 6: Modifying and Combining Data Sets  The SET statement is a powerful statement in the DATA step DATA newdatasetname; SET olddatasetname;.. run;
Descriptive Statistics using R. Summary Commands An essential starting point with any set of data is to get an overview of what you are dealing with You.
Canadian Bioinformatics Workshops
Convenience Sample of 4 Adults and 6 Infants. Adults 4 visits over 2 weeks; infants 2 visits over 2 weeks Adult specimens: 1) plaque (by method, teeth,
Lists/Dictionaries. What we are covering Data structure basics Lists Dictionaries Json.
Control Structures Hara URL:
Multiple Sequence Alignment with PASTA Michael Nute Austin, TX June 17, 2016.
16BIT IITR Data Collection Module If you have not already done so, download and install R from download.
Introduction to R.
Evaluating-Ayasdi’s-Topological-Data-Analysis-For-Big-Data_HKim2015
Chapter 6: Modifying and Combining Data Sets
Digital Text and Data Processing
Chronic kidney disease alters intestinal microbial flora
Instructor: Raul Cruz-Cano
Chronic kidney disease alters intestinal microbial flora
K-Means Lab.
Analysis of the factors affecting the formation of the microbiome associated with chronic osteomyelitis of the jaw  A. Goda, F. Maruyama, Y. Michi, I.
Volume 19, Issue 3, Pages (March 2016)
Fig. 4 Bacterial taxonomic groups that discriminate among RYGB-, SHAM-, and WMS-derived samples. Bacterial taxonomic groups that discriminate among RYGB-,
Microbial flora of giant pandas.
Learning to count: quantifying signal
Conducting a Microbiome Study
Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.
(a) Bifurcating tree generated from hierarchical clustering of OTUs based on mean pH. (a) Bifurcating tree generated from hierarchical clustering of OTUs.
Lecture 7 – Delivering Results with R
Volume 10, Issue 11, Pages (March 2015)
Distribution of the bacterial taxa that exhibit the greatest log fold changes in mean relative abundance. Distribution of the bacterial taxa that exhibit.
Stat 251 (2009, Summer) Lab 2 TA: Yu, Chi Wai.
A principal-coordinate analysis plot of UniFrac distances from de novo OTUs as visualized by Emperor. A principal-coordinate analysis plot of UniFrac distances.
Association of specific phylotypes with walnut consumption and carcinogen exposure (Study 2). Association of specific phylotypes with walnut consumption.
Relative abundances of bacterial/archaeal groups in 16S rRNA data set.
Presentation transcript:

Lab 1: Using data output from Qiime, transformations, quality control

Where to find useful output from QIIME OTU table: wf_da/uclust_picked_otus/rep_set/pynast_aligned_seqs/rdp_assigned_taxonomy/otu_table/seqs_otu_table.txt Mapping file: Fasting_Map.txt Tree file: wf_da/uclust_picked_otus/rep_set/pynast_aligned_seqs/fasttree_phylogeny/seqs_rep_set.tre

OTU table OTU ID CM01_16W CM01_26W … PM12_I Consensus Lineage "Root;Bacteria;Firmicutes;""Clostridia"";Clostridiales;""Lachnospiraceae""" 1 Root;Bacteria;Bacteroidetes;Bacteroidetes;Bacteroidales 2 Root;Bacteria;Firmicutes;""Clostridia"";Clostridiales;""Lachnospiraceae"”” d = read.table(’otu_table.txt', header=T, row.names=1, sep='\t') taxa.names = d$Consensus.Lineage d = as.matrix(d[,-dim(d)[2]])

Absolute abundance to relative abundance Because the number of sequences for each sample is different, we need to standardize each sample to make them comparable. The most natural normalization is to go from absolute abundance to relative abundances. So each number in the OTU table will represent the proportion of sequences from that samples belonging to that OTU. d = scale(d, center=F, scale=colSums(d)) d = t(d)

Writing an R function A function is an R object that can be called (executed) Function has inputs: parameters Function returns an output: value A value is either through explicit return command or by default the value of the last function called from within that function extract.name.level = function(x, level){ a=c(unlist(strsplit(x,';')),'Other') paste(a[1:min(level,length(a))],collapse=';') }

Execute the new function Call the function that we have just created and see what the results are. a = as.character(d$Consensus.Lineage[1]) a extract.name.level(a, 2) extract.name.level(a, 3) extract.name.level(a, 5)

Function to summarize data at different taxonomic levels otu2taxonomy = function(x, level, taxa=NULL){ if(is.null(taxa)){ taxa = colnames(x) } if(length(taxa)!=dim(x)[2]){ print("ERROR: taxonomy should have the same length as the number of columns in OTU table") return; level.names = sapply(as.character(taxa), function(x) extract.name.level(x,level=level)) t(apply(x, 1, function(y) tapply(y,level.names,sum)))

Summarize the OTU table Explore the OTU tables, summarized at different taxonomic levels by running the following commands. How many Phyla, Classes, Orders, Families, Genera, OTUs are represented in the dataset? Hint: use dim function. d.genus = otu2taxonomy(d,level=7,taxa=taxa.names) d.family = otu2taxonomy(d,level=6,taxa=taxa.names) d.order = otu2taxonomy(d,level=5,taxa=taxa.names) d.class = otu2taxonomy(d,level=4,taxa=taxa.names) d.phylum = otu2taxonomy(d,level=3,taxa=taxa.names)

Mapping file SampleID … Treatment Gender Week Location Mouse CM01_C C 30 Cecal C1 CM02_C C2 CM03_C C3 PM11_I P Ileal P11 PM12_I P12 ff = read.table('mapping.txt', header=T)

Matching up the rows of the mapping file with the OTU table Run the following commands: ff$SampleID rownames(d.phylum) Notice that the rows in the OTU table are in a different order (sorted alphanumerically) then in the mapping table. The following command orders the rows of the mapping table: ff=ff[order(ff$SampleID),] Check that the rows now match up: ff$SampleID == rownames(d.phylum)

Subsetting the data # select columns by name vars = c("Root;Bacteria;Bacteroidetes", "Root;Bacteria;Firmicutes", "Root;Bacteria;Proteobacteria") d.phylum[,vars] # select columns by number cnum = c(1,3,6) d.phylum[,cnum] # dropping columns d.phylum[,!colnames(d.phylum) %in% vars] # selecting/dropping rows d.phylum[1:3,] d.phylum[-(1:90),] # subset function week16fecal = subset(d.phylum, ff$Location == ‘Fecal’ & ff$Week == 16, select = vars)

Basic plotting: explore on your own plot(week16fecal) week16map = subset(ff, ff$Location == 'Fecal' & ff$Week == 16) plot(week16fecal ~ week16map$Treatment) plot(week16fecal[,1] ~ week16map$Treatment, main=colnames(week16fecal)[1], xlab='Treatment', ylab='Relative abundance (fraction of the total)') plot(week16fecal[,2] ~ week16map$Treatment, main = colnames(week16fecal)[2], xlab='Treatment', ylab='Relative abundance (fraction of the total)')

Plotting (cont’d) # two plots on one figure par(mfrow=c(1,2)) plot(week16fecal[,1] ~ week16map$Treatment, main = colnames(week16fecal)[1], xlab='Treatment', ylab='Relative abundance (fraction of the total)') plot(week16fecal[,2] ~ week16map$Treatment, main = colnames(week16fecal)[2], xlab='Treatment', ylab='Relative abundance (fraction of the total)')

Plotting (cont’d) #saving the plot pdf(file='bf.pdf') par(mfrow=c(1,2)) plot(week16fecal[,1] ~ week16map$Treatment, main = colnames(week16fecal)[1], xlab='Treatment', ylab='Relative abundance (fraction of the total)') plot(week16fecal[,2] ~ week16map$Treatment, main = colnames(week16fecal)[2], xlab='Treatment', ylab='Relative abundance (fraction of the total)') dev.off()

Heatmaps # heatmaps ma.phylum = subset(d.phylum, ff$Location=='Fecal', colMeans(d.phylum) > 0.01) ma.ff = subset(ff, ff$Location=='Fecal') library(gplots) heatmap.2(ma.phylum, dendrogram="both", scale="col", col=redgreen(75), distfun = function(x) dist(x, method='euclidean'), key=T, tracecol=NULL)

Heatmaps (factors as side colors) heatmap.2(ma.phylum, dendrogram="both", scale="col", col=redgreen(75), distfun = function(x) dist(x, method='euclidean'), key=T, tracecol=NULL, RowSideColors=rainbow(2)[as.integer(ma.ff$Treatment)]) heatmap.2(ma.phylum, dendrogram="both", scale="col", col=redgreen(75), distfun = function(x) dist(x, method='euclidean'), key=T, tracecol=NULL, RowSideColors=rainbow(3)[as.integer(factor(ma.ff$Week))])

On your own Using plot() and heatmap.2() visually explore the data at a different taxonomic level (Class, Order, Family or Phylum) Tip: Focus only on highly abundant taxa (>1%, >5%)

Useful commands: find help pages on these and use for exploring the data colnames(), rownames() dim() max(), min() mean(), sd() scale() boxplot() heatmap.2()