Presentation on theme: "Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment."— Presentation transcript:
Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment Estimation Experimental design Image analysis Normalization Clustering Discrimination Churchill, March 15 Bult, Lecture 5 Bult, Lecture 6 Hibbs, Lectures 10 and 11 Blake, Lecture 16 and 17
Project Steps Find and Download Array Data Normalize Array Data Analyze Data – i.e., generate gene lists Differentially expressed genes, genes in clusters, etc. Interpret Gene Lists – Use the annotations of genes in your lists Gene Ontology terms are available for many organisms, but not all
Getting The Data Search GEO (or whatever) for a data set of interest. Download the data files – e.g., Affy.CEL files, Affy.CDF files, etc. Upload to home directory
Normalize the Data Sent you all a script (2/23/2012) to RMA normalize the Ackerman array data available from my home directory
What is a library? What does the ReadAffy() function do?What are possible arguments for the ReadAffy() function? What class of R object is rma.CELData? What class of R object is rma.expr? What class of R object is rma.expr.df?
Plotting summarized probeset intensities across the Ackerman arrays….(non normalized) jpeg("boxplot.jpeg") boxplot(CELData, names=CELData$sample, col="blue") dev.off()
mydata=rma.expr.df jpeg("normal_boxplot.jpg") boxplot(mydata[-1], main = "Normalized Intensities", xlab="Array", ylab="Intensities", col="blue") dev.off() Plotting summarized probeset intensities across the Ackerman arrays….(normalized)
Next time Posted articles from Gary Churchill. – If you only read one article, read Churchill 2004 – See also Gary’s web site: http://churchill.jax.org/software/rmaanova.shtml – Look at Sample Data and Tutorial After that lecture we will begin analysis of microarray data – MAANOVA
Sequence “Space” Roche 454 – Flow space – Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain – Flow space describes sequence in terms of these base incorporations – http://www.youtube.com/watch?v=bFNjxKHP8Jc http://www.youtube.com/watch?v=bFNjxKHP8Jc AB SOLiD – Color space – Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known bases with a flouorescent dye – Each base sequenced twice – http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related Illumina/Solexa – Base space – Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groups – Sequencing via cycles of base addition/detection followed deprotection of the 3’ OH – http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related GenomeTV – Next Generation Sequencing (lecture) – http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related http://finchtalk.geospiza.com/2008/03/color-space-flow-space-sequence-space_23.html
“Standard” File formats Sequence containers FASTA FASTQ BAM/SAM Alignments BAM/SAM MAF Annotation BED GFF/GTF/GFF3 WIG Variation VCF GVF
FASTQ: Data Format FASTQ – Text based – Encodes sequence calls and quality scores with ASCII characters – Stores minimal information about the sequence read – 4 lines per sequence Line 1: begins with @; followed by sequence identifier and optional description Line 2: the sequence Line 3: begins with the “+” and is followed by sequence identifiers and description (both are optional) Line 4: encoding of quality scores for the sequence in line 2 References/Documentation – http://maq.sourceforge.net/fastq.shtml http://maq.sourceforge.net/fastq.shtml – Cock et al. (2009). Nuc Acids Res 38:1767-1771.
FASTQ Example FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771. For analysis, it may be necessary to convert to the Sanger form of FASTQ…For example, Illumina stores quality scores ranging from 0-62; Sanger quality scores range from 0-93. Solexa quality scores have to be converted to PHRED quality scores.
SAM (Sequence Alignment/Map) It may not be necessary to align reads from scratch…you can instead use existing alignments in SAM format – SAM is the output of aligners that map reads to a reference genome – Tab delimited w/ header section and alignment section Header sections begin with @ (are optional) Alignment section has 11 mandatory fields – BAM is the binary format of SAM http://samtools.sourceforge.net/
Tutorial Web Site http://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/index.shtml Tutorial 5
RNA Seq Workflow Convert data to FASTQ Upload files to Galaxy Quality Control – Throw out low quality sequence reads, etc. Map reads to a reference genome – Many algorithms available – Trade off between speed and sensitivity Data summarization – Associating alignments with genome annotations – Counts Data Visualization Statistical Analysis
Typical RNA_Seq Project Work Flow Sequencing Tissue Sample Cufflinks TopHat FASTQ file QC Gene/Transcript/Exon Expression Visualization Total RNA mRNA cDNA Statistical Analysis JAX Computational Sciences Service
TopHat Trapnell et al. (2009). Bioinformatics 25:1105-1111. http://tophat.cbcb.umd.edu/ Figure from: Trapnell et al. (2010). Nature Biotechnology 28:511-515. TopHat is a good tool for aligning RNA Seq data compared to other aligners (Maq, BWA) because it takes splicing into account during the alignment process.
Trapnell C et al. Bioinformatics 2009;25:1105-1111 TopHat is built on the Bowtie alignment algorithm.
Cufflinks Trapnell et al. (2010). Nature Biotechnology 28:511-515. http://cufflinks.cbcb.umd.edu/ Assembles transcripts, Estimates their abundances, and Tests for differential expression and regulation in RNA-Seq samples