Presentation on theme: "Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo."— Presentation transcript:
1Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo
2What is Sequencing?Sequencing is the process of determining the precise order of nucleotides.Non high throughput sequencing: Sanger Sequencing: The basic chain termination method, developed by Frederick Sanger in Generates all possible single-stranded DNA molecules complementary to a given template, and beginning at a common 5' base.
3The Pros and Cons of Sanger Sequencing Pros: Highly accuratetargetableCons:Cost $15 per /1000 base pairs, to sequencing the whole genome will cost roughly: 30bil/1000x$15=$15mLow detection rate of alternative allele
7Sequencing Type By Source RNA: mRNA, Small RNA, Total RNADNA: Whole Genome or targeted (Exome, mitochondrial, genes of interest, etc)
8Sequencing Data Raw Image data is more than 2TB per sample Raw data is about 5-15GB per single end sample or 10-30GB per pair end sample for RNAseq or Exome Sequencing. Whole genome data can easily exceed 200GB per sample.In general 5x raw data size is needed to finish processingRaw data is usually in FASTQ format, the base quality is in Phred scaleOlder Illumina pipeline uses Phred 64 scale, newer CASAVA 1.8 pipeline uses Sanger scale.
9Single end vs Paired end Paired end data has double amount of data than single end.Paired end is more expensive than single end.Paired end data is easier to do quality control (insert size, removing duplicate)Paired end data provides more opportunities to detect structural variance.
10What can you obtain from DNAseq SNPs (require only normal or tumor)Somatic Mutations (require tumor and normal pair)Copy Number Variation (work best with whole genome sequencing)Small Structural Variance: Insertion, deletionLarge Structure Variance: (Translocation, Inversion)
11What can you obtain from RNAseq Gene ExpressionSNP (only for expressed genes)Novel Splicing VariantsGenes FusionRNAseq has been used primarily as a replacement of microarray
12How does RNAseq compare to Microarray? Since 2008, people has been saying that RNAseq will replace microarray for gene expression profiling.VANTAGE stopped offering microarray service earlier this year.Wang, Z., M. Gerstein, and M. Snyder, RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet, (1): p2. Shendure, J., The beginning of the end for microarrays? Nat Methods, (7): p
13Data DistributionGuo, Y., et al., Large Scale Comparison of Gene Expression Levels by Microarrays and RNAseq Using TCGA Data. PLoS One, (8): p. e71462.
14Result ConsistencyGuo, Y., et al., Large Scale Comparison of Gene Expression Levels by Microarrays and RNAseq Using TCGA Data. PLoS One, (8): p. e71462.
15RNAseq vs Microarray - advantages MiroarrayResult TypeRich, not limited to expressionLimited to expression onlyExpressionCan quantify expression on exon and gene levelCan quantify expression on exon or gene levelNovel DiscoveryCan be used for novel discoveryCan only detect what is on the chipAnalysisDifficultEasyInterpretationPrice for assayPrice has become comparable to microarray, however the analysis hardware and analysis time may increase the final costPrice is stable
17Raw data @HWI-ST508:203:D078GACXX:8:1101:1296:1011 1:N:0:ATCACG NTGGAGTCCTAGGCACAGCTCTAAGCCTCCTTATTCGAGCCGAGCTGGGCC+
18@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG the unique instrument name136the run idFC706VJthe flowcell id2flowcell lane2104tile number within the flowcell lane15343'x'-coordinate of the cluster within the tile197393'y'-coordinate of the cluster within the tile1the member of a pair, 1 or 2 (paired-end or mate-pair reads only)YY if the read fails filter (read is bad), N otherwise180 when none of the control bits are on, otherwise it is an even numberATCACGindex sequence
19Phred Score Phred Quality Score Probability of incorrect base call Base call accuracy101 in 1090 %201 in 10099 %301 in 100099.9 %401 in 1000099.99 %501 in99.999 %
20Quality ControlQuality control should be conducted at multiple steps during sequencing data processingRaw dataAlignmentResults (Expression for RNA, and SNP/mutation for DNA)Guo, Y., et al., Three-stage quality control strategies for DNA re-sequencing data. Brief Bioinform, 2013.
21Raw Data QC - ToolsFAST QCFASTX-ToolkitQC3 https://github.com/slzhao/QC3NGS QC Toolkit
26Clustering Algorithms Start with a collection of n objects each represented by a p–dimensional feature vector xi , i=1, …n.The goal is to divide these n objects into k clusters so that objects within a clusters are more “similar” than objects between clusters. k is usually unknown.Popular methods: hierarchical, k-means, SOM, mixture models, etc.
44Example of Quantile Normalization Red = G1; Green = G2; Blue = G3; Yellow = G4; Black = G5OriginalOriginalSort S1Sort S2Sort S3SortedS1S2S3G124G2514G368G43G59S12345S23456S348914S1S2S3G1234G28G3G459G5614
45Take Average for Each Row SortedS1S2S3234859614S1S2S33S1S2S335S1S2S335S1S2S3356AveragedS1S2S33568
46Reorder Red = G1; Green = G2; Blue = G3; Yellow = G4; Black = G5 AveragedS1S2S33568S1S2S335S1S2S3358S1S2S33586S1S2S33586S1S2S33586
47Differential Expression Analysis Cuffdiff from Cufflinks package Trapnell, C., et al., Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc, (3): pDESeqEdgeRNBPSeqTSPMbaySeq
48Which Method Is the Best? Guo, Y., et al., Evaluation of read count based RNAseq analysis methods. BMC Genomics, Suppl 8: p. S2.
58Test Your Hypothesis Without Performing Any Analysis GEO
59Test Your Hypothesis Without Performing Any Analysis
60Functional Analysis Samples Space n F M Suppose in a study, we are trying to find out if the proportion of smoking individual is significantly different between men and women.Smokingdcba
61Fisher’s Exact Test Male Female Total Smoking a b a + b Nonsmoking c d b + da+b+c+d=nH0 : The proportion of smoking in male == the proportion of smoking in femaleH1 : The proportion of smoking in male != the proportion of smoking in female
62Fisher’s Exact Test – in Functional Analysis All GenesWinner GenesNon Winner GenesBreast Cancer GenesabNon Brest Cancer GenescddWinner GenesBreast Cancer Genesacb
63Analogy There are 18000 Balls: 200 + 17800 in a box. Blindfolded, you randomly draw 100 balls.What is the probability that you draw less than 50
65Gene Set Enrichment Analysis KS test based analysis (Ref)GSEA does not need a winner list first
66SNV and Indel Difficulty due to high false positive rate RNAMapper (Miller, et al. Genome Research, 2013)SNVQ (Duitama, et al. (BMC Genomics, 2013)FX (Hong, et al. Bioinformatics, 2012)OSA (Hu, et al. Binformatics, 2012)
67Microsatellite instability Examples:Yoon, et al. Genome Research 2013Zheng, et al. BMC Genomics, 2013
68RNA Editing and Allele-specific expression RNA editing tools and databaseDARNED, REDidb, dbRES, RADARAllele-specific expressionasSeq (Sun, et al. Biometrics, 2012)AlleleSeq (Rozowsky, et al. Molecular Systems Biology, 2011)
69Exogenous RNA Virus (Same as DNA) Food RNA (you are what you eat) Wang, et al. PLOS ONE, 2012