Presentation is loading. Please wait.

Presentation is loading. Please wait.

Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.

Similar presentations


Presentation on theme: "Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo."— Presentation transcript:

1 Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo

2 What is Sequencing? Sequencing is the process of determining the precise order of nucleotides. Non high throughput sequencing: Sanger Sequencing: The basic chain termination method, developed by Frederick Sanger in 1974. Generates all possible single-stranded DNA molecules complementary to a given template, and beginning at a common 5' base.

3 The Pros and Cons of Sanger Sequencing Pros: Highly accurate targetable Cons: Cost $15 per /1000 base pairs, to sequencing the whole genome will cost roughly: 30bil/1000x$15=$15m Low detection rate of alternative allele

4

5 Current Generation Sequencing IlluminaABI Solid454 Life Science PriceLowmediumHigh Read Length50-100 400-1000 Read DepthHigh Low DifficultyEasyHighEasy

6

7 Sequencing Type By Source RNA: mRNA, Small RNA, Total RNA DNA: Whole Genome or targeted (Exome, mitochondrial, genes of interest, etc)

8 Sequencing Data Raw Image data is more than 2TB per sample Raw data is about 5-15GB per single end sample or 10-30GB per pair end sample for RNAseq or Exome Sequencing. Whole genome data can easily exceed 200GB per sample. In general 5x raw data size is needed to finish processing Raw data is usually in FASTQ format, the base quality is in Phred scale Older Illumina pipeline uses Phred 64 scale, newer CASAVA 1.8 pipeline uses Sanger scale.

9 Single end vs Paired end Paired end data has double amount of data than single end. Paired end is more expensive than single end. Paired end data is easier to do quality control (insert size, removing duplicate) Paired end data provides more opportunities to detect structural variance.

10 What can you obtain from DNAseq SNPs (require only normal or tumor) Somatic Mutations (require tumor and normal pair) Copy Number Variation (work best with whole genome sequencing) Small Structural Variance: Insertion, deletion Large Structure Variance: (Translocation, Inversion)

11 What can you obtain from RNAseq Gene Expression SNP (only for expressed genes) Novel Splicing Variants Genes Fusion RNAseq has been used primarily as a replacement of microarray

12 How does RNAseq compare to Microarray? Since 2008, people has been saying that RNAseq will replace microarray for gene expression profiling. VANTAGE stopped offering microarray service earlier this year. Wang, Z., M. Gerstein, and M. Snyder, RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet, 2009. 10(1): p. 57-63. 2.Shendure, J., The beginning of the end for microarrays? Nat Methods, 2008. 5(7): p. 585-7.

13 Data Distribution Guo, Y., et al., Large Scale Comparison of Gene Expression Levels by Microarrays and RNAseq Using TCGA Data. PLoS One, 2013. 8(8): p. e71462.

14 Result Consistency Guo, Y., et al., Large Scale Comparison of Gene Expression Levels by Microarrays and RNAseq Using TCGA Data. PLoS One, 2013. 8(8): p. e71462.

15 RNAseq vs Microarray - advantages RNAseqMiroarray Result Type Rich, not limited to expressionLimited to expression only Expression Can quantify expression on exon and gene levelCan quantify expression on exon or gene level Novel Discovery Can be used for novel discovery Can only detect what is on the chip Analysis DifficultEasy Interpretation DifficultEasy Price for assay Price has become comparable to microarray, however the analysis hardware and analysis time may increase the final cost Price is stable

16 Processing RNA

17 Raw data @HWI-ST508:203:D078GACXX:8:1101:1296:1011 1:N:0:ATCACG NTGGAGTCCTAGGCACAGCTCTAAGCCTCCTTATTCGAGCCGAGCTGGGCC + #4=DDDDDDDDDDE { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/3373325/12/slides/slide_16.jpg", "name": "Raw data @HWI-ST508:203:D078GACXX:8:1101:1296:1011 1:N:0:ATCACG NTGGAGTCCTAGGCACAGCTCTAAGCCTCCTTATTCGAGCCGAGCTGGGCC + #4=DDDDDDDDDDE

18 @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG EAS139the unique instrument name 136the run id FC706VJthe flowcell id 2flowcell lane 2104tile number within the flowcell lane 15343 'x'-coordinate of the cluster within the tile 197393 'y'-coordinate of the cluster within the tile 1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only) Y Y if the read fails filter (read is bad), N otherwise 18 0 when none of the control bits are on, otherwise it is an even number ATCACGindex sequence

19 Phred Score Phred Quality Score Probability of incorrect base call Base call accuracy 101 in 1090 % 201 in 10099 % 301 in 100099.9 % 401 in 1000099.99 % 501 in 10000099.999 %

20 Quality Control Quality control should be conducted at multiple steps during sequencing data processing – Raw data – Alignment – Results (Expression for RNA, and SNP/mutation for DNA) Guo, Y., et al., Three-stage quality control strategies for DNA re-sequencing data. Brief Bioinform, 2013.

21 Raw Data QC - Tools FAST QC http://www.bioinformatics.babraham.ac.uk/p rojects/fastqc/ http://www.bioinformatics.babraham.ac.uk/p rojects/fastqc/ FASTX-Toolkit http://hannonlab.cshl.edu/fastx_toolkit/ http://hannonlab.cshl.edu/fastx_toolkit/ QC3 https://github.com/slzhao/QC3https://github.com/slzhao/QC3 NGS QC Toolkit http://59.163.192.90:8080/ngsqctoolkit/ http://59.163.192.90:8080/ngsqctoolkit/

22 Raw Data QC - What to Look For

23 Alignment QC - Tools QC3 https://github.com/slzhao/QC3https://github.com/slzhao/QC3 Qqplot http://genome.sph.umich.edu/wiki/QPLOT http://genome.sph.umich.edu/wiki/QPLOT SAMStat http://samstat.sourceforge.net/http://samstat.sourceforge.net/

24 Alignment QC - What to Look For

25 Expression QC - Tools MultiRankSeq https://github.com/slzhao/MultiRankSeq https://github.com/slzhao/MultiRankSeq

26 Clustering Algorithms Start with a collection of n objects each represented by a p–dimensional feature vector x i, i=1, …n. The goal is to divide these n objects into k clusters so that objects within a clusters are more “similar” than objects between clusters. k is usually unknown. Popular methods: hierarchical, k-means, SOM, mixture models, etc.

27

28

29

30

31

32

33 Distance Calculation in Sequencing Smith-Waterman algorithm Sequence 1 = ACACACTA Sequence 2 = AGCACACA w(gap) = 0 w(match) = +2 w(a, − ) = w( −,b) = w(mismatch) = − 1

34 Distance Calculation in Microarray Pearson Correlation Two profiles (vectors) and +1  Pearson Correlation  – 1

35 Similarity Measurements Euclidean Distance

36 Linkage Single Linkage: D(X, Y) = min(d(x, y)), x ϵ X, y ϵ Y Complete Linkage: D(X, Y) = max(d(x, y)), x ϵ X, y ϵ Y Average Linkage:

37 Experssion QC - What to Look For

38 Batch Effect

39 Correction of Batch Effect Guo, Y., et al., Statistical strategies for microRNAseq batch effect reduction. Translational Cancer Research, 2014. 3(3): p. 260-265.

40 Normalization of RNAseq Reads Per Kilo base per Million reads (RPKM)

41 RNAseq Data Alignment TopHat2 http://ccb.jhu.edu/software/tophat/index.sht ml http://ccb.jhu.edu/software/tophat/index.sht ml MapSplice http://www.netlab.uky.edu/p/bioinfo/MapSpl ice http://www.netlab.uky.edu/p/bioinfo/MapSpl ice

42 Gene Quantification CufflInks for RPKM http://cufflinks.cbcb.umd.edu/ http://cufflinks.cbcb.umd.edu/ HTSeq for read count http://www- huber.embl.de/users/anders/HTSeq/doc/over view.htmlhttp://www- huber.embl.de/users/anders/HTSeq/doc/over view.html

43 Data Gene Symbol123456 DDR19.3762988.9619969.2719358.9682118.6635889.214028 RFC27.9504757.7959767.1247828.1566037.8210476.613421 HSPA65.5847985.124915.779075.8499145.5935965.042853 PAX86.3551866.2457886.3887946.7375456.6624286.279758 GUCA1A2.9610013.2269683.0929153.1876183.0673533.159364 UBE1L7.4379697.4227078.2989446.1245516.2630977.548323 THRA6.6065466.6877686.9106237.1662936.7117486.632955 PTPN217.3926786.7727026.8342536.8403136.8131156.68312 CCL52.7107442.4798182.518982.612852.8851172.668616 CYP2E13.8712314.0855535.0318655.0530695.0803945.557095 EPHB34.2894113.7710913.7984253.8934214.016674.200385 ESRRA7.1510267.2191176.9001737.8414367.2541737.119073 CYP2A64.5684924.335654.51234.6722114.5875974.561608 SCARB16.1348236.4408555.7399456.2698675.5344825.281546 TTLL129.3469168.9555748.8684339.8259059.3873979.1008 C2orf594.426665.2193884.7995425.2042454.8460793.934838 WFDC24.7067944.9742955.1498924.4170644.2735044.638822 MAPK14.7773124.7970724.2492384.2525843.6875914.412024 MAPK17.8750457.9024577.5729438.105767.7938287.635768 ADAM324.6297265.273954.3512495.2490615.2042165.412291

44 Example of Quantile Normalization S1S2S3 G1244 G25414 G3468 G4358 G5339 S1 2 3 3 4 5 Red = G1; Green = G2; Blue = G3; Yellow = G4; Black = G5 Original Sort S1 S2 3 4 4 5 6 Sort S2 S3 4 8 8 9 14 Sort S3 S1S2S3 G1234 G2348 G3348 G4459 G55614 Sorted

45 Take Average for Each Row S1S2S3 234 348 348 459 5614 Sorted S1S2S3 333 S1S2S3 333 555 S1S2S3 333 555 555 S1S2S3 333 555 555 666 S1S2S3 333 555 555 666 888 Averaged

46 Reorder S1S2S3 353 S1S2S3 333 555 555 666 888 Averaged S1S2S3 353 858 Red = G1; Green = G2; Blue = G3; Yellow = G4; Black = G5 S1S2S3 353 858 685 S1S2S3 353 858 685 565 S1S2S3 353 858 685 565 536

47 Differential Expression Analysis Cuffdiff from Cufflinks package Trapnell, C., et al., Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc, 2012. 7(3): p. 562-78. DESeq http://bioconductor.org/packages/release/bioc/html/DESeq.html http://bioconductor.org/packages/release/bioc/html/DESeq.html EdgeR http://www.bioconductor.org/packages/release/bioc/html/edgeR.h tml http://www.bioconductor.org/packages/release/bioc/html/edgeR.h tml NBPSeq http://cran.r- project.org/web/packages/NBPSeq/index.htmlhttp://cran.r- project.org/web/packages/NBPSeq/index.html TSPM http://omictools.com/sequencing/rna-seq/normalization- de/tspm-r-s2496.htmlhttp://omictools.com/sequencing/rna-seq/normalization- de/tspm-r-s2496.html baySeq http://www.bioconductor.org/packages/release/bioc/html/baySeq. html http://www.bioconductor.org/packages/release/bioc/html/baySeq. html

48 Which Method Is the Best? Guo, Y., et al., Evaluation of read count based RNAseq analysis methods. BMC Genomics, 2013. 14 Suppl 8: p. S2.

49 Consistency

50

51 Inconsistency MethodAdj pvalueLog2FCRank DESeq0.2783.002572 edgeR0.0472.92712 baySeq0.907NA24962 Cuffdiff<0.0015.8313 Disease1Disease2Disease3Control1Control2Control3 Read count (IGHG2)3912038338634102821764 Total Read count498700846555090271454121356410844486397549052840 Adjusted Read Count78311471782292360

52 Combined Approach Guo, Y., et al., MultiRankSeq: Multiperspective Approach for RNAseq Differential Expression Analysis and Quality Control. BioMed Research International, 2014. 2014: p. 8. log2FoldChan ge(DESeq2) pValue(DESe q2)pAdj(DESeq2) log2FoldChan ge(edgeR) pValue(edge R)pAdj(edgeR) log2FoldChan ge(raw) 1- Likelihood(ba ySeq) AdjLikelihood (baySeq)rank(DESeq)rank(edgeR)rank(baySeq)rankMethod1 ENSMUSG000000 90862_Rps13-5.863357.02E-2101.07E-205-6.199041.80E-1094.01E-105-6.024474.21E-071.81E-071146 ENSMUSG000000 58546_Rpl23a-3.675153.27E-1402.49E-136-3.758072.14E-582.38E-54-3.575032.53E-055.21E-062259 ENSMUSG000000 91957_Gm8841-4.686581.91E-727.27E-69-5.337231.60E-508.90E-47-5.148734.71E-051.70E-0544816 ENSMUSG000000 62683_Atp5g2-4.622744.86E-691.48E-65-5.273524.06E-501.81E-46-5.061787.54E-052.83E-05551020 ENSMUSG000000 82697_Gm12913-3.949568.13E-804.13E-76-4.217384.01E-522.98E-48-4.105320.0002969.17E-05331420 ENSMUSG000000 60128_Gm10075-4.597747.59E-641.65E-60-5.341372.30E-426.41E-39-5.138092.68E-058.81E-0678621 ENSMUSG000000 58558_Rpl5-3.063178.73E-682.22E-64-3.18051.29E-424.12E-39-2.991930.0003130.000119671629 ENSMUSG000000 63316_Rpl27-3.265592.86E-625.45E-59-3.44342.48E-439.20E-40-3.273130.0004090.000151861832 ENSMUSG000000 73702_Rpl31-2.718364.03E-414.72E-38-2.87537.47E-331.67E-29-2.704130.0004530.00016713101942 ENSMUSG000000 85279_Gm15965-4.211141.41E-331.34E-30-6.493433.47E-244.29E-21-6.539167.18E-052.31E-051618943 ENSMUSG000000 78686_Mup9-3.775388.13E-326.19E-29-4.71235.65E-225.47E-19-4.585141.71E-13 2023144 ENSMUSG000000 93337_Mir51092.8107719.06E-369.20E-333.1122153.04E-326.16E-293.3118940.0007510.00024415112248 ENSMUSG000000 49517_Rps23 -2.382273.15E-444.37E-41-2.454729.73E-291.67E-25-2.259970.0010890.00033711132549

53 Presentation Using Heatmap and Cluster Zhao, S., et al., Advanced Heat Map and Clustering Analysis Using Heatmap3. BioMed Research International, 2014. 2014: p. 6.

54 Difference Between Heatmaps

55 Questions We Can Answer with Cluster Microarray data quality checking – Does replicates cluster together? – Does similar conditions, time points, tissue types cluster together?

56 Presentation Using Volcano Plot

57 Presentation Using Circos Plot

58 Test Your Hypothesis Without Performing Any Analysis GEO http://www.ncbi.nlm.nih.gov/geo/http://www.ncbi.nlm.nih.gov/geo/

59 Test Your Hypothesis Without Performing Any Analysis

60 Functional Analysis Samples Space n M F Smoking a b c d Suppose in a study, we are trying to find out if the proportion of smoking individual is significantly different between men and women.

61 Fisher’s Exact Test MaleFemaleTotal Smokingaba + b Nonsmokingcdc + d Totala + cb + da+b+c+d=n H 0 : The proportion of smoking in male == the proportion of smoking in female H 1 : The proportion of smoking in male != the proportion of smoking in female http://www.graphpad.com/quickcalcs/contingency1.cfm

62 Fisher’s Exact Test – in Functional Analysis Winner Genes Non Winner Genes Breast Cancer Genes ab Non Brest Cancer Genes cd a b c Breast Cancer Genes Winner Genes All Genes d

63 Analogy There are 18000 Balls: 200 + 17800 in a box. Blindfolded, you randomly draw 100 balls. What is the probability that you draw less than 50

64 WebGestalt http://bioinfo.vanderbilt.edu/webgestalt/

65 Gene Set Enrichment Analysis KS test based analysis (Ref)Ref GSEA does not need a winner list first http://www.broadinstitute.org/gsea/index.jsp

66 SNV and Indel Difficulty due to high false positive rate RNAMapper (Miller, et al. Genome Research, 2013) SNVQ (Duitama, et al. (BMC Genomics, 2013) FX (Hong, et al. Bioinformatics, 2012) OSA (Hu, et al. Binformatics, 2012)

67 Microsatellite instability Examples: Yoon, et al. Genome Research 2013 Zheng, et al. BMC Genomics, 2013

68 RNA Editing and Allele-specific expression RNA editing tools and database DARNED, REDidb, dbRES, RADAR Allele-specific expression asSeq (Sun, et al. Biometrics, 2012) AlleleSeq (Rozowsky, et al. Molecular Systems Biology, 2011)

69 Exogenous RNA Virus (Same as DNA) Food RNA (you are what you eat) Wang, et al. PLOS ONE, 2012

70 nonCoding RNA


Download ppt "Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo."

Similar presentations


Ads by Google