Presentation is loading. Please wait.

Presentation is loading. Please wait.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology NGS Part I RNA-Seq Short Reads Sequence Analysis Feb 29, 2012 Daniel Fernandez and Alejandro.

Similar presentations


Presentation on theme: "STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology NGS Part I RNA-Seq Short Reads Sequence Analysis Feb 29, 2012 Daniel Fernandez and Alejandro."— Presentation transcript:

1 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology NGS Part I RNA-Seq Short Reads Sequence Analysis Feb 29, 2012 Daniel Fernandez and Alejandro Quiroz 1

2 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology 1 st ACT (1 hour) Introduction INTERLUDE Chill Out Sessions with DJ Bowtie (10 min) 2 nd ACT (1 hour 50 min) Homework help Q4 and Q5. 2

3 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Central Dogma of MB GENOME TRANSCRIPTOME BIOLOGYBIOLOGY REVERSEENGINEERINGREVERSEENGINEERING

4 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Reverse Engineering: We can use sequencing to find the genome state RNA-Seq Transcription Wang, Z Nature Reviews Genetics 2009

5 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Reverse Engineering: Once sequenced the problem becomes computational Sequenced reads cells sequencer Library preparation genome read coverage Alignment

6 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Overview of the session We’ll cover the 3 main computational challenges of sequence analysis for counting applications: Read mapping: Placing short reads in the genome Reconstruction: Finding the regions that originated the reads Quantification: Assigning scores to regions Finding regions that are differentially represented between two or more samples.

7 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Trapnell, Salzberg, Nature Biotechnology 2009

8 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Short read mapping software for RNA-Seq Seed-extendShort indelsUse base qual B-WUse base qual MaqNoYESBWAYES BFASTYesNOBowtieNO GASSSTYesNOSoap2NO RMAPYesYES SeqMapYesNO SHRiMPYesNO

9 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology What software to use If read quality is good (error rate < 1%) and there is a reference. BWA is a very good choice. If read quality is not good or the reference is phylogenetically far (e.g. Wolf to dog) and you have a server with enough memory SHRiMP or BFAST should be a sensitive but relatively fast choice. What about RNA-Seq?

10 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology RNA-Seq read mapping is more complex than just sequencing 10s kb100s bp RNA-Seq reads can be spliced, and spliced reads are most informative

11 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Method 1: Seed-extend spliced alignment

12 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Method 1I: Exon-first spliced alignment

13 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Short read mapping software for RNA-Seq Seed-extendShort indelsUse base qual Exon-firstUse base qual GSNAPNoNOMapSpliceNO QPALMAYesNOSpliceMapNO STAMPYYesYESTopHatNO BLATYesNO Exon-first alignments will map contiguous first at the expense of spliced hits

14 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology The Broad Institute of MIT and Harvard A desktop application for the visualization and interactive exploration of genomic data IGV: Integrative Genomics Viewer Microarrays Epigenomics RNA-Seq NGS alignments Comparative genomics

15 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Visualizing read alignments with IGV Long marks Medium marks Punctuate marks

16 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Visualizing read alignments with IGV — RNASeq Gap between reads spanning exons

17 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Visualizing read alignments with IGV — RNASeq close-up What are the gray reads? We will revisit later.

18 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Overview of the session The 3 main computational challenges of sequence analysis for counting applications: Read mapping: Placing short reads in the genome Reconstruction: Finding the regions that originate the reads Quantification: Assigning scores to regions Finding regions that are differentially represented between two or more samples.

19 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Scripture for RNA-Seq: Extending segmentation to discontiguous regions

20 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology The transcript reconstruction problem 10s kb100s bp Challenges: Genes exist at many different expression levels, spanning several ordersof magnitude. Reads originate from both mature mRNA (exons) and immature mRNA(introns) and it can be problematic to distinguish between them. Reads are short and genes can have many isoforms making itchallenging to determine which isoform produced each read. There are two main approaches to this problem, first lets discuss Scripture’s

21 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Merge windows & build transcript graph Filter & report isoforms Scripture Overview Map reads Scan “discontiguous” windows

22 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Method I: Direct assembly

23 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Method II: Genome-guided

24 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Transcriptome reconstruction method summary

25 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Pros and cons of each approach Transcript assembly methods are the obvious choicefor organisms without a reference sequence. Genome-guided approaches are ideal for annotatinghigh-quality genomes and expanding the catalog ofexpressed transcripts and comparing transcriptomesof different cell types or conditions. Hybrid approaches for lesser quality ortranscriptomes that underwent majorrearrangements, such as in cancer cell. More than 1000 fold variability in expression levesmakes assembly a harder problem for transcriptomeassembly compared with regular genome assembly. Genome guided methods are very sensitive toalignment artifacts.

26 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology RNA-Seq transcript reconstruction software AssemblyPublishedGenome Guided OasisNOCufflinks Trans-ABySSYESScripture TrinityNO

27 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Differences between Cufflinks and Scripture Scripture was designed with annotation in mind. It reportsall possible transcripts that are significantly expressed given the aligned data ( Maximum sensitivity ). Cuffl links was designed with quantification in mind. It limits reported isoforms to the minimal number thatexplains the data ( Maximum precision ).

28 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Differences between Cufflinks and Scripture - Example Annotation Scripture Cufflinks Alignments

29 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Overview of the session The 3 main computational challenges of sequence analysis for counting applications: Read mapping: Placing short reads in the genome Reconstruction: Finding the regions that originate the reads Quantification: Assigning scores to regions Finding regions that are differentially represented between two or more samples.

30 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Quantification Fragmentation of transcripts results in length bias: longer transcripts have higher counts Different experiments have different yields. Normalization is required for cross lane comparisons: Reads per kilobase of exonic sequence per million mapped reads (Mortazavi et al Nature methods 2008) This is all good when genes have one isoform.

31 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Quantification with multiple isoforms How do we define the gene expression? How do we compute the expression of each isoform?

32 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Computing gene expression Idea1: RPKM of the constitutive reads (Neuma, Alexa-Seq, Scripture)

33 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Computing gene expression — isoform deconvolution

34 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Computing gene expression — isoform deconvolution If we knew the origin of the reads we could compute each isoform’s expression. The gene’s expression would be the sum of the expression of all its isoforms. E = RPKM 1 + RPKM 2 + RPKM 3

35 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Programs to measure transcript expression Implemented method Alexa-seqGene expression by constitutive exons ERANGEGene expression by using all Exons ScriptureGene expression by constitutive exons CufflinksTranscript deconvolution by solving the maximum likelihood problem MISOTranscript deconvolution by solving the maximum likelihood problem RSEMTranscript deconvolution by solving the maximum likelihood problem

36 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Impact of library construction methods

37 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Library construction improvements — Paired-end sequencing Adapted from the Helicos website

38 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Paired-end reads are easier to associate to isoforms P1P1 P2P2 P3P3 Isoform 1 Isoform 2 Isoform 3 Paired ends increase isoform deconvolution confidence P 1 originates from isoform 1 or 2 but not 3. P 2 and P 3 originate from isoform 1 Do paired-end reads also help identifying reads originating in isoform 3?

39 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology We can estimate the insert size distribution P1P1 P2P2 d1d1 d2d2 Splice and compute insert distance Estimate insert size empirical distribution Get all single isoform reconstructions

40 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology … and use it for probabilistic read assignment Isoform 1 Isoform 2 Isoform 3 d1d1 d2d2 d1d1 d2d2 P(d > d i )

41 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology And improve quantification Katz et al Nature Methods 2008

42 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Paired-end improve reconstructions Paired-end data complements the connectivity graph

43 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology And merge regions Single reads Paired reads

44 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Or split regions Single reads Paired reads

45 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Summary Paired-end reads are now routine in Illumina and SOLiDsequencers. Paired end alignment is supported by most short read aligners Transcript quantification depends heavily in paired-end data Transcript reconstruction is greatly improved when using paired-ends (work in progress)

46 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology The libraries we will work with are strand sepcific

47 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Summary Several methods now exist to build strand sepecificRNA-Seq libraries. Quantification methods support strand specific libraries.For example Scripture will compute expression on bothstrand if desired.

48 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Overview of the session The 3 main computational challenges of sequence analysis for counting applications: Read mapping: Placing short reads in the genome Reconstruction: Finding the regions that originate the reads Quantification: Assigning scores to regions Finding regions that are differentially represented between two or more samples.

49 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology The problem. Finding genes that have different expression between twoor more conditions. Find gene with isoforms expressed at different levelsbetween two or more conditions. Find differentially used slicing events Find alternatively used transcription start sites Find alternatively used 3 ’ UTRs

50 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Differential gene expression using RNA-Seq (Normalized) read counts  Hybridization intensity We observe the individual events.

51 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology The Poisson model Suppose you have 2 conditions and R replicates for each conditions and each replicate in its own lane L. Lets consider a single gene G. Let C ik the number of reads aligned to G in lane i of condition k then (k=1,2) and i=(1,…R). Assume for simplicity that all lanes give the same number of reads (otherwise introduce a normalization constant) Assume C ik distributes Poisson with unknown mean m ik. Use a GLN to estimate m ik using two parameters, a gene dependent parameter a and a sample dependent parameter s k log(m ik ) = a + s k to obtain two estimators m 1 and m 2 Alternatively estimate a mean m using all replicates for all conditions

52 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology The Poisson model G is differentially expressed when m 1 != m 2 Is P(C 1k,C 2k |m) is close to P(C 1k |m 1 )P(C 2k |m 2 ) The likelihood ratio test is ideal to see this and since the difference between the two models is one variable it distributes X 2 of degree 1. The X 2 can be used to assess significance. For details see Auer and Doerge - Statistical Design and Analysis of RNA Sequencing Data genomics Marioni et al – RNASeq: An assessment technical of reproducibility and comparison with gene expression arrays Genome Reasearch 2008.

53 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Cufflinks differential issoform ussage Let a gene G have n isoforms and let p 1, …, p n the estimated fraction of expression of each isoform. Call this a the isoform expression distribution P for G Given two samples we the differential isoform usage amounts to determine whether H 0 : P 1 = P 2 or H 1 : P 1 != P 2 are true. To compare distributions Cufflinks utilizes an information content based metric of how different two distributions are called the Jensen-Shannon divergence: The square root of the JS distributes normal.

54 STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology RNA-Seq differential expression software Underlying modelNotes DegSeqNormal. Mean and variance estimated from replicates Works directly from reference transcriptome and read alignment EdgeRNegative BionomialGene expression table DESeqPoissonGene expression table MyrnaEmpiricalSequence reads and reference transcriptome


Download ppt "STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology NGS Part I RNA-Seq Short Reads Sequence Analysis Feb 29, 2012 Daniel Fernandez and Alejandro."

Similar presentations


Ads by Google