Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Methods for Analysis of Single Cell RNA-Seq Data

Similar presentations


Presentation on theme: "Computational Methods for Analysis of Single Cell RNA-Seq Data"— Presentation transcript:

1 Computational Methods for Analysis of Single Cell RNA-Seq Data
Ion Măndoiu Computer Science & Engineering Department University of Connecticut

2 Outline Intro to RNA-Seq
Next-generation sequencing technologies RNA-Seq applications Analysis challenges for single cell data Typical analysis pipeline for single-cell RNA-Seq Primary analysis: reads QC, mapping, and quantification Secondary analysis: cells QC, normalization, clustering, and differential expression Tertiary analysis: functional annotation Conclusions

3 2nd Gen. Sequencing: Illumina
3

4 2nd Gen. Sequencing: Illumina
4

5 2nd Gen. Sequencing: ION Torrent
ION Torrent uses a high-density array of micro-machined wells to perform this biochemical process in a massively parallel way Each well holds a different DNA template generated by emulsion PCR. Beneath the wells is an ion-sensitive layer and beneath that a proprietary ION sensor The sequencer sequentially floods the chip with one nucleotide after another; in each cycle the voltage change recorded at a well is proportional to the number of incorporated bases

6 3rd Gen. Sequencing: PacBio SMRT

7 3rd Gen. Sequencing: PacBio SMRT

8 3rd Gen. Sequencing: Oxford Nanopore

9 Standard (Bulk) RNA-Seq
AAAAAA AAAAAA AAAAAA Reverse transcribe into cDNA & shatter into fragments Sequence fragment ends A B C D E Map reads A B C D E Transcriptome reconstruction Gene expression quantification Isoform expression quantification

10 Alternative splicing [Griffith and Marra 07]

11 Alternative Splicing Pal S. et all , Genome Research, June 2011

12 Transcriptome Reconstruction

13 Common Approaches De novo (genome independent reconstruction)
Trinity, Oases, TransABySS de Brujin k-mer graph Genome guided Scripture Reports “all” transcripts Cufflinks, IsoLasso, SLIDE Minimize set of transcripts explaining reads Annotation guided RABT Simulate reads from annotated transcripts

14 Genome-Guided Transcriptome Reconstruction – Multiple Solutions
1 7 4 2 3 6 5 1 7 4 2 3 6 5 t1 : 1 7 4 3 6 5 t2 : 1 7 4 2 3 5 t3 : t4 : 1 7 4 3 5

15 Which Solution is Most Likely?
TRIP: select smallest set of transcripts with good statistical fit between fragment length distribution empirically determined during library preparation implied by “mapping” read pairs 1 3 2 500 300 200

16 100x coverage, 2x100bp pe reads; annotations for genes
TRIP Results 100x coverage, 2x100bp pe reads; annotations for genes

17 Why Single Cell RNA-Seq?
Macaulay and Voet, PLOS Genetics, 2014

18 Challenges Low RNA input + low RT efficiency
Especially problematic for low expression genes Macaulay and Voet, PLOS Genetics, 2014

19 Challenges Stochastic effects (e.g., transcriptional bursting) hard to distinguish from regulated transcriptional heterogeneity PCR amplification bias results in distortion of transcript abundances

20 SMARTer RNA-Seq Protocol

21 Correcting PCR Bias using UMIs (STRT-C1)
Islam et al.

22 Outline Intro to RNA-Seq
RNA-Seq applications Analysis challenges for single cell data Typical analysis pipeline for single-cell RNA-Seq Primary analysis: reads QC, mapping, and quantification Secondary analysis: cells QC, normalization, clustering, and differential expression Tertiary analysis: functional annotation Conclusions

23 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis

24 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Tools to analyze and preprocess fastq files FASTX ( Charts quality statistics Filters sequences based on quality Trims sequences based on quality Collapses identical sequences into a single sequence  PRINSEQ ( Generates read length and quality statistics Filters reads based on length, quality, GC content and other criteria Trims reads based on length/position or quality scores

25 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis

26 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis

27 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis 27

28 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis RNA-Seq read mapping strategies: Ungapped mapping (with mismatches) to genome Cannot align reads spanning exon-junctions Local alignment (Smith-Waterman) to genome Very slow Spliced alignment to genome Computationally harder than ungapped alignment, but much faster than local alignment Mapping on transcript libraries Fastest, but cannot align reads from un-annotated transcripts Mapping on exon-exon junction libraries Cannot align reads overlapping un-annotated exons Hybrid approaches

29 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Comparison of spliced read mapping tools Kim et al.

30 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Cannot use raw read counts (why not?) Islam et al.

31 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis CPM = count per million Ignores multireads  underestimates expression of genes in large families Does not normalize for gene length  cannot compare CPMs b/w genes Comparing CPMs between samples assumes similar transcriptome size RPKM/FPKM = reads/fragments per kilobase per million [Mortazavi et al. 08] Fractionally allocates multireads based on unique read estimates Length for multi-isoform genes? Comparing FPKM between samples assumes similar (weighted) transcriptome size TPM: transcripts per million Still relative measure of expression, but comparable between samples Most accurate estimation methods use multireads and isoform level estimation UMI counts Absolute measure of expression?

32 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Gene ambiguous reads A B C D E Isoform ambiguous reads

33 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Expectation-maximization approach (IsoEM, RSEM) A B C i j Fa(i) Fa (j)

34 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis EM Algorithm 3. Compute expected #reads for each transcript 0.5 2.5 1 1.5 1. Start with random transcript frequencies 0.2 0.5 1 2. Fractionally allocate reads to transcripts

35 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis EM Algorithm 3. Compute expected #reads for each transcript 0.5 2.5 1 1.5 0.5/6 2.5/6 1/6 1.5/6 1. Start with random transcript frequencies 2. Fractionally allocate reads to transcripts 4. Update transcript frequencies using maximum likelihood estimates

36 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis EM Algorithm 3. Compute expected #reads for each transcript 0.5/6 2.5/6 1/6 1.5/6 1. Start with random transcript frequencies 2. Fractionally allocate reads to transcripts 4. Update transcript frequencies using maximum likelihood estimates 5. Repeat steps 2-4 until convergence

37 Detected genes/cell -- main population
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Detected genes/cell -- main population Detected genes/cell -- bi-modal distribution Detected genes/cell -- minor population

38 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Batch effects can be larger than biological effects, but can be corrected by normalization procedures CPM & TPM datasets pre-quantile normalization CPM & TPM datasets post-quantile normalization

39 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Quantile normalization (Irizarry et al 2002) Shifts CPM/FPKM/TPM values for each cell to match a reference distribution (e.g., distribution of means) - Highest value gets matched to highest value in reference - 2nd highest gets mapped to 2nd highest value in reference - And so on Distribution of TPMs Reference distribution

40 Principal Component Analysis
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Principal Component Analysis Linear transformation of the data: 1st component = direction of max. variance 2nd component = orthogonal on 1st, max. residual variance Used for dimensionality reduction (ignore high components) Visualization for exploratory analysis Feature selection

41 What makes a good clustering?
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis What makes a good clustering? Homogeneity: Elements within a cluster are close to each other Separation: Elements in different clusters are further apart from each other Bad clustering Good clustering

42 Many clustering algorithms!
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Many clustering algorithms! Algorithm Parameters K-means K = Number of clusters Fuzzy c-means Clustering (FCM) K = number of clusters d = Degree of fuzziness Hierarchical Clustering (HCS) Metric = euclidean, seuclidean, cityblock, minkowski, chebychev, cosine, correlation, spearman Method = average, centroid, complete, median, single EM Clustering S = Number of initial seeds I = Number of iteration SNN-Cliq n = Size of the nearest neighbor list r = Density threshold of quasi-cliques m = Threshold on the overlapping rate for merging.

43 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis K-Means Clustering Goal: find K clusters minimizing the mean squared distance from data points to corresponding cluster centroids

44 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis K-Means Clustering 1 2 3 4 5 expression in condition 1 expression in condition 2 k1 k2 k3

45 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis K-Means Clustering 1 2 3 4 5 expression in condition 1 expression in condition 2 k1 k2 k3

46 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis K-Means Clustering 5 k1 4 3 expression in condition 2 k3 2 k2 1 1 2 3 4 5 expression in condition 1

47 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis K-Means Clustering 5 k1 4 3 expression in condition 2 k2 2 k3 1 1 2 3 4 5 expression in condition 1

48 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Accuracy measures Purity 𝑃𝑢𝑟𝑖𝑡𝑦= 1 𝑁 𝑖 𝑣 𝑖 ∩ 𝑢 𝑗 U: set of ground truth classes; V: set of the computed clusters; N:total # of objects in dataset Adjusted Rand Index (AR) 𝐴𝑅𝐼= 𝑁 2 𝑇𝑃+𝑇𝑁 −[(𝑇𝑃+𝐹𝑃)(𝑇𝑃+𝐹𝑁)+(𝐹𝑁+𝑇𝑁)(𝐹𝑃+𝑇𝑁)] 𝑁 −[(𝑇𝑃+𝐹𝑃)(𝑇𝑃+𝐹𝑁)+(𝐹𝑁+𝑇𝑁)(𝐹𝑃+𝑇𝑁)] Rand Index (RI) RI= (TP+TN)/(TP+FP+FN+TN) F1 Score F1 Score= 2×TP/(2×TP+FP+FN) Mirkin’s index (MI) It counts the number of disagreements in data pairs between two clustering. It is the ratio of the number of disagreeing pairs to the total number of pairs. Lower value of Mirkin’s index indicates better clustering. Hubert’s index (HI) HI = RI – MI Corr Maximum weighted Pearson correlation between average expression value of each class at ground truth and computed cluster

49 Accuracy comparison (Pollen et al. 2014, MiSeq)
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Accuracy comparison (Pollen et al. 2014, MiSeq)

50 Accuracy comparison (Pollen et al. 2014, HiSeq)
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Accuracy comparison (Pollen et al. 2014, HiSeq)

51 Accuracy comparison (Zeisel et al. 2015)
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Accuracy comparison (Zeisel et al. 2015)

52 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Tests for differential gene expression must take both fold change and statistical significance into account DE * * FC = FC = FC = 1.5

53 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Many reliable DE methods for data with replicates edgeR [Robinson et al., 2010] DESeq [Anders et al., 2010] When no/few replicates available bootstrapping provides a robust alternative

54 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Sensitivity results on Illumina MCF-7 data with varying number of replicates and minimum fold change 1.5

55 existing experimental data
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Gene expression table Enrichment Table Spindle Apoptosis ENRICHMENT TEST Interpretation & Hypotheses Experimental Data Gene-set Databases A priori knowledge + existing experimental data

56 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis

57 Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis

58 Outline Intro to RNA-Seq
RNA-Seq applications Analysis challenges for single cell data Typical analysis pipeline for single-cell RNA-Seq Primary analysis: reads QC, mapping, and quantification Secondary analysis: cells QC, normalization, clustering, and differential expression Tertiary analysis: functional annotation Conclusions

59 Conclusions The range of single-cell applications continues to expand, fueled by advances in microfluidics technology and library prep protocols ATAC-Seq, GT-Seq, Methyl-Seq, … Primary analysis is compute intensive Requires server/cluster/cloud + linux + scripting Galaxy framework ( provides web-based interface to many tools Most secondary/tertiary analyses can be done on PC/Mac using R environment (some programming) Many can be done using web-based tools and user-friendly apps (we’ll use JMP)

60 Conclusions Development of single-cell specific analysis methods critical for fully realizing the potential of the technology Allele specific expression Biomarker selection Cell type assignment Lineage reconstruction Characterization of heterogeneity Joint analysis of bulk and single cell data still needed to get unbiased cell type frequencies Can also identify and characterize cell types missed by current capture protocols

61 Single cells or AND computational deconvolution

62 Acknowledgements Sahar Al Seesi Adrian Caciula Marius Nicolae
Elham Sherafat Craig Nelson Adrian Caciula Serghei Mangul Yvette Temate Tiagueu Alex Zelikovsky Edward Hemphill James Lindsay


Download ppt "Computational Methods for Analysis of Single Cell RNA-Seq Data"

Similar presentations


Ads by Google