Computational Methods for Analysis of Single Cell RNA-Seq Data

Name: Computational Methods for Analysis of Single Cell RNA-Seq Data
Uploaded: 2017-07-12T15:55:34+00:00
Duration: PTM34S40
Channel: Willis Carson
Description: Computational Methods for Analysis of Single Cell RNA-Seq Data

Computational Methods for Analysis of Single Cell RNA-Seq Data
Ion Măndoiu Computer Science & Engineering Department University of Connecticut

Outline Intro to RNA-Seq
Next-generation sequencing technologies RNA-Seq applications Analysis challenges for single cell data Typical analysis pipeline for single-cell RNA-Seq Primary analysis: reads QC, mapping, and quantification Secondary analysis: cells QC, normalization, clustering, and differential expression Tertiary analysis: functional annotation Conclusions

2nd Gen. Sequencing: Illumina
3

2nd Gen. Sequencing: Illumina
4

2nd Gen. Sequencing: ION Torrent
ION Torrent uses a high-density array of micro-machined wells to perform this biochemical process in a massively parallel way Each well holds a different DNA template generated by emulsion PCR. Beneath the wells is an ion-sensitive layer and beneath that a proprietary ION sensor The sequencer sequentially floods the chip with one nucleotide after another; in each cycle the voltage change recorded at a well is proportional to the number of incorporated bases

3rd Gen. Sequencing: PacBio SMRT

3rd Gen. Sequencing: Oxford Nanopore

Standard (Bulk) RNA-Seq
AAAAAA AAAAAA AAAAAA Reverse transcribe into cDNA & shatter into fragments Sequence fragment ends A B C D E Map reads A B C D E Transcriptome reconstruction Gene expression quantification Isoform expression quantification

Alternative splicing [Griffith and Marra 07]

Alternative Splicing Pal S. et all , Genome Research, June 2011

Transcriptome Reconstruction

Common Approaches De novo (genome independent reconstruction)
Trinity, Oases, TransABySS de Brujin k-mer graph Genome guided Scripture Reports “all” transcripts Cufflinks, IsoLasso, SLIDE Minimize set of transcripts explaining reads Annotation guided RABT Simulate reads from annotated transcripts

Genome-Guided Transcriptome Reconstruction – Multiple Solutions
1 7 4 2 3 6 5 1 7 4 2 3 6 5 t1 : 1 7 4 3 6 5 t2 : 1 7 4 2 3 5 t3 : t4 : 1 7 4 3 5

Which Solution is Most Likely?
TRIP: select smallest set of transcripts with good statistical fit between fragment length distribution empirically determined during library preparation implied by “mapping” read pairs 1 3 2 500 300 200

100x coverage, 2x100bp pe reads; annotations for genes
TRIP Results 100x coverage, 2x100bp pe reads; annotations for genes

Why Single Cell RNA-Seq?
Macaulay and Voet, PLOS Genetics, 2014

Challenges Low RNA input + low RT efficiency
Especially problematic for low expression genes Macaulay and Voet, PLOS Genetics, 2014

Challenges Stochastic effects (e.g., transcriptional bursting) hard to distinguish from regulated transcriptional heterogeneity PCR amplification bias results in distortion of transcript abundances

SMARTer RNA-Seq Protocol

Correcting PCR Bias using UMIs (STRT-C1)
Islam et al.

RNA-Seq applications Analysis challenges for single cell data Typical analysis pipeline for single-cell RNA-Seq Primary analysis: reads QC, mapping, and quantification Secondary analysis: cells QC, normalization, clustering, and differential expression Tertiary analysis: functional annotation Conclusions

Differential expression
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Tools to analyze and preprocess fastq files FASTX ( Charts quality statistics Filters sequences based on quality Trims sequences based on quality Collapses identical sequences into a single sequence PRINSEQ ( Generates read length and quality statistics Filters reads based on length, quality, GC content and other criteria Trims reads based on length/position or quality scores

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis 27

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis RNA-Seq read mapping strategies: Ungapped mapping (with mismatches) to genome Cannot align reads spanning exon-junctions Local alignment (Smith-Waterman) to genome Very slow Spliced alignment to genome Computationally harder than ungapped alignment, but much faster than local alignment Mapping on transcript libraries Fastest, but cannot align reads from un-annotated transcripts Mapping on exon-exon junction libraries Cannot align reads overlapping un-annotated exons Hybrid approaches

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Comparison of spliced read mapping tools Kim et al.

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Cannot use raw read counts (why not?) Islam et al.

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis CPM = count per million Ignores multireads  underestimates expression of genes in large families Does not normalize for gene length  cannot compare CPMs b/w genes Comparing CPMs between samples assumes similar transcriptome size RPKM/FPKM = reads/fragments per kilobase per million [Mortazavi et al. 08] Fractionally allocates multireads based on unique read estimates Length for multi-isoform genes? Comparing FPKM between samples assumes similar (weighted) transcriptome size TPM: transcripts per million Still relative measure of expression, but comparable between samples Most accurate estimation methods use multireads and isoform level estimation UMI counts Absolute measure of expression?

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Gene ambiguous reads A B C D E Isoform ambiguous reads

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Expectation-maximization approach (IsoEM, RSEM) A B C i j Fa(i) Fa (j)

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis EM Algorithm 3. Compute expected #reads for each transcript 0.5 2.5 1 1.5 1. Start with random transcript frequencies 0.2 0.5 1 2. Fractionally allocate reads to transcripts

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis EM Algorithm 3. Compute expected #reads for each transcript 0.5 2.5 1 1.5 0.5/6 2.5/6 1/6 1.5/6 1. Start with random transcript frequencies 2. Fractionally allocate reads to transcripts 4. Update transcript frequencies using maximum likelihood estimates

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis EM Algorithm 3. Compute expected #reads for each transcript 0.5/6 2.5/6 1/6 1.5/6 1. Start with random transcript frequencies 2. Fractionally allocate reads to transcripts 4. Update transcript frequencies using maximum likelihood estimates 5. Repeat steps 2-4 until convergence

Detected genes/cell -- main population
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Detected genes/cell -- main population Detected genes/cell -- bi-modal distribution Detected genes/cell -- minor population

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Batch effects can be larger than biological effects, but can be corrected by normalization procedures CPM & TPM datasets pre-quantile normalization CPM & TPM datasets post-quantile normalization

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Quantile normalization (Irizarry et al 2002) Shifts CPM/FPKM/TPM values for each cell to match a reference distribution (e.g., distribution of means) - Highest value gets matched to highest value in reference - 2nd highest gets mapped to 2nd highest value in reference - And so on Distribution of TPMs Reference distribution

Principal Component Analysis
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Principal Component Analysis Linear transformation of the data: 1st component = direction of max. variance 2nd component = orthogonal on 1st, max. residual variance Used for dimensionality reduction (ignore high components) Visualization for exploratory analysis Feature selection

What makes a good clustering?
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis What makes a good clustering? Homogeneity: Elements within a cluster are close to each other Separation: Elements in different clusters are further apart from each other Bad clustering Good clustering

Many clustering algorithms!
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Many clustering algorithms! Algorithm Parameters K-means K = Number of clusters Fuzzy c-means Clustering (FCM) K = number of clusters d = Degree of fuzziness Hierarchical Clustering (HCS) Metric = euclidean, seuclidean, cityblock, minkowski, chebychev, cosine, correlation, spearman Method = average, centroid, complete, median, single EM Clustering S = Number of initial seeds I = Number of iteration SNN-Cliq n = Size of the nearest neighbor list r = Density threshold of quasi-cliques m = Threshold on the overlapping rate for merging.

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis K-Means Clustering Goal: find K clusters minimizing the mean squared distance from data points to corresponding cluster centroids

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis K-Means Clustering 1 2 3 4 5 expression in condition 1 expression in condition 2 k1 k2 k3

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis K-Means Clustering 5 k1 4 3 expression in condition 2 k3 2 k2 1 1 2 3 4 5 expression in condition 1

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis K-Means Clustering 5 k1 4 3 expression in condition 2 k2 2 k3 1 1 2 3 4 5 expression in condition 1

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Accuracy measures Purity 𝑃𝑢𝑟𝑖𝑡𝑦= 1 𝑁 𝑖 𝑣 𝑖 ∩ 𝑢 𝑗 U: set of ground truth classes; V: set of the computed clusters; N:total # of objects in dataset Adjusted Rand Index (AR) 𝐴𝑅𝐼= 𝑁 2 𝑇𝑃+𝑇𝑁 −[(𝑇𝑃+𝐹𝑃)(𝑇𝑃+𝐹𝑁)+(𝐹𝑁+𝑇𝑁)(𝐹𝑃+𝑇𝑁)] 𝑁 −[(𝑇𝑃+𝐹𝑃)(𝑇𝑃+𝐹𝑁)+(𝐹𝑁+𝑇𝑁)(𝐹𝑃+𝑇𝑁)] Rand Index (RI) RI= (TP+TN)/(TP+FP+FN+TN) F1 Score F1 Score= 2×TP/(2×TP+FP+FN) Mirkin’s index (MI) It counts the number of disagreements in data pairs between two clustering. It is the ratio of the number of disagreeing pairs to the total number of pairs. Lower value of Mirkin’s index indicates better clustering. Hubert’s index (HI) HI = RI – MI Corr Maximum weighted Pearson correlation between average expression value of each class at ground truth and computed cluster

Accuracy comparison (Pollen et al. 2014, MiSeq)
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Accuracy comparison (Pollen et al. 2014, MiSeq)

Accuracy comparison (Pollen et al. 2014, HiSeq)
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Accuracy comparison (Pollen et al. 2014, HiSeq)

Accuracy comparison (Zeisel et al. 2015)
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Accuracy comparison (Zeisel et al. 2015)

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Tests for differential gene expression must take both fold change and statistical significance into account DE * * FC = FC = FC = 1.5

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Many reliable DE methods for data with replicates edgeR [Robinson et al., 2010] DESeq [Anders et al., 2010] When no/few replicates available bootstrapping provides a robust alternative

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Sensitivity results on Illumina MCF-7 data with varying number of replicates and minimum fold change 1.5

existing experimental data
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression Functional analysis Gene expression table Enrichment Table Spindle Apoptosis ENRICHMENT TEST Interpretation & Hypotheses Experimental Data Gene-set Databases A priori knowledge + existing experimental data

RNA-Seq applications Analysis challenges for single cell data Typical analysis pipeline for single-cell RNA-Seq Primary analysis: reads QC, mapping, and quantification Secondary analysis: cells QC, normalization, clustering, and differential expression Tertiary analysis: functional annotation Conclusions

Conclusions The range of single-cell applications continues to expand, fueled by advances in microfluidics technology and library prep protocols ATAC-Seq, GT-Seq, Methyl-Seq, … Primary analysis is compute intensive Requires server/cluster/cloud + linux + scripting Galaxy framework ( provides web-based interface to many tools Most secondary/tertiary analyses can be done on PC/Mac using R environment (some programming) Many can be done using web-based tools and user-friendly apps (we’ll use JMP)

Conclusions Development of single-cell specific analysis methods critical for fully realizing the potential of the technology Allele specific expression Biomarker selection Cell type assignment Lineage reconstruction Characterization of heterogeneity Joint analysis of bulk and single cell data still needed to get unbiased cell type frequencies Can also identify and characterize cell types missed by current capture protocols

Single cells or AND computational deconvolution

Acknowledgements Sahar Al Seesi Adrian Caciula Marius Nicolae
Elham Sherafat Craig Nelson Adrian Caciula Serghei Mangul Yvette Temate Tiagueu Alex Zelikovsky Edward Hemphill James Lindsay

Computational Methods for Analysis of Single Cell RNA-Seq Data

Similar presentations

Presentation on theme: "Computational Methods for Analysis of Single Cell RNA-Seq Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computational Methods for Analysis of Single Cell RNA-Seq Data

Similar presentations

Presentation on theme: "Computational Methods for Analysis of Single Cell RNA-Seq Data"— Presentation transcript:

Similar presentations

About project

Feedback