Presentation is loading. Please wait.

Presentation is loading. Please wait.

DATASET DESCRIPTION PCA RESULTS Dataset #1 RNA-Seq of neural cells (MiSeq) [2]  65 cells  Ground truth clusters:  Group I (Neural Progenitors), Group.

Similar presentations


Presentation on theme: "DATASET DESCRIPTION PCA RESULTS Dataset #1 RNA-Seq of neural cells (MiSeq) [2]  65 cells  Ground truth clusters:  Group I (Neural Progenitors), Group."— Presentation transcript:

1 DATASET DESCRIPTION PCA RESULTS Dataset #1 RNA-Seq of neural cells (MiSeq) [2]  65 cells  Ground truth clusters:  Group I (Neural Progenitors), Group II (Radial Gilia), Group III (Newborn Neurons), Group IV (Maturing Neurons)  Feature selection: Top 500 genes from [2]  Best parameters: K=4; d= 1.4; metric= Euclidean; method= complete; I = 200; S= 200; n= 5; r = 0.7; m= 0.5 Dataset #2 RNA-Seq of neural cells (HiSeq) [2]  65 cells  Ground truth clusters:  Group I (Neural Progenitors), Group II (Radial Gilia), Group III (Newborn Neurons), Group IV (Maturing Neurons)  Feature selection: Top 500 genes from [2]  Best parameters: K=4; d= 1.4; metric= Cityblock; method= complete; I = 200; S= 200; n= 5; r = 0.4; m= 0.4 Dataset #3 RNA-Seq of mouse sematosensory cortex and hippocampal CA1 cells [6]  3005 cells  Ground truth clusters:  Astrocytes_ependymal, Endothelial-mural, Interneurons, Microglia, Oligodendrocytes, Pyramidal CA1, Pyramidal SS and 47 subtypes  Feature selection: Top 500 genes using ceftools published by [6]  Best parameters: K=47; d= 1.2; metric= Seulidean; method= ward; I = 200; S= 200 Dataset #4 qPCR of mouse hematopoietic system [1]  327 cells  Ground truth clusters:  HSC (Hematopoietic stem cells)  CMP (Common myeloid progenitors  GMP (Granulocyte/monocyte progenitors )  MEP (Megakaryocyte/erythroid progenitors)  CLP (Common lymphoid progenitors)  MPP (Multipotent progenitor cells)  Feature selection: Top 280 genes from [1]  Best parameters: K= 6; d= 1.2; metric= correlation; method= complete; I = 500; S= 500; n= 5; r = 0.6; m=0.7 Dataset #5 RNA-Seq of mouse distal lung epithelial [4]  80 cells  Ground truth clusters:  Clara (Scgb1a1), Ciliated (Foxj1), AT1 (Pdpn, Ager), AT2 (Sftpc, Sftpb), BP (alveolar bipotential progenitor)  Number of genes: 8,578  Feature selection: Top 8,578 from [4] and then choose 10 first PCs after applying PCA  Best parameters: K=5; d= 1.2; metric= correlation; method= complete; I = 500; S= 500; n= 5; r = 0.7; m= 0.9 University of Connecticut School of Engineering We selected five clustering methods: K-means, fuzzy c-means, hierarchical and EM clustering and SNN-Cliq [5] as tools to reveal heterogeneity of cell type at six different datasets. The last one is recently developed to do clustering on RNA-seq data. Some datasets have their own rules to reduce features. Even dimensionality will reduce too much in this way, but sometimes applying feature extraction methods like PCA leads to more improvement at final performance of the algorithms. There are some parameters for each algorithm which is in below table. The results are reported base on best parameter setting founded for different datasets. Below are list of parameters for the algorithms. METHODS AlgorithmParameters K-means K = Number of clusters Fuzzy c-means Clustering (FCM) K = number of clusters d = Degree of fuzziness Hierarchical Clustering (HCS) Metric = euclidean, seuclidean, cityblock, minkowski, chebychev, cosine, correlation, spearman Method = average, centroid, complete, median, single EM Clustering K = Number of clusters S = Number of initial seeds I = Number of iteration SNN-Cliq n = Size of the nearest neighbor list r = Density threshold of quasi-cliques m = Threshold on the overlapping rate for merging.  HC and model-based clustering methods (EM) performed well on most datasets; the other clustering methods had less consistent performance.  Best method and best parameters depend on dataset.  A limitation of most current methods is that they do not model the noise in the expression level estimates. Density based clustering may be a good way of handling noise.  An additional advantage of model-based methods is that they can incorporate prior knowledge in the inference process; the value of incorporating such prior knowledge is currently under evaluation.  Further increases in accuracy may benefit from time-series data such as SCUBA [7].  With increased number of cells scalability becomes a concern. In ongoing work we will explore scalable alignment- free clustering methods. CONCLUSION & FUTURE WORK Seven measures are considered to evaluate clustering algorithms which their definitions are as following. EVALUATION METRICS Recent RNA-seq technologies have facilitated generation of single-cell transcriptome data, but it should be considered that sometime tools and computational strategies that work for analyzing bulk-cell population RNA-seq data cannot successfully applied to study of gene expression at sing-cell level. So, we need to develop new tools for analyzing them. At this work, we look at application of clustering methods in analyzing different RNA-seq data, especially identification of heterogeneous cell types from single-cell transcriptome. Clustering multiple tissues samples by their bulk expression profiles have done at previous studies. Clustering methods also can be used to identify hidden tissue heterogeneity on the basis of expression profiles. There are two groups of clustering methods in this context [3]. It mainly depends on availability of prior knowledge or expectation regarding relationship between cells. Gene expression profiles are input as clustering method. Most of times the values are measured or estimated imprecisely due to biological or technical noise which really affect accuracy of clustering result. We applied different clustering methods on different datasets and evaluate the methods to see how well they can basically address noisy and high dimensional data. INTRODUCTION Application of Clustering to Identify Cell Types from Single-Cell mRNA Expression Data Elham Sherafat, Ion Mandoiu Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269 [1] Guo, Guoji, et al. "Mapping cellular hierarchy by single-cell analysis of the cell surface repertoire." Cell stem cell 13.4 (2013): 492-505. [2] Pollen, Alex A., et al. "Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex." Nature biotechnology (2014). [3] Stegle, Oliver, et al. "Computational and analytical challenges in single- cell transcriptomics." Nature Reviews Genetics 16.3 (2015): 133-145. [4] Treutlein, B. et al. “Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq.” Nature 509, 371–375 (2014). [5] Xu, Chen, and Zhengchang Su. "Identification of cell types from single- cell transcriptomes using a novel clustering method." Bioinformatics (2015): btv088. [6] Zeisel, Amit, et al. "Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq." Science 347.6226 (2015): 1138-1142. [7] Marco, Eugenio, et al. "Bifurcation analysis of single-cell gene expression data reveals epigenetic landscape." Proceedings of the National Academy of Sciences 111.52 (2014): E5643-E5650 REFERENCES Purity Adjusted Rand Index (AR) Rand Index (RI) RI= (TP+TN)/(TP+FP+FN+TN) F 1 ScoreF 1 Score= 2×TP/(2×TP+FP+FN) Mirkin’s index (MI) It counts the number of disagreements in data pairs between two clustering. It is the ratio of the number of disagreeing pairs to the total number of pairs. Lower value of Mirkin’s index indicates better clustering. Hubert’s index (HI) HI = RI – MI CorrMaximum weighted Pearson correlation between average expression value of each class at ground truth and computed cluster


Download ppt "DATASET DESCRIPTION PCA RESULTS Dataset #1 RNA-Seq of neural cells (MiSeq) [2]  65 cells  Ground truth clusters:  Group I (Neural Progenitors), Group."

Similar presentations


Ads by Google