William Norris Professor and Head, Department of Computer Science

Slides:

Advertisements

Similar presentations

Yinyin Yuan and Chang-Tsun Li Computer Science Department

Advertisements

Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Cluster Analysis: Basic Concepts and Algorithms

Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.

Hierarchical Clustering, DBSCAN The EM Algorithm

PARTITIONAL CLUSTERING

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

Mutual Information Mathematical Biology Seminar

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

What is Cluster Analysis?

Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.

What is Cluster Analysis?

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.

Quantitative Genetics

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.

Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.

Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”

CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.

Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.

Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries that Characterize Clustering Results Focus: Primary Focus Cluster Summarization (what.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

Prediction of Interconnect Net-Degree Distribution Based on Rent’s Rule Tao Wan and Malgorzata Chrzanowska- Jeske Department of Electrical and Computer.

Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.

MIS2502: Data Analytics Clustering and Segmentation Jeremy Shafer

Graph clustering to detect network modules

Unsupervised Learning

Data Mining: Basic Cluster Analysis

Clustering CSC 600: Data Mining Class 21.

An accurate, efficient method for calculating hydrometeor advection in multi-moment bulk and bin microphysics schemes Hugh Morrison (NCAR*) Thanks to:

SLAW: A Mobility Model for Human Walks

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Data Mining K-means Algorithm

Research in Computational Molecular Biology , Vol (2008)

Fast Preprocessing for Robust Face Sketch Synthesis

William Norris Professor and Head, Department of Computer Science

Clustering (3) Center-based algorithms Fuzzy k-means

Topic 3: Cluster Analysis

Statistics and Research Desgin

Outlier Discovery/Anomaly Detection

K Nearest Neighbor Classification

Clustering Basic Concepts and Algorithms 1

Hierarchical clustering approaches for high-throughput data

William Norris Professor and Head, Department of Computer Science

Computer Vision Lecture 16: Texture II

Tabulations and Statistics

Data Mining Anomaly/Outlier Detection

GPX: Interactive Exploration of Time-series Microarray Data

Clustering Wei Wang.

SEG5010 Presentation Zhou Lanjun.

Data Mining for Finding Connections of Disease and Medical and Genomic Characteristics Vipin Kumar William Norris Professor and Head, Department of Computer.

Nearest Neighbors CSC 576: Data Mining.

Data Mining Classification: Alternative Techniques

Microarray Data Set The microarray data set we are dealing with is represented as a 2d numerical array.

Topic 5: Cluster Analysis

Topological Signatures For Fast Mobility Analysis

Data Mining Anomaly Detection

Data Mining Cluster Analysis: Basic Concepts and Algorithms

Data Mining Anomaly Detection

Unsupervised Learning

Data Mining CSCI 307, Spring 2019 Lecture 11

Presentation transcript:

William Norris Professor and Head, Department of Computer Science Comparative Gene Expression Analysis: Data Analysis Issues and Solutions Vipin Kumar William Norris Professor and Head, Department of Computer Science

Problem Definition Goal: gain biological insights by analyzing which genes have the same or divergent behavior across the two organisms Techniques can identify pairs of orthologous genes between two organisms C. albicans and S cerevisiae have 4000 such pairs 11/10/2018

One Approach (Judith Berman, et al.) Step 1: Identify clusters of functionally related orthologous genes within one organism Select a functionally related group of genes Find clusters using similarities computed from the gene expression data of the organism Step 2: Split each cluster into two clusters Use the similarities computed from the gene expression data of the second organism Analyze for similarities and differences 11/10/2018

Problems With Step 1 Clustering techniques may produce incorrect clusters due to Noise Varying cluster sizes Varying cluster density Non-globular cluster shape High-dimensional data Clusters that exist in subsets of the attributes Clusters may be overlapping Normalization Choice of similarity measure 11/10/2018

Problems With Step 2 Given a decomposition of genes into functionally coherent clusters for two organisms, A and B, there are a wide variety of relationships between the clusters of the two organisms Some relationships are not captured by current approach Example: a cluster of genes in organism A may (1) be split into two standalone clusters, or (2) be split into two groups that are just a part of larger clusters Focusing on one cluster at a time does not take into account cross-talk between functional categories 11/10/2018

Alternative #1: Similarity-Based Approach Directly compare the pattern of similarities of a gene g in both organisms Idea is that the function of a gene is conserved if its relationship to other genes is similar in both organisms Degree of similarity reflects the degree of overlap Assign a value between 0 and 1 to each pair that indicates the divergence or conservation of functionality A value of 0 implies divergence of function A value of 1 implies conservation of function Intermediate values indicate intermediate degrees of conservation/divergence Orthologous pair of genes 11/10/2018

Shared Nearest Neighbor Approach Idea is that the function of a gene is conserved if its relationship to other genes is similar in both organisms 11/10/2018

Shared Nearest Neighbor Approach For each pair of orthologues of a gene g in organisms A and B Assign a measure based on the overlap of the k nearest neighbor list Various possibilities Fraction of overlap in k nearest neighbor list (0 indicates no overlap, 1 indicates complete overlap) Use a weighted measure (high weight for high ranks) A pair of orthologues that have a high value of the measure are likely to have conserved behavior 11/10/2018

Alternative #2: Contrast Sets (motivated by Bay and Pazzani, KDD 99) A set of genes that have very high similarity (in expression patterns) for one organisms and low similarity for the other organism Contrast sets can be overlapping Set of candidates are exponentially large Recent advantages make it possible to prune the search space and compute them efficiently 11/10/2018

Alternatives for Step 2 Assume that the output of step 1 is accurate Could apply statistical tests for comparing distributions T-test commonly used for comparing individual genes Issues for comparing clusters using this scheme Need to define a multi-dimensional version of the T-test Only tests equality of the sample means Assumes that the conditions are the same for the samples Could apply techniques developed for comparing partitions (Strehl and Ghosh, 2002) Measures of distance between partitions Evaluate which clusters contribute most to the distance Catch: Works only for the same data set (Correlation matrices for the two organisms in this case) Need a more general solution 11/10/2018

General solution to step 2 Compare sets of clusters derived from two different but related data sets Biologically-inspired overlap-based approach: Consider cluster C1 of genes for first organism and C2 for second |C1∩C2|/|C2|>α1 implies genes in C2 still working together for a function similar to C1 Else, |C1∩C2|/|C2|<α2 implies genes in C2 have diverged into some other functional category Guidelines for choosing the α’s: Ideally, α1→1 and α2→0 α1 should be small enough to allow splits into more than two clusters Similarly, α2 should be just high enough to be able to identify outliers 11/10/2018