Presentation is loading. Please wait.

Presentation is loading. Please wait.

William Norris Professor and Head, Department of Computer Science

Similar presentations


Presentation on theme: "William Norris Professor and Head, Department of Computer Science"— Presentation transcript:

1 William Norris Professor and Head, Department of Computer Science
Comparative Gene Expression Analysis: Data Analysis Issues and Solutions Vipin Kumar William Norris Professor and Head, Department of Computer Science

2 Problem Definition Goal: gain biological insights by analyzing which genes have the same or divergent behavior across the two organisms Techniques can identify pairs of orthologous genes between two organisms C. albicans and S cerevisiae have 4000 such pairs 11/10/2018

3 One Approach (Judith Berman, et al.)
Step 1: Identify clusters of functionally related orthologous genes within one organism Select a functionally related group of genes Find clusters using similarities computed from the gene expression data of the organism Step 2: Split each cluster into two clusters Use the similarities computed from the gene expression data of the second organism Analyze for similarities and differences 11/10/2018

4 Problems With Step 1 Clustering techniques may produce incorrect clusters due to Noise Varying cluster sizes Varying cluster density Non-globular cluster shape High-dimensional data Clusters that exist in subsets of the attributes Clusters may be overlapping Normalization Choice of similarity measure 11/10/2018

5 Problems With Step 2 Given a decomposition of genes into functionally coherent clusters for two organisms, A and B, there are a wide variety of relationships between the clusters of the two organisms Some relationships are not captured by current approach Example: a cluster of genes in organism A may (1) be split into two standalone clusters, or (2) be split into two groups that are just a part of larger clusters Focusing on one cluster at a time does not take into account cross-talk between functional categories 11/10/2018

6 Alternative #1: Similarity-Based Approach
Directly compare the pattern of similarities of a gene g in both organisms Idea is that the function of a gene is conserved if its relationship to other genes is similar in both organisms Degree of similarity reflects the degree of overlap Assign a value between 0 and 1 to each pair that indicates the divergence or conservation of functionality A value of 0 implies divergence of function A value of 1 implies conservation of function Intermediate values indicate intermediate degrees of conservation/divergence Orthologous pair of genes 11/10/2018

7 Shared Nearest Neighbor Approach
Idea is that the function of a gene is conserved if its relationship to other genes is similar in both organisms 11/10/2018

8 Shared Nearest Neighbor Approach
For each pair of orthologues of a gene g in organisms A and B Assign a measure based on the overlap of the k nearest neighbor list Various possibilities Fraction of overlap in k nearest neighbor list (0 indicates no overlap, 1 indicates complete overlap) Use a weighted measure (high weight for high ranks) A pair of orthologues that have a high value of the measure are likely to have conserved behavior 11/10/2018

9 Alternative #2: Contrast Sets (motivated by Bay and Pazzani, KDD 99)
A set of genes that have very high similarity (in expression patterns) for one organisms and low similarity for the other organism Contrast sets can be overlapping Set of candidates are exponentially large Recent advantages make it possible to prune the search space and compute them efficiently 11/10/2018

10 Alternatives for Step 2 Assume that the output of step 1 is accurate
Could apply statistical tests for comparing distributions T-test commonly used for comparing individual genes Issues for comparing clusters using this scheme Need to define a multi-dimensional version of the T-test Only tests equality of the sample means Assumes that the conditions are the same for the samples Could apply techniques developed for comparing partitions (Strehl and Ghosh, 2002) Measures of distance between partitions Evaluate which clusters contribute most to the distance Catch: Works only for the same data set (Correlation matrices for the two organisms in this case) Need a more general solution 11/10/2018

11 General solution to step 2
Compare sets of clusters derived from two different but related data sets Biologically-inspired overlap-based approach: Consider cluster C1 of genes for first organism and C2 for second |C1∩C2|/|C2|>α1 implies genes in C2 still working together for a function similar to C1 Else, |C1∩C2|/|C2|<α2 implies genes in C2 have diverged into some other functional category Guidelines for choosing the α’s: Ideally, α1→1 and α2→0 α1 should be small enough to allow splits into more than two clusters Similarly, α2 should be just high enough to be able to identify outliers 11/10/2018


Download ppt "William Norris Professor and Head, Department of Computer Science"

Similar presentations


Ads by Google