The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Slides:



Advertisements
Similar presentations
Sorting by reversals Bogdan Pasaniuc Dept. of Computer Science & Engineering.
Advertisements

Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),
Greedy Algorithms CS 466 Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix of.
Mauro Sozio and Aristides Gionis Presented By:
Gene an d genome duplication Nadia El-Mabrouk Université de Montréal Canada.
Introduction to Bioinformatics
Evaluating the Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand.
Heuristic alignment algorithms and cost matrices
Bioinformatics and Phylogenetic Analysis
Introduction to Bioinformatics Algorithms Greedy Algorithms And Genome Rearrangements.
Of Mice and Men Learning from genome reversal findings Genome Rearrangements in Mammalian Evolution: Lessons From Human and Mouse Genomes and Transforming.
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Sequence similarity.
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Similar Sequence Similar Function Charles Yan Spring 2006.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Fast identification and statistical evaluation of segmental homologies in comparative maps Peter Calabrese 1, Sugata Chakravarty 2 and Todd Vision 3 1.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
7-2 Estimating a Population Proportion
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Inductive Generalizations Induction is the basis for our commonsense beliefs about the world. In the most general sense, inductive reasoning, is that in.
Greedy Algorithms And Genome Rearrangements An Introduction to Bioinformatics Algorithms (Jones and Pevzner)
Introduction to Bioinformatics Biological Networks Department of Computing Imperial College London March 18, 2010 Lecture hour 18 Nataša Pržulj
Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.
Chapter 3. Community Detection and Evaluation May 2013 Youn-Hee Han
Identifying conserved segments in rearranged and divergent genomes Bob Mau, Aaron Darling, Nicole T. Perna Presented by Aaron Darling.
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
통계적 추론 (Statistical Inference) 삼성생명과학연구소 통계지원팀 김선우 1.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7-1 Review and Preview.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
BPS - 3rd Ed. Chapter 161 Inference about a Population Mean.
Measures of Conserved Synteny Work was funded by the National Science Foundation’s Interdisciplinary Grants in the Mathematical Sciences All work is joint.
PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.
Confidence Interval Estimation For statistical inference in decision making:
Significance Tests for Max-Gap Gene Clusters Rose Hoberman joint work with Dannie Durand and David Sankoff.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Identifying conserved spatial patterns in genomes Rose Hoberman Dannie Durand Depts. of Biological Sciences and Computer Science, CMU David Sankoff Dept.
Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.
Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
CHAPTER 2.3 PROBABILITY DISTRIBUTIONS. 2.3 GAUSSIAN OR NORMAL ERROR DISTRIBUTION  The Gaussian distribution is an approximation to the binomial distribution.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
Learning and Removing Cast Shadows through a Multidistribution Approach Nicolas Martel-Brisson, Andre Zaccarin IEEE TRANSACTIONS ON PATTERN ANALYSIS AND.
A Statistical Framework for Spatial Comparative Genomics Rose Hoberman Carnegie Mellon University, March 2007 Thesis Committee Dannie Durand (chair) Andrew.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Intelligent Exploration for Genetic Algorithms Using Self-Organizing.
3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.
Conservation of Combinatorial Structures in Evolution Scenarios
3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.
Tests for Gene Clustering
Identifying conserved spatial patterns in genomes
Mattew Mazowita, Lani Haque, and David Sankoff
A Statistical Framework for Spatial Comparative Genomics Thesis Proposal Rose Hoberman Carnegie Mellon University, August 2005 Thesis Committee.
SEG5010 Presentation Zhou Lanjun.
How to Define a Cluster? constructive or algorithmic definition
Discovering Frequent Poly-Regions in DNA Sequences
Presentation transcript:

The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand

How to detect segmental homology? Intuitive notions of what gene clusters look like Enriched for homologous gene pairs Neither gene content nor order is perfectly preserved How can we define a gene cluster formally?

Definitions will be application-dependent If the goal is to estimate the number of inversions, then gene order should be preserved If the goal is to find duplicated segments, allow some disorder

Gene Clusters Definitions Large-Scale Duplications Vandepoele et al 02 McLysaght et al 02 Hampson et al 03 Panopoulou et al 03 Guyot & Keller, 04 Kellis et al, Genome rearrangements Bourque et al, 05 Pevzner & Tesler 03 Coghlan and Wolfe Functional Associations between Genes Tamames 01 Wolf et al 01 Chen et al 04 Westover et al Algorithmic and Statistical Communities Bergeron et al 02 Calabrese et al 03 Heber & Stoye 01...

Groups find very different clusters when analyzing the same data

Cluster locations differ from study to study  Inference of duplication mechanism for individual genes varies greatly The Genomes of Oryza sativa: A History of Duplications Yu et al, PLoS Biology 2005

Goals: Characterizing existing definitions Formal properties form a basis for comparison Gene cluster desiderata

Outline Introduction  Brief overview of gene cluster identification Proposed properties for comparison Analysis of data: nested property

Detecting Homologous Chromosomal Segments ( a marker-based approach ) 1. Find homologous genes 2. Formally define a “gene cluster” 3. Devise an algorithm to identify clusters 4. Statistically verify that clusters indicate common ancestry

Cluster definitions in the literature Descriptive: r-windows connected components (Pevzner & Tesler 03) common intervals (Uno and Tagiura 00) max-gap … Constructive : LineUp (Hampson et al 03) CloseUp (Hampson et al 05) FISH (Calabrese et al 03) AdHoRe (Vandepoele et al 02) Gene teams (Bergeron et al 02) greedy max-gap (Hokamp 01) … Require search algorithms Harder to reason about formally

Cluster definitions in the literature Descriptive: r-windows connected components (Pevzner & Tesler 03) common intervals (Uno and Tagiura 00) max-gap … Constructive : LineUp (Hampson et al 03) CloseUp (Hampson et al 05) FISH (Calabrese et al 03) AdHoRe (Vandepoele et al 02) Gene teams (Bergeron et al 02) greedy max-gap (Hokamp 01) … I illustrate properties with a few definitions

r-windows Two windows of size r that share at least m homologous gene pairs (Calvacanti et al 03, Durand and Sankoff 03, Friedman & Hughes 01, Raghupathy and Durand 05) r = 4, m ≥ 2

max-gap cluster A set of genes form a max-gap cluster if the gap between adjacent genes is never greater than g on either genome Widely used definition in genomic studies g  2 g  3

Outline Introduction Brief overview of existing approaches  Proposed properties for comparison Analysis of data: nested property

Proposed Cluster Properties  Symmetry Size Density Order Orientation Nestedness Disjointness Isolation Temporal Coherence

Symmetry Many existing cluster algorithms are not symmetric with respect to chromosome =? clusters found

Asymmetry: an example FISH (Calabrese et al, 2003) Constructive cluster definition: clusters correspond to paths through a dot-plot Publicly available software Statistical model

Asymmetry: an example FISH Euclidian distance between gene pairs is constrained Paths in the dot-plot must always move to the right

Switching the axes yields different clusters FISH Euclidian distance between markers is constrained Paths in the dot-plot must always move to the right

Ways to regain symmetry 1. Paths in the dot-plot must always move down and to the right  miss the inversion 2. Paths can move in any direction  statistics becomes difficult Regaining symmetry entails some tradeoffs

Proposed Cluster Properties Symmetry  Size  Density Order Orientation Nestedness Disjointness Isolation Temporal Coherence

Cluster Parameters size: number of homologous pairs in the cluster length: total number of genes in the cluster density: proportion of homologous pairs (size/length) size = 5, length = 12 density = 5/12 gap ≤ 3 genes

cluster grows to its natural size cluster of size m may be of length m to g ( m -1)+ m maximal length grows as size grows gap  g length  r cluster size is constrained cluster of size m may be of length m to r maximal length is fixed, regardless of cluster size max-gap clusters r-windows

A tradeoff: local vs global density max-gap constrains local density only weakly constrains global density (≥ 1/(g+1)) r-window constrains global density only weakly constrains local density (maximum possible gap ≤ r-m)

Even when global density is high, Density = 12/18 a region may not be locally dense

Size vs Density: An example Maximum GapCluster SizePost-Processing McLysaght et al, 2002 constrainedtest statistic Panopoulou et al, 2003 test statisticconstrained merged nearby clusters Application: all-against-all comparison of human chromosomes to find duplicated blocks

Panopoulou et al 2003 Size >= 2 Gap ≤ Gap Size Large and Dense Small but dense Large but less dense McLysaght et al, 2002 Gap ≤ 30, Size ≥ 6 A Tradeoff in Parameter Space

Proposed Cluster Properties Symmetry Size Density  Order  Orientation Disjointntess Isolation Nestedness Temporal Coherence

Order and Orientation Local rearrangements will cause both gene order and orientation to diverge Overly stringent order constraints could lead to false negatives Partial conservation of order and orientation provide additional evidence of regional homology density = 6/8

Wide Variation in Order Constraints None (r-windows, max-gap,...) Explicit constraints: Limited number of order violations (Hampson et al, 03) Near-diagonals in the dot-plot (Calabrese et al 03,...) Test statistic (Sankoff and Haque, 05) Implicit constraints: via the search algorithm (Hampson et al 05,...)

Proposed Cluster Properties Symmetry Size Density Disjointness Isolation Order Orientation  Nestedness Temporal Coherence

Nestedness In particular, implicit ordering constraints are imposed by many greedy, agglomerative search algorithms Formally, such search algorithms will find only nested clusters A cluster of size m is nested if it contains sub-clusters of size m-1,...,1

Greedy Algorithms Impose Order Constraints A greedy, agglomerative algorithm initializes a cluster as a single homologous pair searches for a gene in proximity on both chromosomes either extends the cluster and repeats, or terminates g = 2

Greediness: an example (Bergeron et al, 02) No greedy, agglomerative algorithm will find this cluster There is no max-gap cluster of size 2 (or 3) In other words, the cluster is not nested g = 2 A max-gap cluster of size four

Thus: different results when searching for max-gap clusters Greedy algorithms agglomerative find nested max-gap clusters Gene Teams algorithm (Bergeron et al 02; Beal et al 03,...) divide-and-conquer finds all max-gap clusters, nested or not

An example of a greedy search: CloseUp (Hampson et al, Bioinformatics, 2005) Software tool to find clusters Goal: statistical detection of chromosomal homology using density alone Method: greedy search for nearby matches terminates when density is low randomization to statistically verify clusters

A comparative study (Hampson et al, 05) Empirical comparison: CloseUp: “density alone”, but greedy LineUp and ADHoRe: density + order information evaluated accuracy on synthetic data Is order information necessary or even helpful for cluster detection?

Result: CloseUp had comparable performance Their conclusion: order is not particularly helpful My conclusion: results are actually inconclusive, since CloseUp implicitly constrains order A comparative study (Hampson et al, 05) Is order information necessary or even helpful for cluster detection?

Proposed Cluster Properties Symmetry Size Density Order Orientation Nestedness  Disjointness  Isolation Temporal Coherence

Gene clusters: islands of homology in a sea of interlopers How can we formally describe this intuitive notion?

Islands of Homology Disjoint: A homologous gene pair should be a member of at most one cluster Isolated: The minimum distance between clusters should be larger than the maximum distance between homologous gene pairs within the cluster

Various types of constraints lead to overlapping (or nearby) clusters that cannot be merged If we search for clusters with density ≥ ½: If we search for nested max-gap clusters, g=1:

Our Proposed Cluster Properties Symmetry Size Density Disjointness Isolation Order Orientation Nestedness Temporal Coherence

Temporal coherence Divergence times of homologous pairs within a block should agree now before time

Outline Introduction Brief overview of existing approaches Proposed properties for comparison  My analysis of data: nested property Many groups use a greedy, agglomerative search to find gene clusters Does a greedy search have a large effect on the set of clusters identified in real data?

Data Gene orthology data: bacterial: GOLDIE database eukaryotes: InParanoid database 10,33817,70922,216Human & Chicken 14,76825,38322,216Human & Mouse 1,3154,2454,108E. coli & B. subtilis orthologsgenes (2)genes (1) pairwise genome comparisons

Methods Maximal max-gap clusters Gene Teams software Maximal nested max-gap clusters simple greedy heuristic (no merging) For each genome comparison and gap size:

Percent of gene teams that are nested Percentage Nested

Number of genes in some gene team of size 7 or greater that are not in any nested cluster of 7 or greater Chicken/Human

Results For the datasets analyzed, a nestedness constraint does not appear too conservative However, we didn’t survey a wide range of evolutionary distances expect nestedness to decrease with evolutionary distance open question: are there more rearranged datasets for which the proportion of nested clusters is much smaller?

Is nestedness desirable? A nestedness constraint: offers a middle ground between no order constraints and strict order However, nestedness provides no formal description of order constraints is restrictive rather than descriptive We may instead prefer methods that allow for parameterization of degree of disorder consider order conservation in the statistical tests

Conclusion Proposed 9 properties to compare and evaluate methods for identifying gene clusters Illustrated cluster differences due to cluster definition search algorithm statistics Incompatible Desiderata: these properties are intuitively natural yet many are surprisingly difficult to satisfy with the same definition

Acknowledgements David Sankoff The Durand Lab Barbara Lazarus Fellowship Sloan Foundation NHGRI, Packard Foundation

Discussion are our intuitions about clusters reasonable? which cluster properties are important or desirable? how can we quantitatively evaluate cluster definitions? what are the tradeoffs between methods? how can better definitions be designed?