Evaluating the Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand.

Slides:



Advertisements
Similar presentations
A Simpler 1.5-Approximation Algorithm for Sorting by Transpositions Tzvika Hartman Weizmann Institute.
Advertisements

Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),
Gene an d genome duplication Nadia El-Mabrouk Université de Montréal Canada.
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
The Cobweb of life revealed by Genome-Scale estimates of Horizontal Gene Transfer Fan Ge, Li-San Wang, Junhyong Kim Mourya Vardhan.
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand.
Heuristic alignment algorithms and cost matrices
Finding approximate palindromes in genomic sequences.
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
Similar Sequence Similar Function Charles Yan Spring 2006.
Fast identification and statistical evaluation of segmental homologies in comparative maps Peter Calabrese 1, Sugata Chakravarty 2 and Todd Vision 3 1.
Inferring Phylogeny using Permutation Patterns on Genomic Data 1 Md Enamul Karim 2 Laxmi Parida 1 Arun Lakhotia 1 University of Louisiana at Lafayette.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
1 Chapter 20 Two Categorical Variables: The Chi-Square Test.
Genome projects and model organisms Level 3 Molecular Evolution and Bioinformatics Jim Provan.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.
Chapter 5 Genome Sequences and Gene Numbers. 5.1Introduction  Genome size vary from approximately 470 genes for Mycoplasma genitalium to 25,000 for human.
How to Raise the Dead: The Nuts & Bolts of Ancestral Sequence Reconstruction Jeffrey Boucher Theobald Laboratory.
The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.
Chapter 26: Phylogeny and the Tree of Life Objectives 1.Identify how phylogenies show evolutionary relationships. 2.Phylogenies are inferred based homologies.
VAST 2011 Sebastian Bremm, Tatiana von Landesberger, Martin Heß, Tobias Schreck, Philipp Weil, and Kay Hamacher Interactive-Graphics Systems TU Darmstadt,
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Yeast genome sequencing: the power of comparative genomics MEDG 505, 03/02/04, Han Hao Molecular Microbiology (2004)53(2), 381 – 389.
Genome Rearrangements Unoriented Blocks. Quick Review Looking at evolutionary change through reversals Find the shortest possible series of reversals.
whole-genome duplications and large segmental duplications… …seem to be a common feature in eukaryotic genome evolution …play a crucial role in the evolution.
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Identifying conserved segments in rearranged and divergent genomes Bob Mau, Aaron Darling, Nicole T. Perna Presented by Aaron Darling.
Genome Analysis II Comparative Genomics Jiangbo Miao Apr. 25, 2002 CISC889-02S: Bioinformatics.
Significance Tests for Max-Gap Gene Clusters Rose Hoberman joint work with Dannie Durand and David Sankoff.
Comparative genomics of Gossypium and Arabidopsis: Unraveling the consequences of both ancient and recent polyploidy Junkang Rong, John E. Bowers, Stefan.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Genome Rearrangement By Ghada Badr Part I.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Identifying conserved spatial patterns in genomes Rose Hoberman Dannie Durand Depts. of Biological Sciences and Computer Science, CMU David Sankoff Dept.
Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.
Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.
BPS - 5th Ed. Chapter 221 Two Categorical Variables: The Chi-Square Test.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
A Statistical Framework for Spatial Comparative Genomics Rose Hoberman Carnegie Mellon University, March 2007 Thesis Committee Dannie Durand (chair) Andrew.
WABI: Workshop on Algorithms in Bioinformatics
Original Synteny Vincent Ferretti, Joseph H. Nadeau, David Sankoff, 1996 Presented by: Suzy Sun.
A Hybrid Algorithm for Multiple DNA Sequence Alignment
Two-sided p-values (1.4) and Theory-based approaches (1.5)
Tests for Gene Clustering
Identifying conserved spatial patterns in genomes
By Chunfang Zheng and David Sankoff, 2014
Mattew Mazowita, Lani Haque, and David Sankoff
Volume 11, Issue 3, Pages (March 2018)
A Statistical Framework for Spatial Comparative Genomics Thesis Proposal Rose Hoberman Carnegie Mellon University, August 2005 Thesis Committee.
Multiple Genome Rearrangement
Volume 10, Issue 11, Pages (March 2015)
How to Define a Cluster? constructive or algorithmic definition
Speaker : 樂正、張耿健、張馭荃、葉光哲 Author
Volume 11, Issue 3, Pages (March 2018)
Tests of Significance Section 10.2.
Presentation transcript:

Evaluating the Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

Gene content and order are preserved rearrangement, mutation Similarity in gene content large scale duplication or speciation event original genome

TBOX Cryb, Lhx, Nos, Tbx, Tcf, Prkar MHC Abc, C3/4/5, Col, Hsp, Notch, Pbx, Psmb, Rxr, Ten HOX Achr, Ccnd, Cdc, Cdk, Dlx, En, Evx, Gli, Hh, Hox, If, Inhb, Nhr, Npy/Ppy, Wnt FGR Adr, Ank, Egr, Fgfr, Pa, Vmat, Lpl MATN Eya, Hck, Matn, Myb, Myc, Sdc, Src … … Local Spatial Evidence of Duplication

Gene Clustering for Functional Inference in Bacterial Genomes The Use of Gene Clusters to Infer Functional Coupling, Overbeek et al., PNAS 96: , 1999.

Clusters are Commonly Used in Genomic Analyses Genomic Analyses Blanc et al 2003, recent polyploidy in Arabidopsis Venter et al 2001, sequence of the human genome Overbeek et al 1999, inferring functional coupling of genes in bacteria Vandepoele et al 2002, duplications in Arabidopsis through comparison with rice Vision et al 2000, duplications in Eukaryotes Lawrence and Roth 1996, identification of horizontal transfers Tamames 2001, evolution of gene order conservaiton in prokaryotes Wolfe and Shields 1997, ancient yeast duplication McLysaght02, genomic duplication during early chordate evolution Coghlan and Wolfe 2002, comparing rates of rearrangements Seoighe and Wolfe 1998, genome rearrangements after duplication in yeast Chen et al 2004, operon prediction in newly sequenced bacteria Blanchette et al 1999, breakpoints as phylogenetic features...

Clusters are Commonly Used in Genomic Analyses Genomic Analyses Blanc et al 2003, recent polyploidy in Arabidopsis Venter et al 2001, sequence of the human genome Overbeek et al 1999, inferring functional coupling of genes in bacteria Vandepoele et al 2002, duplications in Arabidopsis through comparison with rice Vision et al 2000, duplications in Eukaryotes Lawrence and Roth 1996, identification of horizontal transfers Tamames 2001, evolution of gene order conservaiton in prokaryotes Wolfe and Shields 1997, ancient yeast duplication McLysaght02, genomic duplication during early chordate evolution Coghlan and Wolfe 2002, comparing rates of rearrangements Seoighe and Wolfe 1998, genome rearrangements after duplication in yeast Chen et al 2004, operon prediction in newly sequenced bacteria Blanchette et al 1999, breakpoints as phylogenetic features... Need to assess cluster significance

Given: a genome: G = 1, …, n unique genes a set of m special genes Can we find a significant cluster of (a subset of) the m homologs?

Can we find a significant cluster of the red genes ? How do we formally define a cluster? Given: a genome: G = 1, …, n unique genes a set of m special genes

Possible Cluster Parameters –size: number of red genes in the cluster Example: cluster size ≥ 3 size = 3 genes

Possible Cluster Parameters –size: number of red genes in the cluster Example: cluster size ≥ 3 –length: number of genes between first and last red gene Example: cluster length ≤ 6 length = 6

Possible Cluster Parameters –size: number of red genes in the cluster Example: cluster size ≥ 3 –length: number of genes between first and last red gene Example: cluster length ≤ 6 length = 6

Possible Cluster Parameters –size: number of red genes in the cluster Example: cluster size ≥ 3 –length: number of genes between first and last red genes Example: cluster length ≤ 6 –density: proportion of red genes (size/length) Example: density ≥ 0.5 density = 6/11

Possible Cluster Parameters –size: number of red genes in the cluster Example: cluster size ≥ 3 –length: number of genes between first and last red genes Example: cluster length ≤ 6 –density: proportion of red genes (size/length) Example: density ≥ 0.5 density = 6/11

Possible Cluster Parameters –size: number of red genes in the cluster Example: cluster size ≥ 3 –length: number of genes between first and last red genes Example: cluster length ≤ 6 –density: proportion of red genes (size/length) –compactness: maximum gap between adjacent red genes gap ≤ 4 genes

Max-Gap Cluster Commonly used in genomic analyses Expandable, ensures minimum local and global density Efficient algorithm to find them (Bergeron, Corteel, Raffinot 2002) gap  g

Max-Gap Clusters are Commonly Used in Genomic Analyses Genomic Analyses Blanc et al 2003, recent polyploidy in Arabidopsis Venter et al 2001, sequence of the human genome Overbeek et al 1999, inferring functional coupling of genes in bacteria Vandepoele et al 2002, duplications in Arabidopsis through comparison with rice Vision et al 2000, duplications in Eukaryotes Lawrence and Roth 1996, identification of horizontal transfers Tamames 2001, evolution of gene order conservation in prokaryotes Wolfe and Shields 1997, ancient yeast duplication McLysaght02, genomic duplication during early chordate evolution Coghlan and Wolfe 2002, comparing rates of rearrangements Seoighe and Wolfe 1998, genome rearrangements after duplication in yeast Chen et al 2004, operon prediction in newly sequenced bacteria Blanchette et al 1999, breakpoints as phylogenetic features...

Statistical models Statistical Models Calabrese03 Danchin03 Durand03 Ehrlich97 Trachtulec01 HumanGenomeScience

A gap between statistical tests and cluster definitions used in practice Genomic Analyses Blanc et al 2003 Venter et al 2001 Overbeek et al 1999 Vandepoele et al 2002 Vision et al 2000 Lawrence and Roth 1996 Tamames 2001 Wolfe and Shields 1997 McLysaght02 Coghlan and Wolfe 2002 Seoighe and Wolfe 1998 Chen et al 2004 Blanchette et al Statistical Models Calabrese03 Danchin03 Durand03b Ehrlich97 Trachtulec01 HumanGenomeScience no analytical statistical model for max-gap clusters statistical significance assessed through Monte Carlo randomization

A gap between statistical tests and cluster definitions used in practice Genomic Analyses Blanc et al 2003 Venter et al 2001 Overbeek et al 1999 Vandepoele et al 2002 Vision et al 2000 Lawrence and Roth 1996 Tamames 2001 Wolfe and Shields 1997 McLysaght02 Coghlan and Wolfe 2002 Seoighe and Wolfe 1998 Chen et al 2004 Blanchette et al Statistical Models Calabrese03 Danchin03 Durand03b Ehrlich97 Trachtulec01 HumanGenomeScience

This Talk Analytical statistical tests –reference region model with m genes of interest –complete and incomplete clusters Relationship between cluster parameters and significance Future extension to whole-genome comparison

Outline Introduction Complete Clusters Incomplete Clusters Genome Comparison Open problems

Complete Clusters Given –a genome: G = 1, …, n unique genes –a set of m special genes (the “red” genes) –a maximum-gap size g Null hypothesis: Random gene order Alternate hypotheses: Evolutionary history Functional selection Test statistic –the probability of observing all m genes in a max-gap cluster in G ?

Probability of a Complete Cluster Count how many of the permutations of m red genes and n-m blue genes contain a max-gap cluster 1.At how many positions in the genome can we locate a cluster (e.g. place the leftmost red gene)? 2.Given the location of the first red gene, how many ways are there to place the remaining m-1 red genes so that they form a max-gap cluster?

w = (m-1)g + m ways to choose m-1 gaps ways to place the first gene and still have w-1 slots left edge effects

w-1 For clusters at the end of the genome: –Length of the cluster is constrained –Sum of the gaps is constrained: Counting clusters at the end of the genome

How many ways can we form a max-gap cluster of size m with length ≤ l ? g1g1 g2g2 g3g3 g m-1 l < w

g1g1 g2g2 g3g3 g m-1 l < w = Number of ways of choosing m-1 gaps between 0 and g so sum ≤ l-m Number of ways of rolling m-1 dice with faces labeled 0 to g so faces total ≤ l-m How many ways can we form a max-gap cluster of size m with length ≤ l ?

g1g1 g2g2 g3g3 g m-1 = Number of ways of choosing m-1 gaps between 0 and g so sum ≤ l-m Number of ways of rolling m-1 dice with faces labeled 0 to g so faces total ≤ l-m How many ways can we form a max-gap cluster of size m with length ≤ l ?

Accounting for edge effects is the least efficient part of calculation –Eliminate for faster approximation when w << n Now we can calculate probabilities for various parameter values Final Probability

Probability of Observing a Complete Cluster g m n = 500

Significant Parameter Values (α = ) n = 500

Significant Parameter Values (α = ) n = 500

Outline Motivation Complete Clusters Incomplete Clusters Genome Comparison Open problems

Incomplete clusters Is it significant to find a max-gap cluster of size h < m ? Test statistic: probability of finding at least one cluster Given: m genes of interest h=3 m=5, g=1

Enumerating clusters by starting position will lead to overcounting of permutations that include more than one cluster Probability of incomplete gene clusters m = 6, h = 3, g = 1

Alternative Approach Enumerate all permutations that do not contain any clusters of size h or larger Dynamic programming –Iteratively place “red” or “blue” genes making sure not to create any cluster of size h or larger by judicious placement of “blue genes”

k = 4 c = 0, j = 8 r = 6 Dynamic programming Algorithm r total number of genes to place c size of cluster so far k number of special genes j distance to previous cluster n = 6, m = 4, h = 3, g=1

k = 4 c = 0, j = 8 r = 6 c = 1, j = 0 c = 0, j = 8 k = 4 r = 5 k = 3 r = 5 Dynamic programming Algorithm r total number of genes to place c size of cluster so far k number of special genes j distance to previous cluster n = 6, m = 4, h = 3, g=1

k = 2 r = 4 k = 4 c = 0, j = 8 r = 6 c = 1, j = 0 c = 0, j = 8 k = 4 r = 5 k = 3 r = 5 c = 1, j = 1 k = 3 r = 4 c = 1, j = 0 Dynamic programming Algorithm r total number of genes to place c size of cluster so far k number of special genes j distance to previous cluster n = 6, m = 4, h = 3, g=1

k = 2 r = 4 k = 4 c = 0, j = 8 r = 6 c = 1, j = 0 c = 0, j = 8 k = 4 r = 5 k = 3 r = 5 c = 1, j = 1 k = 3 r = 4 c = 1, j = 0 k = 2 r = 3 c = 2, j = 0 Dynamic programming Algorithm r total number of genes to place c size of cluster so far k number of special genes j distance to previous cluster X n = 6, m = 4, h = 3, g=1

Incomplete cluster significance  h  g m ( ) significant region of parameter space shown in paper n = 1000 m = 50 n = 500

Outline Motivation Complete Clusters Incomplete Clusters Extensions Genome Comparison

Possible Extensions Tandem duplications Gene families Extensions for prokaryotic genomes –Gene orientation –Circular genomes –Physical distance between genes Whole genome comparison

Assuming identical gene content … What is the probability that at least k genes form a max-gap cluster in both genomes? g  30

The probability of finding a max-gap cluster of size at least k is always one An Odd Property Example: g =1

The probability of finding a max-gap cluster of size at least k is always one Example: g =1 An Odd Property

The probability of finding a max-gap cluster of size at least k is always one Example: g =1 gap An Odd Property A cluster of size k does not necessarily contain a cluster of size k-1! gap

The probability of finding a max-gap cluster of size at least k is always one There will always be a cluster of size n Example: g =1 An Odd Property

Conclusions Presented statistical tests for max-gap clusters –Evaluate the significance of clusters of a pre- specified set of genes –Choose parameters effectively –Understand trends

Thank You