A Statistical Framework for Spatial Comparative Genomics Rose Hoberman Carnegie Mellon University, March 2007 Thesis Committee Dannie Durand (chair) Andrew.

A Statistical Framework for Spatial Comparative Genomics Rose Hoberman Carnegie Mellon University, March 2007 Thesis Committee Dannie Durand (chair) Andrew Moore Russell Schwartz Jeffrey Lawrence (Univ. of Pittsburgh, Biological Sciences) David Sankoff (Univ. of Ottawa, Math & Statistics)

Spatial Comparative Genomics Comparing the spatial organization of elements within related genomes helps us understand how genome structure changes and evolves.

Spatial Comparative Genomics Key task: identify homologous blocks, chromosomal regions that descended from the same region in the genome of the common ancestor

Many applications require the identification of homologous chromosomal regions Reconstruction of large-scale genomic changes Infer chromosomal rearrangements (i.e. Pevzner and Tesler 03) Find evidence of whole-genome duplications (i.e. Blanc et al 03) Phylogenetic analysis Infer an ancestral genetic map (i.e. Bourque et al 04) Build species trees from whole-genome data (i.e. Blanchette et al 99) Functional annotation Predict operons (i.e. Chen et al 04) Infer functional associations between genes (i.e. Overbeek et al 99)

Challenges in detecting homologous blocks Local mutation: For distantly related genomes, little sequence similarity will remain except in genic regions Translocations, fissions: Interchromosomal rearrangements break blocks into small pieces, and disrupt 1-to-1 mapping between chromosomes Inversions: Intrachromosomal rearrangements shuffle gene order Gene loss/Duplications/Insertions: Indels result in homologous blocks that share few homologous genes

Gene content and order are highly conserved rearrangement, mutation, loss Similarity in gene content large scale duplication or speciation event original genome Neither gene content nor order is strictly preserved gene clusters

Gene clusters are characterized by shared gene content and possibly scrambled gene order Many unrelated regions will also share homologous gene pairs In distantly related genomes, homologous blocks may be evident only as gene clusters To identify ancient homologous blocks statistical tests are essential

Statistical Tests for Gene Clusters Given: a pair of regions that share some number of genes, not necessarily in the same order Goal: design formal statistical tests to show that local similarities in gene content could not have occurred by chance Approach: Hypothesis testing Alternate hypothesis: shared ancestry Null hypothesis: random gene order

Overview of Thesis 1. Statistical tests for max- gap clusters in one or two genomes 2. Statistical tests for clusters spanning three genomes 3. Application of cluster statistics: an algorithm for ortholog detection g  = 2 g  = 3

Outline Intro and thesis overview Background: identifying gene clusters 1. Max-gap cluster statistics 2. Statistics for clusters in multiple genomes 3. Ortholog detection using max-gap statistics

Steps in identifying gene clusters 1. Identify homologous genes 2. Search the genome for clusters 3. Test cluster significance

Gene Homology Identification of homologous gene pairs generally based on sequence similarity still an imprecise science preprocessing step Depending on how homologs are identified, the homology map may have different properties each gene may be limited to a fixed number of homologs Number of homologs per gene One-to-oneAt most one Two-to-oneAt most two Many-to-manyUnlimited

Steps in identifying gene clusters 1. Identify homologous genes 2. Search the genome for clusters 3. Test cluster significance

How to formally define a gene cluster? Many definitions have been proposed Four definitions for which efficient search algorithms have been designed: 1. common substring 2. common interval 3. r-windows 4. max-gap clusters

The Model Representation of a genome: a sequence of non-overlapping genes the distance or “gap” between genes is equal to the number of intervening genes gap  = 3 gap  = 1

Common substring A set of genes form a common substring if they form a substring (in either direction), in both genomes order preserved no rearrangements no insertion/loss/duplication no constraint on length

Common interval A set of genes form a common interval if they are adjacent in both genomes order can be scrambled rearrangements and duplications allowed no insertion/loss no constraint on length

An r -window Two windows of size r that share at least m homologous gene pairs m and r are user-specified parameters rearrangements, duplications, deletions, and insertions allowed length constraint r = 6 m= 5

A max-gap g-cluster A set of genes form a max-gap g-cluster if the gap between adjacent genes is never greater than g, in either genome g is a user-specified parameter allows rearrangements, duplications, insertions, deletions cluster length is not constrained g = 1 gap  > 1 gap  = 1 g = 2 gap  = 2

Outline Intro and thesis overview Background: identifying gene clusters 1. Max-gap cluster statistics 2. Statistics for clusters in multiple genomes 3. Ortholog detection using max-gap statistics

Max-gap cluster statistics The max-gap cluster definition is the most widely used definition in genomic analyses There was no formal statistical model developed for this cluster definition I used a combinatorial approach to determine the probability of observing max-gap clusters by chance, for two common search scenarios: reference set whole-genome comparison

Given: a genome containing n genes a set of m genes of interest If the genes are randomly ordered, what is the probability that at least h of the m genes will cluster together? Reference set Hoberman, Sankoff, Durand. RECOMB Comp Genomics, 2005.

Whole Genome Comparison Given: 1. two genomes of n genes 2. m homologous genes pairs 3. a maximum gap size g Problem What is the probability of observing a maximal max-gap cluster of size exactly h, if the genes in both genomes are randomly ordered? g  3 Hoberman, Sankoff, Durand. J Computational Biology, 2005.

With these parameters, a cluster must contain at least 7 homologous gene pairs to reject the null hypothesis Cluster Probability 1000 genes, g=10 Summary This formal statistical framework for max-gap clusters is useful for data analysis h, cluster size

Demonstrated that in some cases larger clusters are more likely to occur by chance than smaller clusters 1000 genes, g=20 This formal statistical framework for max-gap clusters is useful for data analysis understanding trends, and properties of cluster definitions Summary Cluster Probability h, cluster size Durand & Hoberman. Trends in Genetics, 2006.

If we’re interested in clusters containing h=5-10 genes, then we should choose a gap size between g=1 and g=5 1000 genes Summary This formal statistical framework for max-gap clusters is useful for data analysis understanding trends, and properties of cluster definitions choosing parameter values / designing studies

Open Problems: max-gap statistics Statistical models for genome self-comparison (to find duplicated blocks) Statistical tests, for a many-to-many homology map Search algorithms and statistics for max-gap clusters spanning more than two regions

Outline Intro and thesis overview Background: identifying gene clusters 1. Max-gap cluster statistics 2. Statistics for clusters in multiple genomes An application: identifying blocks duplicated by WGD Review of r-window statistics for pairwise comparison My results: r-window statistics for three regions 3. Ortholog detection using max-gap statistics

Whole genome duplication (WGD) WGD chromosome 2 chromosome 1 Ancestral chromosome All genes and regulatory regions are duplicated simultaneously Opportunity for large-scale adaptation (citations?) The higher yields and sizes of modern crop species The huge diversity of fish species The increased genome size and complexity of vertebrates

Duplicated segments can be difficult to identify by direct comparison WGD chromosome 2 chromosome 1 Ancestral chromosome chr 2 chr 1

Duplicated segments can be difficult to identify by direct comparison Ancestral chromosome chr 2 chr 1 Reciprocal gene loss WGD Only 11-28% of genes in the pre-WGD genome are retained in duplicate

Arabidopsis thaliana region 1 Rice chr 2 Arabidopsis thaliana region 2 Simillion et al, PNAS 2002 Comparisons of multiple regions offer significantly more power to detect highly diverged homologous segments

Existing tests for gene clusters are almost exclusively for clusters spanning at most two regions Need tests for three or more chromosomal regions

Window Sampling Sample two windows of r genes, surrounding the genes of interest Do the two regions of interest share a significant number of homologs? An example: was the insulin gene duplicated in a whole-genome duplication? r = 6 m = 3 W2W2 W1W1 Gene of interest

m r-m Durand and Sankoff, J Comp Biology, 2003 The probability of observing a cluster by chance, assuming random gene order depends only on: r, the window size m, the number of shared homologs n 1 and n 2, the genome sizes W2W2 W1W1 r = 6 m = 3

With three regions there are many more quantities to consider Three-way overlap: x 123 Pairwise overlaps: x 12, x 23, x 13 x3x3 x 23 x 123 x2x2 x 12 x1x1 x 13 No existing statistical approach considers all these quantities in a single test. W3W3 W1W1 W2W2 W1W1 W3W3 W2W2

x3x3 x 23 x 123 x2x2 x 11 x1x1 x 13 W1W1 W2W2 W3W3 Method reviewed in Simillion et al, Bioessays, 2004 Require that at least two pairwise clusters are independently significant P(X > x 123 +x 13 ) < α and P(X > x 123 +x 23 ) < α Previous Statistical Approach W1W1 W3W3 W2W2

x3x3 x 23 x 123 x2x2 x 11 x1x1 x 13 W1W1 W2W2 W3W3 Method reviewed in Simillion et al, Bioessays, 2004 Require that at least two pairwise clusters are independently significant P(X > x 123 +x 13 ) < α and P(X > x 123 +x 23 ) < α Previous Statistical Approach W3W3 W2W2 W1W1

x3x3 x 23 x 123 x2x2 x 11 x1x1 x 13 W1W1 W2W2 W3W3 Method reviewed in Simillion et al, Bioessays, 2004 Require that at least two pairwise clusters are independently significant P(X > x 123 +x 13 ) < α and P(X > x 123 +x 23 ) < α Ignores pairwise overlap between W 1 and W 2 Does not consider the greater impact of x 123 Fails to consider all data jointly Previous Statistical Approach

Our approach Given three randomly selected windows of length r, we assess the probability of observing as extreme a cluster under a null hypothesis of random gene order. We derived probability expressions for two models: two windows sampled from a post-WGD genome, and one window sampled from a pre-WGD genome three windows sampled from three distinct genomes x3x3 x 23 x 123 x2x2 x11x11 x1x1 x 13 W1W1 W2W2 W3W3 Raghupathy, Hoberman, Durand. J Bioinformatics & Comput Biol, 2007.

About 84% of duplicates a lost An example: Whole Genome Duplication in S. cerevisiae We investigated statistical trends of three-region clusters, based on a model constructed from this yeast data 1000 3600 single-copy genes in both S. cerevisiae and K. waltii 450 genes retained in duplicate in S. cerevisiae, but with only a single copy in K. waltii 500 genes in the S. cerevisiae genome but not K. waltii K. waltii S. cerevisiae S. kluyveri WGD S. cerevisiae 1 K. lactis S. cerevisiae 2 Byrne & Wolfe, Genome Research 2005

The effect of retained duplicates on cluster significance W post1 W pre W post2 3 0 0 3 2 1 0 2 W pre W post1 W post2 W pre W post1 W post2 h = x 12 + x 13 + x 123, the number of genes shared between W post and W pre h = 3 W post1 W pre W post2

Even a single gene retained in duplicate has a large impact on cluster significance W post1 W pre W post2 a) h r=100 1 2 3 4 5 6 W post1 W pre W post2 b)

W post1 W pre W post2 x 12 = x 13 b) a) W post1 W pre W post2 With a pairwise approach retained duplicates are disregarded a pairwise test of cluster b would fail to reject the null hypothesis even though such a cluster is highly unlikely to occur by chance Even when there are no genes shared between the duplicated region, pairwise tests are still overly conservative pairwise tests require each pair of regions to be independently significant

Summary x3x3 x 23 x 123 x2x2 x 12 x1x1 x 13 W1W1 W2W2 W3W3 Preliminary results suggest this approach can detect homologous regions sharing fewer genes than can be identified by pairwise comparisons Developed the first statistical approach that considers both pairwise and three-way overlaps in a single test

Open problems: r -window statistics Three-region clusters for whole-genome comparison Many-to-many homology mapping Comparisons of more than three regions

Outline Intro and thesis overview Background: identifying gene clusters 1. Max-gap cluster statistics 2. Statistics for clusters in multiple genomes 3. Ortholog detection using max-gap statistics The problem A cluster-based method for ortholog prediction Evaluation

Orthologs are a fundamental unit of comparison in many comparative genomics studies Species 1 Species 2 Ancestral Species

Orthologs are a fundamental unit of comparison in many comparative genomics studies c2c2 c c’ 2 c c’ 1 ’ c1c1 Species 1 Species 2 Ancestral Species Duplication Speciation c’

Orthologs are a fundamental unit of comparison in many comparative genomics studies c2c2 c c c1c1 Species 1 Species 2 Ancestral Species (MRCA) Duplication Speciation Orthologs: related through speciation Paralogs: related through duplication Descended from a single gene in the MRCA Descended from distinct genes in the MRCA c’ 2 c’ 1 c’

Orthologs are a fundamental unit of comparison in many comparative genomics studies c2c2 c c c1c1 Species 1 Species 2 Ancestral Species (MRCA) Duplication Speciation Homologs: related through common ancestry c’ 2 c’ 1 c’

Many applications utilize orthologs Genome annotation orthologs often have similar functions Identifying regulatory sequences conserved sequences near orthologous genes are often functional Building species trees only orthologs arise through speciation, so it is essential that only orthologous genes are analyzed. Analyzing rearrangements and finding orthologous blocks orthologs are often used as markers

Ortholog Prediction Given: a bipartite graph, with edges representing homologous pairs Goal: classify each gene pair as paralogous or orthologous Common strategy: select a matching—orthologs correspond to matched pairs c2c2 B c’ 1 c’ 2 c1c1 Genome 1 Genome 2

Traditional methods Orthologs identified by comparing gene sequences c2c2 c1c1 c1’c1’ c2’c2’ 10 -12 10 -62 10 -100 10 -63

Traditional methods Orthologs identified by comparing gene sequences, Most common method: bi-directional best hits (BBH) a pair (a,b) is predicted to be orthologous if b is a’s top sequence match and a is b’s top sequence match 10 -12 10 -62 10 -100 10 -63 c2c2 c1c1 c’ 1 c’ 2

Augmenting with spatial information Similar sequences + similar spatial context is a better indicator of orthology than similar sequences alone c2c2 c’ 2 c c’ 1 c1c1 c’ c Orthologs cannot always be identified by sequence comparison c1c1 c2c2 c’ 2 c’ 1

A cluster-based approach to ortholog detection Assumption 1: Clusters arose from a single ancestral segment Assumption 2: If two regions are orthologous then the genes within the regions are most likely orthologous Clusters specify local ortholog assignments c1c1 c2c2 c’ 2 c’ 1

If a gene is a member of more than one cluster, a conflict may arise. The goal: Select the best subset of gene clusters, such that no clusters conflict, and every gene is a member of a cluster, i.e. the clusters cover both genomes. A Challenge c2c2 c’ 2 c’ 1

A Greedy approach Select a cluster definition Identify a set of interesting clusters Do select the best gene cluster from this set assign orthologs within the cluster remove all remaining clusters that are inconsistent with these assignments Until all genes are covered

What is the optimal cluster definition? E. coli Salmonella typhimurium Yersinia Pestis Pseudomonas aeruginosa E. coli

What is the optimal cluster definition? E. coli Salmonella typhimurium Yersinia Pestis Pseudomonas aeruginosa E. coli Probable orthologs Probable paralogs Unknown

Previous approaches Method Cluster Definition Interesting clusters Ranking criteria LCS common substrings maximal clusters length of cluster CIGAL common intervals maximal clusters product of two lengths Blin et al, RECOMB Comparative Genomics, 2005 and 2006

Method Cluster Definition Interesting clusters Ranking criteria LCS common substrings Maximal clusters Length of cluster CIGAL common intervals Maximal clusters Product of two lengths Max-gapmax-gap??

Previous Work Sequence only Spatial only BBH Inparanoid OrthoMCL Kellis 04 YGOB Ensembl MSOAR LCS CIGAL Bourque 05 Me Swenson 05 Zheng 05 Burgetz 06 Cannon 03

Two challenges in extending this approach to allow gaps in conserved blocks Problem 1: Given two conflicting clusters, how do we decide which cluster is “best”? Problem 2: How do we select a set of “interesting” clusters? Method Cluster Definition Interesting clusters Ranking criteria LCS common substrings Maximal clusters Length of cluster CIGAL common intervals Maximal clusters Product of two lengths Max-gapmax-gap??

Problem 1: Which cluster is “best”? Previous approaches assumed bigger is always better A large cluster with many gaps may be more likely to occur than a small cluster with few gaps Idea: Use cluster significance to rank clusters Evaluate each cluster based on its size and max-gap cluster of size 4 cluster of size 5 c2c2 c’ 2 c’ 1

Problem 1: Which cluster is “best”? Idea: Use cluster significance to rank clusters Evaluate each cluster based on its size and max-gap Solution: implemented two methods Analytical: my probability expressions for max-gap clusters, that assume one-to-one homology Monte-Carlo: randomly order genomes, identify clusters, select matchings, and count matching size and gaps

Problem 2: What is the set of “interesting” clusters? Previous approaches found all maximal clusters A non-maximal max-gap cluster may be more significant than the larger maximal cluster that contains it Idea: Identify all dominant clusters A max-gap cluster is dominant if there is no large cluster that contains it and has the same or smaller max-gap dominant 5-cluster maximal 5-cluster non-maximal 5-cluster

Problem 2: What is the set of “interesting” clusters? Idea: Identify all dominant clusters Solution: designed and implemented a new algorithm efficiently identifies all dominant max-gap clusters, with gap no greater than a user-specified value, g max modified the HomologyTeams software, a top-down algorithm which finds maximal max-gap clusters HomologyTeams: He and Goldwasser J Comp Biol 2005

Method Cluster Definition Interesting clusters Ranking criteria LCS common substrings Maximal clusters Length of cluster CIGAL common intervals Maximal clusters Product of two lengths Max-gapmax-gap Dominant clusters Cluster Significance or highest sequence similarity (if tied) BBHN/A (only considers gene sequences)

Identify the set of Do Select the compute significance analytically or by Monte Carlo break ties based on sequence similarity Assign orthologs within the cluster Remove all remaining clusters that are inconsistent with these assignments Until the genome is covered dominant g-clusters, with g ≤ g max best gene cluster most significant cluster from this set Greedy approach interesting clusters from this set

Eight single-chromosome γ-proteobacteria species A ground truth based on Uniprot gene names Genes with the same name considered orthologs Evaluate our predictions using three metrics: precision: percentage of predicted orthologs that are correct recall: percentage of orthologs predicted to be orthologs F1 measure: harmonic mean of precision and recall Report results for all 28 genome pairs plus: Average: macro-average performance, averaged over genome pairs Overall: micro-average performance, averaged over all gene pairs

Identify the set of dominant g-clusters, with g ≤ g max Do Select the most significant cluster from this set compute significance analytically or by Monte Carlo break ties based on sequence similarity Assign orthologs within the cluster Remove all remaining clusters that are inconsistent with these assignments Until the genome is covered Greedy approach

Analytical vs. Monte Carlo Estimation g max = 5, no sequence data Analytical Monte Carlo

Is allowing gaps useful? (no sequence data) CIGAL: Common intervals Max-gap g max = 10 Max-gap g max = 5 Max-gap g max = 0

Order vs gaps ( g max = 5, no sequence data ) LCS: Conserved order, no gaps CIGAL: Scrambled order, no gaps Max-gap: Scrambled order, gaps

Adding sequence information ( g max = 5 ) No sequence data With sequence data

Four Ortholog Prediction Methods Max-gap vs LCS Max-gap with sequence data Precision Recall

Four Ortholog Prediction Methods Precision Recall Max-gap vs CIGAL Max-gap with sequence data

Four Ortholog Prediction Methods Max-gap vs BBH Recall Precision Max-gap with sequence data

Summary New method that utilizes max-gap cluster statistics Improved performance over previous methods based on spatial information alone Slightly lower precision and slightly higher recall than BBHs Useful platform for comparing cluster definitions and test statistics Demonstrated the importance of combining sequence and spatial information

Open problems: ortholog detection Improved methods for combining sequence and spatial information Consider gene order conservation when computing cluster significance Apply to a wider range of genomic datasets Sequence only Spatial only BBH Inparanoid OrthoMCL Kellis Zheng 05 YGOB Ensembl MSOAR LCS CIGAL Me Burgetz 06 Cannon 03 Bourque 05 Swenson 05

Overview of Thesis 1. Statistical tests for max- gap clusters in one or two genomes 2. Statistical tests for clusters spanning three genomes 3. Application of cluster statistics: an algorithm for ortholog detection g  = 2 g  = 3

Future Directions New search algorithms and statistical methods for many-to-many homology maps clusters spanning more than three genomes New cluster definitions and statistical tests greater discriminatory power could be achieved by considering additional cluster properties global cluster density extent of gene order conservation gene orientation divergence times

Acknowledgements Data: A. K. Bansal Cedric Chauve Software: Anne Bergeron He & Goldwasser Collaborators: Narayanan Raghupathy Dannie Durand David Sankoff Durand lab Funders? more!

Orthologs are a fundamental unit of comparison in many comparative genomics studies d2d2 d d1’d1’ d1d1 Species 1 Species 2 Ancestral Species (MRCA) Duplication Speciation Orthologs: related through speciation Paralogs: related through duplication

A Statistical Framework for Spatial Comparative Genomics Rose Hoberman Carnegie Mellon University, March 2007 Thesis Committee Dannie Durand (chair) Andrew.

Similar presentations

Presentation on theme: "A Statistical Framework for Spatial Comparative Genomics Rose Hoberman Carnegie Mellon University, March 2007 Thesis Committee Dannie Durand (chair) Andrew."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Statistical Framework for Spatial Comparative Genomics Rose Hoberman Carnegie Mellon University, March 2007 Thesis Committee Dannie Durand (chair) Andrew.

Similar presentations

Presentation on theme: "A Statistical Framework for Spatial Comparative Genomics Rose Hoberman Carnegie Mellon University, March 2007 Thesis Committee Dannie Durand (chair) Andrew."— Presentation transcript:

Similar presentations

About project

Feedback