Tao Jiang Department of Computer Science

A Combinatorial Approach to Genome-Wide Ortholog Assignment: Beyond Sequence Similarity Search
Tao Jiang Department of Computer Science University of California, Riverside Joint work with X. Chen, Z. Fu, J. Zheng, V. Vacic, P. Nan, Y. Zhong, and S. Lonardi

Outline An introduction to orthology
Existing ortholog assignment methods Ortholog assignment via genome rearrangement An introduction to genome rearrangement Computing signed reversal distance with duplicates Minimum Common Substring Partition Maximum Cycle Decomposition Experimental results Summary and future directions 9/17/2018

Previous ortholog assignment methods Ortholog assignment via genome rearrangement An introduction to genome rearrangement Computing signed reversal distance with duplicates Minimum Common Substring Partition Maximum Cycle Decomposition Experimental Results Summary and future directions 9/17/2018

Orthology Homolog Paralog Ortholog mouse Gene family chicken
Duplication Ortholog Speciation mouse chicken frog (from 9/17/2018

Orthology a b Homolog Paralog Ortholog mouse Gene family chicken
Duplication Ortholog Speciation mouse chicken frog (from 9/17/2018

Orthology – the more complicated picture
Speciation 1 Gene duplication 1 B C Speciation 2 Speciation 2 B C1 B2 C C3 True exemplar is the direct descendant of the ancestral gene of a given set of inparalogs. A main ortholog pair is defined as two true exemplar genes of two co-orthologous gene sets. Gene duplication 2 Outparalogs evolved via a duplication prior to a given speciation event. B C1 A1 B C1 B2 C C3 Inparalogs evolved via a duplication posterior to a given speciation event. B2 C C3 G1 G2 G3 9/17/2018

Significance Orthologous genes in different species are evolutionary and functional counterparts. Many methods use orthologs in a critical way: Function inference Protein structure prediction Motif finding Phylogenetic analysis Pathway reconstruction and more ... Identification of orthologs, especially exemplar genes, is a fundamental and challenging problem. 9/17/2018

Previous ortholog assignment methods Ortholog assignment via genome rearrangement An introduction to genome rearrangement Computing signed reversal distance with duplicates Minimum Common Substring Partition Maximum Cycle Decomposition Experimental Results Summary and future directions 9/17/2018

Existing Methods Methods based on sequence similarity
BBH Inparanoid/Multiparanoid PhiGs COG/KOG OrthoMCL MGD TOGA/EGO KEGG HomoloGene Methods based on phylogenetic trees Reconciled tree Orthostrapper OrthologID RAP RIO PhyOP TreeFam Methods based that take into account gene locations Shared genomic synteny 9/17/2018

Observations Sequence similarity-based methods assume that the evolutionary rates of all genes in a homologous family are equal and thus the divergence time could be estimated by comparing the sequence of genes. Tree-based methods critically rely on the correctness of reconstructed gene and species trees. Global genome rearrangements are not considered in gene location-based methods. 9/17/2018

Previous ortholog assignment methods Ortholog assignment via genome rearrangement An introduction to genome rearrangement Computing signed reversal distance with duplicates NP-hard A low bound Minimum Common Substring Partition Maximum Cycle Decomposition Experimental Results Summary and future directions 9/17/2018

Molecular Evolution Local mutation
Base substitution Base insertion Base deletion Global rearrangement and duplication Inversion/Reversal Translocation Transposition Fusion/Fission Duplication/Loss A complete ortholog assignment system should make use of information from both levels of molecular evolution. 9/17/2018

Genome Rearrangement Operations
Reversal (inversion) Translocation Fusion Fission 9/17/2018

Example a1 b c a2 d e f g The ancestral genome Speciation a1 c a2 d e f g b reversal a1 b c a2 d e f g a3 duplication a1 c a2 d e f g b a4 duplication Genome a1 b c a2 d e f g a3 fission Genome Given the evolutionary scenario, main ortholog pairs and inparalogs could be identified in a straightforward way. 9/17/2018

The Parsimony Approach
Identify homologs using sequence similarity search (e.g.) BLASTp. Reconstruct the evolutionary scenario on the basis of the parsimony principle: postulate the minimum possible number of rearrangement events and duplication events in the evolution of two closely related genomes since their splitting so as to assign orthologs. Ortholog assignment problem could be formulated as a problem of finding a most parsimonious transformation from one genome into the other, without explicitly inferring their ancestral genome. 9/17/2018

RD (Rearrangement-Duplication) Distance
RD distance: denotes the number of rearrangement events in a most parsimonious transformation denotes the number of gene duplications in a most parsimonious transformation 9/17/2018

The key algorithmic problem -SRDD
Two related (unichromosomal) genomes No inparalogs, i.e. no post-speciation duplications No gene losses, and thus equal gene content Only reversals have occurred Signed Reversal Distance with Duplicates How to find a shortest sequence of reversals Almost untouched in the literature Duplicated genes are present Generalizes the problem of sorting by reversal A high-throughput system for assigning orthologs on a genome scale. 9/17/2018

When there are no (post-speciation) duplications
The most parsimonious rearrangement scenario may suggest the true orthology. 9/17/2018

Previous ortholog assignment methods Ortholog assignment via genome rearrangement An introduction to genome rearrangement Computing signed reversal distance with duplicates NP-hard A low bound Minimum Common Substring Partition Maximum Cycle Decomposition Experimental Results Summary and future directions 9/17/2018

Sorting by reversal Sorting a permutation into the identity by reversals Distinct genes only Signed vs. unsigned version Sorting signed permutation Sorting unsigned permutation A permutation A high-throughput system for ssigning orthologs on a genome scale. 9/17/2018

Sorting signed permutation
Hannenhalli-Pevzner (HP) theory Polynominal-time solvable Breakpoint graph Breakpoint, cycle, hurdle, fortress HP formula: Breakpoint graph d = 3 – A permutation Hannenhalli and Pevzner, STOC, , 1995 9/17/2018

Sorting unsigned permutation
NP-hard (Caprara, 1997) Breakpoint graph Maximum alternating cycle decomposition (NP-hard) 1.375-approximation (Berman, et al. 2002) Breakpoint graph Alternating cycle decomposition d = 3 – A permutation Caprara, RECOMB, 75-83, 1997 9/17/2018

A brief history signed unsigned Kececioglu and Sankoff (1995)
2-approximation Bafna and Pevzner (1996) 1.5-approximation 1.75-approximation Hannenhalli and Pevzner (1995) Polynomial Special cases – polynomial Caprara (1997) NP-hard Christie (1998) Bader, et al (2001) Linear – distance only Berman, et al (2002) 1.375-approximation d = 3 – The work has also been extended to genomes with multiple chromosomes (Hannenhalli and Pevaner, 1995; Tesler, 2002; Ozery-Flato and Shamir, 2003) 9/17/2018

Previous ortholog assignment methods Ortholog assignment via genome rearrangement An introduction to genome rearrangement Computing signed reversal distance with duplicates Computing Minimum Common Substring Partition Computing Maximum Cycle Decomposition Experimental results Summary and future directions 9/17/2018

SRDD – The exhaustive method
Given genomes and , : the set of all the possible ortholog assignments : the genome after orthologs have been assigned Assume one family with ten duplicated genes in each genome 9/17/2018

SRDD – Hardness SRDD is NP-hard, even when the maximum size of a gene family is limited to two. Reduction from the problem of sorting an unsigned permutation by reversals An unsigned permutation A signed sequence with duplicates No breakpoint No breakpoint Case 1: Case 2: 9/17/2018

SRDD – A lower bound Partial graph
: the number of edges linking two nodes labeled by and , respectively The number of breakpoints: Let and be a pair of related genomes. Their reversal distance is lower bounded by 3h 3t 1t 1h 2t 2h 1h 1t 4h 4t 3h 3t 1h 1t 2h 2t 1h 1t 4h 4t 9/17/2018

(Sub)optimal assignment rules
Rule one: a b c f d e Trivial Non-trivial a b c f d e Trivial Non-trivial / Rule two: a b c f d -e -d -b -c Trivial Non-trivial e a b c f d -e -d -b -c Trivial Non-trivial e / 9/17/2018

The MCSP problem Minimum Common Substring Partition
This may help eliminate many duplicates, but is different from syntenic blocks. Give two related genomes and , we have G: H: G: H: Without loss of generality that the first genes and the last genes of the two related genomes are identical and positive singletons, respectively 9/17/2018

Goldstein, Kolman, and Zheng, ISAAC, 473-484, 2004
MCSP - Hardness Let k-MCSP denote the version of MCSP where each gene family is of size at most k. The problem k-MCSP is NP-hard, for any k > 1. Petr Kolman gave a linear time O( )-approximation algorithm for k-MCSP (MFCS’05), and thus k-SRDD. The approximation ratio was recently improved to O(k). Goldstein, Kolman, and Zheng, ISAAC, , 2004 9/17/2018

MCSP – Pair-match graph
A pair-match graph Single match v.s. pair match Incompatible pair-matches The maximum independent set problem on is equivalent to the minimum common substring partition problem, i.e., G: 3 1 H: 3 1 G: 2 -1 H: 1 -2 G: H: G: 1 2 H: -2 -1 G: H: )) , ( ) E V MIS n H G L Ã - = Goldstein, Kolman, and Zheng, ISAAC, , 2004 9/17/2018

MCSP – Approximation Algorithm APPROX-MCSP( , ) /* and are a pair of related genomes */ Construct the pair-match graph for and Find an approximation of the vertex cover of Identify segments based on the pair-matches in Output all the segments as a common substring partition If the common substring parititon found by the above algorithm APPROX-MCSP is , then where is the ratio of the approximation algorithm for vertex cover and is the genome size. In particular for 2-MCSP, the algorithm achieves an approximation ratio of 1.5. 9/17/2018

Maximum cycle decomposition
What if there still are some duplicates? Given any two genomes without duplicated genes, the (revised) HP formula for computing the rearrangement distance between the two genomes is as follows: Genome rearrangement distance: (Hannenhalli and Pevaner, 1995; Tesler, 2002; Ozery-Flato and Shamir, 2003) We could approximate the minimum rearrangement distance between two genomes by decomposing the complete-breakpoint graph to maximize , where is the number of cycles and paths and is the number of 9/17/2018

MSOAR MSOAR is a high-throughput system for ortholog assignment between closely related genomes. MSOAR employs a heuristic algorithm to calculate the rearrangement/duplication (RD) distance between two genomes using the sub-optimal assignment rules, MCSP and MCD, which can be used to reconstruct a most parsimonious evolutionary scenario. MSOAR extends SOAR by allowing for multi-chromosomal genomes and the detection of inparalogs. 9/17/2018

“Noise” gene pair detection
The previous steps determine a one-to-one gene matching between two genomes. Unmatched genes are removed and marked as inparalogs. Remove gene pairs whose deletion decreases the rearrangement distance by at least two. Since each pair incurs two duplications, the RD distance will not increase: These deleted genes form inparalogs. 9/17/2018

An outline of MSOAR Dataset A Dataset B Homology search:
1. Apply all-vs.-all comparison by BLASTp 2. Only select the blast hits with similarity score above cutoff 3. Keep up to five top bi-directional best hits List of orthologous gene pairs output Assign orthologs by minimizing RD distance: 1. Apply suboptimal rules 2. Apply minimum common partition 3. Maximum graph decomposition 4. Detect inparalogs by identifying “noise” gene pairs 2. Apply minimum common substring partition 3. Maximum cycle decomposition 9/17/2018

Simulated data test Simulated genome : 100 distinct genes
Simulated genome : Randomly perform reversals on to obtain another genome Experiments One: Randomly copy some genes and insert them back into Two: Randomly copy some genes and insert them back into and (Inserted genes are inparalogs by definition.) 9/17/2018

Simulated data test Randomly generate two genomes ( , , , )
Average on 20 random instances for each parameter set Our heuristic algorithm v.s. the iterated exemplar algorithm (Sankoff, Bioinformatics, 1999) 9/17/2018

Real data Homo sapiens: Mus musculus:
Build 36.1 human genome assembly (UCSC hg18, March 2006) 20161 protein sequences in total Mus musculus: Build 36 mouse genome assembly (UCSC mm8, February 2006) 19199 protein sequences in total 9/17/2018

MSOAR vs. Inparanoid Validation: Official gene symbols extracted from the UniProt release 6.0 (September 2005) For human protein sequences and mouse protein sequences, MSOAR assigned orthologs between Human and Mouse, among which are true positives, 1748 are unknown pairs and 1508 are false positives, resulting in a sensitivity of 92.26% and a specificity of 87.99%. The comparison between MSOAR and Inparanoid (Remm et al., J. Mol. Biol., 2001) 9/17/2018

MSOAR vs. Inparanoid Human chromosome 20 Mouse chromosome 2
SNRPB STK35 TGM3 TGM6 ZNF343 TMC2 NOL5A IDH3B Snrpb Stk35 Tgm3 Tgm6 Tmc2 Nol5a Idh3b Mouse chromosome 2 The ortholog pair SNRPB (Human) and Snrpb (Mouse) are not bi-directional best hits, which could be missed by the sequence-similarity based ortholog assignment methods like Inparanoid. 9/17/2018

Number of main ortholog pairs assigned by MSOAR across the chromosome pairs
9/17/2018

An alignment between syntenic blocks and MSOAR blocks
9/17/2018

Validation by HCOP The HGNC Comparison of Orthology Predictions (HCOP) is a tool that integrates and displays the human-mouse orthology assertions made by Ensembl, Homologene, Inparanoid, PhIGS, MGD and HGNC. ( 9/17/2018

Other validations By PANTHER protein sequence classification (ftp://ftp.pantherdb.org/sequence_classifications/) MSOAR identified ortholog pairs with valid Geneid between human and mouse, among which pairs have both orthologous genes in the same protein subfamily. 9/17/2018

Summary and future work
Presented a novel approach to assign orthologs between two genomes via genome rearrangement and gene duplication Introduced a rearrangement/duplication (RD) distance for genome comparisons Proposed a heuristic algorithm for assigning orthologs under maximum parsimony Developed a high-throughput system for ortholog assignment (MSOAR) Tested the system on simulated data and real genomic data of human and mouse MSOAR vs. Iterated exemplar algorithm MSOAR vs. Inparanoid Various validation methods Future directions More efficient algorithms for MCSP and MCD Refine the evolutionary model for MSOAR (transposition, tandem duplication, gene loss, etc.) Ortholog assignment for multiple genome comparison More explicit treatment of one-to-many and many-to-many orthology relationship 9/17/2018

References X. Chen, J. Zheng, Z. Fu, P. Nan, Y. Zhong, S. Lonardi, and T. Jiang. Computing the assignment of orthologous genes via genome rearrangement. Proc. 3rd Asia-Pacific Bioinformatics Conference (APBC), 2005, pp X. Chen, J. Zheng, Z. Fu, P. Nan, Y. Zhong, S. Lonardi, and T. Jiang. Assignment of orthologous genes via genome rearrangement. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 2-4, pp , 2005. Z. Fu, X. Chen, V. Vacic, P. Nan, Y. Zhong, and T. Jiang A parsimony approach to genome-wide ortholog assignment Proc. 10th Annual International Conference on Research in Computational Molecular Biology (RECOMB), 2006, pp Z. Fu, X. Chen, V. Vacic, P. Nan, Y. Zhong, and T. Jiang. MSOAR: A High-throughput ortholog assignment system based on genome rearrangement. Submitted, 2007. Z. Fu and T. Jiang. Clustering of main orthologs for multiple genomes. To be presented at LSI Conference on Computational Systems Biology (CSB), 2007. 9/17/2018

Acknowledgement NSF DoE Genomes to Life (GtL) program
National Key Project for Basic Research NSFC Changjiang Visiting Professorship, Tsinghua Univ. Discussion with Marek Chrobak, Petr Kolman, and Lan Liu on MCSP and MCIP 9/17/2018

Tao Jiang Department of Computer Science

Similar presentations

Presentation on theme: "Tao Jiang Department of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tao Jiang Department of Computer Science

Similar presentations

Presentation on theme: "Tao Jiang Department of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback