Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina

Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina jtang@cse.sc.edu (803) 777-8923

Outline Backgrounds Branch-and-Bound Algorithms for the Median Problem Maximum Likelihood Methods for Phylogenetic Reconstruction Post-Analysis Conclusions

Simple Rearrangements

Phylogenetic Reconstruction

Rearrangement Phylogeny

Median Problem Goal: find M so that D AM +D BM +D CM is minimized NP hard for most metric distances

Multichromosomal Reversal Median problem To find a median genome that minimizes the summation of the multichromosomal HP distances on the three edges Events considered: reversal, translocation, fusion, fission Exact and heuristic solvers exist for the Unichromosomal Reversal Median Problem (reversals are the only events)

Capless Breakpoint Graph Genome A → Non-perfect Matching M(A) Let a,b be adjacency genes in A. Then (a t,b h ) is an edge in M(A) A genome is composed of a set of edges and ends. Matchings naturally correspond to Undirected Genomes (Flipping of chromosomes does not alter matchings)

Matchings : M(A): M(B): : A-end : B-end Example Example Genomes A={‹ -5, 1, 6, 3 ›, ‹ 2, 4 ›} B={‹ 1, 6 ›, ‹ -5, -4, -3, -2 ›} Adjacency Graph

: A-end : B-end Capless Breakpoint Graph AB-paths of length 0 Denote C(A,B) #Cycles, AB #AB-Paths, AA #AA-paths, BB #BB-paths in G(A,B), n #genes n = 6,C(A,B) = 1,AB = 4, d HP ≥ 6-1-4/2 = 3

A Lower Bound of the HP Distance A simpler lower bound only contains #genes, #cycles, #paths. Derived from Hannenhalli, Pevzner 1995 d HP (A,B)≥n – C(A,B) - AB/2 + AA - BB Pseudo-cycle of A and B:

Pseudo-cycle distance Median Problem Pseudo-cycle distance : Pseudo-cycle distance Median Problem (PMP): to find a median genome that minimizes the summation of the Pseudo- cycle distance on the three edges We use the Pseudo-cycle distance as a lower bound for the HP distance to derive a RMP solver

Branch-and-Bound Algorithm Enumerate the solution genomes gene by gene (Genome Enumeration) After enumerated a gene, compute an upper bound based on the partial solution genome Bound: check whether the upper bound of the partial solution is less than a criteria Branch If it is true, the partial genome is discarded, enumerate another gene Otherwise update the criteria and continue enumeration

Genome Enumeration for Multichromosome Genomes Genome Enumeration For genomes on gene {1,2,3} 2 -2 2 2

Features Main Components: Contraction Operation Upper Bound on the number of pseudo- cycles Genome enumeration Extension of Caprara’s method for unichromosomal genomes (1999)

Contraction Operation Contraction e={a t,b h } on M(A): M(A)/e Case(2): Case(3): Case(1)

Upper Bound on the Number of Pseudo-cycles Let S be a genome and Z={G 1, G 2, G 3 } a set of three input genomes The maximal γ(S,Z) is denoted by γ* Based on triangle inequality, an upper bound on the number of pseudo-cycles can be derived:

Notes qn- γ* is the lower bound of the sum of pseudo- cycle distances between any S and each genome in Z ={G 1, G 2, G 3 } Given an edge e, assume genome S contains e and maximizes γ(S,Z); let Z’={G 1 /e, G 2 /e, G 3 /e}, and assume S’ maximizes Z’=γ(S’,Z’), then S = S’ ∪ {e}

Upper Bound Test In a step of the algorithm, the current partial solution is S i ={e 1,e 2,…,e i } The upper bound of γ(S,Z) of genoms containing S i is the following: Let UB be the current upper bound If UB Si <UB, then the best upper bound of the genomes containing S i is worse than UB

Branch-and-Bound Algorithm for Multichromosomal Genomes Compute an initial Upper Bound (UB) from the input genomes. In each step, either an end or an edge is fixed in the solution. End Fixing: Mark a node as an end of a chromosome. Edge Fixing: Fix an edge e to the current partial solution genome S i.

Genome Enumeration for Multichromosome Genomes Genome Enumeration For genomes on gene {1,2,3} Red line: end fixing Black line: edge fixing 2 -2 2 2

Properties Can be extended to compute a given tree using iterative or progressive approaches However, median computation is still difficult Large nuclear genomes Complex events We also need to search the best tree from the large tree space N species: 20 species :

Statistical Approaches Combinatorial approaches are the focus of genome rearrangement research Only one MCMC method exists Maximum Likelihood methods have been very popular in sequence phylogenetic analysis Bootstrapping (data resampling) is a popular method to assess quality of obtained trees Hard to directly apply ML and bootstrapping to gene order

Sequence ML Phylogeny For each position, generate all possible tree structures Based on the evolutionary model, calculate likelihood of these trees and sum them to get the column likelihood Calculate tree likelihood by multiplying the likelihood for each position Choose tree with the greatest likelihood

Example Aacgcaa Bacataa Catgtca Dgcgtta ABCDACBDADCB

All Possible Evolutionary Paths (Column 1) aaag a c g t

Likelihood for One Path aaag ag t

Sum of All Paths (Column 1) aaag a c g t

Whole Sequence ABCD

MLBE Convert the gene-orders into binary sequences based on adjacencies Convert the binary sequences into protein or DNA sequence Use RAxML to compute a ML tree on the sequences Binary encoding was used before for parsimony analysis, with reasonable results

Binary Encoding

MLBE Sequences

Experimental Setup Generate random trees of N taxa Each tree is equally likely Birth-death model is preferred Starting from the root, apply r events along each edge r is the expected number of events Actual number is a sample between 1…2r Comparing the inferred tree with the true tree using RF rate

Experimental Results (Equal Content 1) 80% inversion, 20% transposition

Experimental Results (Equal Content 2) 80% inversion, 20% transposition

Experimental Results (Unequal 1) 90% inversion, 10% of del/ins/dup, 5-30 genes per segment

Experimental Results (Unequal 2) 90% inversion, 10% of del/ins/dup, 5-30 genes per segment

Multistate Endocing

MLME Results (200 genes 20 genomes)

MLME Results (1000 genes 20 genomes)

Post Analysis Bootstrapping has been widely used to assess the quality of sequence phylogeny The same procedure is impossible for gene order data since there is only one character We tested the procedure of jackknifing through simulated data to obtain Is jackknifing useful The best jackknifing rate What is the threshold of the support values

46 DNA bootstrapping

Bootstrapping Results

Jackknifing Procedure Generate a new dataset by removing half of the genes from the original genomes (orders are preserved) Compute a tree on the new dataset Repeat K times and obtain K replicates Obtain a consensus tree with support values

An Example—New Genomes 1 2 3 4 5 6 7 8 9 10 1 -4 5 2 8 10 9 -7 -6 3 … 1 3 5 7 9 1 5 9 -7 3 …

Jackknifing Rate

Support Value Threshold - FP Up to 90% FP can be identified with 85% as the threshold

Trees with FP

Support Value Threshold - FN

Low Support Branches

Jackknife Properties Jackknifing is necessary and useful for gene order phylogeny, and a large number of errors can be identified 40% jackknifing rate is reasonable 85% is a conservative threshold, 75% can also be used Low support branches should be examined in detail

Conclusions Great progress has been made in genome rearrangement research We are able to handle real size data Now the question is what data Data quality and biological modeling Ancestral genome reconstruction is still difficult Putting everything together has just started

Thank You!

Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina

Similar presentations

Presentation on theme: "Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina

Similar presentations

Presentation on theme: "Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina"— Presentation transcript:

Similar presentations

About project

Feedback