CSCI2950-C Lecture 6 Genome Rearrangements and Duplications

Slides:



Advertisements
Similar presentations
A Simpler 1.5-Approximation Algorithm for Sorting by Transpositions Tzvika Hartman Weizmann Institute.
Advertisements

Sorting by reversals Bogdan Pasaniuc Dept. of Computer Science & Engineering.
School of CSE, Georgia Tech
Greedy Algorithms CS 466 Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix of.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Greedy Algorithms CS 6030 by Savitha Parur Venkitachalam.
Gene an d genome duplication Nadia El-Mabrouk Université de Montréal Canada.
Train DEPOT PROBLEM USING PERMUTATION GRAPHS
Sorting Cancer Karyotypes by Elementary Operations Michal Ozery-Flato and Ron Shamir School of Computer Science, Tel Aviv University.
Bioinformatics Chromosome rearrangements Chromosome and genome comparison versus gene comparison Permutations and breakpoint graphs Transforming Men into.
Genome Halving – work in progress Fulton Wang ACGT Group Meeting.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Greedy Algorithms And Genome Rearrangements
Genome Rearrangements CIS 667 April 13, Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.
Sorting Signed Permutations By Reversals (The Hannenhalli – Pevzner Theory) Seminar in Bioinformatics – ©Shai Lubliner.
Introduction to Bioinformatics Algorithms Greedy Algorithms And Genome Rearrangements.
Of Mice and Men Learning from genome reversal findings Genome Rearrangements in Mammalian Evolution: Lessons From Human and Mouse Genomes and Transforming.
Genome Rearrangements CSCI : Computational Genomics Debra Goldberg
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
Transforming Cabbage into Turnip: Polynomial Algorithm for Sorting Signed Permutations by Reversals Journal of the ACM, vol. 46, No. 1, Jan 1999, pp
5. Lecture WS 2003/04Bioinformatics III1 Genome Rearrangements Compare to other areas in bioinformatics we still know very little about the rearrangement.
Genome Rearrangement SORTING BY REVERSALS Ankur Jain Hoda Mokhtar CS290I – SPRING 2003.
1 Genome Rearrangements João Meidanis São Paulo, Brazil December, 2004.
Efficient Data Structures and a New Randomized Approach for Sorting Signed Permutations by Reversals Haim Kaplan and Elad Verbin.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
7-1 Chapter 7 Genome Rearrangement. 7-2 Background In the late 1980‘s Jeffrey Palmer and colleagues discovered a remarkable and novel pattern of evolutionary.
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina
Genome Rearrangements …and YOU!! Presented by: Kevin Gaittens.
Genome Rearrangements Tseng Chiu Ting Sept. 24, 2004.
1 A Simpler 1.5- Approximation Algorithm for Sorting by Transpositions Combinatorial Pattern Matching (CPM) 2003 Authors: T. Hartman & R. Shamir Speaker:
16. Lecture WS 2004/05Bioinformatics III1 V16 – genome rearrangement Important information – contained in the order in which genes occur on the genomes.
A Simpler 1.5-Approximation Algorithm for sorting by transposition Tzvika Hartman.
Genome Rearrangements Unoriented Blocks. Quick Review Looking at evolutionary change through reversals Find the shortest possible series of reversals.
Greedy Algorithms And Genome Rearrangements An Introduction to Bioinformatics Algorithms (Jones and Pevzner)
Genome Rearrangements [1] Ch Types of Rearrangements Reversal Translocation
Greedy Algorithms And Genome Rearrangements
Chap. 7 Genome Rearrangements Introduction to Computational Molecular Biology Chap ~
Sorting by Cuts, Joins and Whole Chromosome Duplications
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Chap. 7 Genome Rearrangements Introduction to Computational Molecular Biology Chapter 7.1~7.2.4.
Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.
Gene: A sequence of nucleotides coding for protein Gene Prediction Problem: Determine the beginning and end positions of genes in a genome Gene Prediction:
Significance Tests for Max-Gap Gene Clusters Rose Hoberman joint work with Dannie Durand and David Sankoff.
Genome Rearrangement By Ghada Badr Part I.
Genome Rearrangements. Turnip vs Cabbage: Look and Taste Different Although cabbages and turnips share a recent common ancestor, they look and taste different.
Genome Rearrangements. Turnip vs Cabbage: Look and Taste Different Although cabbages and turnips share a recent common ancestor, they look and taste different.
1 Genome Rearrangements (Lecture for CS498-CXZ Algorithms in Bioinformatics) Dec. 6, 2005 ChengXiang Zhai Department of Computer Science University of.
Lecture 4: Genome Rearrangements. End Sequence Profiling (ESP) C. Collins and S. Volik (UCSF Cancer Center) 1)Pieces of tumor genome: clones ( kb).
Lecture 2: Genome Rearrangements. Outline Cancer Sequencing Transforming Cabbage into Turnip Genome Rearrangements Sorting By Reversals Pancake Flipping.
Hidden Markov Models BMI/CS 576
WABI: Workshop on Algorithms in Bioinformatics
CSCI2950-C Lecture 9 Cancer Genomics
CSCI2950-C Genomes, Networks, and Cancer
Original Synteny Vincent Ferretti, Joseph H. Nadeau, David Sankoff, 1996 Presented by: Suzy Sun.
Conservation of Combinatorial Structures in Evolution Scenarios
Genome Rearrangement and Duplication Distance
Tao Jiang Department of Computer Science
Latent Variables, Mixture Models and EM
CSE 5290: Algorithms for Bioinformatics Fall 2009
Markov chain monte carlo
Greedy (Approximation) Algorithms and Genome Rearrangements
Lecture 3: Genome Rearrangements and Duplications
CSCI2950-C Lecture 4 Genome Rearrangements
Greedy Algorithms And Genome Rearrangements
3. Brute Force Selection sort Brute-Force string matching
3. Brute Force Selection sort Brute-Force string matching
FanChang Hao, Melvin Zhang, and Hon Wai Leong Review for TCBB
Greedy Algorithms And Genome Rearrangements
JAKUB KOVÁĆ, ROBERT WARREN, MARÍLIA D.V. BRAGA and JENS STOYE
3. Brute Force Selection sort Brute-Force string matching
Presentation transcript:

CSCI2950-C Lecture 6 Genome Rearrangements and Duplications http://cs.brown.edu/courses/csci2950-c/

Outline Recap Multichromosomal Rearrangements Sorting By Reversals & Breakpoint Graphs Multichromosomal Rearrangements Duplications: Segmental and Whole-Genome Probabilistic Genome Rearrangements

Signed Permutations But genes (and DNA) have directions… so we should consider signed permutations 5’ 3’ p = 1 -2 - 3 4 -5

Sorting by reversals: 5 steps hour

Sorting by reversals: 4 steps

Sorting by reversals: 4 steps What is the reversal distance for this permutation? Can it be sorted in 3 steps?

Breakpoint graph 1-dimensional construction Transform p = < 2, -4, -3, 5, -8, -7, -6, 1 > into g = < 1, 2, 3, 4, 5, 6, 7, 8 > by reversals. Vertices: i ® ia ib -i ® ib ia and 0b, 9a Edges: match the ends of consecutive blocks in p, g Superimpose matchings

Breakpoint graph Breakpoints Each reversal goes between 2 breakpoints, so d ³ # breakpoints / 2 = 6/2 = 3. Theorem (Hannenhalli-Pevzner 1995): d(π) = n + 1 – c(π) + h(π) + f(π) where c(π) = # cycles; h,f are rather complicated, but can be computed from graph in polynomial time. Here, d = 8 + 1 – 5 + 0 + 0 = 4 Breakpoints are not independent. Breakpoint graph shows dependencies between the breakpoints.

Oriented and Unoriented Cycles ρ x x+1 y y+1 x y x+1 y+1 Proper reversal acts on black edges: c(ρ π) – c (π) = 1 Unoriented Cycles E No proper reversal acting on an unoriented cycle These are “impediments” in sorting by reversals.

Safe Reversals Oriented Cycles Unoriented Cycles Let Δc = c(ρ π) – c (π) Δh(ρ π) – h(π) A reversal p is safe if Δc – Δh = 1. Oriented Cycles ρ x x+1 y y+1 x y x+1 y+1 Proper reversal acts on black edges: c(ρ π) – c (π) = 1 Unoriented Cycles 2 1 3 -1 -2 3 c(π) = 2, h(π) = 1 c(π) = 2, h(π) = 0

Algorithm Outline Reversal_Sort(π) While π not sorted if π has a “long cycle” Select ρ [a padding of π] else if π has an oriented component Select a safe reversal in component else if π has a hurdle Select ρ [Hurdle merging or cutting] else if π is a fortress Select ρ [superhurdle merging] π  π . ρ endwhile

Breakpoint graph Þ rearrangement scenario

Cell Division and Mutation Single nucleotide change A major contributor to the development of cancer are somatic mutations that occur during cell division Will focus on structural and later copy number, which is not to say that single are not as important. What is the effect of structural changes Copy number Structural

Types of Rearrangements Reversal 1 2 3 4 5 6 1 2 -5 -4 -3 6 Translocation 1 2 3 4 5 6 1 2 6 4 5 3 Fusion 1 2 3 4 5 6 1 2 3 4 5 6 Fission

Multichromosomal rearrangements Translocation (5 9 4 10) (–6 –1 11 7 –2) (5 9 11 7 –2) (–6 –1 4 10) By concatenating chromosomes, this may be mimicked by a single reversal:

Multichromosomal rearrangements Translocation Most concatenates don’t work! The first reversal just flipped a whole chromosome to position it correctly. This is an artifact of our genome representation; it is not a biological event. We want to avoid such artifacts.

Multichromosomal rearrangements Translocation Most concatenates don’t work! These concatenates required 3 reversals instead of 1! The second reversal just flipped a whole chromosome to position it correctly; this is an artifact of our genome representation, not a biological event. We want to avoid such extra steps and artifacts.

Multichromosomal rearrangements Fission and fusion (1 2 3 4 5) ( ) (1 2) (3 4 5) By concatenating chromosomes, this may be mimicked by a single reversal: Evolution: Human chromosome 2 is the fusion of two chromosomes from other hominoids (chimpanzees, orangutans, gorillas).

Multichromosomal rearrangements Fission and fusion (1 2 3 4 5) ( ) (1 2) (3 4 5) By concatenating chromosomes, this may be mimicked by a single reversal: Flipping the whole chromosome (3 4 5) gives a different representation (–5 –4 –3) of the same chromosome. Chromosome ends ( ) ( ) must be tracked too.

Multichromosomal rearrangements Concatenates Concatenate together all the chromosomes of a genome into a single sequence. These concatenates represent the same genome: (5 9 4 10) (8 3) (–6 –1 11 7 –2) (8 3) (2 –7 –11 1 6) (5 9 4 10) Permuting the order of chromosomes and flipping chromosomes do not count as biological events. Chromosome ends ( ) ( ) ( ) are included and are distinguishable.

Multichromosomal rearrangements Results Theorem (Tesler 2002): Let d = minimum total number of reversals, translocations, fissions, and fusions among all rearrangement scenarios between two genomes. By carefully choosing concatenates of the genomes, we can usually mimic a most parsimonious scenario by a d-step reversal scenario on the concatenates with no chromosome flips or chromosome permutations. There are pathological cases requiring a (d + 1)-step reversal scenario with one chromosome flip. Total time O(( n + N )2).

Multichromosomal rearrangements Results n = # of blocks, N = # of chromosomes Distance is the minimum number of reversals, fissions, fusions, translocations. Solution method: use suitable concatenates to obtain an equivalent “sorting by reversals” problem. The H-P algorithm has a nonconstructive step that required a lot of work to fix. It pertains to choosing concatenates to avoid flips and chromosome permutations. (Tesler 2002) does this constructively.

GRIMM Web Server Real genome architectures are represented by signed permutations Efficient algorithms to sort signed permutations have been developed GRIMM web server computes the reversal distances between signed permutations:

GRIMM Web Server http://www-cse.ucsd.edu/groups/bioinformatics/GRIMM 22 dense pages to fix gaps http://www-cse.ucsd.edu/groups/bioinformatics/GRIMM

Other Types of Rearrangements Transpositions 1 2 3 4 5 6 1 2 5 3 4 6 Duplication Transposition 1 2 3 4 5 6 1 2 3 4 5 3 4 6 Duplications are very frequent in cancer genomes.

Duplications HARD!!! (NP-hard?) What problem to solve? Given G  {1, .., n}N . i = (1 2 … n) (“permutation with duplicates”) Find reversals 1, 2, …, t, duplications 1, …, s, and permutation  such that  (1, …, t, 1, …, s) i = G and s + t is minimal 1 2 3 4 5 6 1 2 3 4 5 3 4 -2 -3 6 ??? HARD!!! (NP-hard?)

Duplications (2) What problem to solve? Given: G  {1, .., n}N , H =  G for permutation , (“permutation with duplicates”) Find: Reversals 1, 2, …, t such that 1 …t G = H and t is minimal Signed reversal distance with duplicates NP-hard (Chen, et al. 2005) If 1-1 mapping of repeated elements (orthologs) in G to H then problem reduces to reversal distance.

El-Mabrouk and Sankoff (2002) Duplications (3) What problem to solve? Given: G {1, .., n}N (permutation with duplicates) Find: Permutation  , reversals 1, 2, …, s, and duplications 1, … t such that 1, …, s1, …, t  = G and t minimal. Solution when at most two duplicates per gene and restricted class of duplications El-Mabrouk and Sankoff (2002)

Whole Genome Duplication Genome is doubled – extra copy of each element. Subsequently undergoes reversals. Genome Halving Problem. Given a duplicated genome P, recover the ancestral pre-duplicated genome R minimizing the reversal distance from the perfect duplicated genome R  R to the duplicated genome P. (El-Mabrouk and Sankoff 1998-2003)

Whole Genome Duplication Genome is doubled – extra copy of each element. Subsequently undergoes reversals. If copies of each element labeled uniquely, then problem reduces to reversal distance problem.

Reversal Distance and Duplications Let d(G,H) = reversal distance b/w G and H Problem of computing d(P, R  R) is unsolved minR d(P, R  R) solvable in polynomial time

Breakpoint Graph p g G( p,g ) 0h 2t 2h 4h 4t 3h 3t 5t 5h 8h 8t 7h 7t 2 -4 -3 5 -8 -7 -6 1 9 0h 2t 2h 4h 4t 3h 3t 5t 5h 8h 8t 7h 7t 6h 6t 1t 1h 9t g 1 2 3 4 5 6 7 8 9 0h 1t 1h 2t 2h 3t 3h 4t 4h 5t 5h 6t 6h 7t 7h 8t 8h 9t G( p,g ) 2 -4 -3 5 -8 -7 -6 1 9 0b 2a 2b 4b 4a 3b 3a 5a 5b 8b 8a 7b 7a 6b 6a 1a 1b 9a

Genome Halving: Exhaustive Doubled genome with 2n genes Compute reversal distance on all 2n labeling of genes.

Genome Halving Weak Genome Halving Problem. For a given duplicated genome P, find a perfect duplicated genome R  R and a labeling of gene copies that maximizes the number of black-gray cycles c(G) in the breakpoint graph G(P,R  R) of the labeled genomes P and R  R. (Alekseyev and Pevzner 2006) Theorem (Hannenhalli-Pevzner 1995): d(π) = n + 1 – c(π) + h(π) + f where c = # cycles; h = # hurdles f = 1 if π is fortress.

Contracted Breakpoint Graph Breakpoint graph construction p 2 -4 -3 5 -8 -7 -6 1 9 0h 2t 2h 4h 4t 3h 3t 5t 5h 8h 8t 7h 7t 6h 6t 1t 1h 9t g 1 2 3 4 5 6 7 8 9 0h 1t 1h 2t 2h 3t 3h 4t 4h 5t 5h 6t 6h 7t 7h 8t 8h 9t G( p,g ) 2 -4 -3 5 -8 -7 -6 1 9 0h 2t 2h 4h 4t 3h 3t 5h 5t 8h 8t 7h 7t 6h 6t 1t 1h 9t Implicit were obverse edges (xt, xh)  is black-obverse alternating path  is gray-observe alternating path

Contracted Breakpoint Graph With duplicates, pair of vertices with same label. Contract these identical vertices

Contracted Breakpoint Graph P = −a−b+g+d+f+g+e−a+c−f−c−b−d−e R = −a−b−d−g+f−c−e G’(P,R  R) Each gray edge is pair of parallel edges

Cycle Decompositions In H-P theory, c(π) = # of cycles in maximal cycle decomposition was key parameter. Strategy: analyze cycle decompositions of contracted breakpoint graph

Cycle Decompositions Genomes P and Q G(P,Q) breakpoint graph for some labeling Black-gray cycle decomposition ??? G’(P,Q) contracted breakpoint graph Induced black-gray cycle decomposition Labeling Problem. Given a black-gray cycle decomposition of the contracted breakpoint graph G′(P,Q) of duplicated genomes P and Q, find labeling of P and Q that induces this cycle decomposition. Does not always have a solution.

Maximal black-gray cycle decomposition P = −a−b+g+d+f+g+e−a+c−f−c−b−d−e R = −a−b−d−g+f−c−e Contracted breakpoint graph G’ BG graph corresponding to G’ Maximal black gray cycle decomposition of G’ G’(P,R  R) BG graph corresponding to G’ Maximal black-gray cycle decomposition

P as black-observe cycle Cycle Decomposition P = −a−b+g+d+f+g+e−a+c−f−c−b−d−e R = −a−b−d−g+f−c−e P as black-observe cycle c) Maximal black-gray cycle decomposition C of G’ (e) Superimpose two graphs – gives breakpoint graph inducing cycle decomposition in c

Genome Halving Algorithm: Outline Input: Doubled genome P Construct BO (black-obverse) graph for P by gluing identical edges Introduce gray edges “optimally” to create BOG (black-observe-gray) graph G’ with single gray-observe cycle (!!!) R = gray-observe cycle in G’ Find maximal black-gray cycle decomposition of G’ and labeling of Q = R  R

Alternative Rearrangement Metrics Thus far, distance posed as minimum number of rearrangements transforming one permutation to identity. Parsimony assumption in evolution. Score S(ρ) for a rearrangement ρ. Parsimony: S(ρ) = 1 for all ρ. S(ρ1, ρ2 …, ρt) = Σ S(ρi) = t Length-weighted reversals S(ρ) = l(ρ)α, where l(ρ) = length of reversed subsequence (Bender, et al. 2008) Many of the resulting optimization problems are NP hard

Probabilistic Genome Rearrangements Pr[rearrangement ρ] = p. Compute Pr[rearrangement sequence ρ1…ρn] Inversions occur according to Poisson process (York, et al. (2002)) L inversions: Pr[L | λ] = e-λ λL / L! n(n+1)/2 possible inversions. Each occurs with equal probability Ω = {inversion sequences} For X = ρ1… ρLx ε Ω, Pr[X | λ] = (e-λ λLx / Lx!) ( n (n+1)/2)-Lx

Probabilistic Genome Rearrangements Pr[X, λ | π] = Pr [X, λ, π] / Pr[π] = Pr[π | X, λ] Pr[X | λ] Pr[λ] / Pr[π] = (1) ((e-λ λLx / Lx!) ( n (n+1)/2)-Lx) (1/ λmax) / Pr[π] Problem: How to evaluate this distribution? Solution: Iteratively sample from Ω × (0, λmax]. (X0, λ0)  (X1, λ1)  (X2, λ2)  … After a long time, reach stationary distribution. Markov chain Monte Carlo

MCMC Genome Rearrangements How to update? (Xi, λi)  (Xi+1, λi+1) Alternate updates of λ and X (Metropolis-Hastings algorithm) (Xi, λi)  (Xi, λi+1)  (Xi+1, λi+1) Pr[ λ | X, π] α Pr[X | λ] Pr[λ] α e-λ λLx Pr[λ]

MCMC Genome Rearrangements: Updating X (Xi, λi+1)  (Xi+1, λi+1) Choose a section to replace with probability q(l,j), l = length, pj = starting permutation Generate new subpath from pα to pβ Use breakpoint graph G(pα, pβ) to choose an inversion sequence where Δ(c) = 1 with high probability

MCMC Genome Rearrangements

MCMC Genome Rearrangements Can we use this approach for other genome rearrangement operations? Translocations, duplications, etc.

References G. Tesler: “Efficient algorithms for multichromosomal genome rearrangements.” J. Comput. Syst. Sci. 65(3): 587-609 (2002) Xin Chen, Jie Zheng, Zheng Fu, Peng Nan, Yang Zhong, Stefano Lonardi, Tao Jiang: Assignment of Orthologous Genes via Genome Rearrangement. IEEE/ACM Trans. Comput. Biology Bioinform. 2(4): 302-315 (2005) N. El-Mabrouk: “Reconstructing an ancestral genome using minimum segments duplications and reversals.” J. Comput. Syst. Sci. 65(3): 442-464 (2002) N. El-Mabrouk, David Bryant, David Sankoff: “Reconstructing the pre-doubling genome.” RECOMB 1999: 154-163 M. Alekseyev & P. Pevzner: “Colored de Bruijn Graphs and the Genome Halving Problem.” IEEE/ACM Trans. Comput. Biology Bioinform. 4(1): 98-107 (2007) Bender, et al. “Improved bounds on sorting by length-weighted reversals.” J. of Computer and System Sciences 74 (2008) 744–774. York, et al. “Bayesian Estimation of the Number of Inversions in the History of Two Chromosomes” J. of Computational Biol. (2002)