Presentation on theme: "String Metrics in Classification of Mobile Genetic Elements Discrete Mathematical Biology, Math 8803 www.yale.edu/turner/projects/ecoli.htmwww.geneticengineering.org/evolution."— Presentation transcript:
String Metrics in Classification of Mobile Genetic Elements Discrete Mathematical Biology, Math 8803 www.yale.edu/turner/projects/ecoli.htmwww.geneticengineering.org/evolution Ryan Wagner Biology/Bioinformatics PhD student http://pdb.lbl.gov/microscopies
String Metrics in Classification of Mobile Genetic Elements 1.Mathematical relevance 2.Biological relevance 3.Test and review of four distance methods 4.What was good, bad, and ugly.
Introduction: distances on strings D(a, b) = 0 a = b, the identity axiom D(a, b) = D(b, a), the symmetry axiom D(a, b) + D(b, c) D(a, c), the triangle inequality Formal definition of a distance function, D Tree Additivity When a tree is made from a matrix of pairwise distance metrics, the distance between any two leaves (sequences) equals the sum of the edge lengths connecting them (Baake and Heaseler, 1997).
Introduction: mathematical distance vs. evolutionary distance the three metric properties comprise the basis for characterization may also be characterized by Turing Machine computability (Ahlbrandt et al., 2004) amenable to both alignment- based and alignment-free methods when obtained by common statistical correction techniques, fails to satisfy the triangle inequality the tree additivity property may hold where the triangle inequality fails not developed for alignment-free distances on DNA strings
Introduction: mobile genetic elements Plasmid - an autonomous, self-replicating circular piece of DNA found outside the chromosome in bacteria. www8.nos.noaa.gov/coris_glossary www8.nos.noaa.gov/coris_glossary Bacteriophage - a virus that attacks and infects bacterial cells. www.ncbi.nlm.nih.gov/ICTBdb/ICTVdB www.microbe-edu.org/etudiant Transposon - a DNA sequence capable of moving to new locations within the same cell
Methods: data collection and software DNA sequences for replication initiation (RepA) and division partition (ParA) in both plasmids and host chromosomes obtained from NCBI, www.ncbi.nlm.nih.gov/genomes/lproks.cgi www.ncbi.nlm.nih.gov/genomes/lproks.cgi DNA sequences of selected plasmids from gram-negative bacteria also obtained from NCBI Neighbor-joining trees constructed for each set of pairwise distances using PHYLIP, http://evolution.genetics.washington.edu/phylip.html http://evolution.genetics.washington.edu/phylip.html Custom Perl script used to generate matrices of pairwise distances: 4 G_lovleyi 0.000000 0.864000 0.887000 0.844000 Acidovoro 0.864000 0.000000 0.664000 0.724000 Acid_JS42 0.887000 0.664000 0.000000 0.836000 Xanth_axo 0.844000 0.724000 0.836000 0.000000
Methods: edit distance Here is where horizontal gene transfer begins to cause problems. Data structure in custom script for test input: ATTGCGAGC and ATGCGACC ATGCGACC 012345678 A101234567 T210123456 T321234567 G432123456 C543212345 G654321234 A765432123 G876543234 C987654323 Levenshtein distance = 3, from lower right corner, no traceback needed
Methods: the problem with edit distance Consider GTGACGTACTATTGC_ and GTACTATTGCGTGAC Consider GTGACGTACTATTGC_ and GTGAGTACTATTGCC 1 character delete/insert Edit distance = 2 5 character delete/insert Edit distance = 8 Allowing block deletions, block insertions, and block reversals confers better approximations to the recombinant nature of DNA evolution ( Long-Hui, 2004 ). However, the least-constrained application of block edit distance has O(n 3 ) time complexity. Constrained block edit distance computation is NP-hard (Lopresti and Tomkins, 1997)
Methods: Euclidean distance over dinucleotide counts A new paradigm: complexity-based distance metrics which do not employ alignments nor dynamic programming a = GTGACGTACTATTGC b = GTACTATTGCGTGAC Computation of counts vectors for a and b dinucleotide ab L2L2 TC + GA 110 TG + CA 220 CT + AG 110 AC + GT 440 TT + AA 110 CC + GG 000 CG110 AT110 GC110 TA220 L 2 = (1/16)[ | a * ij b * ij |], where a * ij = freq(ij)/(freq(i) freq(j)) here L 2 = 0 TC + GA TG + CA CT + AG AC + GT TT + AA CC + GG CG AT GC TA
Methods: compression distance by the Burrows-Wheeler transform (scheme from Mantaci et al., 2008) a 0 = GTGACGTACTATTGC b 0 = GTACTATTGCGTGAC a 1 = TGACGTACTATTGCG b 1 = TACTATTGCGTGACG a 2 = GACGTACTATTGCGT b 2 = ACTATTGCGTGACGT a 3 = ACGTACTATTGCGTG b 3 = CTATTGCGTGACGTA............ a 14 = CGTGACGTACTATTG b 14 = CGTACTATTGCGTGA Merge lists Blue listRed list
Merged list is then sorted: ACGTACTATTGCGTG G ACTATTGCGTGACGT T ATTGCGTGACGTACT T CGTACTATTGCGTGA A CGTGACGTACTATTG G CTATTGCGTGACGTA A GACGTACTATTGCGT T GCGTGACGTACTATT T GTACTATTGCGTGAC C GTGACGTACTATTGC C TACTATTGCGTGACG G TATTGCGTGACGTAC C TGACGTACTATTGCG G TGCGTGACGTACTAT T TTGCGTGACGTACTA A BRBRBRBRBRBRRBBRRBRBBRRBBRRBBRBRBRBRBRBRRBBRRBRBBRRBBRRB Column of last characters is the Burrows-Wheeler transform. Note runs of nucleotides. Sequence color is then correlated to Burrows-Wheeler column If color counts in each segment of runs is equal, sum is 0. Else, sum up total unequal colors Distance = 2
Methods: rank distance Related to Hamming distance, but less sensitive to insertions/deletions (from Dinu and Sgarro, 2006) a = GTGACGTACTATTGC b = GTACTATTGCGTGAC Index each base and correlate it to its position in the sequence relative to the other sequence: e.g. count the first occurrence of G in a and b, compute the difference in their positions, count the second occurrence of G in a and b, compute the difference in their positions, …
Methods: rank distance a = GTGACGTACTATTGC b = GTACTATTGCGTGAC position difference = 0 position difference = 6 position difference = 5 Sum rank counts for G Repeat procedure for T, A, and C Sum rank counts for all four bases and normalize by arithmetic mean of sequence length Distance = 0.01667, c.f. normed edit distance = 0.5333
Results of attempt to cluster by mobile element type Multiple sequence alignment-based NJ tree - customary bioinformatics. Sequences of different taxonomic groups paired closely - diagnostic of mobile genetic elements
Results of attempt to cluster by mobile element type Edit distance tree gives same topology
Results of attempt to cluster by mobile element type Dinucleotide counts over Euclidean distance and Rank distance successfully group two plasmids
Results of attempt to cluster by mobile element type Burrows-Wheeler compression pairwise distances do not give a clear clustering.
Why did the BWT distances not perform well? Insurmountable problem: the BWT distance script, as given, could not compute distances on whole plasmids. Diagnosis: time-complexity of BWT is O( n· log (n) ), but space complexity is O(n2) RepA-ParA sequence data were too short for useful shared repeat regions to appear. Remedy: Run complete plasmid sequences through BWT distance script Mantaci et al. also found their BWT distance does not satisfy the triangle inequality (2008)
Can dinucleotide counts or rank distance be made to perform better in separating mobile elements? Li et al (2004) used trinucleotide counts combined with higher-order nucleotide word counts to accurately infer an evolutionary tree of mammalian mitochondrial DNA. Such simple methods cannot hope to approximate Kolomogorov complexity distance. Recall that Kolmogorov complexity is related to the length of the Turing Machine needed to transform sequence a into sequence b (Li et al., 2004).
Open issues So far, only dinucleotide counts have been developed for clustering of mobile elements (Blaisdell and Karlin, 1996) BWT distance and Rank distance were developed to cluster mammalian mitochondrial DNA (Mantaci et al.,2008; Dina and Sgarro, 2006). Rank distance not shown to satisfy triangle inequality Can it be proven whether or not a pairwise distance satisfying the triangle inequality yields an additive tree.
References Ahlbrandt, C., Benson, G., and Casey, W. (2004) Minimal entropy probability between genome families. Journal of Mathematical Biology. 48:563-590. Baake, E. (1998) What can and cannot be inferred from pairwise sequence comparisons? Mathematical Biosciences. 154:1-21 Blaisdell, B.E., Campbell, A.M., and Karlin, S. (1996) Similarities and dissimilarities of phage genomes. Proc. Natl. Acad. Sci. 93:5854-5859. Dinu, L.P. and Sgarro, A. (2006) A low-complexity distance for DNA strings. Fundamenta Informaticae. 76:361-372. Li. M, Chen, X., Li, X., Ma, B., and Vianyi, P. (2004) The similarity metric. IEEE Transactions on Information Theory XX(Y) Long-Hui, W., Juan, L., Zhou, H-B., and Feng, Shi. (2004) "A new distances metric and its application in phylogenetic tree construction." Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology. Lopresti, D. and Tomkins, A. (1997) "Block edit models for approximate string matching." Theoretical Computer Science. 181:159-179 Mantaci, S., Restivo, A., and Sciortino. (2008) Distance measure for biological sequences: Some recent approaches. International Journal of Approximate Reasoning. 47:109-124.