Presentation on theme: "The BioQUEST Curriculum Consortium at Clark Atlanta University Atlanta, Georgia Feb. 14-16, 2003 Evolutionary Bioinformatics Education: a National Science."— Presentation transcript:
The BioQUEST Curriculum Consortium at Clark Atlanta University Atlanta, Georgia Feb , 2003 Evolutionary Bioinformatics Education: a National Science Foundation Chautauqua Course
More data yields stronger analyses — if done carefully! Mosaic ideas and evolutionary ‘importance.’ Multiple Sequence Alignment & Analysis Steven M. Thompson Steven M. Thompson Florida State University School of Computational Science and Information Technology (CSIT) CSIT
So what; why even bother? Applications: Probe, primer, and motif design; Graphical illustrations; Comparative ‘homology’ inference; Molecular evolutionary analysis. OK — well, how do you do it? Applicability?
Dynamic programming’s complexity increases exponentially with the number of sequences being compared: N-dimensional matrix.... complexity=[sequence length] number of sequences
See — MSAMSA (‘global’ within ‘bounding box’) and MSA PIMAPIMA (‘local’ portions only) on the multiple alignment page at the PIMA Baylor College of Medicine’s Search Launcher — — but, severely limiting restrictions! ‘Global’ heuristic solutions
Therefore — pairwise, progressive dynamic programming restricts the solution to the neighbor- hood of only two sequences at a time. All sequences are compared, pairwise, and then each is aligned to its most similar partner or group of partners. Each group of partners is then aligned to finish the complete multiple sequence alignment. Multiple Sequence Dynamic Programming
Web resources for pairwise, progressive multiple alignment — bielefeld.de/bcd/Curric/MulAli/welcome.htmlhttp://www.techfak.uni- bielefeld.de/bcd/Curric/MulAli/welcome.html. bielefeld.de/bcd/Curric/MulAli/welcome.html However, problems with very large datasets and huge multiple alignments make doing multiple sequence alignment on the Web impractical after your dataset has reached a certain size. You’ll know it when you’re there!
Reliability and the Comparative Approach — explicit homologous correspondence; manual adjustments based on knowledge, especially structural, regulatory, and functional sites. Therefore, editors like SeqLab and the Ribosomal Database Project:
Structural & Functional correspondence in the Wisconsin Package’s SeqLab — Wisconsin PackageSeqLabWisconsin PackageSeqLab
Work with proteins! If at all possible — Twenty match symbols versus four, plus similarity! Way better signal to noise. Also guarantees no indels are placed within codons. So translate, then align. Nucleotide sequences will only reliably align if they are very similar to each other. And they will require extensive hand editing and careful consideration.
Beware of aligning apples and oranges [and grapefruit]! Parologous versus orthologous; genomic versus cDNA; mature versus precursor.
Mask out uncertain areas —
Complications — Order dependence. Not that big of a deal. Substitution matrices and gap penalties. A very big deal! Regional ‘realignment’ becomes incredibly important, especially with sequences that have areas of high and low similarity (GCG’ PileUp -InSitu option).
Complications cont. — Format hassles! Specialized format conversion tools such as GCG’s From’ and To’ programs and PAUPSearch. Don Gilbert’s public domain ReadSeq program. ReadSeq
Still more complications — Indels and missing data symbols (i.e. gaps) designation discrepancy headaches —., -, ~, ?, N, or X..... Help!
The consensus and motifs — Conserved regions can be visualized with a sliding window approach and appear as peaks. P-Loop Let’s concentrate on the first peak seen here to simplify matters.
The first GTP binding domain of EF 1 /Tu — A consensus isn’t necessarily the biologically “correct” combination. A simple consensus throws much information away! Therefore, motif definition.
The EF 1 /Tu P-Loop — Defined as: (A,G)x4GK(S,T). A one-dimensional ‘regular-expression’ of a conserved site. Not necessarily biologically meaningful. Motifs are limited in their ability to discriminate a residue’s ‘importance.’
FOR MORE INFO... Explore my Web Home: and and Contact me for specific long-distance bioinformatics assistance and collaboration. So how do we include ‘all’ the information of a multiple sequence alignment, or of a region within an alignment, in a description that doesn’t throw anything away? Enter — for remote homology searching, the ‘profile’... profile algorithms, incl. ‘traditional’ Gribskov profiles, Expectation Maximization (MEME’s), and Hidden Markov Models (HMMer’s). Conclusions —
Many fine texts are starting to become available in the field. Many fine texts are starting to become available in the field. To ‘honk-my-own-horn’ a bit, check out the new — Current Protocols in Bioinformatics from John Wiley & Sons, Inc: They asked me to contribute a chapter on multiple sequence analysis using GCG software. Humana Press, Inc. also asked me to contribute. I’ve got two chapters in their — Introduction to Bioinformatics: A Theoretical And Practical Approach m/Product.pasp?txtCatalog= HumanaBooks&txtCategory= &txtProductID= X&isVariant=0http://www.humanapress.co m/Product.pasp?txtCatalog= HumanaBooks&txtCategory= &txtProductID= X&isVariant=0. m/Product.pasp?txtCatalog= HumanaBooks&txtCategory= &txtProductID= X&isVariant=0 Both volumes are now available. AND FOR EVEN MORE INFO...
References — Bailey, T.L. and Elkan, C., (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers, in Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, California, U.S.A. pp. 28–36. Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids Research 20, Eddy, S.R. (1996) Hidden Markov models. Current Opinion in Structural Biology 6, 361–365. Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics 14, Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle, Washington, U.S.A. Feng, D.F. and Doolittle, R. F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution 25, 351–360. Genetics Computer Group (Copyright ) Program Manual for the Wisconsin Package, Version 10.3, Accelrys, subsidiary of Pharmocopeia Inc. Gilbert, D.G. (1993 [C release] and 1999 [Java release]) ReadSeq, public domain software distributed by the author. Bioinformatics Group, Biology Department, Indiana University, Bloomington, Indiana,U.S.A. Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A. 84, Gupta, S.K., Kececioglu, J.D., and Schaffer, A.A. (1995) Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. Journal of Computational Biology 2, 459–472. Smith, R.F. and Smith, T.F. (1992) Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for comparative protein modelling. Protein Engineering 5, 35–41. Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) ( ) Illinois Natural History Survey, (1994) personal copyright, and (1997) Smithsonian Institution, Washington D.C., U.S.A. Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins,D.G. (1997) The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Research 24, 4876–4882. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22,