Presentation is loading. Please wait.

Presentation is loading. Please wait.

Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution.

Similar presentations


Presentation on theme: "Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution."— Presentation transcript:

1 Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

2 More data yields stronger analyses — if done carefully! Mosaic ideas and evolutionary ‘importance.’ Multiple Sequence Alignment & Analysis thru GCG’s SeqLab Steven M. Thompson Steven M. Thompson Florida State University School of Computational Science (SCS) SCS

3 But first a prelude: My definitions Biocomputing and computational biology are synonymous and describe the use of computers and computational techniques to analyze any biological system, from molecules, through cells, tissues, and organisms, all the way to populations. Bioinformatics describes using computational techniques to access, analyze, and interpret the biological information in any of the available biological databases. Sequence analysis is the study of molecular sequence data for the purpose of inferring the function, mechanism, interactions, evolution, and perhaps structure of biological molecules. Genomics analyzes the context of genes or complete genomes (the total DNA content of an organism) within and across genomes. Proteomics is the subdivision of genomics concerned with analyzing the complete protein complement, i.e. the proteome, of organisms, both within and between different organisms.

4 from a ‘virtual’ DNA sequence to actual molecular physical characterization, not the other way ‘round. Using bioinformatics tools, you can infer all sorts of functional, evolutionary, and, structural insights into a gene product, without the need to isolate and purify massive amounts of protein! Eventually you can go on to clone and express the gene based on that analysis using PCR techniques. The computer and molecular databases are an essential part of this process. And a ‘way’ to think about it: The reverse biochemistry analogy

5 The exponential growth of molecular sequence databases Year BasePairs Sequences 1982 680338 606 1983 2274029 2427 1984 3368765 4175 1985 5204420 5700 1986 9615371 9978 1987 1551477614584 1988 2380000020579 1989 3476258528791 1990 4917928539533 1991 71947426 55627 1992 101008486 78608 1993 157152442143492 1994 217102462 215273 1995 384939485555694 1996 6519729841021211 1997 11603006871765847 1998 20087617842837897 1999 3841163011 4864570 20001110106628810106023 20011584992143814976310 200228507990166 22318883 20033655336848530968418 20044457574517640604319 http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html & cpu power Doubling time ~ 1 year!

6 So what; why even bother? Applications: Probe/primer, and motif/profile design; Graphical illustrations; Comparative ‘homology’ inference; Molecular evolutionary analysis. OK — well, how do you do it? Back to multiple sequence alignment — Applicability?

7 Dynamic programming’s complexity increases exponentially with the number of sequences being compared: N-dimensional matrix.... complexity=[sequence length] number of sequences

8 See — MSAMSA (‘global’ within ‘bounding box’) and MSA PIMAPIMA (‘local’ portions only) on the multiple alignment page at the PIMA Baylor College of Medicine’s Search Launcher — http://searchlauncher.bcm.tmc.edu/http://searchlauncher.bcm.tmc.edu/ — but, http://searchlauncher.bcm.tmc.edu/ severely limiting restrictions! ‘Global’ heuristic solutions

9 Therefore — pairwise, progressive dynamic programming restricts the solution to the neighbor- hood of only two sequences at a time. All sequences are compared, pairwise, and then each is aligned to its most similar partner or group of partners. Each group of partners is then aligned to finish the complete multiple sequence alignment. Multiple Sequence Dynamic Programming

10 Reliability and the Comparative Approach — explicit homologous correspondence; manual adjustments based on knowledge, especially structural, regulatory, and functional sites. Therefore, editors like SeqLab and the Ribosomal Database Project: http://rdp.cme.msu.edu/index.jsp

11 Structural & Functional correspondence in the Wisconsin Package’s SeqLab — Wisconsin PackageSeqLabWisconsin PackageSeqLab

12 Work with proteins! If at all possible — Twenty match symbols versus four, plus similarity! Way better signal to noise. Also guarantees no indels are placed within codons. So translate, then align. Nucleotide sequences will only reliably align if they are very similar to each other. And they will require extensive hand editing and careful consideration.

13 Beware of aligning apples and oranges [and grapefruit]! Parologous versus orthologous; genomic versus cDNA; mature versus precursor.

14 Mask out uncertain areas —

15 Complications — Order dependence. Not that big of a deal. Substitution matrices and gap penalties. A very big deal! Regional ‘realignment’ becomes incredibly important, especially with sequences that have areas of high and low similarity (GCG’ PileUp -InSitu option).

16 Complications cont. — Format hassles! Specialized format conversion tools such as GCG’s From’ and To’ programs and PAUPSearch. Don Gilbert’s public domain ReadSeq program. ReadSeq

17 Still more complications — Indels and missing data symbols (i.e. gaps) designation discrepancy headaches —., -, ~, ?, N, or X..... Help!

18 Web resources for pairwise, progressive multiple alignment — http://www.techfak.uni- bielefeld.de/bcd/Curric/MulAli/welcome.htmlhttp://www.techfak.uni- bielefeld.de/bcd/Curric/MulAli/welcome.html. http://www.techfak.uni- bielefeld.de/bcd/Curric/MulAli/welcome.html http://pbil.univ-lyon1.fr/alignment.html http://www.ebi.ac.uk/clustalw/ http://searchlauncher.bcm.tmc.edu/ However, problems with very large datasets and huge multiple alignments make doing multiple sequence alignment on the Web impractical after your dataset has reached a certain size. You’ll know it when you’re there!

19 If large datasets become intractable for analysis on the Web, what other resources are available? Desktop software solutions — public domain programs are available, but... complicated to install, configure, and maintain. User must be pretty computer savvy. So, commercial software packages are available, e.g. MacVector, DS Gene, DNAsis, DNAStar, etc., but... license hassles, big expense per machine, and Internet and/or CD database access all complicate matters!

20 Therefore, UNIX server-based solutions Public domain solutions also exist, but now a very cooperative systems manager needs to maintain everything for users, so, commercial products, e.g. the Accelrys GCG Wisconsin Package [a Pharmacopeia Co.] and the SeqLab Graphical User Interface, simplify matters for administrators and users. Accelrys GCG Wisconsin PackageSeqLabAccelrys GCG Wisconsin PackageSeqLab One license fee for an entire institution and very fast, convenient database access on local server disks. Connections from any networked terminal or workstation anywhere! Operating system: UNIX command line operation hassles; communications software — telnet, ssh, and terminal emulation; X graphics; file transfer — ftp, and scp/sftp; and editors — vi, emacs, pico (or desktop word processing followed by file transfer [save as "text only!"]). See my supplement pdf file.

21 The Genetics Computer Group — The Accelrys Wisconsin Package for Sequence Analysis Begun in 1982 in Oliver Smithies’ Genetics Dept. lab at the University of Wisconsin, Madison, then a private company for over 10 years, then acquired by the Oxford Molecular Group U.K., and now owned by Pharmacopeia Inc. U.S.A., Accelrys Division, under the brand new name, as of May 2005, Discovery Studio GCG. The suite contains almost 150 programs designed to work in a “toolbox” fashion. Several simple programs used in succession can lead to sophisticated results. Also ‘internal compatibility,’ i.e. once you learn to use one program, all programs can be run similarly, and, the output from many programs can be used as input for other programs. Used all over the world by more than 30,000 scientists at over 950 institutions in more than 35 countries, so learning it here will likely be useful at any other research institution that you may end up at.

22 To answer the always perplexing GCG question — “What sequence(s)?....” The sequence is in a local GCG format single sequence file in your UNIX account. (GCG Reformat and all From & To programs) The sequence is in a local GCG database in which case you ‘point’ to it by using any of the GCG database logical names. A colon, “:,” always sets the logical name apart from either an accession number or a proper identifier name or a wildcard expression and they are case insensitive. The sequence is in a GCG format multiple sequence file, either an MSF (multiple sequence format) file or an RSF (rich sequence format) file. To specify sequences contained in a GCG multiple sequence file, supply the file name followed by a pair of braces, “{},” containing the sequence specification, e.g. a wildcard — { * }. Finally, the most powerful method of specifying sequences is in a GCG “list” file. It is merely a list of other sequence specifications and can even contain other list files within it. The convention to use a GCG list file in a program is to precede it with an at sign, “@.” Furthermore, one can supply attribute information within list files to specify something special about the sequence. Specifying sequences, GCG style; in order of increasing power and complexity:

23 This is a small example of GCG single sequence format. Always put some documentation on top, so in the future you can figure out what it is you're dealing with! The line with the two periods is converted to the checksum line. example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099.. 1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA 1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA 51 GATTTAATAG CATGCGATCC CATGGGA ‘Clean’ GCG format single sequence file after ‘reformat’ (or any of the From… programs) SeqLab’s Editor mode can also “Import” native GenBank format and ABI or LI-COR trace files!

24 Logical terms for the Wisconsin Package Sequence databases, nucleic acids:Sequence databases, amino acids: GENBANKPLUSall of GenBank plus EST and GSS subdivisionsGENPEPTGenBank CDS translations GBPall of GenBank plus EST and GSS subdivisionsGPGenBank CDS translations GENBANKall of GenBank except EST and GSS subdivisionsSWISSPROTPLUSall of Swiss-Prot and all of SPTrEMBL GBall of GenBank except EST and GSS subdivisionsSWPall of Swiss-Prot and all of SPTrEMBL BAGenBank bacterial subdivisionSWISSPROTall of Swiss-Prot (fully annotated) BACTERIALGenBank bacterial subdivisionSWall of Swiss-Prot (fully annotated) ESTGenBank EST (Expressed Sequence Tags) subdivisionSPTREMBLSwiss-Prot preliminary EMBL translations GSSGenBank GSS (Genome Survey Sequences) subdivisionSPTSwiss-Prot preliminary EMBL translations HTCGenBank High Throughput cDNAPall of PIR Protein HTGGenBank High Throughput GenomicPIRall of PIR Protein INGenBank invertebrate subdivisionPROTEINPIR fully annotated subdivision INVERTEBRATEGenBank invertebrate subdivisionPIR1PIR fully annotated subdivision OMGenBank other mammalian subdivisionPIR2PIR preliminary subdivision OTHERMAMMGenBank other mammalian subdivisionPIR3PIR unverified subdivision OVGenBank other vertebrate subdivision PIR4PIR unencoded subdivision OTHERVERTGenBank other vertebrate subdivision NRL_3DPDB 3D protein sequences PATGenBank patent subdivision NRLPDB 3D protein sequences PATENTGenBank patent subdivision PHGenBank phage subdivision PHAGEGenBank phage subdivision General data files: PLGenBank plant subdivision PLANTGenBank plant subdivision GENMOREDATApath to GCG optional data files PRGenBank primate subdivision GENRUNDATApath to GCG default data files PRIMATEGenBank primate subdivision ROGenBank rodent subdivision RODENTGenBank rodent subdivision STSGenBank (sequence tagged sites) subdivision SYGenBank synthetic subdivision SYNTHETICGenBank synthetic subdivision TAGSGenBank EST and GSS subdivisions UNGenBank unannotated subdivision UNANNOTATEDGenBank unannotated subdivision VIGenBank viral subdivision VIRALGenBank viral subdivision These are easy — they make sense and you’ll have a vested interest.

25 GCG MSF & RSF format The trick is to not forget the Braces and ‘wild card,’ e.g. filename{ * }, when specifying! !!RICH_SEQUENCE 1.0..{ name ef1a_giala descrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.list type PROTEIN longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala} sequence-ID Q08046 checksum 7342 offset 23 creation-date 07/11/2001 16:51:19 strand 1 comments //////////////////////////////////////////////////////////// !!AA_MULTIPLE_ALIGNMENT 1.0 small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619.. small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619.. Name: a49171 Len: 425 Check: 537 Weight: 1.00 Name: a49171 Len: 425 Check: 537 Weight: 1.00 Name: e70827 Len: 577 Check: 21 Weight: 1.00 Name: e70827 Len: 577 Check: 21 Weight: 1.00 Name: g83052 Len: 718 Check: 9535 Weight: 1.00 Name: g83052 Len: 718 Check: 9535 Weight: 1.00 Name: f70556 Len: 534 Check: 3494 Weight: 1.00 Name: f70556 Len: 534 Check: 3494 Weight: 1.00 Name: t17237 Len: 229 Check: 9552 Weight: 1.00 Name: t17237 Len: 229 Check: 9552 Weight: 1.00 Name: s65758 Len: 735 Check: 111 Weight: 1.00 Name: s65758 Len: 735 Check: 111 Weight: 1.00 Name: a46241 Len: 274 Check: 3514 Weight: 1.00 Name: a46241 Len: 274 Check: 3514 Weight: 1.00 // ////////////////////////////////////////////////// This is SeqLab’s native format

26 The List File Format An example GCG list file of many elongation 1a and Tu factors follows. As with all GCG data files, two periods separate documentation from data... my-special.pepbegin:24end:134 SwissProt:EfTu_EcoliEf1a-Tu.msf{*}/usr/accounts/test/another.rsf{ef1a_*}@another.list The ‘way’ SeqLab works! remember the @ sign!

27 SeqLab — GCG’s X-based GUI! SeqLabSeqLab is the merger of Steve Smith’s Genetic Data Environment and GCG’s Wisconsin Package Interface: SeqLab GDE + WPI = SeqLab SeqLab Requires an X-Windowing environment — either native on UNIX computers (including LINUX, but not installed by default on Mac OS X [v.10+] systems, however, see Apple’s free X11 package or XDarwin), or emulated with X- Server Software on personal computers.

28 FOR MORE INFO... Explore my Web Home: http://bio.fsu.edu/~stevet/cv.html. http://bio.fsu.edu/~stevet/cv.html Contact me (stevet@bio.fsu.edu) for specific long-distance bioinformatics assistance and collaboration. stevet@bio.fsu.edu Gunnar von Heijne in his old but quite readable treatise, Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987), provides a very appropriate conclusion: “Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and do not blindly accept everything the computer offers you.” He continues: “... if any lesson is to be drawn... it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and only second a theoretician.... We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and above all we have to become better biologists. But that’s all it takes.” Conclusions —

29 Many texts are now available in the field. To ‘honk-my-own-horn’ a bit, check out: Current Protocols in Bioinformatics from John Wiley & Sons, Inc. (http://www.does.org/cp/bioinfo.html); http://www.does.org/cp/bioinfo.html and Horizon Scientific Press’ Computational Genomics: Theory and Application ( http://www.horizonpress.com/ hsp/books/com.html). http://www.horizonpress.com/ hsp/books/com.html http://www.horizonpress.com/ hsp/books/com.html AND FOR EVEN MORE INFO... Humana Press’ Introduction to Bioinformatics: A Theoretical And Practical Approach (http://www.humanapress.com/Product. pasp?txtCatalog=HumanaBooks&txtCat egory=&txtProductID=1-58829-241- X&isVariant=0); http://www.humanapress.com/Product. pasp?txtCatalog=HumanaBooks&txtCat egory=&txtProductID=1-58829-241- X&isVariant=0http://www.humanapress.com/Product. pasp?txtCatalog=HumanaBooks&txtCat egory=&txtProductID=1-58829-241- X&isVariant=0 They all asked me to contribute chapters on multiple sequence alignment and analysis using GCG software.

30 References — Bailey, T.L. and Elkan, C., (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers, in Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, California, U.S.A. pp. 28–36. Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids Research 20, 2013-2018. Eddy, S.R. (1996) Hidden Markov models. Current Opinion in Structural Biology 6, 361–365. Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics 14, 755--763 Felsenstein, J. (1993–2005) PHYLIP (Phylogeny Inference Package) Distributed by the author. Dept. of Genetics, University of Washington, Seattle, Washington, U.S.A. Feng, D.F. and Doolittle, R. F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution 25, 351–360. Genetics Computer Group (Copyright 1982–2005) Program Manual for the Wisconsin Package, Version 10.3, Accelrys, subsidiary of Pharmocopeia Inc. Gilbert, D.G. (1993 [C release] and 1999 [Java release]) ReadSeq, public domain software distributed by the author. http://iubio.bio.indiana.edu/soft/molbio/readseq/ Bioinformatics Group, Biology Department, Indiana University, Bloomington, Indiana,U.S.A. Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A. 84, 4355-4358. Gupta, S.K., Kececioglu, J.D., and Schaffer, A.A. (1995) Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. Journal of Computational Biology 2, 459–472. Smith, R.F. and Smith, T.F. (1992) Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for comparative protein modelling. Protein Engineering 5, 35–41. Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) (1989-1993) Illinois Natural History Survey, (1994) personal copyright, (1995–2000) Smithsonian Institution, Washington D.C., U.S.A., and (2001–2005) Florida State University, School of Computational Science, Tallahassee, Florida, U.S.A. Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins,D.G. (1997) The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Research 24, 4876–4882. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22, 4673-4680.

31 On to a demonstration of some of SeqLab’s multiple sequence dataset capabilities — SeqLab Glutathione Reductase, G-protein coupled TM7 receptors, primate prions, Human Papilloma Virus L1 major coat protein, Major Histocompatibility Class II, Vicilin seed storage proteins, and Elongation Factor 1  /Tu.


Download ppt "Woods Hole, Massachusetts July 25, 2005, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution."

Similar presentations


Ads by Google