Presentation is loading. Please wait.

Presentation is loading. Please wait.

A BioInformatics Survey... just a taste, with an emphasis on the GCG suite. Steven M. Thompson Steven M. Thompson Florida State University School of Computational.

Similar presentations


Presentation on theme: "A BioInformatics Survey... just a taste, with an emphasis on the GCG suite. Steven M. Thompson Steven M. Thompson Florida State University School of Computational."— Presentation transcript:

1 A BioInformatics Survey... just a taste, with an emphasis on the GCG suite. Steven M. Thompson Steven M. Thompson Florida State University School of Computational Science and Information Technology (CSIT) CSIT

2 the Woods Hole Oceanographic Institution Woods Hole, MA Feb. 21 – 22, 2003 A GCG ¥ SeqLab Introduction for:

3 Introductory Overview: What is bioinformatics, genomics, sequence analysis, computational molecular biology... The Reverse Biochemistry Analogy. Using sequence analysis tools, one can infer all sorts of functional, evolutionary, and, perhaps, structural insight into a gene, without the need to isolate and purify massive amounts of protein! The computer is an essential part of this entire process.

4 Definitions: Biocomputing and computational biology are fairly synonymous and both describe the use of computers and computational techniques to analyze biological systems. Bioinformatics describes using computational techniques to access, analyze, and interpret the biological information in any of the available biological databases. Sequence analysis is the study of molecular sequence data for the purpose of inferring the function, interactions, evolution, and perhaps structure of biological molecules. Genomics analyzes the context of genes or complete genomes (the total DNA content of an organism) within and across genomes. Proteomics is the subdivision of genomics concerned with analyzing the complete protein complement, i.e. the proteome, of organisms, both within and between different organisms.

5 The exponential growth of molecular sequence databases & cpu power. Year BasePairs Sequences Doubling time ~ 1 year!

6 Database Growth (cont.) The Human Genome Project and numerous smaller genome projects have kept the data coming at alarming rates. As of February 2003, 115 complete, finished genomes are publicly available for analysis, not counting all the virus and viroid genomes available. The International Human Genome Sequencing Consortium announced the completion of a "Working Draft" of the human genome in June 2000; independently that same month, the private company Celera Genomics announced that it had completed the first assembly of the human genome. Both articles were published mid-February 2001 in the journals Science and Nature. Celera Genomics ScienceNature Celera Genomics ScienceNature

7 Some neat stuff from the papers: We, Homo sapiens, aren’t nearly as special as we had hoped we were. Of the 3.2 billion base pairs in our DNA — Traditional, text-book estimates of the number of genes were often in the 100,000 range; turns out we’ve only got about twice as many as a fruit fly, between 25,000 and 35,000! The protein coding region of the genome is only about 1% or so, much of the remainder “junk” is “jumping,” “selfish DNA” of which much may be involved in regulation and control genes were transferred from an ancestral bacterial genome to an ancestral vertebrate genome! (Later shown to be not true by more extensive analyses, and to be due to gene loss rather than transfer.)

8 What are these databases like? What are primary sequences? (Central Dogma: DNA —> RNA —> protein) Primary refers to one dimension — all of the “symbol” information written in sequential order necessary to specify a particular biological molecular entity, be it polypeptide or nucleotide. The symbols are the one letter alphabetic codes for all of the biological nitrogenous bases and amino acid residues and their ambiguity codes. Biological carbohydrates, lipids, and structural information are not included within this sequence, however, much of this type of information is available in the reference documentation sections associated with primary sequences in the databases.

9 What are sequence databases? These databases are an organized way to store the tremendous amount of sequence information that accumulates from laboratories worldwide. Each database has its own specific format. Three major database organizations around the world are responsible for maintaining most of this data; they largely ‘mirror’ one another. North America: National Center for Biotechnology Information (NCBI): GenBank & GenPept. NCBIGenBankNCBIGenBank Also Georgetown University’s NBRF Protein Identification Resource: PIR & NRL_3D. PIRNRL_3DPIRNRL_3D Europe: European Molecular Biology Laboratory (also EBI & ExPasy): EMBL & Swiss-Prot. European Molecular Biology Laboratory EBIExPasyEMBLSwiss-ProtEuropean Molecular Biology Laboratory EBIExPasyEMBLSwiss-Prot Asia: The DNA Data Bank of Japan (DDBJ). DDBJ

10 Content & Organization: Most sequence database installations are examples of complex ASCII/Binary databases, but they usually are not Oracle or SQL or Object Oriented (proprietary ones often are). They often contain several very long text files containing different types of information all related to particular sequences, such as all of the sequences themselves, versus all of the title lines, or all of the reference sections. Binary files often help ‘glue together’ all of these other files by providing index functions. Software is usually required to successfully interact with these databases and access is most easily handled through various software packages and interfaces, either on the World Wide Web or otherwise, although systems level commands can be used if one understands the data's structure. Nucleic acid databases are split into subdivisions based on taxonomy (historical). Protein databases are often organized into sections by level of annotation.

11 What are other biological databases? Three dimensional structure databases: Three dimensional structure databases: the Protein Data Bank and Rutgers Nucleic Acid Database. Protein Data BankRutgers Nucleic Acid DatabaseProtein Data BankRutgers Nucleic Acid Database Still more; these can be considered ‘non-molecular’: Reference Databases: e.g. OMIMOMIM — Online Mendelian Inheritance in Man OMIM PubMed/MedLinePubMed/MedLine — over 11 million citations from more than 4 thousand bio/medical scientific journals. PubMed/MedLine Phylogenetic Tree Databases: e.g. the Tree of Life. Tree of LifeTree of Life Metabolic Pathway Databases: e.g. WIT (What Is There) and Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of Genes and Genomes). WITKEGGWITKEGG Population studies data — which strains, where, etc. And then databases that most biocomputing people don’t even usually consider: e.g. GIS/GPS/remote sensing data, medical records, census counts, mortality and birth rates....

12 So how does one do Bioinformatics? Often on the InterNet over the World Wide Web: SiteURL (Uniform Resource Locator)Content Nat’l Center Biotech' Info'Nat’l Center Biotech' Info'http://www.ncbi.nlm.nih.gov/databases/analysis/software Nat’l Center Biotech' Info' PIR/NBRFPIR/NBRFhttp://www-nbrf.georgetown.edu/protein sequence database PIR/NBRF IUBIO Biology ArchiveIUBIO Biology Archivehttp://iubio.bio.indiana.edu/database/software archive IUBIO Biology Archive Univ. of MontrealUniv. of Montrealhttp://megasun.bch.umontreal.ca/database/software archive Univ. of Montreal Japan's GenomeNetJapan's GenomeNethttp://www.genome.ad.jp/databases/analysis/software Japan's GenomeNet European Mol' Bio' Lab'European Mol' Bio' Lab'http://www.embl-heidelberg.de/databases/analysis/software European Mol' Bio' Lab' European BioinformaticsEuropean Bioinformaticshttp://www.ebi.ac.uk/databases/analysis/software European Bioinformatics The Sanger InstituteThe Sanger Institutehttp://www.sanger.ac.uk/databases/analysis/software The Sanger Institute Univ. of Geneva BioWebUniv. of Geneva BioWebhttp://www.expasy.ch/databases/analysis/software Univ. of Geneva BioWeb ProteinDataBankProteinDataBankhttp://www.rcsb.org/pdb/3D mol' structure database ProteinDataBank Molecules R UsMolecules R Ushttp://molbio.info.nih.gov/cgi-bin/pdb/3D protein/nuc' visualization Molecules R Us The Genome DataBaseThe Genome DataBasehttp://www.gdb.org/The Human Genome Project The Genome DataBase Stanford GenomicsStanford Genomicshttp://genome-www.stanford.edu/various genome projects Stanford Genomics Inst. for Genomic Res’rchInst. for Genomic Res’rchhttp://www.tigr.org/esp. microbial genome projects Inst. for Genomic Res’rch HIV Sequence DatabaseHIV Sequence Databasehttp://hiv-web.lanl.gov/HIV epidemeology seq' DB HIV Sequence Database The Tree of LifeThe Tree of Lifehttp://tolweb.org/tree/phylogeny.htmloverview of all phylogeny The Tree of Life Ribosomal Database Proj’Ribosomal Database Proj’http://rdp.cme.msu.edu/html/databases/analysis/software Ribosomal Database Proj’ WIT MetabolismWIT Metabolismhttp://wit.mcs.anl.gov/WIT2/metabolic reconstruction WIT Metabolism Harvard Bio' LaboratoriesHarvard Bio' Laboratorieshttp://golgi.harvard.edu/nice bioinformatics links list Harvard Bio' Laboratories NCBI’s BLAST & Entrez, EMBL’s SRS, + GCG’s SeqLab and LookUp, phylogenetics... BLASTEntrezSRSGCGSeqLabphylogeneticsBLASTEntrezSRSGCGSeqLabphylogenetics

13 So what are the alternatives... ? Desktop software solutions — public domain programs are available, but... complicated to install, configure, and maintain. User must be pretty computer savvy. So, commercial software packages are available, e.g. Omiga, MacVector, DNAsis, DNAStar, etc., but... license hassles, big expense per machine, and Internet and/or CD database access all complicate matters!

14 Therefore, UNIX server-based solutions (e.g. the Accelrys GCG Wisconsin Package [a Pharmacopeia Co.] ): One commercial license fee for an entire institution and very fast, convenient database access on local server disks. Connections from any networked terminal or workstation anywhere! Operating system: UNIX command line operation hassles; communications software — telnet, ssh, xdmcp, etc. and terminal emulation; X graphics; file transfer — ftp, Mac Fetch, and scp/sftp; and editors — vi, emacs, pico (or desktop word processing followed by file transfer [save as "text only!"]). Basic UNIX for Neophytes.

15 The Genetics Computer Group — The Accelrys Wisconsin Package for Sequence Analysis. Begun in 1982 in Oliver Smithies’ lab at the Genetics Dept. at the University of Wisconsin, Madison, then a private company for over 10 years, then acquired by the Oxford Molecular Group U.K., and now owned by Pharmacopeia U.S.A. under the new name Accelrys, Inc. The suite contains almost 150 programs designed to work in a "toolbox" fashion. Several simple programs used in succession can lead to sophisticated results. Also 'internal compatibility,' i.e. once you learn to use one program, all programs can be run similarly, and, the output from many programs can be used as input for other programs. Used all over the world by more than 30,000 scientists at over 530 institutions in 35 countries, so learning it here will most likely be useful anywhere else you may end up.

16 To answer the always perplexing GCG question — “What sequence(s)?....” The sequence is in a local GCG format single sequence file in your UNIX account. (GCG Reformat and all From & To programs) The sequence is in a local GCG database in which case you ‘point’ to it by using any of the GCG database logical names. A colon, “:,” always sets the logical name apart from either an accession number or a proper identifier name or a wildcard expression and they are case insensitive. The sequence is in a GCG format multiple sequence file, either an MSF (multiple sequence format) file or an RSF (rich sequence format) file. To specify sequences contained in a GCG multiple sequence file, supply the file name followed by a pair of braces, “{},” containing the sequence specification, e.g. a wildcard — { * }. Finally, the most powerful method of specifying sequences is in a GCG “list” file. It is merely a list of other sequence specifications and can even contain other list files within it. The convention to use a GCG list file in a program is to precede it with an at sign, Furthermore, one can supply attribute information within list files to specify something special about the sequence. Specifying sequences, GCG style; in order of increasing power and complexity:

17 This is a small example of GCG single sequence format. Always put some documentation on top, so in the future you can figure out what it is you're dealing with! The line with the two periods is converted to the checksum line. example.seq Length: 77 July 21, :30 Type: N Check: ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA 1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA 51 GATTTAATAG CATGCGATCC CATGGGA ‘Clean’ GCG format single sequence file after ‘reformat’ (or any of the From… programs) — SeqLab’s Editor mode can also “Import” native GenBank format and ABI or LI-COR trace files!

18 Logical terms for the Wisconsin Package — Sequence databases, nucleic acids:Sequence databases, amino acids: GENMBLPLUSall of GenEMBL plus EST and GSS subdivisionsGENPEPTGenBank CDS translations GEPall of GenEMBL plus EST and GSS subdivisionsGPGenBank CDS translations GENEMBLall of GenEMBL except EST and GSS subdivisionsSWISSPROTPLUSall of Swiss-Prot and all of SPTrEMBL GEall of GenEMBL except EST and GSS subdivisionsSWPall of Swiss-Prot and all of SPTrEMBL BAGenEMBL bacterial subdivisionSWISSPROTall of Swiss-Prot (fully annotated) BACTERIALGenEMBL bacterial subdivisionSWall of Swiss-Prot (fully annotated) ESTGenEMBL EST (Expressed Sequence Tags) subdivisionSPTREMBLSwiss-Prot preliminary EMBL translations GSSGenEMBL GSS (Genome Survey Sequences) subdivisionSPTSwiss-Prot preliminary EMBL translations HTCGenEMBL High Throughput cDNAPall of PIR Protein HTGGenEMBL High Throughput GenomicPIRall of PIR Protein INGenEMBL invertebrate subdivisionPROTEINPIR fully annotated subdivision INVERTEBRATEGenEMBL invertebrate subdivisionPIR1PIR fully annotated subdivision OMGenEMBL other mammalian subdivisionPIR2PIR preliminary subdivision OTHERMAMMGenEMBL other mammalian subdivisionPIR3PIR unverified subdivision OVGenEMBL other vertebrate subdivision PIR4PIR unencoded subdivision OTHERVERTGenEMBL other vertebrate subdivision NRL_3DPDB 3D protein sequences PATGenEMBL patent subdivision NRLPDB 3D protein sequences PATENTGenEMBL patent subdivision PHGenEMBL phage subdivision PHAGEGenEMBL phage subdivision General data files: PLGenEMBL plant subdivision GCGCORE path to main GCG files PLANTGenEMBL plant subdivision GENMOREDATApath to GCG optional data files PRGenEMBL primate subdivision GENRUNDATApath to GCG default data files PRIMATEGenEMBL primate subdivision ROGenEMBL rodent subdivision RODENTGenEMBL rodent subdivision STSGenEMBL (sequence tagged sites) subdivision SYGenEMBL synthetic subdivision SYNTHETICGenEMBL synthetic subdivision TAGSGenEMBL EST and GSS subdivisions UNGenEMBL unannotated subdivision UNANNOTATEDGenEMBL unannotated subdivision VIGenEMBL viral subdivision VIRALGenEMBL viral subdivision These are easy — they make sense and you’ll have a vested interest.

19 GCG MSF & RSF format — The trick is to not forget the Braces and ‘wild card,’ e.g. filename{ * }, when specifying! !!RICH_SEQUENCE 1.0..{ name ef1a_giala descrip PileUp type PROTEIN longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala} sequence-ID Q08046 checksum 7342 offset 23 creation-date 07/11/ :51:19 strand 1 comments //////////////////////////////////////////////////////////// !!AA_MULTIPLE_ALIGNMENT 1.0 small.pfs.msf MSF: 735 Type: P July 20, :53 Check: small.pfs.msf MSF: 735 Type: P July 20, :53 Check: Name: a49171 Len: 425 Check: 537 Weight: 1.00 Name: a49171 Len: 425 Check: 537 Weight: 1.00 Name: e70827 Len: 577 Check: 21 Weight: 1.00 Name: e70827 Len: 577 Check: 21 Weight: 1.00 Name: g83052 Len: 718 Check: 9535 Weight: 1.00 Name: g83052 Len: 718 Check: 9535 Weight: 1.00 Name: f70556 Len: 534 Check: 3494 Weight: 1.00 Name: f70556 Len: 534 Check: 3494 Weight: 1.00 Name: t17237 Len: 229 Check: 9552 Weight: 1.00 Name: t17237 Len: 229 Check: 9552 Weight: 1.00 Name: s65758 Len: 735 Check: 111 Weight: 1.00 Name: s65758 Len: 735 Check: 111 Weight: 1.00 Name: a46241 Len: 274 Check: 3514 Weight: 1.00 Name: a46241 Len: 274 Check: 3514 Weight: 1.00 // ////////////////////////////////////////////////// This is SeqLab’s native format

20 The List File Format — An example GCG list file of many elongation 1a and Tu factors follows. As with all GCG data files, two periods separate documentation from data... my-special.pepbegin:24end:134 The ‘way’ SeqLab works! remember sign!

21 SeqLab — GCG’s X-based GUI! Seqlab is the merger of Steve Smith’s Genetic Data Environment and GCG’s Wisconsin Package Interface: GDE + WPI = SeqLab Requires an X-Windowing environment — either native on UNIX computers (including LINUX, but not included by Apple in Mac OS X [v.10+] but see XDarwin), or emulated with X-Server Software on personal computers.

22 And to ‘honk-my-own-horn’ a bit, check out the new Current Protocols in Bioinformatics from John Wiley & Sons, Inc: They asked me to contribute a chapter on multiple sequence analysis using GCG software. Humana Press, Inc. also asked me to contribute. I’ve got two chapters in their Introduction to Bioinformatics — A Theoretical And Practical Approach: /Product.pasp?txtCatalog= HumanaBooks&txtCategor y=&txtProductID= X&isVariant=0http://www.humanapress.com /Product.pasp?txtCatalog= HumanaBooks&txtCategor y=&txtProductID= X&isVariant=0. /Product.pasp?txtCatalog= HumanaBooks&txtCategor y=&txtProductID= X&isVariant=0 Both publishers say the volumes should be available February 2003! Participate in the tutorial following this lecture and visit my Web page stuff: Contact me for specific bioinformatics assistance and/or long distance collaboration. F OR M ORE I NFO...

23 The SeqLab Tutorial — Elongation Factor 1  from a sorted collection of ‘lower’ Eukaryotes. How to search databases, analyze and interpret pair-wise comparisons for significance, and prepare and analyze multiple sequence alignments.

24 Supplement — How the algorithms work; accompanying illustrations for the Tutorial Introduction: Dot Matrix Techniques, Dynamic Programming, Score Matrices, Significance, Database Similarity Searching, Multiple Sequence Analysis.

25 What about Homology? Inference through homology is a fundamental principle of biology! What is homology — in this context it is similarity great enough such that common ancestry is implied. Walter Fitch, a famous molecular evolutionist, likes to relate the analogy — homology is like pregnancy, you either are or you’re not; there’s no such thing as 65% pregnant!

26 Pairwise Comparisons : The Dot Matrix Method. Provides a ‘Gestalt’ of all possible alignments between two sequences. To begin — very simple 0, 1 (match, nomatch) identity scoring function. Put a dot wherever symbols match. A way to see similarities —

27 Identities and insertion/deletion events (indels) identified (zero:one match score matrix, no window).

28 Noise due to random composition effects contributes to confusion. To ‘clean up’ the plot consider a filtered windowing approach. A dot is placed at the middle of a window if some ‘stringency’ is met within that defined window size. Then the window is shifted one position and the entire process is repeated (zero:one match score, window of size three and a stringency level of two out of three).

29 The phenylalanine transfer RNA molecule from yeast plotted against itself using a window size to 7 and the stringency value to 5. As a general guide pick a window size about the same size as the feature that you are trying to recognize and a stringency such that unwanted background noise is just filtered away enough to enable you to see that desired feature.

30 RNA comparisons of the reverse, complement of a sequence to itself can often be very informative. The yeast tRNA sequence is compared to its reverse, complement using the same 5 out of 7 stringency setting as previously. The stem-loop, inverted repeats of the tRNA clover-leaf molecular shape become obvious. They appear as clearly delineated diagonals running perpendicular to an imaginary main diagonal running oppositely than before.

31 22 GAGCGCCAGACT G 12, 22 || | ||||| | A 48 CTGGAGGTCTAG A 3 Base position 22 through position 33 base pairs with (think — is quite similar to the reverse- complement of) itself from base position 37 through position 48. MFold, Zuker’s RNA folding algorithm uses base pairing energies to find the family of optimal and suboptimal structures; the most stable structure found is shown to possess a stem at positions 27 to 31 with 39 to 43. However the region around position 38 is represented as a loop. The actual modeled structure as seen in PDB’s 1TRA shows ‘reality’ lies somewhere in between.

32 Pairwise Comparisons: Dynamic Programming. A ‘brute force’ approach just won’t work. The computation required to compare all possible alignments between two sequences requires time proportional to the product of the lengths of the two sequences, without considering gaps at all. If the two sequences are approximately the same length (N), this is a N 2 problem. To include gaps, the calculation needs to be repeated 2N times to examine the possibility of gaps at each possible position within the sequences, now a N 4N problem. Therefore, An optimal alignment is defined as an arrangement of two sequences, 1 of length i and 2 of length j, such that: 1) you maximize the number of matching symbols between 1 and 2; 2) you minimize the number of indels within 1 and 2; and 3) you minimize the number of mismatched symbols between 1 and 2. Therefore, the actual solution can be represented by: Therefore, the actual solution can be represented by: S i-1 j-1 or S i-1 j-1 or max S i-x j-1 + w x-1 or max S i-x j-1 + w x-1 or S ij = s ij + max 2 < x < i max S i-1 j-y + w y-1 max S i-1 j-y + w y-1 2 < y < I 2 < y < I Where S ij is the score for the alignment ending at i in sequence 1 and j in sequence 2, s ij is the score for aligning i with j, w x is the score for making a x long gap in sequence 1, w y is the score for making a y long gap in sequence 2, allowing gaps to be any length in either sequence.

33 An oversimplified example — total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one here}])

34 Optimum Alignments — There will probably be more than one best path through the matrix and none of them may be the biologically CORRECT alignment. Starting at the top and working down as we did, then tracing back, I found two optimum alignments: cTATAtAaggcTATAtAagg | ||||| | |||| cg.TAtAaT.cgT.AtAaT. Each of these solutions yields a traceback total score of 22. This is the number optimized by the algorithm, not any type of a similarity or identity score! Even though one of these alignments has 6 exact matches and the other has 5, they are both optimal according to the relatively strange criteria by which we solved the algorithm. Software will report only one of these solutions. Do you have any ideas about how others could be discovered? Answer — Often if you reverse the solution of the entire dynamic programming process, other solutions can be found! Global versus local solution: negative numbers in match matrix and pick best diagonal within overall graph.

35 What about proteins — conservative replacements and similarity as opposed to identity, and similarity versus homology! Similarity is not automatically homology. Homology always means related by descent from a common ancestor. Values whose magnitude is   4 are drawn in outline characters to make them easier to recognize. Notice that positive values for identity range from 4 to 11 and negative values for those substitutions that rarely occur go as low as –4. The most conserved residue is tryptophan with a score of 11; cysteine is next with a score of 9; both proline and tyrosine get scores of 7 for identity. BLOSUM62 amino acid substitution matrix Henikoff, S. and Henikoff, J. G. (1992). A B C D E F G H I K L M N P Q R S T V W X Y Z 4 A B C D E F G H I K L M N P Q R S T V W X Y x Z x

36 Significance — When is an Alignment Worth Anything Biologically? Monte Carlo simulations: Z score = [ ( actual score ) - ( mean of randomized scores ) ] ( standard deviation of randomized score distribution ) Many Z scores measure the distance from a mean using a simplistic Monte Carlo model assuming a normal distribution, in spite of the fact that ‘sequence-space’ actually follows what is know as an ‘extreme value distribution;’ however, the Monte Carlo method does approximate significance estimates pretty well.

37 Histogram Key: Each histogram symbol represents 604 search set sequences Each histogram symbol represents 604 search set sequences Each inset symbol represents 21 search set sequences Each inset symbol represents 21 search set sequences z-scores computed from opt scores z-scores computed from opt scores z-score obs exp (=) (*) (=) (*) < :== : : := := :* :* :* :* :* :* :===* :===* :=========* :=========* :==================*== :==================*== :===============================*===== :===============================*===== :===========================================*==== :===========================================*==== :=====================================================*=== :=====================================================*=== :==========================================================* :==========================================================* :===========================================================* :===========================================================* :======================================================== * :======================================================== * :=================================================== * :=================================================== * :=============================================* :=============================================* :====================================== * :====================================== * :============================== * :============================== * :========================= * :========================= * :=====================* :=====================* :=================* :=================* :=============*= :=============*= :==========* :==========* :========* :========* :======* :======* :=====* :=====* :====* :====* :===* :===* :==* :==* :=* :=* :=* :=* :=* :=* :* :* :* :* :* :* :* :========= * :* :========= * :* :=========* :* :=========* :* :===== * :* :===== * :* :=== * :* :=== * :* :=== * :* :=== * :* :== * :* :== * :* :==* :* :==* :* :=* :* :=* :* :=* :* :=* :* :=* :* :=* :* :* :* :* :* :* :* :* :* :* :* :* :* :* :* :* > :*= :*======================================= These are the best hits, those most similar sequences with a Pearson z- score greater than 120 in this search. ‘Sequence-space’ actually follows the ‘extreme value distribution.’ Based on this known statistical distribution, and robust statistical methodology, a realistic Expectation function, the E value, can be calculated. The particulars of how BLAST and FastA do this differ, but the ‘take- home’ message is the same: The higher the E value is, the more probable that the observed match is due to chance in a search of the same size database and the lower its Z score will be, i.e. is NOT significant. Therefore, the smaller the E value, i.e. the closer it is to zero, the more significant it is and the higher its Z score will be! The E value is the number that really matters.

38 Pairwise Comparisons — Database Searching BLAST — Basic Local Alignment Search Tool, developed at NCBI. 1)Normally NOT a good idea to use for DNA against DNA searches (not optimized); 2)Prefilters repeat and “low complexity” sequence regions; 4)Can find more than one region of gapped similarity; 5)Very fast heuristic and parallel implementation; 6)Restricted to precompiled, specially formatted databases; FastA — and its family of relatives, developed by Bill Pearson at the University of Virginia. 1)Works well for DNA against DNA searches (within limits of possible sensitivity); 2)Can find only one gapped region of similarity; 3)Relatively slow, should usually be run in the background; 4)Does not require specially prepared, preformatted databases. Add the previous concepts to ‘hashing’ to come up with heuristic style database searching. Hashing breaks sequences into small ‘words’ or ‘ktuples’ of a set size to create a ‘look-up’ table with words keyed to numbers. When a word matches part of a database entry, that match is saved. ‘Worthwhile’ results at the end are compiled and the longest alignment within the program’s restrictions is created. Hashing reduces the complexity of the search problem from N 2 for dynamic programming to N, the length of all the sequences in the database. Approximation techniques are collectively known as ‘heuristics.’ In database searching the heuristic restricts search space by calculating a statistic that allows the program to decide whether further scrutiny of a particular match should be pursued. Versions available of each for DNA-DNA, DNA-protein, protein-DNA, and protein-protein searches. Translations done ‘on the fly’ for mixed searches.

39 The algorithms — BLAST: FastA: Two word hits on the same diagonal above some similarity threshold triggers ungapped extension until the score isn’t improved enough above another threshold: the HSP. Find all ungapped exact word hits; maximize the ten best continuous regions’ scores: init1. Combine nonoverlapping init regions on different diagonals: initn. Use dynamic programming ‘in a band’ for all regions with initn scores better than some threshold: opt score. Initiate gapped extensions using dynamic programming for those HSP’s above a third threshold up to the point where the score starts to drop below a fourth threshold: yields alignment.

40 What about more than just two sequences? Dynamic programming’s complexity increases exponentially with the number of sequences being compared: N-dimensional matrix.... complexity=[sequence length] number of sequences

41 Multiple Sequence Alignment and Analysis — Therefore — pairwise, progressive dynamic programming restricts the solution to the neighbor-hood of only two sequences at a time. All sequences are compared, pairwise, and then each is aligned to its most similar partner or group of partners. Each group of partners is then aligned to finish the complete multiple sequence alignment. Conserved regions can be visualized with a sliding window approach and appear as peaks. Concentrate on the first peak seen here.

42 Motifs — GHVDHGKS A consensus isn’t necessarily the biologically “correct” combination. Therefore, build one- dimensional ‘pattern descriptors.’ PROSITE Database of protein families and domains - over 1,000 motifs. This motif, the P-loop, is defined: (A,G)x4GK(S,T), i.e. either an Alanine or a Glycine, followed by four of anything, followed by an invariant Glycine- Lysine pair, followed by either a Serine or a Threonine. But motifs can not convey any degree of the ‘importance’ of the residues.

43 Enter the Profile Given a multiple sequence alignment, how can we use all of the information contained in it to find ever more remotely similar sequences, that is those “Twilight Zone” similarities below ~20% identity, those Z scores below ~5, those BLAST/Fast E values above ~10 -5 or so? Use a position specific, two-dimensional matrix where conserved areas of the alignment receive the most importance and variable regions hardly matter! The threonine at position 27 is absolutely conserved — it gets the highest score, 150! The aspartate at position 22 substituted with a tryptophan would never happen, -87. Tryptophan is the most conserved residue on all matrix series and aspartate 22 is conserved throughout the alignment — the negative matrix score of any substitution to tryptophan times the high conservation at that position for aspartate equals the most negative score in the profile. Position 16 has a valine assigned because it has the highest score, 37, but glycine also occurs several times, a score of 20. However, other residues are ranked in the substitution matrices as being quite similar to valine; therefore isoleucine and leucine also get similar scores, 24 and 14, and alanine occurs some of the time in the alignment so it gets a comparable score, 15.

44 Some generalities: Work with proteins! If at all possible — Twenty match symbols versus four, plus similarity! Way better signal to noise. Also guarantees no indels are placed within codons. So translate, then align. Nucleotide sequences will only reliably align if they are very similar to each other. And they will require extensive hand editing and careful consideration.

45 Beware of aligning apples and oranges [and grapefruit]! Parologous versus orthologous; genomic versus cDNA; mature versus precursor.

46 Advanced methodologies — Many wondrous things can be accomplished based on combinations of all the previous techniques. PSI-BLAST uses profile methods to iterate database searches. Motif profiles can be discovered in unaligned sequences using Expectation Maximization or optimized in aligned sequences with Hidden Markov Modeling statistics. Secondary structure can be predicted in many cases. See which uses multiple sequence alignment profile techniques along with neural net technology. Even three-dimensional “homology modeling” will often lead to remarkably accurate representations if the similarity is great enough between your protein and one in which the structure has been solved through experimental means. See SwissModel at MODEL.html. MODEL.html MODEL.html Evolutionary relationships can be ascertained using a multiple sequence alignment and the methods of molecular phylogenetics. See the PAUP* and PHYLIP software packages. And if you’re really interested in this topic check out the Workshop on Molecular Evolution offered every August at the Woods Hole Marine Biological Laboratory and/or similar courses worldwide. PAUP*PHYLIPWorkshop on Molecular EvolutionPAUP*PHYLIPWorkshop on Molecular Evolution

47 See the listed references and WWW sites. Many fine texts are also starting to become available in the field. To learn more - Gunnar von Heijne in his old but quite readable treatise, Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987), provides a very appropriate conclusion: “Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and do not blindly accept everything the computer offers you.” He continues: “... if any lesson is to be drawn... it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and only second a theoretician.... We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and above all we have to become better biologists. But that’s all it takes.” Conclusions —

48 Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Journal of Molecular Biology 215, Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs. Nucleic Acids Research 25, Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids Research 20, Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle, Washington, U.S.A. Genetics Computer Group (GCG), a division of Accelrys, a member of Pharmacopeia, Inc. (Copyright ) Program Manual for the Wisconsin Package, Version 10.3, San Diego, California, USA. Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis Primer. W.H. Freeman and Company, New York, N.Y., U.S.A. Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A. 84, Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences U.S.A. 89, Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology 48, Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio Nucleic Acids Research 22, Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the National Academy of Sciences U.S.A. 85, Rost, B. and Sander, C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. Journal of Molecular Biology 232, Smith, S.W., Overbeek, R., Woese, C.R., Gilbert, W., and Gillevet, P.M. (1994) The Genetic Data Environment, an Expandable GUI for Multiple Sequence Analysis. CABIOS, 10, Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and Structure, (M.O. Dayhoff editor) 5, Suppl. 3, , National Biomedical Research Foundation, Washington D.C., U.S.A. Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied Mathematics 2, Sundaralingam, M., Mizuno, H., Stout, C.D., Rao, S.T., Liedman, M., and Yathindra, N. (1976) Mechanisms of Chain Folding in Nucleic Acids. The Omega Plot and its Correlation to the Nucleotide Geometry in Yeast tRNAPhe1. Nucleic Acids Research 10, Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) ( ) Illinois Natural History Survey, (1994) personal copyright, and (1997) Smithsonian Institution, Washington D.C., U.S.A. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22, von Heijne, G. (1987) Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit. Academic Press, Inc., San Diego, California, U.S.A. Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of Sciences U.S.A. 80, Zuker, M. (1989) On Finding All Suboptimal Foldings of an RNA Molecule. Science 244,


Download ppt "A BioInformatics Survey... just a taste, with an emphasis on the GCG suite. Steven M. Thompson Steven M. Thompson Florida State University School of Computational."

Similar presentations


Ads by Google