Presentation is loading. Please wait.

Presentation is loading. Please wait.

A BioInformatics Survey... just a taste. Steve Thompson Steve Thompson Florida State University School of Computational Science and Information Technology.

Similar presentations


Presentation on theme: "A BioInformatics Survey... just a taste. Steve Thompson Steve Thompson Florida State University School of Computational Science and Information Technology."— Presentation transcript:

1 A BioInformatics Survey... just a taste. Steve Thompson Steve Thompson Florida State University School of Computational Science and Information Technology (CSIT) CSIT

2 Florida State University School of Information Studies LIS 4722 Information Representation Shawne Miksa Spring Semester March 20 & 22, 2001

3 What is bioinformatics, genomics, sequence analysis, computational molecular biology... ? The Reverse Biochemistry Analogy. Biochemists no longer have to begin a research project by isolating and purifying massive amounts of a protein from its native organism in order to characterize a particular gene product. Rather, now scientists can amplify a section of some genome based on its similarity to other genomes, sequence that piece of DNA and, using sequence analysis tools, infer all sorts of functional, evolutionary, and, perhaps, structural insight into that stretch of DNA! The computer and molecular databases are a necessary, integral part of this entire process.

4 High quality training is essential! Why: graduates need to be competitive on the world biotechnology market. A perusal of employment listings in scientific journals or e-news groups (e.g. and clearly illustrates this point; over half are often bioinformatics/ biocomputing type positions. An Alfred P. Sloan Foundation Report from a couple of years ago, "Hiring Patterns Experienced by Students Enrolled in Bioinformatics/Computational Biology Programs” (http://www.sloan.org/programs/scitech_page1.htm, May 1999) provides some early insights to the trend. The biotechnology sector either in academia or in commerce, especially the pharmaceutical industry, is the obvious employer, but opportunities abound, in fields as diverse as hospital administration and genetic counseling, to large scale sequencing centers and software development companies. There is no lack of incentive and the situation is unlikely to change for quite some time, especially with the completion of so many genome projects on the horizon. All that newly sequenced DNA needs to be analyzed and annotated; it is, and will continue to be, an enormous job. This is the essence of 'data-mining.'

5 Definitions: Biocomputing and computational biology are synonyms and describe the use of computers and computational techniques to analyze any type of a biological system, from individual molecules to organisms to overall ecology. Bioinformatics describes using computational techniques to access, analyze, and interpret the biological information in any type of biological database (more later). Sequence analysis is the study of molecular sequence data for the purpose of inferring the function, interactions, evolution, and perhaps structure of biological molecules. Genomics analyzes the context of genes or complete genomes (the total DNA content of an organism) within the same and/or across different genomes. Proteomics is the subdivision of genomics concerned with analyzing the complete protein complement, i.e. the proteome, of organisms, both within and between different organisms.

6 The exponential growth of molecular sequence databases The exponential growth of molecular sequence databases & cpu power. Year BasePairs Sequences

7 Database Growth (cont.) The Human Genome Project and numerous smaller genome projects have kept the data coming at alarming rates. As of February complete, finished genomes are publicly available for analysis, not counting all the virus and viroid genomes available. The International Human Genome Sequencing Consortium announced the completion of a "Working Draft" of the human genome in June 2000; independently that same month, the private company Celera Genomics announced that it had completed the first assembly of the human genome. Both articles were recently published mid-February of this year in the journals Science and Nature. Celera GenomicsScienceNatureCelera GenomicsScienceNature

8 Some neat stuff from the papers: We, Homo sapiens, aren’t nearly as special as we had hoped we were. Of the 3.2 billion base pairs in our DNA — Traditional, text-book estimates of the number of genes were often in the 100,000 range; turns out we’ve only got about twice as many as a fruit fly, between 25’ and 35,000! The protein coding region of the genome is only about 1% or so, much of the remainder is “jumping” “selfish DNA” of which much may be involved in regulation and control. Over genes were transferred from an ancestral bacterial genome to an ancestral vertebrate genome!

9 What are these databases like? What are primary sequences? (Central Dogma: DNA —> RNA —> protein) Primary refers to one dimension — all of the “symbol” information written in sequential order necessary to specify a particular biological molecular entity, be it polypeptide or nucleotide. The symbols are the one letter alphabetic codes for all of the biological nitrogenous bases and amino acid residues and their ambiguity codes. Biological carbohydrates, lipids, and structural information are not included within this sequence, however, much of this type of information is available in the reference documentation sections associated with primary sequences in the databases. What are sequence databases? These databases are an organized way to store the tremendous amount of sequence information that accumulates from laboratories worldwide. Each database has its own specific format. Three major database organizations around the world are responsible for maintaining most of this data; they largely ‘mirror’ one another. North America: National Center for Biotechnology Information (NCBI): GenBank & GenPept. NCBIGenBankNCBIGenBank Also Georgetown University’s NBRF Protein Identification Resource: PIR & NRL_3D. PIRNRL_3DPIRNRL_3D Europe: European Molecular Biology Laboratory (also EBI & ExPasy): EMBL & Swiss-Prot. EBIExPasyEMBLSwiss-ProtEBIExPasyEMBLSwiss-Prot Asia: The DNA Data Bank of Japan (DDBJ). DDBJ Content & Organization. Most sequence databases are examples of complex ASCII/Binary databases, but usually are not Oracle or SQL or Object Oriented (proprietary ones often are). They contain several very long text files containing different types of information all related to particular sequences, such as all of the sequences themselves, versus all of the title lines, or all of the reference sections. Binary files often help ‘glue together’ all of these other files by providing indexing functions. Software is usually required to successfully interact with these databases and access is most easily handled through various software packages and interfaces, either on the World Wide Web or otherwise, although systems level commands can be used if one understands the data's structure. Nucleic acid sequence databases (and TrEMBL) are split into subdivisions based on taxonomy (historical).

10 What about other types of biological databases? Three dimensional structure databases: Three dimensional structure databases: the Protein Data Bank and Rutgers Nucleic Acid Database. Protein Data BankRutgers Nucleic Acid DatabaseProtein Data BankRutgers Nucleic Acid Database Still more; these can be considered ‘non-molecular’: Reference Databases: e.g. OMIMOMIM — Online Mendelian Inheritance in Man OMIM PubMed/MedLinePubMed/MedLine — over 11 million citations from more than 4 thousand bio/medical scientific journals. PubMed/MedLine Phylogenetic Tree Databases: e.g. the Tree of Life. Tree of LifeTree of Life Metabolic Pathway Databases: e.g. WIT (What Is There) and Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of Genes and Genomes). WITKEGGWITKEGG Population studies data — which strains, where, etc. And then databases that most biocomputing people don’t even usually consider: e.g. GIS/GPS/remote sensing data, medical records, census counts, mortality and birth rates....

11 So how does one do Bioinformatics? Often on the InterNet over the World Wide Web: SiteURL (Uniform Resource Locator)Content Nat’l Center Biotech' Info'http://www.ncbi.nlm.nih.gov/databases/analysis/software PIR/NBRFhttp://www-nbrf.georgetown.edu/protein sequence database Johns Hopkins BioInfo'http://www.bis.med.jhmi.edu/bioInformatics.htmldatabases/analysis/software Harvard Bio' Laboratorieshttp://golgi.harvard.edu/databases/analysis/software IUBIO Biology Archivehttp://iubio.bio.indiana.edu/database/software archive Univ. of Montrealhttp://megasun.bch.umontreal.ca/database/software archive Japan's GenomeNethttp://www.genome.ad.jp/databases/analysis/software European Mol' Bio' Lab'http://www.embl-heidelberg.de/databases/analysis/software European Bioinformaticshttp://www.ebi.ac.uk/databases/analysis/software The Sanger Institutehttp://www.sanger.ac.uk/databases/analysis/software Univ. of Geneva BioWebhttp://www.expasy.ch/databases/analysis/software ProteinDataBankhttp://www.rcsb.org/pdb/3D mol' structure database Molecules R Ushttp://molbio.info.nih.gov/cgi-bin/pdb3D protein/nuc' visualization The Genome DataBasehttp://www.gdb.org/The Human Genome Project Stanford Genomicshttp://genome-www.stanford.edu/various genome projects Inst. for Genomic Res’rchhttp://www.tigr.org/esp. microbial genome projects HIV Sequence Databasehttp://hiv-web.lanl.gov/HIV epidemeology seq' DB The Baylor Search Launchhttp://searchlauncher.bcm.tmc.edu/sequence search launcher Pedro's BioMol Res' Toolshttp://www.public.iastate.edu/~pedro/research_tools.htmlbig bookmark list BioToolKithttp://www.biosupplynet.com/cfdocs/btk/btk.cfmannotated molbio tool links Felsenstein's PHYLIP sitehttp://evolution.genetics.washington.edu/phylip.htmlphylogenetic inference The Tree of Lifehttp://phylogeny.arizona.edu/tree/phylogeny.htmloverview of all phylogeny Ribosomal Database Proj’http://www.cme.msu.edu/RDP/databases/analysis/software WIT Metabolismhttp://wit.mcs.anl.gov/WIT2metabolic reconstructions BIOSCI/BIONEThttp://net.bio.netbiologists' news groups Access Excellencehttp://www.accessexcellence.org/biology teaching and learning CELLS alive!http://www.cellsalive.com/animated microphotography Genetics Computer Grouphttp://www.gcg.com/Wisconsin S.A. Package NCBI’s BLAST & Entrez, EMBL’s SRS, + GCG’s SeqLab and LookUp, phylogenetics... BLASTEntrezSRSGCG’sSeqLabphylogeneticsBLASTEntrezSRSGCG’sSeqLabphylogenetics

12 What about Homology? Inference through homology is a fundamental principle of biology! What is homology — in this context it is similarity great enough such that common ancestry is implied. Walter Fitch, a famous molecular evolutionist, likes to relate the analogy — homology is like pregnancy, you either are or you’re not; there’s no such thing as 65% pregnant! How to see similarities — Pairwise Comparisons : The Dot Matrix Method. Provides a ‘Gestalt’ of all possible alignments between two sequences. To begin — very simple 0, 1 (match, nomatch) identity scoring function. Put a dot wherever symbols match.

13 Identities and insertion/deletion events (indels) identified (zero:one match score matrix, no window).

14 Noise due to random composition effects contributes to confusion. To ‘clean up’ the plot consider a filtered windowing approach. A dot is placed at the middle of a window if some ‘stringency’ is met within that defined window size. Then the window is shifted one position and the entire process is repeated (zero:one match score, window of size three and a stringency level of two out of three).

15 The phenylalanine transfer RNA molecule from yeast plotted against itself using a window size to 7 and the stringency value to 5. As a general guide pick a window size about the same size as the feature that you are trying to recognize and a stringency such that unwanted background noise is just filtered away enough to enable you to see that desired feature.

16 RNA comparisons of the reverse, complement of a sequence to itself can often be very informative. The yeast tRNA sequence is compared to its reverse, complement using the same 5 out of 7 stringency setting as previously. The stem-loop, inverted repeats of the tRNA clover-leaf molecular shape become obvious. They appear as clearly delineated diagonals running perpendicular to an imaginary main diagonal running oppositely than before.

17 22 GAGCGCCAGACT G 12, 22 || | ||||| | A 48 CTGGAGGTCTAG A 3 Base position 22 through position 33 base pairs with (think — is quite similar to the reverse-complement of) itself from base position 37 through position 48. MFold, Zuker’s RNA folding algorithm uses base pairing energies to find the family of optimal and suboptimal structures; the most stable structure found is shown to possess a stem at positions 27 to 31 with 39 to 43. However the region around position 38 is represented as a loop. The actual modeled structure as seen in PDB’s 1TRA shows ‘reality’ lies somewhere in between.

18 Pairwise Comparisons: Dynamic Programming. A ‘brute force’ approach just won’t work. The computation required to compare all possible alignments between two sequences requires time proportional to the product of the lengths of the two sequences, without considering gaps at all. If the two sequences are approximately the same length (N), this is a N 2 problem. To include gaps, the calculation needs to be repeated 2N times to examine the possibility of gaps at each possible position within the sequences, now a N 4N problem. Therefore, An optimal alignment is defined as an arrangement of two sequences, 1 of length i and 2 of length j, such that: 1)you maximize the number of matching symbols between 1 and 2; 2)you minimize the number of indels within 1 and 2; and 3)you minimize the number of mismatched symbols between 1 and 2. Therefore, the actual solution can be represented by: S i-1 j-1 or S i-1 j-1 or max S i-x j-1 + w x-1 or max S i-x j-1 + w x-1 or S ij = s ij + max 2 < x < i max S i-1 j-y + w y-1 max S i-1 j-y + w y-1 2 < y < I 2 < y < I Where S ij is the score for the alignment ending at i in sequence 1 and j in sequence 2, s ij is the score for aligning i with j, w x is the score for making a x long gap in sequence 1, w y is the score for making a y long gap in sequence 2, allowing gaps to be any length in either sequence.

19 An oversimplified example: total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one here}])

20 Optimum Alignments There will probably be more than one best path through the matrix and none of them may be the biologically CORRECT alignment. Starting at the top and working down as we did, then tracing back, I found two optimum alignments: cTATAtAaggcTATAtAagg | ||||| | |||| cg.TAtAaT.cgT.AtAaT. Each of these solutions yields a traceback total score of 22. This is the number optimized by the algorithm, not any type of a similarity or identity score! Even though one of these alignments has 6 exact matches and the other has 5, they are both optimal according to the relatively strange criteria by which we solved the algorithm. Software will report only one of these solutions. Do you have any ideas about how others could be discovered? Answer — Often if you reverse the solution of the entire dynamic programming process, other solutions can be found! Global versus local solution: negative numbers in match matrix and pick best diagonal within overall graph.

21 What about proteins — conservative replacements and similarity as opposed to identity. The nitrogenous bases, A,C, T, G, are either the same or they’re not, but amino acids can be similar, genetically and structurally! Values whose magnitude is   4 are drawn in outline characters to make them easier to recognize. Notice that positive values for identity range from 4 to 11 and negative values for those substitutions that rarely occur go as low as –4. The most conserved residue is tryptophan with a score of 11; cysteine is next with a score of 9; both proline and tyrosine get scores of 7 for identity. BLOSUM62 amino acid substitution matrix. Henikoff, S. and Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89: { GAP_CREATE 12 GAP_EXTEND 4 } A B C D E F G H I K L M N P Q R S T V W X Y Z 4 A B C D E F G H I K L M N P Q R S T V W X Y x Z x

22 Significance: When is an Alignment Worth Anything Biologically? Monte Carlo simulations: Z score = [ ( actual score ) - ( mean of randomized scores ) ] ( standard deviation of randomized score distribution ) Many Z scores measure the distance from a mean using a simplistic Monte Carlo model assuming a normal distribution, in spite of the fact that ‘sequence-space’ actually follows what is know as an ‘extreme value distribution;’ however, the Monte Carlo method does approximate significance estimates pretty well.

23 Pairwise Comparisons: Database Searching BLAST — Basic Local Alignment Search Tool, developed at NCBI. 1)Normally NOT a good idea to use for DNA against DNA searches w/o translation (not optimized); 2)Prefilters repeat and “low complexity” sequence regions; 4)Can find more than one region of gapped similarity; 5)Very fast heuristic and parallel implementation; 6)Restricted to precompiled, specially formatted databases; FastA — and its family of relatives, developed by Bill Pearson at the University of Virginia. 1)Works well for DNA against DNA searches (within limits of possible sensitivity); 2)Can find only one gapped region of similarity; 3)Relatively slow, should usually be run in the background; 4)Does not require specially prepared, preformatted databases. Add the previous concepts to ‘hashing’ to come up with heuristic style database searching. Hashing breaks sequences into small ‘words’ or ‘ktuples’ of a set size to create a ‘look-up’ table with words keyed to numbers. When a word matches part of a database entry, that match is saved. ‘Worthwhile’ results at the end are compiled and the longest alignment within the program’s restrictions is created. Hashing reduces the complexity of the search problem from N 2 for dynamic programming to N, the length of all the sequences in the database. Approximation techniques are collectively known as ‘heuristics.’ In database searching the heuristic restricts search space by calculating a statistic that allows the program to decide whether further scrutiny of a particular match should be pursued. Versions available of each for DNA-DNA, DNA-protein, protein-DNA, and protein- protein searches. Translations done ‘on the fly’ for mixed searches.

24 The algorithms: BLAST: FastA: Two word hits on the same diagonal above some similarity threshold triggers ungapped extension until the score isn’t improved enough above another threshold: the HSP. Find all ungapped exact word hits; maximize the ten best continuous regions’ scores: init1. Combine non- overlapping init regions on different diagonals: initn. Use dynamic programming ‘in a band’ for all regions with initn scores better than some threshold: opt score. Initiate gapped extensions using dynamic programming for those HSP’s above a third threshold up to the point where the score starts to drop below a fourth threshold: yields alignment.

25 Histogram Key: Each histogram symbol represents 604 search set sequences Each histogram symbol represents 604 search set sequences Each inset symbol represents 21 search set sequences Each inset symbol represents 21 search set sequences z-scores computed from opt scores z-scores computed from opt scores z-score obs exp (=) (*) (=) (*) < :== : : := := :* :* :* :* :* :* :===* :===* :=========* :=========* :==================*== :==================*== :===============================*===== :===============================*===== :===========================================*==== :===========================================*==== :=====================================================*=== :=====================================================*=== :==========================================================* :==========================================================* :===========================================================* :===========================================================* :======================================================== * :======================================================== * :=================================================== * :=================================================== * :=============================================* :=============================================* :====================================== * :====================================== * :============================== * :============================== * :========================= * :========================= * :=====================* :=====================* :=================* :=================* :=============*= :=============*= :==========* :==========* :========* :========* :======* :======* :=====* :=====* :====* :====* :===* :===* :==* :==* :=* :=* :=* :=* :=* :=* :* :* :* :* :* :* :* :========= * :* :========= * :* :=========* :* :=========* :* :===== * :* :===== * :* :=== * :* :=== * :* :=== * :* :=== * :* :== * :* :== * :* :==* :* :==* :* :=* :* :=* :* :=* :* :=* :* :=* :* :=* :* :* :* :* :* :* :* :* :* :* :* :* :* :* :* :* > :*= :*======================================= These are the best hits, those most similar sequences with a Pearson z-score greater than 120 in this search. ‘Sequence-space’ actually follows the ‘extreme value distribution.’ Based on this known statistical distribution, and robust statistical methodology, a realistic Expectation function, the E value, can be calculated. The particulars of how BLAST and FastA do this differ, but the ‘take-home’ message is the same: The higher the E value is, the more probable that the observed match is due to chance in a search of the same size database and the lower its Z score will be, i.e. is NOT significant. Therefore, the smaller the E value, i.e. the closer it is to zero, the more significant it is and the higher its Z score will be! The E value is the number that really matters.

26 Multiple Sequence Analysis: Multiple Sequence Alignment. Dynamic programming’s complexity increases exponentially with the number of sequences being compared. N-dimensional matrix ideas.... Therefore — pairwise, progressive dynamic programming restricts the solution to the neighborhood of only two sequences at a time. All sequences are compared, pairwise, and then each is aligned to its most similar partner or group of partners. Each group of partners is then aligned to finish the complete multiple sequence alignment. Conserved regions can be visualized with a sliding window approach and appear as peaks. Let’s concentrate on the first peak seen here.

27 Motifs GHVDHGKS A consensus isn’t necessarily the biologically “correct” combination. Therefore, build one- dimensional ‘pattern descriptors.’ PROSITE Database of protein families and domains - over 1,000 motifs. This motif, the P-loop, is defined: (A,G)x4GK(S,T), i.e. either an Alanine or a Glycine, followed by four of anything, followed by an invariant Glycine-Lysine pair, followed by either a Serine or a Threonine. But motifs can not convey any degree of the ‘importance’ of the residues.

28 Enter the Profile Given a multiple sequence alignment, how can we use all of the information contained in it to find ever more remotely similar sequences, that is those “Twilight Zone” similarities below ~20% identity, those Z scores below ~5, those BLAST/Fast E values above ~10 -5 or so? Use a position specific, two-dimensional matrix where conserved areas of the alignment receive the most importance and variable regions hardly matter! The threonine at position 27 is absolutely conserved — it gets the highest score, 150! The aspartate at position 22 substituted with a tryptophan would never happen, -87. Tryptophan is the most conserved residue on all matrix series and aspartate 22 is conserved throughout the alignment — the negative matrix score of any substitution to tryptophan times the high conservation at that position for aspartate equals the most negative score in the profile. Position 16 has a valine assigned because it has the highest score, 37, but glycine also occurs several times, a score of 20. However, other residues are ranked in the substitution matrices as being quite similar to valine; therefore isoleucine and leucine also get similar scores, 24 and 14, and alanine occurs some of the time in the alignment so it gets a comparable score, 15.

29 Advanced methodologies Many wondrous things can be accomplished based on combinations of all the previous techniques. PSI-BLAST uses profile methods to iterate database searches. Profiles can be discovered in unaligned sequences to discover motifs using expectation maximization and/or hidden Markov model statistical methods. Secondary structure can be predicted in many cases. See heidelberg.de/predictprotein/predictprotein.html, which uses multiple sequence alignment profile techniques along with neural net technology. Even three-dimensional “homology modeling” will often lead to remarkably accurate representations if the similarity is great enough between your protein and one in which the structure has been solved through experimental means. See SwissModel at heidelberg.de/predictprotein/predictprotein.html Evolutionary relationships can be ascertained using a multiple sequence alignment and the methods of molecular phylogenetics. See the PAUP* and PHYLIP software packages. And if you’re really interested in this topic check out the Workshop on Molecular Evolution offered every August at the Woods Hole Marine Biological Laboratory and/or similar courses worldwide.Workshop on Molecular Evolution

30 So what about training? How do you get training in this field; what are you supposed to do? Read all you can and explore the Web sites and, if you’re serious, get involved in one of the training programs, usually at the graduate level, around the country. See the URL’s coming up... See the URL’s coming up... What you can do here at FSU...

31 BioComputing Education Six Major Proposal Foci at FSU: Current Workshops Current Workshops — continue to offer and further expand GCG SeqLab workshop series; each session currently offered twice per semester. Current Workshops Modules (such as this one) — incorporate across the curricula within existing courses, interdisciplinary by nature. Graduate Course — practical, project-oriented approach. Collaborate with proposed Math course. Undergraduate Genomics Course — survey, practical WWW techniques, implications, & ethics. Computational Molecular Biology Program — in association and cooperation with students’ present major department. Pros and Cons... Summer Short Course — long-range. Participants from world-wide disparate disciplines learning bioinformatics techniques and theory.

32 GCG SeqLab Workshop Series Presently four different sessions: Intro to SeqLab & Multiple Sequence AnalysisIntro to SeqLab & Multiple Sequence Analysis and its supplement supplement Intro to SeqLab & Multiple Sequence Analysissupplement Rational Primer Design Rational Primer Design Database Searching & Pairwise Comparisons — Significance Database Searching & Pairwise Comparisons — Significance Molecular Evolutionary Phylogenetics Molecular Evolutionary Phylogenetics FOR MORE INFO...

33 Education/Training Programs I helped to develop one of the first at Washington State University. They are still relatively rare, but more appear all the time. Washington State University Washington State University Biocomputing education URL’s:

34 See the listed references and WWW sites and participate in my bioinformatics workshop series. FOR MORE INFO... Contact CSIT (http://www.csit.fsu.edu/) for general questions; me for specific bioinformatics assistance and/or collaboration. Gunnar von Heijne in his quite readable treatise, Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987), provides a very appropriate conclusion: “Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and do not blindly accept everything the computer offers you.” He continues: “... if any lesson is to be drawn... it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and only second a theoretician.... We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and above all we have to become better biologists. But that’s all it takes.” Conclusions

35 Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Journal of Molecular Biology 215, Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs. Nucleic Acids Research 25, Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids Research 20, Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle, Washington, U.S.A. Genetics Computer Group (GCG), Inc. (Copyright ) Program Manual for the Wisconsin Package, Version 10.1, Madison, Wisconsin, USA Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis Primer. W.H. Freeman and Company, New York, N.Y., U.S.A. Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A. 84, Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences U.S.A. 89, Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology 48, Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio Nucleic Acids Research 22, Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the National Academy of Sciences U.S.A. 85, Rost, B. and Sander, C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. Journal of Molecular Biology 232, Smith, S.W., Overbeek, R., Woese, C.R., Gilbert, W., and Gillevet, P.M. (1994) The Genetic Data Environment, an Expandable GUI for Multiple Sequence Analysis. CABIOS, 10, Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and Structure, (M.O. Dayhoff editor) 5, Suppl. 3, , National Biomedical Research Foundation, Washington D.C., U.S.A. Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied Mathematics 2, Sundaralingam, M., Mizuno, H., Stout, C.D., Rao, S.T., Liedman, M., and Yathindra, N. (1976) Mechanisms of Chain Folding in Nucleic Acids. The Omega Plot and its Correlation to the Nucleotide Geometry in Yeast tRNAPhe1. Nucleic Acids Research 10, Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) ( ) Illinois Natural History Survey, (1994) personal copyright, and (1997) Smithsonian Institution, Washington D.C., U.S.A. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22, von Heijne, G. (1987) Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit. Academic Press, Inc., San Diego, California, U.S.A. Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of Sciences U.S.A. 80, Zuker, M. (1989) On Finding All Suboptimal Foldings of an RNA Molecule. Science 244, References

36 Laboratory Exercise: A structural variant of this simple protein molecule causes the notorious condition “Mad-Cow Disease.” Use the Web sites you learned about in this lecture to investigate: What is the molecule’s name? What is the name of the disease in human beings? Is it caused by a virus or bacteria or other pathogen? Do humans have a gene for this protein, and, if so, what is its name, where is it located, and what does it do? What is the name of one of the most similar genes or proteins in non-vertebrates? Is this similarity significant? Just explore — no credit.


Download ppt "A BioInformatics Survey... just a taste. Steve Thompson Steve Thompson Florida State University School of Computational Science and Information Technology."

Similar presentations


Ads by Google