Special Topics BSC4933/5936: An Introduction to Bioinformatics. Florida State University The Department of Biological Science www.bio.fsu.edu Sept. 2,

Special Topics BSC4933/5936: An Introduction to Bioinformatics. Florida State University The Department of Biological Science www.bio.fsu.edu Sept. 2, 2003

BioInformatics Databases, the “T” in van Engelen’s talk... just a glimpse. Steven M. Thompson Steven M. Thompson Florida State University School of Computational Science and Information Technology (CSIT) CSIT

But first some comments — If van Engelen’s talk scared you, don’t worry. His lecture is designed to provide an overview of the “way” that computers “think.” Don’t get hung up on the details. If you were bored, just wait. Lots and lots more ‘exciting’ algorithms are coming up. Something is guaranteed to challenge you!

To begin — some terminology Just what the heck is an algorithm ! ? Merriam-Webster’s says: “A rule of procedure for solving a problem [often mathematical] that frequently involves repetition of an operation.” So, you could write an algorithm for tying your shoe! It’s just a set of explicit instructions for doing some routine task. And what about bioinformatics, genomics, proteomics, sequence analysis, computational molecular biology... ?

Some Definitions, lots of overlap — Biocomputing and computational biology are synonyms and describe the use of computers and computational techniques to analyze any type of a biological system, from individual molecules to organisms to overall ecology. Bioinformatics describes using computational techniques to access, analyze, and interpret the biological information in any type of biological database (much more later). Sequence analysis is the study of molecular sequence data for the purpose of inferring the function, interactions, evolution, and perhaps structure of biological molecules. Genomics analyzes the context of genes or complete genomes (the total DNA content of an organism) within the same and/or across different genomes. Proteomics is the subdivision of genomics concerned with analyzing the complete protein complement, i.e. the proteome, of organisms, both within and between different organisms.

One way to think about the field — The Reverse Biochemistry Analogy. Biochemists no longer have to begin a research project by isolating and purifying massive amounts of a protein from its native organism in order to characterize a particular gene product. Rather, now scientists can amplify a section of some genome based on its similarity to other genomes, sequence that piece of DNA and, using sequence analysis tools, infer all sorts of functional, evolutionary, and, perhaps, structural insight into that stretch of DNA! The computer and molecular databases are a necessary, integral part of this entire process.

The exponential growth of molecular sequence databases & cpu power — Year BasePairs Sequences 1982 680338 606 1983 2274029 2427 1984 3368765 4175 1985 5204420 5700 1986 9615371 9978 1987 1551477614584 1988 2380000020579 1989 3476258528791 1990 4917928539533 1991 7194742655627 1992 10100848678608 1993 157152442 143492 1994 217102462 215273 1995 384939485 555694 1996 651972984 1021211 1997 1160300687 1765847 1998 2008761784 2837897 1999 3841163011 4864570 2000 11101066288 10106023 2001 14396883064 13602262 http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html doubling time ~ one year

Database Growth (cont.) — The Human Genome Project and numerous smaller genome projects have kept the data coming at alarming rates. As of October 2002, 99 complete, finished genomes are publicly available for analysis, not counting all the virus and viroid genomes available. The International Human Genome Sequencing Consortium announced the completion of the "Working Draft" of the human genome in June 2000; Independently that same month, the private company Celera Genomics announced that it had completed the first “Assembly” of the human genome. Both articles were published mid-February 2001 in the journals Science and Nature. Celera Genomics ScienceNature Celera Genomics ScienceNature

Some neat stuff from the papers — We, Homo sapiens, aren’t nearly as special as we had hoped we were. Of the 3.2 billion base pairs in our DNA: Traditional, text-book estimates of the number of genes were often in the 100,000 range; turns out we’ve only got about twice as many as a fruit fly, between 25’ and 35,000! The protein coding region of the genome is only about 1% or so, a bunch of the remainder is ‘jumping’ ‘selfish DNA’ of which much may be involved in regulation and control. Over 100-200 genes were transferred from an ancestral bacterial genome to an ancestral vertebrate genome! ( Later shown to be not true by more extensive analyses, and to be due to gene loss rather than transfer. )

These databases are an organized way to store the tremendous amount of sequence information accumulating worldwide. Most have their own specific format. An ‘alphabet soup’ of three major database organizations around the world are responsible for maintaining most of this data. They largely ‘mirror’ one another and share accession codes, but NOT proper identifier names: North America: the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM), at the National Institute of Health (NIH), has GenBank & GenPept. Also Georgetown University’s National Biomedical Research Foundation (NBRF) Protein Identification Resource (PIR) & NRL_3D (Naval Research Lab sequences of known three-dimensional structure). NCBIGenBankPIRNRL_3DNCBIGenBankPIRNRL_3D Europe: the European Molecular Biology Laboratory (EMBL), the European Bioinformatics Institute (EBI), and the Swiss Institute of Bioinformatics’ (SIB) Expert Protein Analysis System (ExPasy), all help maintain the EMBL Nucleotide Sequence Database, and the SWISS- PROT & TrEMBL amino acid sequence databases. EMBLEBIExPasyEMBLSWISS- PROTTrEMBLEMBLEBIExPasyEMBLSWISS- PROTTrEMBL Asia: The National Institute of Genetics (NIG) supports the Center for Information Biology’s (CIG) DNA Data Bank of Japan (DDBJ). DDBJ What are sequence databases?

A little history — Developments that affect software and the end user — The first well recognized sequence database was Dr. Margaret Dayhoff’s hardbound Atlas of Protein Sequence and Structure begun in the mid- sixties. DDBJ began in 1984, GenBank in 1982, and EMBL in 1980. They are all attempts at establishing an organized, reliable, comprehensive and openly available library of genetic sequences. Databases have long-since outgrown a hardbound atlas. They have become huge and have evolved through many changes with many more yet to come. Changes in format over the years are a major source of grief for software designers and program users. Each program needs to be able to recognize particular aspects of the sequence files; whenever they change it throws a wrench in the works. NCBI’s ASN.1 format and its Entrez interface attempt to circumvent some of these frustrations. However, database format is much debated as many bioinformaticians argue for relational or object-oriented standards. Unfortunately, until all biologists and computer scientists worldwide agree on one standard and all software is (re)written to that standard, neither of which is likely to happen very quickly, format issues will remain probably the most confusing and troubling aspect of working with primary sequence data.

So what are these databases like? Just what are primary sequences? (Central Dogma: DNA —> RNA —> protein) Primary refers to one dimension — all of the ‘symbol’ information written in sequential order necessary to specify a particular biological molecular entity, be it polypeptide or nucleotide (van Engelen’s “P”). The symbols are the one letter codes for all of the biological nitrogenous bases and amino acid residues and their ambiguity codes (van Engelen’s “Alphabet”). Biological carbohydrates, lipids, and structural and functional information are not sequence data. Not even DNA translations in a DNA database! However, much of this feature and bibliographic type information is available in the reference documentation sections associated with primary sequences in the databases.

Sequence database installations are commonly a complex ASCII/Binary mix, usually not relational or Object Oriented (but proprietary ones often are). They’ll contain several very long text files each containing different types of information all related to particular sequences, such as all of the sequences themselves, versus all of the title lines, or all of the reference sections. Binary files often help ‘glue together’ all of these other files by providing indexing functions. Software is usually required to successfully interact with these databases and access is most easily handled through various software packages and interfaces, either on the World Wide Web or otherwise, although systems level commands can be used if one understands the data's structure well enough. Content & Organization —

More organization stuff — Nucleic Acid DB’s GenBank/EMBL/DDBJ all Taxonomic categories + HTC’s, HTG’s, & STS’s “Tags”EST’sGSS’s Amino Acid DB’s SWISS-PROT TrEMBL PIR PIR1 PIR2 PIR3 PIR4 NRL_3D Genpept Nucleic acid sequence databases (and TrEMBL) are split into subdivisions based on taxonomy (historical rankings — the Fungi warning!). PIR is split into subdivisions based on level of annotation. TrEMBL sequences are merged into SWISS-PROT as they receive increased levels of annotation.

All sequence databases contain these elements: Name: LOCUS, ENTRY, ID all are unique identifiers Definition: A brief, one-line, textual sequence description. Accession Number: A constant data identifier. Source and taxonomy information. Complete literature references. Comments and keywords. The all important FEATURE table! A summary or checksum line. The sequence itself. But: Each major database as well as each major suite of software tools that you are likely to use has its own distinct format requirements. This can be a huge problem and an enormous time sink, even with helpful tools such as Don Gilbert’s ReadSeq. Therefore, becoming familiar with some of the common formats is a big help. Look for key features of each type of entry: ReadSeq Parts and problems —

GenBank and GenPept format — LOCUS HSEF1AR 1506 bp mRNA linear PRI 12-SEP-1993 DEFINITION Human mRNA for elongation factor 1 alpha subunit (EF-1 alpha). ACCESSION X03558 VERSION X03558.1 GI:31097 KEYWORDS elongation factor; elongation factor 1. SOURCE human. ORGANISM Homo sapiens ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 1506) AUTHORS Brands,J.H., Maassen,J.A., van Hemert,F.J., Amons,R. and Moller,W. AUTHORS Brands,J.H., Maassen,J.A., van Hemert,F.J., Amons,R. and Moller,W. TITLE The primary structure of the alpha subunit of human elongation…… TITLE The primary structure of the alpha subunit of human elongation…… JOURNAL Eur. J. Biochem. 155 (1), 167-171 (1986) JOURNAL Eur. J. Biochem. 155 (1), 167-171 (1986) MEDLINE 86136120 MEDLINE 86136120 FEATURES Location/Qualifiers source 1..1506 source 1..1506 /organism="Homo sapiens" /organism="Homo sapiens" /db_xref="taxon:9606" /db_xref="taxon:9606" CDS 54..1442 CDS 54..1442 /note="EF-1 alpha (aa 1-463)" /note="EF-1 alpha (aa 1-463)" /codon_start=1 /codon_start=1 /protein_id="CAA27245.1" /protein_id="CAA27245.1" /db_xref="GI:31098" /db_xref="GI:31098" /db_xref="SWISS-PROT:P04720" /db_xref="SWISS-PROT:P04720" /translation="MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEK /translation="MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEK EAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNM EAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNM ……VTKSAQKAQKAK" ……VTKSAQKAQKAK" BASE COUNT 412 a 337 c 387 g 370 t ORIGIN 1 acgggtttgc cgccagaaca caggtgtcgt gaaaactacc cctaaaagcc aaaatgggaa 1 acgggtttgc cgccagaaca caggtgtcgt gaaaactacc cctaaaagcc aaaatgggaa 61 aggaaaagac tcatatcaac attgtcgtca ttggacacgt agattcgggc aagtccacca………. 61 aggaaaagac tcatatcaac attgtcgtca ttggacacgt agattcgggc aagtccacca………. 1501 aactgt 1501 aactgt//

EMBL and SWISS-PROT format — ID EF11_HUMAN STANDARD; PRT; 462 AA. AC P04720; P04719; DT 13-AUG-1987 (Rel. 05, Created)…… DE Elongation factor 1-alpha 1 (EF-1-alpha-1) (Elongation factor 1 A-1) DE (eEF1A-1) (Elongation factor Tu) (EF-Tu). GN EEF1A1 OR EEF1A OR EF1A. OS Homo sapiens (Human), OS Bos taurus (Bovine), and OS Oryctolagus cuniculus (Rabbit). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. OX NCBI_TaxID=9606, 9913, 9986; RN [1] RP SEQUENCE FROM N.A. RC SPECIES=Human; RX MEDLINE=86136120; PubMed=3512269; RA Brands J.H.G.M., Maassen J.A., van Hemert F.J., Amons R., Moeller W.; RT "The primary structure of the alpha subunit of human elongation …. -binding sites."; RL Eur. J. Biochem. 155:167-171(1986).…… CC -!- FUNCTION: THIS PROTEIN PROMOTES THE GTP-DEPENDENT BINDING OF CC AMINOACYL-TRNA TO THE A-SITE OF RIBOSOMES DURING PROTEIN CC BIOSYNTHESIS. CC -!- SUBCELLULAR LOCATION: Cytoplasmic. CC -!- TISSUE SPECIFICITY: BRAIN, PLACENTA, LUNG, LIVER, KIDNEY, CC PANCREAS BUT BARELY DETECTABLE IN HEART AND SKELETAL MUSCLE. CC -!- SIMILARITY: BELONGS TO THE GTP-BINDING ELONGATION FACTOR FAMILY. CC EF-TU/EF-1A SUBFAMILY…… DR EMBL; X03558; CAA27245.1; -…… DR PIR; S18054; EFRB1…… DR HSSP; Q01698; 1TUI…… DR InterPro; IPR004160; GTP_EFTU_D3. DR Pfam; PF00009; GTP_EFTU; 1…… DR PROSITE; PS00301; EFACTOR_GTP; 1. KW Elongation factor; Protein biosynthesis; GTP-binding; Methylation; KW Multigene family. FT NP_BIND 14 21 GTP (BY SIMILARITY). FT NP_BIND 91 95 GTP (BY SIMILARITY). FT NP_BIND 153 156 GTP (BY SIMILARITY). FT MOD_RES 36 36 METHYLATION (TRI-). FT MOD_RES 55 55 METHYLATION (DI-). FT MOD_RES 79 79 METHYLATION (TRI-). FT MOD_RES 165 165 METHYLATION (DI-). FT MOD_RES 318 318 METHYLATION (TRI-). FT BINDING 301 301 ETHANOLAMINE-PHOSPHOGLYCEROL. FT BINDING 374 374 ETHANOLAMINE-PHOSPHOGLYCEROL. FT CONFLICT 83 83 S -> A (IN REF. 2). FT CONFLICT 232 232 L -> V (IN REF. 3). SQ SEQUENCE 462 AA; 50141 MW; D465615545AF686A CRC64; MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKEAAEMG KGSFKYAWVL MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKEAAEMG KGSFKYAWVL DKLKAERERG …… VTKSAQKAQK AK DKLKAERERG …… VTKSAQKAQK AK//

PIR/NBRF format — ENTRY EFHU1 #type complete iProClass View of EFHU1 TITLE translation elongation factor eEF-1 alpha-1 chain - human ALTERNATE_NAMES translation elongation factor Tu ORGANISM #formal_name Homo sapiens #common_name man #cross-references taxon:9606 DATE 30-Jun-1988 #sequence_revision 05-Apr-1995 #text_change….. ACCESSIONS B24977; A25409; A29946; A32863; I37339 REFERENCE A93610 #authors Rao, T.R.; Slobin, L.I. #journal Nucleic Acids Res. (1986) 14:2409 #journal Nucleic Acids Res. (1986) 14:2409 #title Structure of the amino-terminal end of mammalian elongation… #title Structure of the amino-terminal end of mammalian elongation… #accession B24977 ##molecule_type mRNA ##molecule_type mRNA ##residues 1-82,'A',84-94 ##label RAO ##residues 1-82,'A',84-94 ##label RAO ##cross-references EMBL:X03689; NID:g31109; PIDN:CAA27325.1; PID:g31110……. PID:g31110…….GENETICS #gene GDB:EEF1A1; EEF1A; EF1A #gene GDB:EEF1A1; EEF1A; EF1A ##cross-references GDB:118791; OMIM:130590 ##cross-references GDB:118791; OMIM:130590 #map_position 6q14-6q14 #introns 48/3; 108/3; 207/3; 258/1; 343/3; 422/1 #introns 48/3; 108/3; 207/3; 258/1; 343/3; 422/1 CLASSIFICATION SF003007 #superfamily translation elongation factor Tu; translation elongation factor Tu homology KEYWORDS GTP binding; methylated amino acid; nucleotide binding; P-loop; phosphoprotein; protein biosynthesis; RNA binding P-loop; phosphoprotein; protein biosynthesis; RNA bindingFEATURE 1-223 #domain eEF-1 alpha domain I, GTP-binding #status 1-223 #domain eEF-1 alpha domain I, GTP-binding #status predicted #label EF1\ predicted #label EF1\ 8-156 #domain translation elongation factor Tu homology 8-156 #domain translation elongation factor Tu homology #label ETU\ 14-21 #region nucleotide-binding motif A (P-loop)\ 14-21 #region nucleotide-binding motif A (P-loop)\ 153-156 #region GTP-binding NKXD motif\ 153-156 #region GTP-binding NKXD motif\ 245-330 #domain eEF-1 alpha domain II, tRNA-binding 245-330 #domain eEF-1 alpha domain II, tRNA-binding #status predicted #label EF2\ 332-462 #domain eEF-1 alpha domain III, tRNA-binding 332-462 #domain eEF-1 alpha domain III, tRNA-binding #status predicted #label EF3\ 36,55,79,165,318 #modified_site N6,N6,N6-trimethyllysine (Lys) 36,55,79,165,318 #modified_site N6,N6,N6-trimethyllysine (Lys) #status predicted\ 301,374 #binding_site glycerylphosphorylethanolamine 301,374 #binding_site glycerylphosphorylethanolamine (Glu) (covalent) #status predicted SUMMARY #length 462 #molecular_weight 50141 SEQUENCE 5 10 15 20 25 30 5 10 15 20 25 30 1 M G K E K T H I N I V V I G H V D S G K S T T T G H L I Y K 1 M G K E K T H I N I V V I G H V D S G K S T T T G H L I Y K 31 C G G I D K R T I E K F E K E A A E M G K G S F K Y A W V L 31 C G G I D K R T I E K F E K E A A E M G K G S F K Y A W V L 61 D K L K A E R E R …... Q K A Q K A K 61 D K L K A E R E R …... Q K A Q K A K

Pearson FastA format — >EFHU1 PIR1 release 71.01 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIVGVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDNMLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPLQDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALSEALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHPGQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAAIVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGKVTKSAQKAQKAK GCG single sequence format — !!AA_SEQUENCE 1.0 P1;EFHU1 - translation elongation factor eEF-1 alpha-1 chain - human N;Alternate names: translation elongation factor Tu…… F;1-223/Domain: eEF-1 alpha domain I, GTP-binding #status predicted F;1-223/Domain: eEF-1 alpha domain I, GTP-binding #status predicted F;8-156/Domain: translation elongation factor Tu homology F;8-156/Domain: translation elongation factor Tu homology F;14-21/Region: nucleotide-binding motif A (P-loop) F;153-156/Region: GTP-binding NKXD motif F;245-330/Domain: eEF-1 alpha domain II, tRNA-binding #status predicted F;245-330/Domain: eEF-1 alpha domain II, tRNA-binding #status predicted F;332-462/Domain: eEF-1 alpha domain III, tRNA-binding #status predicted F;332-462/Domain: eEF-1 alpha domain III, tRNA-binding #status predicted F;36,55,79,165,318/Modified site: N6,N6,N6-trimethyllysine (Lys) #status predicted F;301,374/Binding site: glycerylphosphorylethanolamine (Glu) (covalent) #status predicted EFHU1 Length: 462 January 14, 2002 19:49 Type: P Check: 5308.. 1 MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKE…… 1 MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKE…… 401 IVDMVPGKPM CVESFSDYPP LGRFAVRDMR QTVAVGVIKA VDKKAAGAGK 401 IVDMVPGKPM CVESFSDYPP LGRFAVRDMR QTVAVGVIKA VDKKAAGAGK 351 GQISAGYAPV LDCHTAHIAC KFAELKEKID RRSGKKLEDG PKFLKSGDAA 351 GQISAGYAPV LDCHTAHIAC KFAELKEKID RRSGKKLEDG PKFLKSGDAA 451 VTKSAQKAQK AK 451 VTKSAQKAQK AK Only one line of documentation allowed!

GCG MSF & RSF format — !!RICH_SEQUENCE 1.0..{ name ef1a_giala descrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.list type PROTEIN longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala} sequence-ID Q08046 checksum 7342 offset 23 creation-date 07/11/2001 16:51:19 strand 1 comments ……………. !!AA_MULTIPLE_ALIGNMENT 1.0 small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619.. small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619.. Name: a49171 Len: 425 Check: 537 Weight: 1.00 Name: a49171 Len: 425 Check: 537 Weight: 1.00 Name: e70827 Len: 577 Check: 21 Weight: 1.00 Name: e70827 Len: 577 Check: 21 Weight: 1.00 Name: g83052 Len: 718 Check: 9535 Weight: 1.00 Name: g83052 Len: 718 Check: 9535 Weight: 1.00 Name: f70556 Len: 534 Check: 3494 Weight: 1.00 Name: f70556 Len: 534 Check: 3494 Weight: 1.00 Name: t17237 Len: 229 Check: 9552 Weight: 1.00 Name: t17237 Len: 229 Check: 9552 Weight: 1.00 Name: s65758 Len: 735 Check: 111 Weight: 1.00 Name: s65758 Len: 735 Check: 111 Weight: 1.00 Name: a46241 Len: 274 Check: 3514 Weight: 1.00 Name: a46241 Len: 274 Check: 3514 Weight: 1.00 // …………… This is SeqLab’s native format

Specialized ‘sequence’ -type DB’s — Databases that contain special types of sequence information, such as patterns, motifs, and profiles. These include: REBASE, EPD, PROSITE, BLOCKS, ProDom, Pfam.... REBASEEPDPROSITEBLOCKS ProDomPfamREBASEEPDPROSITEBLOCKS ProDomPfam Databases that contain multiple sequence entries aligned, e.g. RDP and ALN. RDPALNRDPALN Databases that contain families of sequences ordered functionally, structurally, or phylogenetically, e.g. iProClass and HOVERGEN. iProClassHOVERGEN iProClassHOVERGEN Databases of species specific sequences, e.g. the HIV Database and the Giardia lamblia Genome Project. HIV DatabaseGiardia lamblia Genome ProjectHIV DatabaseGiardia lamblia Genome Project And on and on.... See Amos Bairoch’s excellent links page: http://us.expasy.org/alinks.html and the wonderful Human Genome Ensemble Project at http://www.ensembl.org/ that tries to tie it all together. http://us.expasy.org/alinks.html http://www.ensembl.org/http://us.expasy.org/alinks.html http://www.ensembl.org/

What about other types of biological databases? Three dimensional structure databases: Three dimensional structure databases: the Protein Data Bank and Rutgers Nucleic Acid Database. Protein Data BankRutgers Nucleic Acid DatabaseProtein Data BankRutgers Nucleic Acid Database These databases contain all of the 3D atomic coordinate data necessary to define the tertiary shape of a particular biological molecule. The data is usually experimentally derived, either by X-ray crystallography or with NMR, but sometimes it is a hypothetical model. In all cases the source of the structure and its resolution is clearly indicated. Secondary structure boundaries, sequence data, and reference information are often associated with the coordinate data, but it is the 3D data that really matters, not the annotation. Molecular visualization or modeling software is required to interact with the data. It has little meaning on its own. See Molecules R Us at http://molbio.info.nih.gov/cgi-bin/pdb/. http://molbio.info.nih.gov/cgi-bin/pdb/

Other types of Biological DB’s — Still more; these can be considered ‘non-molecular’: (w/ pointers to sequences) Genomic linkage mapping databases for most large genome projects (w/ pointers to sequences) — H. sapiens, Mus, Drosophila, C. elegans, Saccharomyces, Arabidopsis, E. coli,....H. sapiensMusDrosophilaC. elegansSaccharomycesArabidopsisE. coli Reference Databases (also w/ pointers to sequences): e.g. OMIMOMIM — Online Mendelian Inheritance in Man OMIM PubMed/MedLinePubMed/MedLine — over 11 million citations from more than 4 thousand bio/medical scientific journals. PubMed/MedLine Phylogenetic Tree Databases: e.g. the Tree of Life. Tree of LifeTree of Life Metabolic Pathway Databases: e.g. WIT (What Is There) and Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of Genes and Genomes). WITKEGGWITKEGG Population studies data — which strains, where, etc. And then databases that many biocomputing people don’t even usually consider: e.g. GIS/GPS/remote sensing data, medical records, census counts, mortality and birth rates....

So how do you access and manipulate all this data? Often on the InterNet over the World Wide Web: SiteURL (Uniform Resource Locator)Content Nat’l Center Biotech' Info'Nat’l Center Biotech' Info'http://www.ncbi.nlm.nih.gov/databases/analysis/software Nat’l Center Biotech' Info' PIR/NBRFPIR/NBRFhttp://www-nbrf.georgetown.edu/protein sequence database PIR/NBRF IUBIO Biology ArchiveIUBIO Biology Archivehttp://iubio.bio.indiana.edu/database/software archive IUBIO Biology Archive Univ. of MontrealUniv. of Montrealhttp://megasun.bch.umontreal.ca/database/software archive Univ. of Montreal Japan's GenomeNetJapan's GenomeNethttp://www.genome.ad.jp/databases/analysis/software Japan's GenomeNet European Mol' Bio' Lab'European Mol' Bio' Lab'http://www.embl-heidelberg.de/databases/analysis/software European Mol' Bio' Lab' European BioinformaticsEuropean Bioinformaticshttp://www.ebi.ac.uk/databases/analysis/software European Bioinformatics The Sanger InstituteThe Sanger Institutehttp://www.sanger.ac.uk/databases/analysis/software The Sanger Institute Univ. of Geneva BioWebUniv. of Geneva BioWebhttp://www.expasy.ch/databases/analysis/software Univ. of Geneva BioWeb ProteinDataBankProteinDataBankhttp://www.rcsb.org/pdb/3D mol' structure database ProteinDataBank Molecules R UsMolecules R Ushttp://molbio.info.nih.gov/cgi-bin/pdb/3D protein/nuc' visualization Molecules R Us The Genome DataBaseThe Genome DataBasehttp://www.gdb.org/The Human Genome Project The Genome DataBase Stanford GenomicsStanford Genomicshttp://genome-www.stanford.edu/various genome projects Stanford Genomics Inst. for Genomic Res’rchInst. for Genomic Res’rchhttp://www.tigr.org/esp. microbial genome projects Inst. for Genomic Res’rch HIV Sequence DatabaseHIV Sequence Databasehttp://hiv-web.lanl.gov/HIV epidemeology seq' DB HIV Sequence Database The Tree of LifeThe Tree of Lifehttp://tolweb.org/tree/phylogeny.htmloverview of all phylogeny The Tree of Life Ribosomal Database Proj’Ribosomal Database Proj’http://rdp.cme.msu.edu/html/databases/analysis/software Ribosomal Database Proj’ WIT MetabolismWIT Metabolismhttp://wit.mcs.anl.gov/WIT2/metabolic reconstruction WIT Metabolism Harvard Bio' LaboratoriesHarvard Bio' Laboratorieshttp://golgi.harvard.edu/nice bioinformatics links list Harvard Bio' Laboratories With tools like NCBI’s Entrez & EMBL’s SRS... EntrezSRSEntrezSRS

Net access software examples — Internet surfing tools — a World Wide Web browser such as NetScape or MS Explorer. Advantage: Can access last night’s updates, fun, fast, efficient. Disadvantage: Reformatting usually essential, and... very easy to get lost and/or distracted in cyberspace! E-Mail at NCBI’s Retrieve Server and other similar servers. Advantage: Can access last night’s updates. Disadvantage: slow, inefficient, and reformatting essential. anonymous FTP through many servers worldwide. Advantage: Can access last night’s updates, very fast. Disadvantage: Difficult, inefficient, reformatting essential, and... often no graphical user interface (GUI). NCBI’s Network Entrez program — a client/server database access solution installed on your machine with a very nice GUI. Entrez Advantage: Very fast, powerful and efficient; relational links between different databases; neighboring concept. Disadvantage: Reformatting usually essential.

But problems sometimes arise with the Net, like bad connections, and there’s always format hassles, etc. So what are some of the alternatives... ? Desktop software solutions — public domain programs are available, but... complicated to install, configure, and maintain. User must be pretty computer savvy. So, commercial software packages are available, e.g. Sequencher, MacVector, DNAsis, DNAStar, etc., but... license hassles, big expense per machine, and Internet and/or CD database access all complicate matters!

Therefore, server-based solutions (e.g. the Wisconsin Package) — we’re talking UNIX server computers here. One commercial license fee for an entire institution and very fast, convenient database access on local server disks. Connections from any networked terminal or workstation anywhere! Operating system: command line operation hassles, communications software — telnet, ssh, xdmcp, etc. and terminal emulation, X graphics, file transfer — ftp, Mac Fetch, and scp, and editors — vi, emacs, pico (or desktop word processing followed by file transfer [save as "text only!"]). Therefore, those in the optional lab section went through a UNIX tutorial last week and everybody else is encouraged to take a look: Week One BioComputing Basics — () Week One BioComputing Basics — ( http://bio.fsu.edu/~stevet/BSC5936/Lab1.pdf ) http://bio.fsu.edu/~stevet/BSC5936/Lab1.pdf Within the GCG suite, LookUp is an SRS derivative used to find a sequence of interest from local GCG server databases. LookUp Advantage: Search output is a legitimate GCG list file, appropriate input to other GCG programs; no need to reformat — all GCG. Disadvantage: DB’s only as new as administrator maintains them.

The Genetics Computer Group — the Wisconsin Package for Sequence Analysis. Begun in 1982 in Oliver Smithies’ lab at the Genetics Dept. at the University of Wisconsin, Madison, then a private company for over 10 years, then acquired by the Oxford Molecular Group U.K., and now owned by Pharmacopeia U.S.A. under the new name Accelrys, Inc. The suite contains almost 150 programs designed to work in a "toolbox" fashion. Several simple programs used in succession can lead to sophisticated results. Also 'internal compatibility,' i.e. once you learn to use one program, all programs can be run similarly, and, the output from many programs can be used as input for other programs. Used all over the world by more than 30,000 scientists at over 530 institutions in 35 countries, so learning it here will most likely be useful anywhere else you may end up.

To answer the always perplexing GCG question — “What sequence(s)?....” The sequence is in a local GCG format single sequence file in your UNIX account. (GCG Reformat and all From & To programs) The sequence is in a local GCG database in which case you ‘point’ to it by using any of the GCG database logical names. A colon, “:,” always sets the logical name apart from either an accession number or a proper identifier name or a wildcard expression and they are case insensitive. The sequence is in a GCG format multiple sequence file, either an MSF (multiple sequence format) file or an RSF (rich sequence format) file. To specify sequences contained in a GCG multiple sequence file, supply the file name followed by a pair of braces, “{},” containing the sequence specification, e.g. a wildcard — { * }. Finally, the most powerful method of specifying sequences is in a GCG “list” file. It is merely a list of other sequence specifications and can even contain other list files within it. The convention to use a GCG list file in a program is to precede it with an at sign, “@.” Furthermore, one can supply attribute information within list files to specify something special about the sequence. Specifying sequences, GCG style; in order of increasing power and complexity:

Logical terms for the Wisconsin Package — Sequence databases, nucleic acids:Sequence databases, amino acids: GENBANKPLUSall of GenBank plus EST and GSS subdivisionsGENPEPTGenBank CDS translations GBPall of GenBank plus EST and GSS subdivisionsGPGenBank CDS translations GENBANKall of GenBank except EST and GSS subdivisionsSWISSPROTPLUSall of Swiss-Prot and all of SPTrEMBL GBall of GenBank except EST and GSS subdivisionsSWPall of Swiss-Prot and all of SPTrEMBL BAGenBank bacterial subdivisionSWISSPROTall of Swiss-Prot (fully annotated) BACTERIALGenBank bacterial subdivisionSWall of Swiss-Prot (fully annotated) ESTGenBank EST (Expressed Sequence Tags) subdivisionSPTREMBLSwiss-Prot preliminary EMBL translations GSSGenBank GSS (Genome Survey Sequences) subdivisionSPTSwiss-Prot preliminary EMBL translations HTCGenBank High Throughput cDNAPall of PIR Protein HTGGenBank High Throughput GenomicPIRall of PIR Protein INGenBank invertebrate subdivisionPROTEINPIR fully annotated subdivision INVERTEBRATEGenBank invertebrate subdivisionPIR1PIR fully annotated subdivision OMGenBank other mammalian subdivisionPIR2PIR preliminary subdivision OTHERMAMMGenBank other mammalian subdivisionPIR3PIR unverified subdivision OVGenBank other vertebrate subdivision PIR4PIR unencoded subdivision OTHERVERTGenBank other vertebrate subdivision NRL_3DPDB 3D protein sequences PATGenBank patent subdivision NRLPDB 3D protein sequences PATENTGenBank patent subdivision PHGenBank phage subdivision PHAGEGenBank phage subdivision General data files: PLGenBank plant subdivision PLANTGenBank plant subdivision GENMOREDATApath to GCG optional data files PRGenBank primate subdivision GENRUNDATApath to GCG default data files PRIMATEGenBank primate subdivision ROGenBank rodent subdivision RODENTGenBank rodent subdivision STSGenBank (sequence tagged sites) subdivision SYGenBank synthetic subdivision SYNTHETICGenBank synthetic subdivision TAGSGenBank EST and GSS subdivisions UNGenBank unannotated subdivision UNANNOTATEDGenBank unannotated subdivision VIGenBank viral subdivision VIRALGenBank viral subdivision These are easy — they make sense and you’ll have a vested interest.

The List File Format — An example GCG list file of many elongation 1a and Tu factors follows. As with all GCG data files, two periods separate documentation from data... my-special.pepbegin:24end:134 SwissProt:EfTu_EcoliEf1a-Tu.msf{*}/usr/accounts/test/another.rsf{ef1a_*}@another.list The ‘way’ SeqLab works!

See the listed references and WWW sites. Many fine texts are also starting to become available in the field. FOR EVEN MORE INFO... http://bio.fsu.edu/~stevet/workshop.html Contact me (stevet@bio.fsu.edu) for specific bioinformatics assistance and/or collaboration. stevet@bio.fsu.edu To learn more - There’s a bewildering assortment of different databases and ways to access and manipulate the information within them. The key is to learn how to use that information in the most efficient manner. Conclusions — Next, a special treat, a colleague of mine, Misha Taylor will further discuss the nature of databases, with particular emphasis on relational and object oriented data structures. Misha TaylorMisha Taylor

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Journal of Molecular Biology 215, 403-410. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs. Nucleic Acids Research 25, 3389-3402. Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids Research 20, 2013-2018. Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle, Washington, U.S.A. Genetics Computer Group ¥ (GCG), Inc. (Copyright 1982-2001) Program Manual for the Wisconsin Package, Version 10.2, Madison, Wisconsin, USA 53711. Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis Primer. W.H. Freeman and Company, New York, N.Y., U.S.A. Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A. 84, 4355-4358. Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences U.S.A. 89, 10915-10919. Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology 48, 443-453. Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio 1994. Nucleic Acids Research 22, 3470-3473. Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the National Academy of Sciences U.S.A. 85, 2444-2448. Rost, B. and Sander, C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. Journal of Molecular Biology 232, 584-599. Smith, S.W., Overbeek, R., Woese, C.R., Gilbert, W., and Gillevet, P.M. (1994) The Genetic Data Environment, an Expandable GUI for Multiple Sequence Analysis. CABIOS, 10, 671-675. Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and Structure, (M.O. Dayhoff editor) 5, Suppl. 3, 353-358, National Biomedical Research Foundation, Washington D.C., U.S.A. Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied Mathematics 2, 482-489. Sundaralingam, M., Mizuno, H., Stout, C.D., Rao, S.T., Liedman, M., and Yathindra, N. (1976) Mechanisms of Chain Folding in Nucleic Acids. The Omega Plot and its Correlation to the Nucleotide Geometry in Yeast tRNAPhe1. Nucleic Acids Research 10, 2471-2484. Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) (1989-1993) Illinois Natural History Survey, (1994) personal copyright, and (1997) Smithsonian Institution, Washington D.C., U.S.A. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22, 4673-4680. von Heijne, G. (1987) Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit. Academic Press, Inc., San Diego, California, U.S.A. Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of Sciences U.S.A. 80, 726-730. Zuker, M. (1989) On Finding All Suboptimal Foldings of an RNA Molecule. Science 244, 48-52. General Bioinformatics References —

Special Topics BSC4933/5936: An Introduction to Bioinformatics. Florida State University The Department of Biological Science www.bio.fsu.edu Sept. 2,

Similar presentations

Presentation on theme: "Special Topics BSC4933/5936: An Introduction to Bioinformatics. Florida State University The Department of Biological Science www.bio.fsu.edu Sept. 2,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Special Topics BSC4933/5936: An Introduction to Bioinformatics. Florida State University The Department of Biological Science www.bio.fsu.edu Sept. 2,

Similar presentations

Presentation on theme: "Special Topics BSC4933/5936: An Introduction to Bioinformatics. Florida State University The Department of Biological Science www.bio.fsu.edu Sept. 2,"— Presentation transcript:

Similar presentations

About project

Feedback