Databases in Bioinformatics

Databases in Bioinformatics
From a ppt by Mark Pallen Prof. Of Bacterial pathogenesis Univ. Birmingham

Databases in Bioinformatics
Sequence databases Sequence analysis Functional genomics Literature databases Structural databases Metabolic pathway databases Specialised databases

The definitive source….

DNA Sequence databases
Main repositories: GenBank (US) ( EMBL (Europe) ( DDBJ (Japan) ( Primary databases DNA sequences are identical

PubMed is… National Library of Medicine's search service
>14 million citations in MEDLINE links to participating online journals PubMed tutorial (via side bar)

Entrez integrates… ENTREZ THE LIFE SCIENCES ENGINE
the scientific literature; DNA and protein sequence databases; 3D protein structure data; population study data sets; assemblies of complete genomes

Entrez is a search and retrieval system that integrates NCBI databases

OMIM is… Online Mendelian Inheritance in Man
catalog of human genes and genetic disorders edited by Dr. Victor McKusick, others at JHU …John Hopkins University

Taxonomy Browser is… browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses) taxonomy information such as genetic codes molecular data on extinct organisms

Structure site includes…
Molecular Modelling Database (MMDB) biopolymer structures obtained from the Protein Data Bank (PDB) Cn3D (a 3D-structure viewer) vector alignment search tool (VAST)

How can I use PubMed at NCBI to find literature information?

PubMed is the NCBI gateway to MEDLINE.
MEDLINE contains bibliographic citations and author abstracts from over 4,000 journals published in the United States and in 70 foreign countries. It has 12 million records dating back to 1966.

MeSH is the acronym for "Medical Subject Headings."
MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM. MeSH vocabulary is used for indexing journal articles for MEDLINE. The MeSH controlled vocabulary imposes uniformity and consistency to the indexing of biomedical literature.

PubMed search strategies
Try the tutorial on the left sidebar Use boolean queries lipocalin AND disease Try using “limits” Try “LinkOut” to find external resources

1 AND 2 1 2 lipocalin AND disease (96 results) 1 OR 2 1 2 lipocalin OR disease (1.9 million results) 1 NOT 2 1 2 lipocalin NOT disease (729 results)

Fulltext Literature Databases
Highwire Google Scholar Google Print Useful for finding information about genes buried in tables in papers, invisible to PubMed

From Highwire ...Stanford University

What is an accession number?
An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X GenBank genomic DNA sequence NT_ Genomic contig Rs dbSNP (single nucleotide polymorphism) N An expressed sequence tag (1 of 170) NM_ RefSeq DNA sequence (from a transcript) NP_ RefSeq protein AAC02945 GenBank protein Q SwissProt protein 1KT7 Protein Data Bank structure record DNA RNA protein

How can I use NCBI (or other sites) to find information about a protein or gene?

FASTA format

Graphics format

Question #4: How can I find information about a particular disease? Answer: Try OMIM

Sequence Databases Annotated sequence databases
SWISS-PROT, GenBank etc… Usage: identifying function, retrieving information Low-annotation sequence databases EST databases, high-throughput genome sequences Usage: discovery of new genes Sequence databases are the databases most molecular biologists are familiar with. These include: Annotated sequence databases (Genbank, EMBL and SWISS-PROT etc) In these databases, each record is annotated and includes reference information in addition to the sequence. This reference information allows the database to be searched using keywords. Annotated sequence databases are typically used for identifying the function of unknown sequences by database similarity searching. If an unknown sequence matches a sequence in one of these databases its function can often be inferred from the annotations associated with the matching sequence Low-annotation sequence databases These databases contain mainly sequence data, usually with only minimal annotations because many of the sequences are uncharacterized. These databases can be useful as source of new gene sequences Specialized databases These contain a subset of sequences, usually of a specific type, formatted or annotated in a specific way useful for biologists working on these sequences

General Protein Databases
SWISS-PROT Manually curated high-quality annotations, less data GenPept/TREMBL Translated coding sequences from GenBank/EMBL Few annotations, more up to date PIR Phylogenetic-based annotations All 3 now combining efforts to form UniProt ( Unlike the main nucleotide sequence databases which all contain the same sequence data, the main public protein sequence databases all have their different “personality”. SWISS-PROT, which comes, not so surprisingly, from Switzerland, is a manually curated database. As such it has very high quality, consistent annotations, which make it very suitable to keyword searching. However, since the validation and annotation processes take time, SWISS-PROT does not contain as much data as other protein databases. SWISS-PROT has recently become private and as such is available freely only to academics. GenPept and TrEMBL are databases generated automatically by translating coding sequences (CDS features) from Genbank and EMBL respectively. TrEMBL sequences eventually get annotated and transferred into SWISS-PROT. The quality of annotations in these databases is not as high as that of SWISS-PROT but these databases are a lot more up to date than SWISS-PROT PIR is also annotated but its annotations are different from those of SWISS-PROT. The format is often less convenient for text searching but entries contain information about the record’s superfamily which is often hard to find in other databases.

Low-annotation Databases
ESTs (Expressed Sequence Tags) Low quality sequences generated by high -volume sequencing the 3’ or 5’ end of cDNAs High-throughput genome sequences Produced by mass-sequencing of genomic DNA Low annotation databases contain sequences on which little is known, and include sequences obtained in large scale genome survey projects. For example, ESTs (Expressed Sequence Tags) are obtained by single pass sequencing of the 3’ or 5’ end of cDNAs derived from a specific library. These sequences are usually incomplete, and often of poor quality, with frequent sequencing errors. However the sheer volume of sequences obtained in this manner makes EST databases a useful database in which to identify new genes and new gene functions, or to extend an existing sequence, or to locate exons in genomic DNA sequences. ESTs now make up about 40% of Genbank. High throughput genome sequences are the genomic DNA equivalent of ESTs, and can be a potential source of new genes, especially poorly expressed genes which would not be detected in an EST library.\ The GenBank EST database is mirrored on BioNavigator (

Non-redundant Databases
Sequence data only: cannot be browsed, can only be searched using a sequence Combine sequences from more than one database Examples: NR Nucleic (genbank+EMBL+DDBJ+PDB DNA) NR Protein (SWISS-PROT+TrEMBL+GenPept+PDB protein) The existence of several sequence databases which may not necessarily contain the same sequence information can present a problem when attempting to search as many sequences as possible: either one searches every database and gets mostly the same entries repeated across the many databases (not to mention the extra work of repeating the same search with different databases) or one searches only a subset of available databases and risks missing out on some valuable entry. This concern is now addressed by creating non-redundant databases which combine several databases together and remove the duplicate entries. So for example, the non-redundant nucleotide database maintained at NCBI contains Genbank plus the sequences in EMBL which are not in Genbank plus the sequences in DDBJ which are not Genbank or EMBL plus the sequences in PDB Nucleic which are not in any of the other 3 databases. Non-redundant databases built from a collection of databases usually contain only sequence data since the annotation format is not consistent across databases. Note also that in the process of removing duplicate sequences, some very similar but not perfectly identical sequences (including variants, mutations etc) can be removed from the non-redundant database.

Sequence & Structure Databases
PDB (Protein Databank) Stores 3-dimensional atomic coordinates for biological molecules including protein and nucleic acids Data obtained by X-ray crystallography, NMR, or computer modeling MMDB (Molecular Modeling database) Over 28,000 3D macromolecular structures, including proteins and polynucleotides ( SCOP (Structural Classification of Proteins) Classification of proteins according to structural and evolutionary relationships Most protein tertiary structure prediction methods are very dependent on existing protein structures. These structures are obtained by experimental methods (X-ray diffraction or NMR) or by computer modeling. The PDB database is the major repository of protein structures (and to some extent of nucleic acid structures). The structures are stored as atomic coordinates. There are other structural databases which add value to the raw data stored in PDB. For example, the SCOP database classifies proteins according to structural similarity and evolutionary relationships. It presents the user with a hierarchical classification of proteins in families and superfamilies, with links to relevant PDB structures.

File Formats GenBank/GB, genbank flatfile format NBRF format
EMBL, EMBL flatfile format Swissprot GCG, single sequence format of GCG software DNAStrider, for common Mac program Pearson/Fasta, a common format used by Fasta programs and others Phylip3.2, sequential format for Phylip programs Phylip, interleaved format for Phylip programs (v3.3, v3.4) Plain/Raw, sequence data only (no name, document, numbering) MSF multi sequence format used by GCG software PAUP"s multiple sequence (NEXUS) format ASN.1 format used by NCBI

EMBL Format ID TRBG361 standard; mRNA; PLN; 1859 BP. XX AC X56734; S46826; XX SV X XX DT 12-SEP-1991 (Rel. 29, Created) DT 15-MAR-1999 (Rel. 59, Last updated, Version 9) XX DE Trifolium repens mRNA for non-cyanogenic beta-glucosidase XX KW beta-glucosidase. XX OS Trifolium repens (white clover) OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; OC eurosids I; Fabales; Fabaceae; Papilionoideae; Trifolieae; Trifolium. XX RN [5] RP RX MEDLINE; RX PUBMED; RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.; RT "Nucleotide and derived amino acid sequence of the cyanogenic RT beta-glucosidase (linamarase) from white clover (Trifolium repens L.)."; RL Plant Mol. Biol. 17(2): (1991). XX RN [6] RP RA Hughes M.A.; RT ; RL Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ databases. RL M.A. Hughes, UNIVERSITY OF NEWCASTLE UPON TYNE, MEDICAL SCHOOL, NEW CASTLE RL UPON TYNE, NE2 4HH, UK XX DR GOA; P DR MENDEL; 11000; Trirp;1162; DR SWISS-PROT; P26204; BGLS_TRIRP. XX FH Key Location/Qualifiers FH FT source FT /db_xref="taxon:3899" FT /mol_type="mRNA" FT /organism="Trifolium repens" FT /tissue_type="leaves" FT /clone_lib="lambda gt10" FT /clone="TRE361" FT CDS FT /db_xref="GOA:P26204" FT /db_xref="SWISS-PROT:P26204" FT /note="non-cyanogenic" FT /EC_number=" " FT /product="beta-glucosidase" FT /protein_id="CAA " FT /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI FT FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK FT DQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLINELLANGIQPFVTLFHWDLPQ FT VLEDEYGGFLNSGVINDFRDYTDLCFKEFGDRVRYWSTLNEPWVFSNSGYALGTNAPGR FT CSASNVAKPGDSGTGPYIVTHNQILAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLD FT DNSIPDIKAAERSLDFQFGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDF FT IGINYYSSSYISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQ FT EDFEIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYYIRSA FT IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD" FT mRNA FT /evidence=EXPERIMENTAL XX SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other; aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt 60 cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag 120 tcggagcagt tttcctcgtg gcttcatctt tggtgctgga tcttcagcat accaatttga 180 aggtgcagta aacgaaggcg gtagaggacc aagtatttgg gataccttca cccataaata 240 tccagaaaaa ataagggatg gaagcaatgc agacatcacg gttgaccaat atcaccgcta 300 caaggaagat gttgggatta tgaaggatca aaatatggat tcgtatagat tctcaatctc 360 ttggccaaga atactcccaa agggaaagtt gagcggaggc ataaatcacg aaggaa

Genbank Format http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
LOCUS SCU bp DNA PLN JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ACCESSION U49845 VERSION U GI: KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. REFERENCE 1 (bases 1 to 5028) AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10 (11), (1994) MEDLINE PUBMED REFERENCE 2 (bases 1 to 5028) AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M. TITLE Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein JOURNAL Genes Dev. 10 (7), (1996) MEDLINE PUBMED REFERENCE 3 (bases 1 to 5028) AUTHORS Roemer,T. TITLE Direct Submission JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New Haven, CT, USA FEATURES Location/Qualifiers source /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" /chromosome="IX" gene /gene="AXL2" CDS /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA " /db_xref="GI: " /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL VDFSNKSNVNVGQVKDIHGRIPEML BASE COUNT a c g t ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa 181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg 241

Swissprot format

Specialized Sequence Databases
Focus on a specific type of sequences Sequences are often modified or specially annotated Usage depends on the database Examples: Ribosomal RNA databases Immunology databases In addition to the general sequence repositories, a number of specialized sequences databases have appeared on the WWW. These databases are usually designed to meet the needs of a particular section of the research community and often store the sequences in a unique format or with annotations not found in the more general sequence databases. Examples of these databases include ribosomal RNA databases and immunological sequence databases.

Protein domain databases
Pfam ( Collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families SMART (a Simple Modular Architecture Research Tool) Identification and annotation of genetically mobile domains and the analysis of domain architectures ( CDD ( Combines SMART and Pfam databases Easier and quicker search

Sequence Motif Databases
Scan Prosite ( and PRINTS ( Store conserved motifs occurring in nucleic acid or protein sequences Motifs can be stored as consensus sequences, alignments, or using statistical representations such as residue frequency tables A sequence motif is a pattern of conservation in a group of sequences which reflects a common function or structure in these sequences. Motifs can be represented in several ways in a computer, and the type of analysis that can be performed with the motif depends on the representation. For example, the most common representation for a motif is a consensus sequence which shows which residues are allowed or not allowed at each position of the sequence. This type of representation is very common in the literature but has a number of drawbacks in that it provides only a qualitative description of the motif. As a result consensus sequences are either too specific and do not represent all the motifs they are meant to be represent, or they are not specific enough and match false positives. Other representations of motifs include multiple sequence alignments of the sequences, and numerical representations such as profiles and weight matrices which assign a “weight” to each possible residue in the motif. Hidden Markov models (HMMs) are statistical representations of motifs where each possible residue at each position is assigned a probability which can be dependent not only on the position but also on its neighborhood.

Ribosomal RNA Databases
RDP (Michigan State University, USA) rRNA database (University of Antwerp, Belgium) ribosomal RNA sequences are pre-aligned according to their secondary structure Usage: creating data sets for molecular phylogeny, especially for microbial taxonomy and identification Ribosomal RNA has been found to be a very useful molecule in the study of evolution, since it is present in every lifeform and can be aligned on the basis of its secondary structure to produce multiple sequence alignments from which phylogenetic trees can be built. This is particularly useful for characterizing new bacteria and classifying them. Two ribosomal RNA sequence databases are available on the WWW. In these databases, the rRNA sequences have been pre-aligned according to their secondary structure so that researchers can extract multiple sequence alignments for their organisms of choice directly from the database. Ribosomal RNA sequences from new organisms can then be aligned to the alignment extracted from the database, and a phylogenetic tree built from the results. The rRNA databases also provide additional information on the sequences, including graphical representations of their predicted secondary structures.

Immunological Sequence Databases
The Kabat Database of Sequences of Proteins of Immunological Interest Sequences are classified according to antigen specificity, and available in pre-aligned format The Immunogenetics database (IMGT) Focuses on immunoglobulins, T-cell receptors and MHC genes The major sequence databases do not cater well to the study of immunological proteins, which have a very large number of variants and would clutter the database if they were all stored as individual records. The Kabat database attempts to store all these protein sequences, classified not only according to type, but also according to antigenic specificity. The IMGT database focuses on the gene sequences rather than the proteins, and stores information about variable and conserved regions. It includes software for displaying 3D structures, aligning sequences etc...

Genome Databases Focus on one organism or group of organisms Examples:
Colibase (E. coli and related species) GDB (human) Flybase (Drosophila) WormBase (C. elegans) AtDB (Arabidopsis) SGD (S. cerevisiae) Genome databases aim to summarize the available information about a particular organism’s genome (or a group of related organisms). Almost every collaborative genome project has an associated genome database which allows the collaborators to access and modify the data.

Expression Databases RNA expression Proteome databases
Results of microarray experiments measuring the change in specific mRNA content under certain conditions Array Express (EBI) and Geo (NCBI) Not user friendly Proteome databases 2D gel electrophoresis images representing the protein content of a cell or tissue under specific conditions SWISS 2D PAGE at Expression data is a relative newcomer on the bioinformatics scene. Several repositories of 2D gel images for proteome studies have been established, with the main one found at the Expasy site in Switzerland. However at this writing there is no central repository of RNA expression data generated by microarray analysis. When available, the data is usually released on the original authors’ web site. The storage, management and automated analysis of the large amount of data being generated by expression studies are still an open issue, as is the integration of these data with the sequence databases.

Other Database Types Literature Variation Metabolic pathways
MEDLINE ( HighWire ( Variation dbSNP ( HGBase ( Metabolic pathways KEGG ( WIT ( Organisms and nomenclature Taxonomies (e.g.: ) Mendel ( There are many other types of databases available either on the WWW or as stand-alone collections (on CD-ROM for example). Most scientists would be familiar with literature databases, MEDLINE in particular. Other databases are not so well known outside a particular field of expertise. The advantage of the WWW is that it allows individuals to make databases publicly available very easily. However data published in this fashion are often difficult to integrate with other information.

Methods for Accessing Data
local installation screen scraping BioPerl FTP sitesScreen scraping is a technique in which a computer program extracts data from the display output of another program. The program doing the scraping is called a screen scraper. The key element that distinguishes screen scraping from regular parsing is that the output being scraped was intended for final display to a human user, rather than as input to another program, and is therefore usually neither documented nor structured for convenient parsing. Screen scraping often involves ignoring binary data (usually images or multimedia data) and formatting elements that obscure the essential, desired text data. Optical character recognition software is a kind of visual scraper.There are a number of synonyms for screen scraping, including: Data scraping, data extraction, web scraping, page scraping, web page wrapping and HTML scraping (the last four being specific to scraping web pages).DAS

Screen scraping is a technique in which a computer program extracts data from the display output of another program. The program doing the scraping is called a screen scraper. The key element that distinguishes screen scraping from regular parsing is that the output being scraped was intended for final display to a human user, rather than as input to another program, and is therefore usually neither documented nor structured for convenient parsing. Screen scraping often involves ignoring binary data (usually images or multimedia data) and formatting elements that obscure the essential, desired text data. Optical character recognition software is a kind of visual scraper.There are a number of synonyms for screen scraping, including: Data scraping, data extraction, web scraping, page scraping, web page wrapping and HTML scraping (the last four being specific to scraping web pages).

Local Installations SRS Download data from FTP sites Ensembl
Need to obtain license from Lion Biosceinces Download data from FTP sites Ensembl "framework to organise biology around the sequences of large genomes"

Screen Scraping URL spoofing html parsing Requirements
construction of URLs that replicate the query html parsing extraction of results from html pages returned by query Requirements html module knowlege of query mechanism Method NOT advocated by most data providers

BioPerl BioPerl is a collection of modules that facilitates the development of Perl scripts for bioinformatics applications.

ReadSeq Converts input DNA/AA sequence to specified format Usage:
readseq my1st.seq my2nd.seq -all - format=genbank -output=my.gb Online Manual:

Databases in Bioinformatics

Similar presentations

Presentation on theme: "Databases in Bioinformatics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Databases in Bioinformatics

Similar presentations

Presentation on theme: "Databases in Bioinformatics"— Presentation transcript:

Similar presentations

About project

Feedback