4 Types of Databases Primary Databases Derivative Databases Original submissions by experimentalistsRemember biologyʼs Central Dogma: DNA → RNA → protein.Primary refers to one dimensional ʻsymbolʼ information written in sequential order necessary to specify a particular biological molecular entity, be it polypeptide or polynucleotide.Content controlled by the submitterExamples: GenBank, SNP, GEO, PubChem SubstanceDerivative DatabasesBuilt from primary dataContent controlled by third party (NCBI)Examples: Refseq, TPA, RefSNP, UniGene, Protein, Structure, Conserved Domain, PubChem CompoundPrimary databases serve as a repository of experimentalist sequences (GenBank).Derivative databases are sources of edited/curated sequences (RefSeq…reference sequences, UniGene...genes compared to genetic loci on genomes)
5 What is Entrez?Entrez Global Query is an integrated search and retrieval system for databases of National Center for Biotechnology Information (NCBI).It provides access to all NCBI databases simultaneously with a single query string and user interface.Support boolean operators and search term tags to limit parts of the search statement to particular fields.This returns a unified results page, that shows the number of hits for the search in each of the databases, which are also links to actual search results for that particular database.A text search / retrieval engine NOT A DATABASE.A tool for finding biologically linked data.A virtual workspace for manipulating large datasets.
6 Entrez Databases Each record is assigned a UID unique integer identifier for internal trackingGI number for NucleotideEach record is given a Document Summarya summary of the record’s content (DocSum)Each record is assigned links to biologically related UIDsEach record is indexed by data fields[author], [title], [organism], and many others
7 The Entrez System Entrez Journals UniGene Books SNP PubMed UniSTS CentralUniSTSNucleotidePopSetProteinEntrezProbeSetDNA/RNA overviewGenomeStructureTaxonomyCDD3D DomainsOMIM
10 An Entrez Database - Nucleotide GenBank: Primary Data (97.9%)original submissions by experimentalistssubmitters retain editorial control of recordsarchival in natureRefSeq: Derivative Data (2.1%)curated by NCBI staffNCBI retains editorial control of recordsrecord content is updated continually
11 What is GenBank? NCBI’s Primary Sequence Database Nucleotide only sequence databaseArchival in natureEach record is assigned a stable accession numberGenBank DataDirect submissions (traditional records )Batch submissions (EST, GSS, STS)ftp accounts (genome data)Three collaborating databasesGenBankDNA Database of Japan (DDBJ)European Molecular Biology Laboratory (EMBL) Database
12 A Traditional GenBank Record LOCUS AY bp mRNA linear PLN 04-MAY-2004DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,complete cds.ACCESSION AY182241VERSION AY GI:KEYWORDS .SOURCE Malus x domestica (cultivated apple)ORGANISM Malus x domesticaEukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.REFERENCE 1 (bases 1 to 1931)AUTHORS Pechous,S.W. and Whitaker,B.D.TITLE Cloning and functional expression of an (E,E)-alpha-farnesenesynthase cDNA from peel tissue of apple fruitJOURNAL Planta 219, (2004)REFERENCE 2 (bases 1 to 1931)TITLE Direct SubmissionJOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD20705, USAREFERENCE 3 (bases 1 to 1931)JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,REMARK Sequence update by submitterCOMMENT On Jun 26, 2003 this sequence version replaced gi:FEATURES Location/Qualifierssource/organism="Malus x domestica"/mol_type="mRNA"/cultivar="'Law Rome'"/db_xref="taxon:3750"/tissue_type="peel"gene/gene="AFS1"CDS/note="terpene synthase"/codon_start=1/product="(E,E)-alpha-farnesene synthase"/protein_id="AAO "/db_xref="GI: "/translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWKNDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLFEKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLENHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHSLELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWWANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGSEEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLTKVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMADFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIKGMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHILSLLFQPLVN"ORIGIN1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa1921 aaaaaaaaaa a//HeaderThe Flatfile FormatFeature TableSequence
13 An Example Record – M17755 Field Indexed Terms Indexing for Nucleotide UIDField Indexed Terms[primary accession] M17755[title] Homo sapiens thyroid peroxidase (TPO) mRNA…[organism] Homo sapiens[sequence length] 3060[modification date] 1999/04/26[properties] biomol mrnagbdiv prisrcdb genbank
14 M17755: Feature Table TPO [gene name] CDS position in bp thyroiditis [text word]thyroid peroxidase[protein name]proteinaccession
15 Sequence: 99.99% Accurate The sequence itself is not indexed… Use BLAST for that!
16 RefSeq: NCBI’s Derivative Sequence Database RefSeq database is a collection of taxonomically diverse, non-redundant and richly annotated sequences representing naturally occurring molecules of DNA, RNA, and protein.Non-redundant nucleotide and protein sequences from plasmids, organelles, viruses, archaea, bacteria, and eukaryotes.Updated to reflect current sequence data and biologyEach RefSeq is constructed wholly from sequence data submitted to the International Nucleotide Sequence Database Collaboration (INSDC).Similar to a review article, a RefSeq integrates information across multiple sources at a given time hence provides a foundation for uniting sequence data with genetic and functional information.They are generated to provide reference standards for multiple purposes ranging from genome annotation to reporting locations of sequence variation in medical records.
17 The common Refseq accession prefix Molecular typeNC_Complete genomic molecule (chromosome; microbial or organelle genome)NT_Genomic contigNM_Curated mRNAXM_mRNA (Computed)NP_Curated ProteinXP_Protein (Computed)NR_Curated RNAXR_RNA(Computed)
18 Entrez Gene and RefSeqGeneGenBankRefSeqNucleotideEntrez Gene is the central depository for information about a gene available at NCBI, and often provides links to sites beyond NCBIEntrez Gene includes records for organisms that have NCBI Reference Sequences (RefSeqs)Entrez Gene records contain RefSeq mRNAs, proteins, and genomic DNA (if known) for a gene locus, plus links to other Entrez databases
20 Beyond RefSeq If your organism does not have RefSeqs… UniGene : gene-based clusters of cDNAs and ESTsWGS sequences in Entrez Nucleotide (wgs[prop])Trace Archive
21 What is UniGene? A gene-oriented view of sequence entries MegaBlast based automated sequence clusteringNow informed by genome hits New!Nonredundant set of gene oriented clustersEach cluster a unique geneInformation on tissue types and map locationsIncludes known genes and uncharacterized ESTsUseful for gene discovery and selection of mapping reagentsClusters of ESTs based on automatic similarity. Each cluster represents a gene.
22 Organisms in UniGene Top Ten 1. Human 2. Rice 3. Mouse 4. Cow 5. Wheat 6. Zebrafish7. Pig8. Chicken9. Frog (X. laevis)10. Frog (X. tropicalis)
23 Finding UniGene Clusters by linkby Entrez search
25 Entrez Protein GenPept (DDBJ, EMBL, GenBank) 4,444,405 RefSeq ,753,167PIR ,395Swiss Prot ,005PDB ,621PRF ,079Third Party Annotation ,219Total ,693,891The Protein database is a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt, PIR, PRF, and PDB.
26 Protein Sources and Links PIRno mRNA!RefSeq NM_000537SWISS-PROTno mRNA!GenPept M17755
27 PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,000 journals published world-wide. It has 12 million records dating back to 1966.In order to impose uniformity and consistency to the indexing of biomedical literature MeSH vocabulary is used for indexing journal articles for MEDLINE.MeSH is the acronym for "Medical Subject Headings."MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM.
35 General Protein Databases SWISS-PROTManually curatedhigh-quality annotations, less dataGenPept/TREMBLTranslated coding sequences from GenBank/EMBLFew annotations, more up to datePIRPhylogenetic-based annotationsAll 3 now combining efforts to form UniProt (http://www.uniprot.org)Unlike the main nucleotide sequence databases which all contain the same sequence data, the main public protein sequence databases all have their different “personality”.SWISS-PROT, which comes, not so surprisingly, from Switzerland, is a manually curated database. As such it has very high quality, consistent annotations, which make it very suitable to keyword searching. However, since the validation and annotation processes take time, SWISS-PROT does not contain as much data as other protein databases. SWISS-PROT has recently become private and as such is available freely only to academics.GenPept and TrEMBL are databases generated automatically by translating coding sequences (CDS features) from Genbank and EMBL respectively. TrEMBL sequences eventually get annotated and transferred into SWISS-PROT. The quality of annotations in these databases is not as high as that of SWISS-PROT but these databases are a lot more up to date than SWISS-PROTPIR is also annotated but its annotations are different from those of SWISS-PROT. The format is often less convenient for text searching but entries contain information about the record’s superfamily which is often hard to find in other databases.
37 Non-redundant Databases Sequence data only: cannot be browsed, can only be searched using a sequenceCombine sequences from more than one databaseExamples:NR Nucleic (genbank+EMBL+DDBJ+PDB DNA)NR Protein (SWISS-PROT+TrEMBL+GenPept+PDB protein)The existence of several sequence databases which may not necessarily contain the same sequence information can present a problem when attempting to search as many sequences as possible: either one searches every database and gets mostly the same entries repeated across the many databases (not to mention the extra work of repeating the same search with different databases) or one searches only a subset of available databases and risks missing out on some valuable entry.This concern is now addressed by creating non-redundant databases which combine several databases together and remove the duplicate entries. So for example, the non-redundant nucleotide database maintained at NCBI contains Genbank plus the sequences in EMBL which are not in Genbank plus the sequences in DDBJ which are not Genbank or EMBL plus the sequences in PDB Nucleic which are not in any of the other 3 databases.Non-redundant databases built from a collection of databases usually contain only sequence data since the annotation format is not consistent across databases. Note also that in the process of removing duplicate sequences, some very similar but not perfectly identical sequences (including variants, mutations etc) can be removed from the non-redundant database.
38 Protein domain databases Pfam (http://www.sanger.ac.uk/Software/Pfam/)Collection of multiple sequence alignments and hidden Markov models covering many common protein domains and familiesSMART (a Simple Modular Architecture Research Tool)Identification and annotation of genetically mobile domains and the analysis of domain architectures(http://smart.embl-heidelberg.de/help/smart_about.shtmlCDD (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi)Combines SMART and Pfam databasesEasier and quicker search
39 Sequence Motif Databases Scan Prosite (http://www.expassy.org/prosite) and PRINTS (http://bioinf.man.ac.uk/dbbrowser/PRINTS/)Store conserved motifs occurring in nucleic acid or protein sequencesMotifs can be stored as consensus sequences, alignments, or using statistical representations such as residue frequency tablesA sequence motif is a pattern of conservation in a group of sequences which reflects a common function or structure in these sequences.Motifs can be represented in several ways in a computer, and the type of analysis that can be performed with the motif depends on the representation.For example, the most common representation for a motif is a consensus sequence which shows which residues are allowed or not allowed at each position of the sequence. This type of representation is very common in the literature but has a number of drawbacks in that it provides only a qualitative description of the motif. As a result consensus sequences are either too specific and do not represent all the motifs they are meant to be represent, or they are not specific enough and match false positives.Other representations of motifs include multiple sequence alignments of the sequences, and numerical representations such as profiles and weight matrices which assign a “weight” to each possible residue in the motif. Hidden Markov models (HMMs) are statistical representations of motifs where each possible residue at each position is assigned a probability which can be dependent not only on the position but also on its neighborhood.