Primary vs. Derivative Databases ACGTGC CGTGA ATTGACTA ACGTGC TTGACA TATAGCCG GenBank Sequencing Centers GA ATT C C GA ATT C C UniGene RefSeq: LocusLink and Genomes Pipelines RefSeq: Annotation Pipeline Labs Curators Algorithms TATAGCCG AGCTCCGATA CCGATGACAA Updated ONLY by submitters EST UniSTS STS GSS HTG Updated continually by NCBI PRIRODPLNMAMBCT INVVRTPHGVRL
Types of Databases Primary Databases Original submissions by experimentalists Remember biology ʼ s Central Dogma: DNA → RNA → protein. Primary refers to one dimensional ʻ symbol ʼ information written in sequential order necessary to specify a particular biological molecular entity, be it polypeptide or polynucleotide. Content controlled by the submitter Examples: GenBank, SNP, GEO, PubChem Substance Derivative Databases Built from primary data Content controlled by third party (NCBI) Examples: Refseq, TPA, RefSNP, UniGene, Protein, Structure, Conserved Domain, PubChem Compound
What is Entrez? Entrez Global Query is an integrated search and retrieval system for databases of National Center for Biotechnology Information (NCBI). It provides access to all NCBI databases simultaneously with a single query string and user interface. Support boolean operators and search term tags to limit parts of the search statement to particular fields. This returns a unified results page, that shows the number of hits for the search in each of the databases, which are also links to actual search results for that particular database. A text search / retrieval engine NOT A DATABASE. A tool for finding biologically linked data. A virtual workspace for manipulating large datasets.
Entrez Databases Each record is assigned a UID unique integer identifier for internal tracking GI number for Nucleotide Each record is given a Document Summary a summary of the record’s content (DocSum) Each record is assigned links to biologically related UIDs Each record is indexed by data fields [author], [title], [organism], and many others
The Entrez System Nucleotide Protein Structure PubMed PopSet Genome OMIM Taxonomy Books ProbeSet 3D Domains UniSTS SNP CDD Entrez UniGene Journals PubMed Central
An Entrez Database - Nucleotide GenBank: Primary Data (97.9%) original submissions by experimentalists submitters retain editorial control of records archival in nature RefSeq: Derivative Data (2.1%) curated by NCBI staff NCBI retains editorial control of records record content is updated continually
What is GenBank? NCBI’s Primary Sequence Database Nucleotide only sequence database Archival in nature Each record is assigned a stable accession number GenBank Data Direct submissions (traditional records ) Batch submissions (EST, GSS, STS) ftp accounts (genome data) Three collaborating databases GenBank DNA Database of Japan (DDBJ) European Molecular Biology Laboratory (EMBL) Database
A Traditional GenBank Record LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 KEYWORDS. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a // Header Feature Table Sequence The Flatfile Format
An Example Record – M17755 FieldIndexed Terms [primary accession]M17755 [title]Homo sapiens thyroid peroxidase (TPO) mRNA… [organism]Homo sapiens [sequence length]3060 [modification date]1999/04/26 [properties]biomol mrna gbdiv pri srcdb genbank Indexing for Nucleotide UID 4680720
M17755: Feature Table CDS position in bp TPO [gene name] thyroiditis [text word] thyroid peroxidase [protein name] protein accession
Sequence: 99.99% Accurate The sequence itself is not indexed… Use BLAST for that!
RefSeq database is a collection of taxonomically diverse, non- redundant and richly annotated sequences representing naturally occurring molecules of DNA, RNA, and protein. Non-redundant nucleotide and protein sequences from plasmids, organelles, viruses, archaea, bacteria, and eukaryotes. Updated to reflect current sequence data and biology Each RefSeq is constructed wholly from sequence data submitted to the International Nucleotide Sequence Database Collaboration (INSDC). Similar to a review article, a RefSeq integrates information across multiple sources at a given time hence provides a foundation for uniting sequence data with genetic and functional information. They are generated to provide reference standards for multiple purposes ranging from genome annotation to reporting locations of sequence variation in medical records. RefSeq: NCBI’s Derivative Sequence Database
The common Refseq accession prefix Accession prefixMolecular type NC_ Complete genomic molecule (chromosome; microbial or organelle genome) NT_ Genomic contig NM_ Curated mRNA XM_ mRNA (Computed) NP_ Curated Protein XP_ Protein (Computed) NR_ Curated RNA XR_ RNA(Computed)
Entrez Gene and RefSeq Entrez Gene is the central depository for information about a gene available at NCBI, and often provides links to sites beyond NCBI Entrez Gene includes records for organisms that have NCBI Reference Sequences (RefSeqs) Entrez Gene records contain RefSeq mRNAs, proteins, and genomic DNA (if known) for a gene locus, plus links to other Entrez databases GenBankRefSeq Gene Nucleotide
Beyond RefSeq If your organism does not have RefSeqs… UniGene : gene-based clusters of cDNAs and ESTs WGS sequences in Entrez Nucleotide (wgs[prop]) Trace Archive
A gene-oriented view of sequence entries MegaBlast based automated sequence clustering Now informed by genome hits New! Nonredundant set of gene oriented clusters Each cluster a unique gene Information on tissue types and map locations Includes known genes and uncharacterized ESTs Useful for gene discovery and selection of mapping reagents What is UniGene?
Organisms in UniGene Top Ten 1. Human 2. Rice 3. Mouse 4. Cow 5. Wheat 6. Zebrafish 7. Pig 8. Chicken 9. Frog (X. laevis) 10. Frog (X. tropicalis)
Finding UniGene Clusters by link by Entrez search
Entrez Protein GenPept (DDBJ, EMBL, GenBank) 4,444,405 RefSeq 1,753,167 PIR 222,395 Swiss Prot 189,005 PDB 68,621 PRF 12,079 Third Party Annotation 4,219 Total 6,693,891
Protein Sources and Links PIR RefSeq SWISS-PROT GenPept NM_000537 M17755 no mRNA!
PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,000 journals published world-wide. It has 12 million records dating back to 1966. In order to impose uniformity and consistency to the indexing of biomedical literature MeSH vocabulary is used for indexing journal articles for MEDLINE. MeSH is the acronym for "Medical Subject Headings." MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM.
General Protein Databases SWISS-PROT Manually curated high-quality annotations, less data GenPept/TREMBL Translated coding sequences from GenBank/EMBL Few annotations, more up to date PIR Phylogenetic-based annotations All 3 now combining efforts to form UniProt (http://www.uniprot.org)
http://us.expasy.org/sprot/userman.html Swissprot format
Non-redundant Databases Sequence data only: cannot be browsed, can only be searched using a sequence Combine sequences from more than one database Examples: NR Nucleic (genbank+EMBL+DDBJ+PDB DNA) NR Protein (SWISS-PROT+TrEMBL+GenPept+PDB protein)
Pfam (http://www.sanger.ac.uk/Software/Pfam/)http://www.sanger.ac.uk/Software/Pfam/ Collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families SMART (a Simple Modular Architecture Research Tool) Identification and annotation of genetically mobile domains and the analysis of domain architectures (http://smart.embl-heidelberg.de/help/smart_about.shtmlhttp://smart.embl-heidelberg.de/help/smart_about.shtml CDD (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi)http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi Combines SMART and Pfam databases Easier and quicker search Protein domain databases
Sequence Motif Databases Scan Prosite (http://www.expassy.org/prosite) and PRINTS (http://bioinf.man.ac.uk/dbbrowser/PRINTS/) Store conserved motifs occurring in nucleic acid or protein sequences Motifs can be stored as consensus sequences, alignments, or using statistical representations such as residue frequency tables