Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center firstname.lastname@example.org.
Published byModified over 4 years ago
Presentation on theme: "Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center email@example.com."— Presentation transcript:
1 Biological DatabasesChi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
2 Biological Databases Data Domains Types of Databases - By Scope Types of Databases - By Level of CurationGenBankRefSeqAcknowledgement: The presentation includes adaptations from NCBI’sIntroduction to Molecular Biology Information Resources Modules
3 Data Domains Types of data generated by molecular biology research: nucleotide sequences (DNA and mRNA)protein sequences3-D protein structurescomplete genomes and mapsAlso now have:gene expressiongenetic variation (polymorphisms)
4 Types of Databases - By Scope ComprehensiveContain data from many organisms and many different types of sequences. Examples:NucleotideGenBank (overview)EMBL: European Molecular Biology LaboratoryDDBJ: DNA Data Bank of Japan (The three databases above comprise the International Nucleotide Sequence Database Collaboration and currently include sequence data from >120,000 species.)Protein, such as Swiss-ProtProtein Structure, such as PDB: Protein Data BankGenomes and Maps, such as Entrez GenomesSpecializedContain data from individual organisms, specific categories/functions of sequences, or data generated by specific sequencing technologies.
5 Types of Databases - By Level of Curation Archival datarepository of informationredundant; might have many sequence records for the same gene, each from a different labsubmitters maintain editorial control over their records: what goes in is what comes outno controlled vocabularyvariation in annotation of biological featuresCurated datanon-redundant; one record for each gene, or each splice varianteach record is intended to present an encapsulation of the current understanding of a gene or protein, similar to a review articlerecords contain value-added information that have been added by an expert(s)
7 100's of Databases100's of databases available (example). Which Ones to Use?easiest to start with a single search system (such as Entrez) that combines data from the most commonly used comprehensive databasesIf user wants additional specialized databases, search the database and software directories
8 GenBankarchival database of nucleotide sequences from >130,000 organismsrecords annotated with coding region (CDS) features also include amino acid translationseach record represents the work of a single labredundant; can have many sequence records for a single genepart of the International Nucleotide Sequence Database Collaborationmore information about GenBank...
9 International Nucleotide Sequence Database Collaboration Collaboration among:DDBJ - DNA Data Bank of JapanEMBL - European Molecular Biology Laboratory, UKGenBank - National Center for Biotechnology Information, NLM, NIH
10 RefSeq Database of reference sequences Curated Non-redundant; one record for each gene, or each splice variant, from each organism representedA representative GenBank record is used as the source for a RefSeq recordValue-added information is added by an expert(s)Each record is intended to present an encapsulation of the current understanding of a gene or protein, similar to a review articleVariety of accession number prefixes (NM_ , NP_ , etc.) and status codes (provisional, reviewed, etc.)RefSeq database includes genomic DNA, mRNA, and protein sequences, so organizes information according to the model of the central dogma of biologyAccessible through Entrez, BLAST, and FTP siteRefSeq records are available in various Entrez Databases such as Nucleotide, Protein, Genome, and are also accessible from Entrez Gene recordsmore about RefSeq
11 RefSeq Scope and Accessions Different record types for different molecules from the central dogma of biology:Genomic DNANC_ complete genome, complete chromosome, complete plasmidNG_ genomic regionNT_ genomic contigmRNA - NM_123456Protein - NP_123456Gene and protein models from genome annotation projects:XM_ mRNAXR_ RNA (non-coding transcripts)XP_ proteinmore about RefSeq scope and accessions...
12 RefSeq Status Codes Level of curation Examples Provisionalhas not yet been subject to individual review and is thought to be well supported and to represent a valid transcript and proteinReviewedhas been the reviewed by NCBI staff or by a collaboratorPredictedis predicted and has not been subject to individual reviewGenome Annotationidentifies RefSeq records provided by the NCBI Genome Annotation processmore about RefSeq status codes