2Introduction to databases Lec-3Introduction to databasesIf we are to derive the maximum benefit from the deluge of sequence information, we must deal with it in a concerted way by doing the following: Establish Maintain Disseminate the information contained in databases
3Introduction to Databases Lec-3Databases are effectively electronic filing cabinets, a convenient and efficient method of storing vast amounts of information.Central, shareable resourcesMany different types of databases, depending on-Nature of information being stored-Manner of data storage
4Primary & Secondary databases Lec-3Primary & Secondary databasesPrimary and secondary databases are used to address different aspects of sequence analysis, because they store different levels of protein sequence informationPrimary or derived databasesPrimary databases: experimental results directly into databaseSecondary databases: results of analysis of primary databasesAggregate of many databases /Composite databasesLinks to other data itemsCombination of dataConsolidation of data
5Primary sequence databases Lec-3Primary sequence databasesEarly 1980’sNucleic acidEMBL (Europe), GenBank (USA), DDBJ (Japan)Protein PIR, SWISS-PROT, TrEMBL, NRL-3DPIR: Protein Information ResourceEMBL: European Molecular Biology LaboratoryTrEMBL: Translated EMBLMIPS: Munich Information Center for Protein Sequences
6EMBL:EMBL is the nucleotide sequence database from European Bioinformatics Institute (EBI)It has sequences from: direct author submissions, genome sequencing groups, scientific literature and patent applications.DDBJ:DNA databank of Japan, produced maintained and distributed at the National Institute of Genetics.GenBank:DNA database from National Center for Biotechnology Information (NCBI).Lec-3
7Principal requirements of a database Lec-3Principal requirements of a databaseThe principal requirements on the public data services are: • Data quality - data quality has to be of the highest priority. However, because the data services in most cases lack access to supporting data, the quality of the data must remain the primary responsibility of the submitter. • Supporting data - database users will need to examine the primary experimental data, either in the database itself, or by following cross-references back to network- accessible laboratory databases. • Deep annotation - deep, consistent annotation comprising supporting and ancillary information should be attached to each basic data object in the database. • Timelines - the basic data should be available on an Internet-accessible server within days (or hours) of publication or submission. • Integration - each data object in the database should be cross-referenced to representation of the same or related biological entities in other databases. Data services should provide capabilities for following these links from one database or data service to another.
8Lec-3ExerciseLook for a gene of your interest in the three primary nucleic acid databases: compare the information given in each one of them.
9Lec-3Primary Sequence DatabaseAmino Acid Nucleic Acide.g. GenBank, EMBL, DDBJSwissProt and PIRSecondary Sequence DatabaseProtein Domains & FamiliesMetabolic Pathwayse.g. RefSeq and Conserved Domain Database (CDD) within NCBISequencing centersLiteratureResearchersCDD: The Conserved Domain Database is a resource for the annotation of functional units in proteins. Its collection of domain models includes a set curated by NCBI, which utilizes 3D structure to provide insights into sequence/structure/function relationships.RefSeq: A collection of curated, non-redundant genomic DNA, transcript (RNA), and protein sequences produced by the National Center for Biotechnology Information (NCBI).Flowchart of sequence data from labs and literature to primary sequence database and subsequent secondary databases
10This depends primarily on the methods used to produce it. Always remember that:The data within primary databases is as reliable as the data submitted.This depends primarily on the methods used to produce it.Regardless of who obtains the sequence data, nucleic acid and amino acid sequencing results are subject to errors.Lec-3
11Protein Sequence databases Lec-3The protein sequence database was developed at the National Biomedical Research Foundation (NBRF)Early 1960’s by Margaret Dayhoff to investigate evolutionary relationships among proteins1988 onwards, maintained collectively by: Protein Information Resource (PIR) at NBRF, International Protein Information Database of Japan (JIPID), and the Martinsried Institute for Protein Sequences (MIPS).
12Examples of molecular sequence types in NCBI records DescriptionGenomeSequence Tagged site (STS)Draft sequencesA unique segment of DNA that occurs only once in a genome and marks a particular location. Can be generated from genomic DNA or cDNA.Pieces of a genome that are compiled from a DNA or cDNA library. Usually large collection of contigs and are in the process of being ordered and catalogued.The complete genome of an organism.
13The whole sequence of a single chromosome. TypeDescriptionChromosomeLocusContigA known location on a chromosome for a particular gene or collection of genes that codes for a specific function.A contiguous segment of a chromosome made by joining overlapping clones or sequences.The whole sequence of a single chromosome.Lec-3
14A complete coding sequence for a protein. TypeDescriptionGeneDomainComplete CDSA discrete portion of a protein assumed to fold independently of the rest of the protein and which possesses its own function.A complete coding sequence for a protein.Whole gene sequence for a proteinLec-3
15Expressed sequence tag (EST) TypeDescriptionmRNAExpressed sequence tag (EST)Complementary DNA sequence (cDNA)Complete CDSA partial sequence of cDNA in mRNA form from either the 5’ or 3’ end of a gene sequence.A cDNA sequence in mRNA form.A complete mRNA sequence for a protein coding region.Lec-3
16Protein Sequence databases Lec-3SWISS-PROTStarted in 1986-University of Geneva and EMBLIt is now maintained by Swiss Institute of Bioinformatics (SIB) and EBI/EMBLTrEMBLStarted in 1996-Follows SWISS-PROT format and contains translations of coding sequences in EMBL.It also provides: synthetic sequences, short amino acid fragments, and codons that do not encode real proteins.
17Composite protein sequence databases Lec-3A database that merges a variety of different primary sources.They obviate the need to interrogate multiple resources.It can eliminate identical sequence copies, or eliminate both identical and highly similar sequences.