Presentation on theme: "Archives and Information Retrieval"— Presentation transcript:
1 Archives and Information Retrieval CSC 487/687 Computing for Bioinformatics
2 Introduction Learning objectives: What is the general arrangement of biological data in the public databases?To know the information retrieval skills that will allow you to make effective use of the databases.To become familiar with basic operations.How does one retrieve information on a particular subject in the literature?Tutorial embedded in many databanks, which make it easy to explore their facilities.
3 Primary public domain bioinformatics servers FacilitiesEuropean BioinformaticsInstitute (EBI)United KingdomNational CenterFor BiotechnologyInformation (NCBI)United StatesGenomeNet(KEGG & DDBJ)JapanDatabasesAnalysisToolsData library, National Institute of genetics, Janpan, DNA Data Bank
4 The Archives Massive biological experimental data These biological information databases can be classified into two typesThe first level databasesCome from the raw data which were obtained via the experiments. “simple”The second level databasesFurther reorganized based on.. in order to achieve some specific goalsthe original data which obtains to the experiment, only passes through the simple classification reorganization and the annotation;
5 The Archives Some examples: The first level databases Nucleic acid sequence databases: GenBank, EMBL Data Library, DNA Database of Japan (DDBJ)Protein sequence database: SWISS-PROT, PIRProtein structure database: PDBThe second level databasesGDBTRANSFACSCOPGDB Human Genome Database: a center for the collection of human genetic mapping data, its use of world-class leaders in human genetics to act as curators for the data.TRANSFAC is the database on eukaryotic transcription factors, their genomic binding sites and DNA-binding profiles. SCOP: Structural Classification of Proteins
6 Nucleic acid sequence databases International DNA Sequence Database CollaborationNCBI (GenBank) – USA (1982)EMBL (Data Library)– Europe (1982)DDBJ (DNA Data Bank)– Japan (1988)Triple partnership, National Center for biotechnology Information (USA), European Bioinformatics Institute , National Institute of Genetics. The raw data are identical, but the format in which they are stored, and the nature of the annotation vary among them
7 NCBIEstablished in USA in 1988 as a national resource for molecular biology informationcreates public databasesconducts research in computational biologydevelops software tools for analyzing genome datadisseminates biomedical information- all for the better understanding of molecular processes affecting human health and disease.
8 Nucleic acid sequence databases GenBanknucleic acid sequence and the protein sequenceliterature workbiological annotationA new release is made every two monthsGenBank information retrieval systemThe Genbank storehouse has contained all known nucleic acid sequence and the protein sequence, as well as literature work and biology annotation which is connected with them.GenBank is an annotated collection of all DNA sequences. A new release of GenBank is made every two months. GenBank is part of the International Nucleotide Sequence Database Collaboration which also includes the DNA DataBank of Japan (DDBJ) and the European Molecular Biology Laboratory (EMBL). The data are exchanged among these three organization on a daily basis. GenBank is the primary sequence repository. It contains the annotated sequences submitted by the original authors and only the original authors can change them.
9 NCBI ENTREZA platform that provides access to and links to databases with biological informationENTREZPubMedGenBankProteindatabasesGenomesPopSetTaxonomyOMIMMedLineEntrez offers access through the following database divisions
10 NCBI ENTREZ Literature Database GenBankProteindatabasesGenomesPopSetTaxonomyOMIMMedLineLiterature DatabaseDatabase of DNA sequences that have been collected toanalyze the evolutionary relatedness of a population.Database of human genes and genetic disordersDatabase of all publicly available DNA sequencesDatabase of amino acid sequences from SwissProt, PIR, PRF,PDB, and translations from annotated coding regions inGenBank and RefSeq.Database of genomes from organisms and virusesDatabase of names of organisms with sequences in GenBank or ProtPubMed CentralPubMed Central (PMC) is the U.S. National Library of Medicine's digital archive of life sciences journal literature. Access to the full text of articles in PMC is free, except where a journal requires a subscription for access to recent articlesOMIM DatabaseThe OMIM (Online Mendelian Inheritance in Man) database is a catalog of human genes and genetic disorders. See OMIM and OMIM Help.The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.RefSeq standards serve as the basis for medical, functional, and diversity studies; they provide a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses. RefSeqs are used as a reagent for the functional annotation of some genome sequencing projects, including those of human and mouse.Taxonomy DatabaseThe Taxonomy database contains the names of all organisms that are represented in the NCBI genetic database by at least one nucleotide or protein sequencePopSet DatabaseThe PopSet database contains aligned sequences submitted as a set resulting from a population, phylogenetic, or mutation study. These alignments describe such events as evolution and population variation.
11 PubMed Centerthe U.S. National Library of Medicine's digital archive of life sciences journal literatureAccess to the full text of articles in PMC is free, except where a journal requires a subscription for access to recent articles
12 OMIM-Online Mendelian Inheritance in Man A catalog of human genes linked to diseasesBegan by Victor A. McKusick at Johns Hopkins UniversityA good place to start when you want to research a certain disease or biological moleculeThis database is cross-referenced to PubMed and other NCBI-based databasesOMIM DatabaseThe OMIM (Online Mendelian Inheritance in Man) database is a catalog of human genes and genetic disorders. See OMIM and OMIM Help.
13 How to submit sequence data to GenBank Bankit based web interfaceSequin program
15 Protein databasesThe Protein Information Resource (PIR) was established in 1984 by the National Biomedical Research Foundation (NBRF).The PIR Protein Sequence Database evolved from the original NBRF Protein Sequence Database, developed over 20 yearsPIR-International is a collaboration between NBRF, the Munich Information Center for Protein Sequences (MIPS), and the Japan International Protein Information Database (JIPID)collect and publish what is now the oldest and largest database of biomolecular sequence, source, literature, and feature information.
16 PIRPIR-International Protein Sequence Database: an annotated, non-redundant and cross-referenced database of protein sequences.PIR Alignment Database, PIR-ALN: contains sequence alignments of superfamilies, families and homology domains produced from information in the Protein Sequence Database.FAMBASE Family Database: a searchable database containing a single representative sequence from each protein family.RESID Database of Amino Acid Modifications: based on feature information in the Protein Sequence Database.major nucleic acid, literature, genome, structure, sequence alignment and family databasesPIR maintains several auxiliary databases to help annotation and for integrity checking. These include: PIR-ALN, containing alignments of superfamilies, families and homology domains; FAMBASE, a searchable database of family representatives; and the RESID Database of covalent protein modifications. All the Databases can be accessed on the PIR Web site ( and contain hypertext-links to each other and relevant external databases. The Web site is being redesigned to include new BLAST similarity search engines and pattern matching capabilities. The latest quarterly release of the databases can be accessed through the ATLAS multi-database retrieval software on the Atlas CD-ROM and downloaded by FTP
18 SWISS-PROT http://www.ebi.ac.uk/swissprot/ an well-annotated protein sequence database established in 1986.It is maintained collaboratively by the Swiss Institute for Bioinformatics (SIB) and the European Bioinformatics Institute (EBI).a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and a high level of integration with other databases. Note: UniProtKB/TrEMBL and UniProtKB/Swiss-Prot have been incorporated into the UniProt (Universal Protein Resource). a one-stop shop allowing easy access to all publicly available information about protein sequences.
19 PROSITE http://ca.expasy.org/prosite/ a method of determining what is the function of uncharacterized proteins translated from genomic or cDNA sequences.a database of biologically significant sitespatterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which known family of protein (if any) the new sequence belongs.
20 PDB http://www.rcsb.org/pdb/ The single international repository for public data on the 3-dimensional structures of biological macromoleculesIs established by the Brookhaven National Lab of United StatesThe contents are primarily experimental data derived from X-ray crystallography and NMR experimentsRasmol may demonstrate 3D structure of the biological macromolecule according to the PDB documentThe Protein Data Bank (PDB) is the single international repository for public data on the 3-dimensional structures of biological macromolecules