Presentation is loading. Please wait.

Presentation is loading. Please wait.

Archives and Information Retrieval

Similar presentations

Presentation on theme: "Archives and Information Retrieval"— Presentation transcript:

1 Archives and Information Retrieval
CSC 487/687 Computing for Bioinformatics

2 Introduction Learning objectives:
What is the general arrangement of biological data in the public databases? To know the information retrieval skills that will allow you to make effective use of the databases. To become familiar with basic operations. How does one retrieve information on a particular subject in the literature? Tutorial embedded in many databanks, which make it easy to explore their facilities.

3 Primary public domain bioinformatics servers
Facilities European Bioinformatics Institute (EBI) United Kingdom National Center For Biotechnology Information (NCBI) United States Genome Net (KEGG & DDBJ) Japan Databases Analysis Tools Data library, National Institute of genetics, Janpan, DNA Data Bank

4 The Archives Massive biological experimental data
These biological information databases can be classified into two types The first level databases Come from the raw data which were obtained via the experiments. “simple” The second level databases Further reorganized based on.. in order to achieve some specific goals the original data which obtains to the experiment, only passes through the simple classification reorganization and the annotation;

5 The Archives Some examples: The first level databases
Nucleic acid sequence databases: GenBank, EMBL Data Library, DNA Database of Japan (DDBJ) Protein sequence database: SWISS-PROT, PIR Protein structure database: PDB The second level databases GDB TRANSFAC SCOP GDB Human Genome Database: a center for the collection of human genetic mapping data, its use of world-class leaders in human genetics to act as curators for the data. TRANSFAC is the database on eukaryotic transcription factors, their genomic binding sites and DNA-binding profiles. SCOP: Structural Classification of Proteins

6 Nucleic acid sequence databases
International DNA Sequence Database Collaboration NCBI (GenBank) – USA (1982) EMBL (Data Library)– Europe (1982) DDBJ (DNA Data Bank)– Japan (1988) Triple partnership, National Center for biotechnology Information (USA), European Bioinformatics Institute , National Institute of Genetics. The raw data are identical, but the format in which they are stored, and the nature of the annotation vary among them

7 NCBI Established in USA in 1988 as a national resource for molecular biology information creates public databases conducts research in computational biology develops software tools for analyzing genome data disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease.

8 Nucleic acid sequence databases
GenBank nucleic acid sequence and the protein sequence literature work biological annotation A new release is made every two months GenBank information retrieval system The Genbank storehouse has contained all known nucleic acid sequence and the protein sequence, as well as literature work and biology annotation which is connected with them. GenBank is an annotated collection of all DNA sequences. A new release of GenBank is made every two months. GenBank is part of the International Nucleotide Sequence Database Collaboration which also includes the DNA DataBank of Japan (DDBJ) and the European Molecular Biology Laboratory (EMBL). The data are exchanged among these three organization on a daily basis. GenBank is the primary sequence repository. It contains the annotated sequences submitted by the original authors and only the original authors can change them.

9 NCBI ENTREZ A platform that provides access to and links to databases with biological information ENTREZ PubMed GenBank Protein databases Genomes PopSet Taxonomy OMIM MedLine Entrez offers access through the following database divisions

10 NCBI ENTREZ Literature Database
GenBank Protein databases Genomes PopSet Taxonomy OMIM MedLine Literature Database Database of DNA sequences that have been collected to analyze the evolutionary relatedness of a population. Database of human genes and genetic disorders Database of all publicly available DNA sequences Database of amino acid sequences from SwissProt, PIR, PRF, PDB, and translations from annotated coding regions in GenBank and RefSeq. Database of genomes from organisms and viruses Database of names of organisms with sequences in GenBank or Prot PubMed Central PubMed Central (PMC) is the U.S. National Library of Medicine's digital archive of life sciences journal literature. Access to the full text of articles in PMC is free, except where a journal requires a subscription for access to recent articles OMIM Database The OMIM (Online Mendelian Inheritance in Man) database is a catalog of human genes and genetic disorders. See OMIM and OMIM Help. The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. RefSeq standards serve as the basis for medical, functional, and diversity studies; they provide a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses. RefSeqs are used as a reagent for the functional annotation of some genome sequencing projects, including those of human and mouse. Taxonomy Database The Taxonomy database contains the names of all organisms that are represented in the NCBI genetic database by at least one nucleotide or protein sequence PopSet Database The PopSet database contains aligned sequences submitted as a set resulting from a population, phylogenetic, or mutation study. These alignments describe such events as evolution and population variation.

11 PubMed Center the U.S. National Library of Medicine's digital archive of life sciences journal literature Access to the full text of articles in PMC is free, except where a journal requires a subscription for access to recent articles

12 OMIM-Online Mendelian Inheritance in Man
A catalog of human genes linked to diseases Began by Victor A. McKusick at Johns Hopkins University A good place to start when you want to research a certain disease or biological molecule This database is cross-referenced to PubMed and other NCBI-based databases OMIM Database The OMIM (Online Mendelian Inheritance in Man) database is a catalog of human genes and genetic disorders. See OMIM and OMIM Help.

13 How to submit sequence data to GenBank
Bankit based web interface Sequin program

14 On-class exercise

15 Protein databases The Protein Information Resource (PIR) was established in 1984 by the National Biomedical Research Foundation (NBRF). The PIR Protein Sequence Database evolved from the original NBRF Protein Sequence Database, developed over 20 years PIR-International is a collaboration between NBRF, the Munich Information Center for Protein Sequences (MIPS), and the Japan International Protein Information Database (JIPID) collect and publish what is now the oldest and largest database of biomolecular sequence, source, literature, and feature information.

16 PIR PIR-International Protein Sequence Database: an annotated, non-redundant and cross-referenced database of protein sequences. PIR Alignment Database, PIR-ALN: contains sequence alignments of superfamilies, families and homology domains produced from information in the Protein Sequence Database. FAMBASE Family Database: a searchable database containing a single representative sequence from each protein family. RESID Database of Amino Acid Modifications: based on feature information in the Protein Sequence Database. major nucleic acid, literature, genome, structure, sequence alignment and family databases PIR maintains several auxiliary databases to help annotation and for integrity checking. These include: PIR-ALN, containing alignments of superfamilies, families and homology domains; FAMBASE, a searchable database of family representatives; and the RESID Database of covalent protein modifications. All the Databases can be accessed on the PIR Web site ( and contain hypertext-links to each other and relevant external databases. The Web site is being redesigned to include new BLAST similarity search engines and pattern matching capabilities. The latest quarterly release of the databases can be accessed through the ATLAS multi-database retrieval software on the Atlas CD-ROM and downloaded by FTP

17 PIR

an well-annotated protein sequence database established in 1986. It is maintained collaboratively by the Swiss Institute for Bioinformatics (SIB) and the European Bioinformatics Institute (EBI). a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and a high level of integration with other databases. Note: UniProtKB/TrEMBL and UniProtKB/Swiss-Prot have been incorporated into the UniProt (Universal Protein Resource). a one-stop shop allowing easy access to all publicly available information about protein sequences.

a method of determining what is the function of uncharacterized proteins translated from genomic or cDNA sequences. a database of biologically significant sites patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which known family of protein (if any) the new sequence belongs.

20 PDB
The single international repository for public data on the 3-dimensional structures of biological macromolecules Is established by the Brookhaven National Lab of United States The contents are primarily experimental data derived from X-ray crystallography and NMR experiments Rasmol may demonstrate 3D structure of the biological macromolecule according to the PDB document The Protein Data Bank (PDB) is the single international repository for public data on the 3-dimensional structures of biological macromolecules

Download ppt "Archives and Information Retrieval"

Similar presentations

Ads by Google