Biological Databases By : Lim Yun Ping E mail :

Slides:

Advertisements

Similar presentations

Bioinformatics Ayesha M. Khan Spring 2013.

Advertisements

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.

© Wiley Publishing All Rights Reserved. How Most People Use Bioinformatics.

On line (DNA and amino acid) Sequence Information Lecture 7.

The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.

Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.

1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.

Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○

The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.

How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171

Swiss-Prot Protein Database Daniel Amoruso December 2, 2004 BI 420.

Evidence-Based Information Retrieval in Bioinformatics

Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.

Archives and Information Retrieval

Biological databases.

Bioinformatics Primer HC Lee 2000 July. What is Bioinformatics? Biomedical/biotechnical information Reproduction and annotation of biosequences – DNA.

How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373

Bioinformatics on Proteomics Hsueh-Fen Juan April 24, 2003 NTNU.

The Cell, Central Dogma and Human Genome Project.

Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center

The Protein Data Bank (PDB)

Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.

Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute

EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:

Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.

Bioinformatics Lecture 3 BCH 550 Arjumand Warsy. Retrieving Protein Sequences.

ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.

An Introduction to Bioinformatics Molecular Biology Databases.

From T. MADHAVAN, & K.Chandrasekaran Lecturers in Zoology.. EXIT.

BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD

Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,

Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.

On line (DNA and amino acid) Sequence Information

Bioinformatics.

Development of Bioinformatics and its application on Biotechnology

Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.

Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.

Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas

Secondary Databases Ansuman sahoo Roll: Y Bioinformatics Class Presentation 30 Jan 2013.

NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.

Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.

1 Review of Biological Database Utilization. 2 Biological Databases We will discuss: Usefulness to the bioinformaticist Database types Search methods.

Organizing information in the post-genomic era The rise of bioinformatics.

Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,

Function preserves sequences

NCBI Literature Databases: PubMed

1 EMBL Outstation — The European Bioinformatics Institute Removing redundancy in SWISS-PROT and TrEMBL.

Copyright OpenHelix. No use or reproduction without express written consent1.

Bioinformatics and Computational Biology

Computer Storage of Sequences

Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.

EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.

March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.

Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,

Copyright OpenHelix. No use or reproduction without express written consent1 1.

1 EMBL Outstation — The European Bioinformatics Institute Mus musculus - a model organism in SWISS-PROT.

Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis

Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas

GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.

RDF based on Integration of Pathway Database and Gene Ontology SNU OOPSLA LAB DongHyuk Im.

 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?

Protein databases Henrik Nielsen

Archives and Information Retrieval

생물정보학 Bioinformatics.

Mangaldai College, Mangaldai

Genomes and Their Evolution

Introduction to Bioinformatics

Lesson 3 Bioinformatics Laboratory

Introduction to Databases

SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.

Presentation transcript:

Biological Databases By : Lim Yun Ping E mail : yunping@chitre.net National University of Singapore

Overview Introduction What is a database What type of databases can we access What roles do they play What type of information can we get from them How do we access these information

What is a database ? Convenient method of vast amount of information Allows for proper storing, searching & retrieving of data. Before analyzing them we need to assemble them into central, shareable resources

Why databases ? Means to handle and share large volumes of biological data Support large-scale analysis efforts Make data access easy and updated Link knowledge obtained from various fields of biology and medicine

Different Database Types depends on the nature of information stored (sequences, 2D gel or 3D structure images) manner of storage (flat files, tables in a relational database, etc) In this course we are concerned more about the different types of databases rather than the particular storage

Features Most of the databases have a web-interface to search for data Common mode to search is by Keywords User can choose to view the data or save to your computer Cross-references help to navigate from one database to another easily

Biological Databases

Types Of Biological Databases Accessible There are many different types of database but for routine sequence analysis, the following are initially the most important Primary databases Secondary databases Composite databases

Nucleic Acid Databases Primary databases Contain sequence data such as nucleic acid or protein Example of primary databases include : Nucleic Acid Databases EMBL Genbank DDBJ Protein Databases SWISS-PROT TREMBL PIR

Secondary databases Or sometimes known as pattern databases Contain results from the analysis of the sequences in the primary databases Example of secondary databases include : PROSITE Pfam BLOCKS PRINTS

Composite databases Combine different sources of primary databases. Make querying and searching efficient and without the need to go to each of the primary databases. Example of composite databases include : NRDB – Non-Redundant DataBase OWL

Nucleic acid Databases NCBI : http://www.ncbi.nlm.nih.gov/ NCBI, at the NIH campus, USA EMBL : http://www.embl-heidelberg.de/ European Molecular Biology Laboratory, UK DDBJ DDBJ : http://www.ddbj.nig.ac.jp DNA Databank of Japan Nucleic acid Databases

The International Sequence Database Collaboration GenBank EMBL DDBJ

The International Sequence Database Collaboration These three databases have collaborated since 1982. Each database collects and processes new sequence data and relevant biological information from scientists in their region e.g. EMBL collects from Europe, GenBank from the USA. These databases automatically update each other with the new sequences collected from each region, every 24 hours. The result is that they contain exactly the same information, except for any sequences that have been added in the last 24 hours. This is an important consideration in your choice of database. If you need accurate and up to date information, you must search an up to date database.

Amount Of Data Grows Rapidly As of June 2003, there were 32528249295 bases in 25592865 sequence

How to access them NCBI : http://www.ncbi.nlm.nih.gov/ Main Sites NCBI : http://www.ncbi.nlm.nih.gov/ EMBL : http://www.embl-heidelberg.de/ DDBJ : http://www.ddbj.nig.ac.jp full release every two months incremental and cumulative updates daily available only through internet ftp://ftp.ncbi.nih.gov/genbank/ 66.3 Gigabytes of data

The Internet and WWW

NCBI : http://www.ncbi.nlm.nih.gov/ NCBI, a division of NLM at the NIH campus, USA EXPASY : http://www.expasy.org Swiss Institute of Bioinformatics Kyoto Encyclopedia of Genes and Genomes http://www.genome.ad.jp/kegg/kegg2.html

National Centre for Biotechnology Information Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information all for the better understanding of molecular processes affecting human health and disease. http://www.ncbi.nlm.nih.gov/

Entrez Entrez is a search and retrieval system that integrates information from databases at NCBI.

BNIP

Accession Number : Unique identifier Source : Organism’s common name Brief description of the sequence. Accession Number : Unique identifier Source : Organism’s common name Formal scientific name Contains information on the publications such as the authors, and topic titles of the journals that discuss the data reported in the record. Contains the contact information of the submitter Contains the information about the genes, gene products and regions of biological significance reported in the sequence & length of sequence scientific name of the source organism Taxon ID number, Map location

Coding sequence (region of the nucleotides that correspond to the sequence of amino acid). This is also the location that contains the start and stop codon. Region of biological interest The amino acid translation corresponding to the nucleotide coding sequence

How to understand the output Unique Identifiers : Each entry in a database must have a unique identifier EMBL Identifier (ID) GENBANK Accession Number (AC) Other information is stored along with the sequence. Each piece of information is written on it's own line, with a code defining the line. For example, DE, description; OS, organism species; AC, accession number. Relevant biological information is usually described in the feature table (FT).

Genbank Flat File Format Refer to Summary Description of the Genbank Flat File Format Or http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

ExPASy Expert Protein Analysis System proteomics server of the Swiss Institute of Bioinformatics (SIB) dedicated to the analysis of protein sequences and structures http://www.expasy.org/

Databases on the Expasy server SWISS-PROT and TrEMBL - Protein knowledgebase PROSITE - Protein families and domains SWISS-2DPAGE - Two-dimensional polyacrylamide gel electrophoresis ENZYME - Enzyme nomenclature SWISS-3DIMAGE - 3D images of proteins and other biological macromolecules SWISS-MODEL Repository - Automatically generated protein models

SWISS-PROT A curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases http://tw.expasy.org/sprot/

TrEMBL Computer-annotated supplement to SWISS-PROT

Enzyme nomenclature database http://tw.expasy.org/enzyme/

ENZYME Database A repository of information relative to the nomenclature of enzymes Describes each type of characterized enzyme for which an EC (Enzyme Commission) number has been provided

Access to ENZYME by EC number by enzyme class by description (official name) or alternative name(s) by chemical compound by cofactor

Kyoto Encyclopedia of Genes and Genomes K E G G Kyoto Encyclopedia of Genes and Genomes http://www.genome.ad.jp/kegg/kegg2.html

A structured database containing information about metabolic pathways in many organisms.

KEGG Part of the GenomeNet database system Linked to all accessible databases by search engines; LIGAND & BRITE

Link to other pathways Enzyme Compound

Summary Biological databases represent an invaluable resource in support of biological research. We can learn much about a particular molecule by searching databases and using available analysis tools. A large number of databases are available for that task. Some databases are very general while some are very specialised. For best results we often need to access multiple databases.

Common database search methods include keyword matching, sequence similarity, motif searching, and class searching The problems with using biological databases include incomplete information, data spread over multiple databases, redundant information, various errors, sometimes incorrect links, and constant change.

Database standards, nomenclature, and naming conventions are not clearly defined for many aspects of biological information. This makes information extraction more difficult Retrieval systems help extract rich information from multiple databases. Examples include Entrez and SRS. Formulating queries is a serious issue in biological databases. Often the quality of results depends on the quality of the queries. Access to biological databases is so important that today virtually every molecular biological project starts and ends with querying biological databases.

The End