Presentation on theme: "The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information www.ncbi.nih.gov Database Resources."— Presentation transcript:
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information www.ncbi.nih.gov Database Resources
…. to develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease. What does this involve ? creating automated systems for storing and analyzing knowledge about molecular biology, biochemistry, and genetics; facilitating the use of such databases and software by the research and medical community; coordinating efforts to gather biotechnology information both nationally and internationally; performing research into advanced methods of computer-based information processing for analyzing the structure and function of biologically important molecules. NCBI Mission
What is a Database ? A model or representation of some aspect of the real world An organized collection of data. May contain many different types of data Coherent, consistent and designed for a specific purpose A computational system for managing and querying the data.
A collection of information organized in such a way that a computer program can quickly select desired pieces of data. An electronic filing system Traditional databases are organized by fields, records, and files. A field is a single piece of information; a record is one complete set of fields a file is a collection of records. For example, a telephone book is analogous to a file. It contains a list of records, each of which consists of three fields: name, address, and telephone number. What is a Database ?
To access information from a database, you need a database management system (DBMS). This is a collection of programs that enables you to enter, organize, and select data in a database. Most molecular biology databases primarily use relational database management systems (RDBMS). What is a Database ?
A relational database is like a large spreadsheet. Each field is a column, each row is an entry. Relational databases use a set of tables to organize data. Each entry must be unambiguously identified Names are not reliable e.g. incorrectly assigned gene function Unique IDs (UID)s are used, e.g. in GenBank these are called accession numbers UIDNameSequenceQuality Value BU039022PP_LEa0001A01fCATACAAAT …35 BU039057PP_LEa0001B17fTACGGCTAC …28 Relational Database
Achieving consistency Repeated information is stored in a single place. Only one copy needs to be updated Sequence UID Definition Locus Accession Taxonomy ID* Sequence Taxonomy Taxonomy ID* Genus Species Ref Index* UID Medline ID Ref Index Medline ID Authors Title Journal * May be referred to by a secondary ID * May be referred to indirectly via an index Relational Database
Language used is SQL or structured query language Easy to understand (essentially English?) Relatively consistent across RBDMS Supplies a set of commands to define tables, insert data and make queries Queries SELECT some fields FROM some table WHERE some condition is met E.g. select accession, sequence FROM sequence WHERE Accession = BU039022 BU039022 CATACAAATACTGCTACHTAAATC …. More complex queries require two or more tables be joined to produce a result Relational Database
Most RDBMS do not allow users to directly query the database by SQL. An ill formed query can overload or crash the system SQL still too complex for biologists? Provide a search interface for the user instead E.g. user enters a phrase and the database identifies what part of the database should be searched. The queries that make it through the web interface have to be translated to SQL Relational Database
Relational database : Example GenBank Query
What Constitutes a Good Database ? Broad coverage of the chosen topic Up to date information gathering Curated Support staff Commitment to the future Good query interface Issues for Molecular Biological Databases ? Annotation Archives Updates Redundancy
Issues for Molecular Biological Databases ? Annotation Adding biological information to genome sequence. Textual descriptive information Correctness Many genes are incorrectly annotated. May assign a function to a novel gene from a similar sequence that may itself be incorrectly annotated so the error is propagated throughout the database. Routine error Quality Expert or non expert curation? Who provided the curation? Is there any biological verification? What vocabulary is used Has their been any peer review ?
Issues for Molecular Biological Databases ? Archival Quality Is the database archival or curated Can the same data be recovered later Don’t overwrite primary key (each accession numbers) The best databases note any changes to the data. Updates How often is the database updated? Major databases take direct submissions Only the direct submitter can make changes, even if you can prove its wrong. When is a sequence finished ? How is annotation updated as more knowledge is available Redundancy This is a major issue, how do we deal with it without losing potentially valuable information. Also relates to archival quality
Genbank is the genetic sequence database of all publicly available DNA and derived protein sequences, with annotations describing the biological information in them. GenBank is hosted within NCBI Researchers submit their sequences to GenBank NCBI provides analysis and retrieval resources for the data in GenBank (and many other NCBI hosted databases). NCBI and GenBank
NCBI Databases (http://www.ncbi.nlm.nih.gov/guide/all/#Databases_) Nucleotide Database EST (dbEST) GSS (dbGSS) Protein Database Structure Database Genome 3D Domains Conserved Domains UniSTS Gene UniGene HomoloGene Reference Sequence (refseq) SNP (dbSNP) dbVAR – large scale genomic variation dbGAP – integration of genotype & phenotype PopSet Database Taxonomy Database GEO Profiles GEO Datasets Cancer Chromosomes Epigenomics PubMed Central Journals MeSH Bookshelf OMIM Database
Retrieving Data from NCBI using Entrez Entrez is a text based retrieval system that integrates all the information resources available at the NCBI such as; 1.Scientific literature 2.DNA and protein sequence databases 3.3D protein structure and protein domain data 4.Population study datasets 5.Expression data 6.Assemblies of complete genomes 7.Taxonomic information
Understanding GenBank records Go to http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#ModificationsDateB Click on the links on the left to get a description of what the term means, Copy the description into a word document and after completed, save the document on your drupal web site
Entrez Sequences Help http://www.ncbi.nlm.nih.gov/books/NBK44864/