Molecular Biology Databases

Slides:



Advertisements
Similar presentations
Week 2 The Object-Oriented Approach to Requirements
Advertisements

Bioinformatics Ayesha M. Khan Spring 2013.
SCORE The Supplemental Complex Repository for Examiners Biotechnology/Chemical/Pharmaceutical Partnership June 2006.
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Databanks (A) NCBINCBI (National Center for Biotechnology Information) is a home for many public biological databases (see an older diagram below). All.
Created as a part of NLM in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
CS 177 Hands-on lab with databases Quiz #1 Summary: Nucleotide and protein databases Sequence formats Lab exercises Quiz #1 Summary: Nucleotide and protein.
NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Archives and Information Retrieval
Lecture 2.21 Retrieving Information: Using Entrez.
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
How to use the web for bioinformatics Ethan Strauss X 1171
An Introduction to Bioinformatics Molecular Biology Databases.
Login: BITseminar Pass: BITseminar2011 Login: BITseminar Pass: BITseminar2011.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Introductory Overview
On line (DNA and amino acid) Sequence Information
Bioinformatics.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Gene Expression Omnibus (GEO)
NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
NCBI FieldGuide NCBI Molecular Biology Resources July 8, 2004 University of São Paulo, Brazil “ Third Latin American Course on Bioinformatics for Tropical.
Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Part I: Identifying sequences with … Speaker : S. Gaj Date
جلسه اول بیو انفورماتیک گردآوری:مسعود رسول آبادی
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
Organizing information in the post-genomic era The rise of bioinformatics.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
NCBI Literature Databases: PubMed
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
Bioinformatics and Computational Biology
Computer Storage of Sequences
EBI is an Outstation of the European Molecular Biology Laboratory. EBI patent related services Jennifer McDowall Senior Scientist, EMBL-EBI 3 rd Annual.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
A Field Guide to GenBank and NCBI Molecular Biology Resources
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
NCBI FieldGuide September 29, 2004 ICGEB NCBI Molecular Biology Resources A Field Guide part 1.
What is BLAST? Basic BLAST search What is BLAST?
Introduction to Genes and Genomes with Ensembl
Introduction to Bioinformatics
Retrieving Information: Using Entrez
Basics of BLAST Basic BLAST Search - What is BLAST?
Archives and Information Retrieval
Mangaldai College, Mangaldai
Access to Sequence Data and Related Information
Genomes and Their Evolution
BLAST.
Lesson 3 Bioinformatics Laboratory
Chapter 3. THE GENBANK SEQUENCE DATABASE
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Molecular Biology Databases NCBI, DDBL, EMBL and others These slides are obtained and/or modified from the ncbi website “field guide to ncbi” clides and from http://www.biotech.ufl.edu/WorkshopsCourses/summerBCourse/IntroToDataBases.ppt

What is a Database? A database can be defined as "a collection of data arranged for ease and speed of search and retrieval.“ A DNA database contains individual records or data entries of the DNA sequences as well as information about the sequences. A DNA database often contains flat-files. These are relatively simple database systems in which each database is contained in a single table. In contrast, relational database systems can use multiple tables to store information, and each table can have a different record format. http://omni.ac.uk/about/ This is the link to Omni, a guide to other databases in biology and medicine.

GenBank as a Database GenBank is the National Institute of Health (NIH) genetic sequence database, an annotated collection of all publicly available DNA sequences. It is maintained by the National Center for Biotechnology Information (NCBI) within the National Institute of Health (NIH). Sequences are said to be annotated when a researcher ascribes a function to some or all of the sequences. For example, If I know a sequence of 1000 nucleotides contains a gene beginning at nucleotide 200 and ending at nucleotide 800 and say so in the record, then it is considered annotated.

Anatomy of a Genome InfoSystem Information structure – Records of hierarchical, complex documents; Tables of rows and columns of numbers, letters, words – Table of contents, Reports, Indexing (as a reference book) – Browse thru available structure. – Search and retrieve according to biological questions – Bulk data selection & retrieval for other uses Information content – Primary: Literature (referenced, abstracted and curated), Sequence and feature analyses, maps, controlled vocabulary/ontologies relevant to biology, people, research methods, contacts, etc. – Metadata describing primary data, along with protocols, notes, sources Informatics / software – “Back-end” database, data collection, management, with some analyses – “Front-end” information services (hypertext web, document search/retrieval methods); ease of understanding and usage (HCI) – “Middleware” glue code, software, etc. – Specialized application for genome data: maps, BLAST searches, ontologies

History of Sequence Databases The first bioinformatics databases were constructed a few years after the first protein sequences began to become available. The first protein sequence reported was that of bovine insulin in 1956, consisting of 51 residues. Nearly a decade later, the first nucleic acid sequence was reported, that of yeast alanine tRNA with 77 bases. Just a year later, Dayhoff gathered all the available sequence data to create the first bioinformatic database. The Protein DataBank followed in 1972 with a collection of ten X-ray crystallographic protein structures, and the SWISSPROT protein sequence database began in 1987.

GenBank History DNA databases began in the early 1980s with a database called GenBank, which was originated by the U.S. Department of Energy to hold the short stretches of DNA sequence that scientists were just beginning to obtain from a range of organisms. In the early days of GenBank, rooms of technicians sat at keyboards consisting of only the four letters A, C, T and G, tediously entering the DNA-sequence information published in academic journals.

The National Center for Biotechnology Information Created as a part of NLM in 1988 Establish public databases U.S. National DNA Sequence Database Perform research in computational biology Develop software tools for sequence analysis Disseminate biomedical information

GenBank History Newer communication technologies enabled researchers to dial up GenBank and dump in their sequence data directly. The administration of GenBank was transferred to National Institutes of Health's National Center for Biotechnology Information (NCBI). With the advent of the World Wide Web, researchers could access the data in GenBank for free from around the globe. Once the Human Genome Project (HGP) began in 1990, DNA-sequence data in GenBank began to grow exponentially. With the introduction in the 1990s of high-throughput sequencing additions to GenBank skyrocketed.

An Interesting Metaphor For Bioinformatics Information Flow and Databases Cooks generate and enter the data. Data Management makes it into a stew of blended information. The waiters take the data from the servers to the public. The diners are placing orders for the information they wish to consume.

Molecular Databases Primary Databases Derivative Databases Original submissions by experimentalists Database staff organize but don’t add additional information Example: GenBank,SNP, GEO Derivative Databases Human curated compilation and correction of data Example: SWISS-PROT, NCBI RefSeq mRNA Computationally Derived Example: UniGene Combinations Example: NCBI Genome Assembly

What, the scientists submit their own DNA sequences? Who checks for error? Who makes people actually send their data to the database so all can share it? Learn from success, failure of GenBank/EMBL extensive publicly shared bio-data Carrot/stick approach. Granting agencies and journals began requiring scientists to publish sequence data. Patented sequences must be entered in the databases too. However, there is significant public databank error due to data ownership by scientists; no inducements to update or go back and correct errors.

Primary vs. Derivative Databases ATTGACTA Primary vs. Derivative Databases ACGTGC TTGACA CGTGA TATAGCCG GenBank AT GA C ATT Sequencing Centers UniGene RefSeq Genome Assembly Labs Curators Algorithms AGCTCCGATA CCGATGACAA

GenBank is NCBI’s Primary Sequence Database Nucleotide only sequence database Archival in nature GenBank Data Direct submissions (traditional records ) Batch submissions (EST, GSS, STS) ftp accounts (genome data) Three collaborating databases GenBank DNA Database of Japan (DDBJ) European Molecular Biology Laboratory (EMBL) Database

Why use Bioinformatics Databases? Speed of information retrieval Increasing size of data sets Amount of information available Save time and money by simulating experiments prior to actual experiment (a.k.a. in silico)

How do you access Databases? Search engines Programs that allow you to search the database Links from other sites to the search engines Programs that directly link to the search engines

Boolean Logic Why do we use Boolean operators To narrow your search get fewer superfluous results What are the Boolean Operators AND-looks for entries with both terms OR-looks for entries with one term or the other NOT (or BUTNOT)-looks for entries with one term but not the other * (Wildcard) -looks for ALL entries that contain the term with the * after it

Citations that contain the descriptors Food ‘AND’ Allergy only.

OR Food Allergy Citations that contain the descriptors Food ‘OR’ Allergy. This is a bigger set.

Citations that contain the descriptors Allergy ‘NOT’ Food

* (Wildcard) Allerg* Food Citations that contain the descriptors Allerg* (Allergies, Allergy, Allergen

GenBank as a Database GenBank identifiers are unique combination of numbers and letters used to index GenBank sequence entries. They can be used to retrieve information about a particular gene or DNA sequence from the GenBank database. This information also includes links to similar sequence entries and other public databases, making it a relational database as well as a flat file database. Sequences are said to be annotated when a researcher ascribes a function to some or all of the sequences. For example, If I know a sequence of 1000 nucleotides contains a gene beginning at nucleotide 200 and ending at nucleotide 800 and say so in the record, then it is considered annotated.

What is GenBank? NCBI’s Primary Sequence Database Nucleotide only sequence database Archival in nature GenBank Data Direct submissions individual records (BankIt, Sequin) Batch submissions via email (EST, GSS, STS) ftp accounts sequencing centers Data shared three collaborating databases GenBank DNA Database of Japan (DDBJ). European Molecular Biology Laboratory Database (EMBL) at EBI.

The International Sequence Database Collaboration EBI GenBank DDBJ EMBL Entrez SRS getentry NIG CIB NCBI NIH Submissions Updates

GenBank: NCBI’s Primary Sequence Database full release every two months incremental and cumulative updates daily available only through internet ftp://ftp.ncbi.nih.gov/genbank/ Release 131 August 2002 18,197,119 Records 22,616,937,182 Nucleotides 110,000 + Species Lots of sequences in the databases! 83.65 Gigabytes of data

GenBank: NCBI’s Primary Sequence Database full release every two months incremental and cumulative updates daily available only through internet ftp://ftp.ncbi.nih.gov/genbank/ Release 135 April 2003 24,027,936 Records 31,099,264,455 Nucleotides 120,000 + Species Increased dramatically since August 2002 114 Gigabytes

GenBank: NCBI’s Primary Sequence Database Release 139 December 2003 30,968,418 Records 36,553,368,485 Nucleotides >140,000 Species 138 Gigabytes 570 files full release every two months incremental and cumulative updates daily available only through internet Increased dramatically since April! ftp://ftp.ncbi.nih.gov/genbank/

The Growth of GenBank Release 139: 31.0 million records 5 10 15 20 25 30 35 5 10 15 20 25 30 35 40 Sequence records Total base pairs Release 139: 31.0 million records 36.6 billion nucleotides Average doubling time ≈ 12 months Sequence Records (millions) Total Base Pairs (billions) Doubling time is currently less than 1 year and still accelerating. '82 '84 '85 '86 '87 '88 '90 '91 '92 '93 '95 '96 '97 '98 '00 '01 '02 '03

The Entrez System

Entrez Nucleotides Primary GenBank / EMBL / DDBJ 35,116,960 Derivative RefSeq 259,219 Third Party Annotation 3,182 PDB 4,703 Total 35,384,248

Entrez Protein GenPept (GB,EMBL, DDBJ) 3,442,298 RefSeq 856,191 Third Party Annotation 3,834 Swiss Prot 144,508 PIR 282,821 PRF 12,079 Total 3,442,298 BLAST nr 1,642,191

Organization of GenBank: GenBank Divisions Records are divided into 17 Divisions. 1 Patent (11 files) 5 High Throughput 11 Traditional EST (288) Expressed Sequence Tag GSS (98) Genome Survey Sequence HTG (61) High Throughput Genomic STS (3) Sequence Tagged Site HTC (3) High Throughput cDNA PRI (27) Primate PLN (10) Plant and Fungal BCT (8) Bacterial and Archeal INV (6) Invertebrate ROD (11) Rodent VRL (3) Viral VRT (4) Other Vertebrate MAM (1) Mammalian (ex. ROD and PRI) PHG (1) Phage SYN (1) Synthetic (cloning vectors) UNA (1) Unannotated Traditional Divisions: Direct Submissions (Sequin and BankIt) Accurate Well characterized BULK Divisions: Batch Submission (Email and FTP) Inaccurate Poorly characterized The traditional divisions are generally taxonomic. 1. PRI - primate sequences 2. ROD - rodent sequences 3. MAM - other mammalian sequences 4. VRT - other vertebrate sequences 5. INV - invertebrate sequences 6. PLN - plant, fungal, and algal sequences 7. BCT - bacterial sequences 8. VRL - viral sequences 9. PHG - bacteriophage sequences 10. SYN - synthetic sequences 11. UNA - unannotated sequences The high throughput divisions are based on genome sequencing projects. 12. EST - EST sequences (expressed sequence tags) 14. STS - STS sequences (sequence tagged sites) 15. GSS - GSS sequences (genome survey sequences) 16. HTG - HTGS sequences (high throughput genomic sequences) 17. HTC - unfinished high-throughput cDNA sequencing Patent division: 13. PAT - patent sequences Entrez query: gbdiv_xxx[Properties]

Traditional GenBank Divisions Direct Submissions (Sequin and BankIt) Accurate Well characterized BCT Bacterial and Archeal INV Invertebrate MAM Mammalian (ex. ROD and PRI) PHG Phage PLN Plant and Fungal PRI Primate ROD Rodent SYN Synthetic (cloning vectors) VRL Viral VRT Other Vertebrate

A Helpful Resource This is a link to a sample annotated GenBank Record. Click on any of the underlined links to learn more about the file structure. http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

What is an Accession Number? An accession number is label that used to identify a sequence in the various databases. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 GenBank genomic DNA sequence NT_030059 Genomic contig Rs7079946 dbSNP (single nucleotide polymorphism) N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) NP_007635 RefSeq protein AAC02945 GenBank protein Q28369 SwissProt protein 1KT7 Protein Data Bank structure record

GenBank Flat File Format When you click on an entry, you have opened a GenBank Flat File Information includes: The Name of the gene The Accession number Journal articles

GenBank Flat File Format Information (Cont) Structural information of the gene (eg intron/exon boundaries, promoters,etc) The code for the protein The code for the DNA (RNA-if mRNA it is the cDNA for the mRNA sequenced)

A Traditional GenBank Record LOCUS AF062069 3808 bp mRNA INV 02-MAR-2000 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. ACCESSION AF062069 VERSION AF062069.2 GI:7144484 KEYWORDS . SOURCE Atlantic horseshoe crab. ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. (1998) In press REFERENCE 2 (bases 1 to 3808) TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700. Definition =Title ACCESSION AF062069 VERSION AF062069.2 GI:7144484 Accession Number Version Number GI Number NCBI’s Taxonomy

GenBank Record: Feature Table FEATURES Location/Qualifiers source 1..3808 /organism="Limulus polyphemus" /db_xref="taxon:6850" /tissue_type="lateral eye" CDS 258..3302 /note="N-terminal protein kinase domain; C-terminal myosin heavy chain head; substrate for PKA" /codon_start=1 /product="myosin III" /protein_id="AAC16332.2" /db_xref="GI:7144485" /translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ BASE COUNT 1201 a 689 c 782 g 1136 t ORIGIN 1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt 3781 aagatacagt aactagggaa aaaaaaaa // /protein_id="AAC16332.2" /db_xref="GI:7144485" GenPept Protein IDS

Multiple Formats are available for Sequence Data Historically, all the DNA and Protein software was written concurrent with the establishment of the databases. So the formats needed in the databases and the software co-evolved. Sequence analysis software needs simpler formats than databases for speed- or else the program must be allowed to ignore most of the excess information.

FastA format is a very popular solution >gi|603218|gb|U18238.1|MSU18238 Medicago sativa glucose-6-phosphate dehyd CCACCAGATATAATTAAGTAGATCAGAGTAGAAGAAGATGGGAACAAATGAATGGCATGTAGAAAGAAGA GATAGCATAGGTACTGAATCTCCTGTAGCAAGAGAGGTACTTGAAACTGGCACACTCTCTATTGTTGTGC TTGGTGCTTCTGGTGATCTTGCCAAGAAGAAGACTTTTCCTGCACTTTTTCACTTATATAAACAGGAATT GTTGCCACCTGATGAAGTTCACATTTTTGGCTATGCAAGGTCAAAGATCTCCGATGATGAATTGAGAAAC AAATTGCGTAGCTATCTTGTTCCAGAGAAAGGTGCTTCTCCTAAACAGTTAGATGATGTATCAAAGTTTT TACAATTGGTTAAATATGTAAGTGGCCCTTATGATTCTGAAGATGGATTTCGCTTGTTGGATAAAGAGAT TTCAGAGCATGAATATTTGAAAAATAGTAAAGAGGGTTCATCTCGGAGGCTTTTCTATCTTGCACTTCCT CCTTCAGTGTATCCATCCGTTTGCAAGATGATCAAAACTTGTTGCATGAATAAATCTGATCTTGGTGGAT GGACACGCGTTGTTGTTGAGAAACCCTTTGGTAGGGATCTAGAATCTGCAGAAGAACTCAGTACTCAGAT TGGAGAGTTATTTGAAGAACCACAGATTTATCGTATTGATCACTATTTAGGAAAGGAACTAGTGCAAAAC ATGTTAGTACTTCGTTTTGCAAATCGGTTCTTCTTGCCTCTGTGGAACCACAACCACATTGACAATGTGC AGATAGTATTTAGAGAGGATTTTGGAACTGATGGTCGTGGTGGATATTTTGACCAATATGGAATTATCCG AGATATCATTCCAAACCATCTGTTGCAGGTTCTTTGCTTGATTGCTATGGAAAAACCCGTTTCTCTCAAG CCTGAGCACATTCGAGATGAGAAAGTGAAGGTTCTTGAATCAGTACTCCCTATTAGAGATGATGAAGTTG TTCTTGGACAATATGAAGGCTATACAGATGACCCAACTGTACCGGACGATTCAAACACCCCGACTTTTGC AACTACTATTCTGCGGATACACAATGAAAGATGGGAAGGTGTTCCTTTCATTGTGAAAGCAGGGAAGGCC CTAAATTCTAGGAAGGCAGAGATTCGGGTTCAATTCAAGGATGTTCCTGGTGACATTTTCAGGAGTAAAA AGCAAGGGAGAAACGAGTTTGTTATCCGCCTACAACCTTCAGAAGCTATTTACATGAAGCTTACGGTCAA GCAACCTGGACTGGAAATGTCTGCAGTTCAAAGTGAACTAGACTTGTCATATGGGCAACGATATCAAGGG ATAACCATTCCAGAGGCTTATGAGCGTCTAATTCTCGACACAATTAGAGGTGATCAACAACATTTTGTTC GCAGAGACGAATTAAAGGCATCATGGCAAATATTCACACCACTTTTACACAAAATTGATAGAGGGGAGTT GAAGCCGGTTCCTTACAACCCGGGAAGTAGAGGTCCTGCAGAAGCAGATGAGTTATTAGAAAAAGCTGGA TATGTTCAAACACCCGGTTATATATGGATTCCTCCTACCTTATAGAGTGACCAAATTTCATAATAAAACA AGGATTAGGATTATCAGGAGCTTATAAATAAGTCTTCAATAAGCTTGTGAAATTTTCGTTATAATCTCTC TCATTTTGGGGTGTATATCAAGCATTTAAGCGCGTGTTTGACACAGTTTGTGTAATAGATTTGGCTCTGA ATGAAAATAAACGGGAATTGTTTCTTTTTGTTTTA FASTA Definition Line >gi|603218|gb|U18238.1|MSU18238 gi number Database Identifiers gb GenBank emb EMBL dbj DDBJ sp SWISS-PROT pdb Protein Databank pir PIR prf PRF ref RefSeq Accession number Locus Name >

FASTA format

Graphics format

ASN.1 Format ASN.1, or Abstract Syntax Notation One, is an International Standards Organization (ISO) data representation format used to achieve interoperability between platforms. NCBI uses ASN.1 for the storage and retrieval of data such as nucleotide and protein sequences, structures, genomes, and MEDLINE records. ASN.1 permits computers and software systems of all types to reliably exchange both the data structure and content.

NCBI Software Development Tool Kit The "NCBI Toolbox" is a set of software and data exchange specifications used by NCBI to produce portable, modular software for molecular biology. The software in the Toolbox is primarily designed to read ASN.1 format records. It is available to the public in the toolbox/ncbi_tools directory of NCBI's ftp site, and can be used in its own right or as a foundation for building tools with similar properties. The readme files in the toolbox and toolbox/ncbi_tools directories of the FTP site contain more information about the toolbox and ASN.1.

Abstract Syntax Notation: ASN.1 Seq-entry ::= set { level 1 , class nuc-prot , descr { title "Medicago sativa glucose-6-phosphate dehydrogenase mRNA, and translated products" , source { org { taxname "Medicago sativa subsp. sativa" , db { { db "taxon" , tag id 56147 } } , orgname { name binomial { genus "Medicago" , species "sativa" , subspecies "subsp. sativa" } , mod { FASTA Nucleotide Protein GenPept GenBank ASN.1

NCBI Toolbox Toolbox Sources ftp> open ftp.ncbi.nih.gov . /************************************************************************ * * asn2ff.c * convert an ASN.1 entry to flat file format, using the FFPrintArray. **************************************************************************/ #include <accentr.h> #include "asn2ff.h" #include "asn2ffp.h" #include "ffprint.h" #include <subutil.h> #include <objall.h> #include <objcode.h> #include <lsqfetch.h> #include <explore.h> #ifdef ENABLE_ID1 #include <accid1.h> #endif FILE *fpl; Args myargs[] = { {"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL}, {"Input is a Seq-entry","F", NULL ,NULL ,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL}, {"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL}, {"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL}, {"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL}, Toolbox Sources ftp> open ftp.ncbi.nih.gov . ftp> cd toolbox ftp> cd ncbi_tools ftp://ftp.ncbi.nlm.gov/toolbox/ncbi_tools

Database Tools aren’t keeping pace Despite the huge progress in sequencing and expression analysis technologies, and the corresponding magnitude of more data that is held in the public, private and commercial databases, the tools used for storage, retrieval, analysis and dissemination of data in bioinformatics are still very similar to the original systems gathered together by researchers 15-20 years ago. Many are simple extensions of the original academic systems, which have served the needs of both academic and commercial users for many years. These systems are now beginning to fall behind as they struggle to keep up with the pace of change in the pharma industry.

Database Tools aren’t keeping pace Databases are still gathered, organized, disseminated and searched using flat files. Relational databases are still few and far between, and object-relational or fully object oriented systems are rarer still in mainstream applications. Interfaces still rely on command lines, fat client interfaces, which must be installed on every desktop, or HTML/CGI forms. Whilst they were in the hands of bioinformatics specialists, pharmas have been relatively undemanding of their tools. Now the problems have expanded to cover the mainstream discovery process, much more flexible and scalable solutions are needed to serve pharma R&D informatics requirements.

There are more than one type of DNA sequence in Genebank Genomic sequences made from genomic DNA- these do contain introns and LOTS of DNA that never becomes messenger RNA. mRNA codes for proteins. cDNA sequences made from mRNA- these don’t contain the introns ESTS (short stretches of cDNA sequences that are sort of a “rough draft” mtDNA from mitochondrial genomes SNP single nucleotide polymorphisms with some DNA variation.

Quality of the Sequence is Variable Some of the DNA is sequenced several times before it is added to the databases. Some of the DNA is sequenced very quickly on automated equipment and is input directly from the computers. Both are important types of information. The “draft” is corrected by curators who assemble the pieces into the genome.

Genome Sequencing Whole BAC insert (or genome) shredding sequencing cloning isolating GSS division or trace archive assembly Draft Sequence (HTG division)

Working Draft Sequence gaps

Assembly Required. All the data is still in the pieces used to assemble the genomes. So, that means all the overlapping pieces are still in the databases. So, searching comes up with many versions and shorter subclones: pieces which are used to assemble the “genomic contigs” or contiguous pieces which are assembled into whole chromosomes. Sometimes you want to use the smaller pieces, since handling the whole chromosome is awkward in sequence analysis.

HTG Division: High Throughput Genome phase 1 phase 2 phase 3 ROD Acc = AC109609.1 Acc =AC109609.6 Acc = AC109609.10 HTG 40,000 to > 350,000 bp

HTG Division: High Throughput Genome 40,000 to > 350,000 bp

Whole Genome Shotgun

STS Division : Sequence Tagged Sites Segment of gene, EST , mRNA or genomic DNA of known position (microsatellite) PCR with STS primers gives one product per genome Basis of Radiation Hybrid Mapping UniGene Genome Assembly Related resource: Electronic PCR http://www.ncbi.nlm.nih.gov/genome/sts/epcr.cgi

Be aware of errors in the databases Sequence errors: genome projects’ error rate is 1/10,000 nucleotides; ESTs’ error rate is 1/100 nucleotides. Annotation errors: Many databases annotate their sequences using automated computer programs. These programs do not always give correct annotations. SwissProt is a protein database curated and annotated manually by biologists. It’s regarded as the most reliable database, but does not have the most up-to-date sequence information.

There is a Lot of Sequence in the Databases One problem is finding what you are looking for in the database. Try putting in the search term human beta hemoglobin into the nucleotide database. It won’t be easy to find the sequence in the 88 pages of hits! RefSeq was invented to help you find some of the common sequences based on a human (or now, a computer) looking over all the similar submissions of the same sequence to the database. RefSeq corrects some of those sequence errors by comparing lots of sequences.

RefSeq: NCBI’s Derivative Sequence Database Curated transcripts and proteins reviewed human, mouse, rat, fruit fly, zebrafish, arabidopsis, C. elegans Human model transcripts and proteins Assembled Genomic Regions (contigs) draft human genome mouse genome Chromosome records microbial organelle

RefSeq Benefits non-redundancy explicitly linked nucleotide and protein sequences updates to reflect current sequence data and biology data validation format consistency distinct accession series stewardship by NCBI staff and collaborators

The RefSeq Accession Numbers NCBI Reference Sequences mRNAs and Proteins NM_123456 Curated mRNA NP_123456 Curated Protein NR_123456 Curated non-coding RNA XM_123456 Predicted Transcript (human, mouse) XP_123456 Predicted Protein (human, mouse) XR_123456 Predicted non-coding RNA Gene Records NG_123456 Reference Genomic Sequence (human) Assemblies NT_123456 Contig (Mouse and Human) NW_123456 WGS Supercontig (Mouse) NC_123455 Chromosome (Microbial, Arabidopsis ) human mouse rat fruit fly zebrafish Arabidopsis Microbial

GenBank Sequences: Human Lipoprotein Lipase

Curated RefSeq Records: NM_, NP_

Alignment Based Models

Alignment Based Models AA change In alignment, computers are used to line up two similar sequences and notice where the matches and mimatches are.

Alignment GeneratedTranscripts: XM_,XP_

RefSeq Contig: NT_, NW_

RefSeq Chromosomes: NC_ LOCUS NC_002695 5498450 bp DNA circular BCT 02-OCT-2001 DEFINITION Escherichia coli O157:H7, complete genome. ACCESSION NC_002695 VERSION NC_002695.1 GI:15829254 KEYWORDS . SOURCE Escherichia coli O157:H7. ORGANISM Escherichia coli O157:H7 Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Escherichia. REFERENCE 1 (sites) AUTHORS Makino,K., Yokoyama,K., Kubota,Y., Yutsudo,C.H., Kimura,S., Kurokawa,K., Ishii,K., Hattori,M., Tatsuno,I., Abe,H., Iida,T., Yamamoto,K., Ohnishi,M., Hayashi,T., Yasunaga,T., Honda,T., Sasakawa,C. and Shinagawa,H. TITLE Complete nucleotide sequence of the prophage VT2-Sakai carrying the verotoxin 2 genes of the enterohemorrhagic Escherichia coli O157:H7 derived from the Sakai outbreak JOURNAL Genes Genet. Syst. 74 (5), 227-239 (1999) MEDLINE 20198780 PUBMED 10734605

Integrated WWW Access: BLAST and Entrez

Some Web Statistics -25 million hits per day -150,000190,000240,000 users/per day -1.2 million Entrez searches -PubMed alone: 1 million searches -BLAST alone: 80,000 searches per day 3 terabytes of data dowloaded daily via FTP July 2001

Users per day 1997 1998 1999 2000 2001 Christmas Day

Bulk GenBank Divisions Batch Submission and htg (email and ftp) Inaccurate Poorly Characterized EST Expressed Sequence Tag STS Sequence Tagged Site GSS Genome Survey Sequence HTG High Throughput Genomic

EST Division: Expressed Sequence Tags >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGCC TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAAAT TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGA GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCC AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTT TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG nucleus 30,000 genes gatccantgccatacg 5’ 3’ ctcgccaattcnntcg >IMAGE:275615 3', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTACT TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTCC AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTAA CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGAT GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC - isolate unique clones sequence once from each end RNA gene products make cDNA library 80-100,000 unique cDNA clones in library

Unigene A gene-oriented view of sequence entries UniGene collects expressed sequence tags (ESTs) into clusters, in an attempt to form one gene per cluster. Use UniGene to study where your gene is expressed in the body, when it is expressed, and see its abundance.

UniGene http://www.ncbi.nlm.nih.gov/UniGene/ MegaBlast based automated sequence clustering Nonredundant set of gene oriented clusters Each cluster a unique gene Information on tissue types and map locations Includes well-characterized genes and novel ESTs Useful for gene discovery and selection of mapping reagents http://www.ncbi.nlm.nih.gov/UniGene/

EST hits A.t. serine protease mRNA A.t. mRNA 5’ EST hits 3’ EST hits

Arabidopsis UniGene Statistics 39,855 mRNAs + gene CDSs 87,006 EST, 3'reads 42,137 EST, 5'reads + 32,571 EST, other/unknown ---------- 201,569 total sequences in clusters Final Number of Clusters (sets) =============================== sets total 25,474 sets contain at least one known gene 17,654 sets contain at least one EST 16,326 sets contain both genes and ESTs UniGene Build 14 Apr. 9th, 2002 26,808 115,000,000 bp 25,498 expected genes 5% uncharacterized transcripts

Hs UniGene Statistics 1,181,855 EST, 3'reads 1,461,928 EST, 5'reads 73,419 mRNAs + gene CDSs 1,181,855 EST, 3'reads 1,461,928 EST, 5'reads + 616,609 EST, other/unknown ---------- 3,333,811 total sequences in clusters Final Number of Clusters (sets) =============================== sets total 22,431 sets contain at least one known gene 97,618 sets contain at least one EST 21,233 sets contain both genes and ESTs UniGene Build 148 Apr. 8th, 2002 98,816 3,000,000 base pairs 30 K expected genes 80% uncharacterized transcripts

UniGene Collections Jul, 2002 Sequences Clusters Homo sapiens human 3,569,546 101,602 Mus musculus mouse 2,332, 864 84,247 Rattus norvegicus rat 334,582 62,220 Danio rerio zebrafish 197,266 15,404 Bos taurus cow 128,914 10,295 Xenopus laevis frog 162,269 18,984 D.melanogaster fruit fly 250,655 11, 115 Anopholes gambiae mosquito 43,126 2,556 Plants Arabidopsis thaliana thale cress 210,693 26,875 Oryzia sativa rice 78,632 15,802 Triticum aestivum wheat 139,447 12,575 Hordeum vulgare barley 160,518 7,324 Zea mays maize (corn) 131,668 10,301