Introduction to biological databases

Introduction to biological databases
Antoine van Kampen Bioinformatics Laboratory

Measurements Can you mention a type of experiment that produces large amounts of data?

High energy physics

The 27 km diameter instrument is almost as big as Paris itself!
ATLAS, the main detector at CERN’s Large Hadron Collider. Is about half as big as Notre Dame cathedral. Data produced by LHC will produce 15 petabytes per year. For comparison, this is about DVD’s. (1PByte = Gigabyte)

How does this compare to data production in biomedical sciences?

Data production in biomedical sciences
Examples of measurement technologies: Atomic Absorption Spectroscopy Serial Analysis of Gene Expression (SAGE) DNA sequencing

Atomic Absorption Spectroscopy
Uses absorption of light to measure the concentration of gas-phase atoms.

Use of Calibration curve to determine sodium concentration {sample absorbance = 0.65}
1.0 0.8 0.6 0.4 Concentration Na+ = 7.3ppm 0.2 2 4 6 8 Concentration (ppm)

Excel or pocket calculator

Measuring gene expression with SAGE
Serial Analysis of Gene Expression Velculescu et al (1995) Science

SAGE library (20.000 -- >250.000 tags)
CTCTTCAAAA 1 CTCTTCAACC 2 CTCTTCACGG 2 CTCTTCAGGA 2 CTCTTCAGGG 1 CTCTTCAGGT 1 CTCTTCGAGA 8 CTCTTCTCCC 4 CTCTTCTGCC 1 CTCTTCTTTG 1 CTCTTGGACT 1 CTCTTGTGGT 1 CTCTTTACTA 1 CTCTTTCCCT 1 CTCTTTGATT 2 CTCTTTTGAT 1 CTGAAAAAAA 3 CTGAAAACCA 4 CTGAAACAGC 1 CTGAAACCCC 6 CTGAAACCCT 4 CTGAAACCTT 1 CTGAAACTGA 1 CTGAAAGGCT 2 CTGAAATCCT 1 CTGAAATCTA 3 CTGAAATGAG 2 CTGAAATTCG 2 CTGAACAAAG 2 CTGAACAAGA 1 CTGAACCTGA 1 This many tags does not always fit Microsoft Excel. Challenges: -Information about identity of tag is lost -You need dedicated software -Statistical Analysis to identify differentially expressed genes. -Use software for statistical analysis: SPSS, SPLUS, R,

DNA sequencing

Sanger sequencing From DNA to Autoradiograms

Automated Sequencing ABI 3100 Automated Capillary DNA Sequencer

Electropherogram

Custom-designed factory-style conveyor belt robots:
perform all functions from purifying DNA from bacterial cultures through setting up and purifying sequencing reactions. Sequencing factories Whitehead institute Automated sequencing

2001 The HGP consortium publishes its working draft in Nature (15 February), and Celera publishes its draft in Science (16 February). Human genome: 3 billion bases In total 23 billion bases were sequenced (7.5-fold coverage) This comprises 23 Gbyte of data Public Human Genome project - 3 billion US dollar Private Celera genome project - Craig Venter - 300 million US dollar 2001

Traditional versus high throughput DNA sequencing
Sanger, one run: ? hours (human genome took 15 years) 1-384 sequences nt per sequence 1 KB KB data = 1, ,000 bases Year: 1963 – now Roche 454, one run: 7.5 hours 1,000,000 sequences 500 nt per sequence 35 GB data (including images) 500 MB data (excluding images) = 500,000,000 bases Year: 2005 – now ABI SoLiD, one run: 3-5 days 150,000,000 sequences nt per sequence 2-4 TB data (raw data) = 15,000,000,000 bases Year: 2005 – end 2011 Data throughput increases with every update

Genome projects Human genome project (1 genome, pooled from several)
Exome sequencing (~10 individuals) Genome of the Netherlands (770 individuals) 1000 genome project (1000 individuals) 10K UK project (10,000 individuals) … many centers have one or more high throughput sequencers

The ‘biomedical’ laboratories
AMC FNWI

DNA isolation Beckman Biomek FX Mass spectrometer Sequencing

Affymetrix gene expression

Clinical data: the hospital as a laboratory

A bioinformatics laboratory
The Cell Statistics Algorithms JAVA, PHP Biomedical researchers are analyzing and interpreting data

Biological databases

Biological databases Hundreds of databases freely available
Commercial databases (e.g., GeneLogic, BioMax) Many databases filled by scientific community

Publication policies of scientific journals

Molecular Biology Database Collection 2013
Published by Nucleic Acids Research (scientific journal) Currently 1512 databases. Latest issue: 132 new databases Coverage is far from exhaustive ?

Molecular Biology Database Collection
Criteria for inclusion: Thoroughly curated Of interest to a wide variety of biologists (primarily bench scientists) Comprehensiveness of coverage Degree of added value (e.g., manual curation) Likely to be maintained for a long period of time

Types of Databases Primary Databases (core databases)
Original submissions by experimentalists Content controlled by the submitter Examples: GenBank, SNP, GEO Derivative Databases Built from primary data Content controlled by third party (NCBI) Examples: Refseq, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain Primary databases serve as a repository of experimentalist sequences (GenBank). Derivative databases are sources of edited/curated sequences (RefSeq…reference sequences, UniGene...genes compared to genetic loci on genomes)

Another list: 325 biological pathway databases

Application of databases
Archival of data Exploration of data to advance biological knowledge In silico analysis (expression, gene discovery) Comparative genomics Integration with experimental data Planning of new experiments (generation of new hypothesis)

Database access Web–interface From other applications Download
Keyword search Browsing / cross-linking BLAST From other applications Application programming interface (API) Web services In-house / third party software Download Full database Specific records Different formats (text, xml, rdf, etc)

Elixir: research domains
The purpose of ELIXIR is to construct and operate a sustainable infrastructure for biological information in Europe to support life science research and its translation to medicine and the environment, the bio-industries and society.

Growth of biological data
The size of several databases (eg GenBank) grows at an exponential rate Number of database increases rapidly data management/maintenance is becoming a full time project and increasingly complex Repeat queries on a regular basis

The National Center for Biotechnology Information (NCBI)
Bethesda,MD Created in 1988 as a part of the National Library of Medicine at NIH Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information

Web Access: www.ncbi.nlm.nih.gov
NCBI homepage. Logo will take you back to home page. About NCBI provides introduction to the NCBI and contains basic information on genetics and bioinformatics.

Entrez Integrates Most of Them!
Gene UniGene CancerChromosomes UniSTS Homologene SNP Genome Nucleotide PopSet GEO Books PubMed Entrez Taxonomy GEO Datasets MeSH OMIM Protein PMC Journals Domains Structure 3D Domains

What is GenBank? NCBI’s Primary Sequence Database
Nucleotide only sequence database Archival in nature Redundant GenBank Data Direct submissions (traditional records) Batch submissions (EST, GSS, STS) ftp accounts (genome data) Three collaborating databases GenBank DNA Database of Japan (DDBJ) European Molecular Biology Laboratory (EMBL-EBI) Database

GenBank: NCBI’s Primary Sequence Database
full release every two months incremental updates daily available only via ftp ftp.ncbi.nih.gov/genbank/

GenBank 2008 Public Collections of DNA and RNA Sequence Reach over 145 Gigabases These 145,000,000,000 bases of the genetic code, represent both individual genes and partial and complete genomes (>61,000,000 sequences) of over 240,000 organisms, e.g. humans, elephants, earthworms, fruitflies, apple trees, and bacteria. 120 of more than 370 microbial genomes submitted past year. About 3000 new organisms are added each month. 16% of the sequences is from human origin. GenBank is maintained by the National Center for Biotechnology Information (NCBI). Submitters to GenBank currently contribute over 15 million new DNA sequences last year to the database. Size of GenBank doubles about every 18 months

Genbank 2013 Benson et al (2013) NAR

Explosive data growth: size of databases
about 27 years > 145 Giga bases

DNA sequencing rate

Petabyte = 1015 byte = 1.000 Terra byte = 1.000.000 Giga byte
European Nucleotide Archive (ENA) Established for next generation sequencing data Last 3 months the EBI received 10TB of data representing 1/8 the total volume of data accumulated in 28 years Challenges Bandwidth to overcome network constraints User tools to submit the data Validation systems to overcome unmanageable demands on manual curation efforts Tools to maximize data utility to users

Organization of GenBank: Traditional Divisions
Records are divided into 18 Divisions. 12 Traditional 6 Bulk PRI Primate PLN Plant and Fungal BCT Bacterial and Archeal INV Invertebrate ROD Rodent VRL Viral VRT Other Vertebrate MAM Mammalian PHG Phage SYN Synthetic (cloning vectors) ENV Environmental Samples UNA Unannotated Traditional Divisions: Direct Submissions (Sequin and BankIt) Accurate Well characterized The traditional divisions are generally taxonomic. 1. PRI - primate sequences 2. ROD - rodent sequences 3. MAM - other mammalian sequences 4. VRT - other vertebrate sequences 5. INV - invertebrate sequences 6. PLN - plant, fungal, and algal sequences 7. BCT - bacterial sequences 8. VRL - viral sequences 9. PHG - bacteriophage sequences 10. SYN - synthetic sequences 11. UNA - unannotated sequences The high throughput divisions are based on genome sequencing projects. 12. EST - EST sequences (expressed sequence tags) 14. STS - STS sequences (sequence tagged sites) 15. GSS - GSS sequences (genome survey sequences) 16. HTG - HTGS sequences (high throughput genomic sequences) 17. HTC - unfinished high-throughput cDNA sequencing Patent division: 13. PAT - patent sequences

Organization of GenBank: Bulk Divisions
Records are divided into 18 Divisions. 12 Traditional 6 Bulk EST Expressed Sequence Tag GSS Genome Survey Sequence HTG High Throughput Genomic STS Sequence Tagged Site HTC High Throughput cDNA PAT Patent BULK Divisions: Batch Submission ( and FTP) Inaccurate Poorly characterized The traditional divisions are generally taxonomic. 1. PRI - primate sequences 2. ROD - rodent sequences 3. MAM - other mammalian sequences 4. VRT - other vertebrate sequences 5. INV - invertebrate sequences 6. PLN - plant, fungal, and algal sequences 7. BCT - bacterial sequences 8. VRL - viral sequences 9. PHG - bacteriophage sequences 10. SYN - synthetic sequences 11. UNA - unannotated sequences The high throughput divisions are based on genome sequencing projects. 12. EST - EST sequences (expressed sequence tags) 14. STS - STS sequences (sequence tagged sites) 15. GSS - GSS sequences (genome survey sequences) 16. HTG - HTGS sequences (high throughput genomic sequences) 17. HTC - unfinished high-throughput cDNA sequencing Patent division: 13. PAT - patent sequences EST error rate of 1-2%.

A Traditional GenBank Record
LOCUS AF bp mRNA linear PLN 29-JAN-2004 DEFINITION Prunus persica ethylene receptor (ETR1) mRNA, complete cds. ACCESSION AF124527 VERSION AF GI: KEYWORDS . SOURCE Prunus persica (peach) ORGANISM Prunus persica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids; eurosids I; Rosales; Rosaceae; Amygdaloideae; Prunus. REFERENCE 1 (bases 1 to 2540) AUTHORS Bassett,C.L., Artlip,T.S. and Callahan,A.M. TITLE Characterization of the peach homologue of the ethylene receptor, PpETR1, reveals some unusual features regarding transcript processing JOURNAL Planta 215 (4), (2002) PUBMED REFERENCE 2 (bases 1 to 2540) AUTHORS Bassett,C.B., Artlip,T.S. and Nickerson,M.L. TITLE Direct Submission JOURNAL Submitted (29-JAN-1999) Appalachian Fruit Research Station, USDA-ARS, 45 Wiltshire Road, Kearneysville, WV 25430, USA FEATURES Location/Qualifiers source /organism="Prunus persica" /mol_type="mRNA" /cultivar="Loring" /db_xref="taxon:3760" /dev_stage="III B/C fruit" gene /gene="ETR1" CDS /codon_start=1 /product="ethylene receptor" /protein_id="AAF " /db_xref="GI: " /translation="MEACNCIEPQWPADELLMKYQYISDFFIALAYFSIPLELIYFVK KSAVFPYRWVLVQFGAFIVLCGATHLINLWTFSMHSRTVAIVMTTAKVLTAVVSCATA LMLVHIIPDLLSVKTRELFLKNKAAELDREMGLIRTQEETGRHVRMLTHEIRSTLDRH TILKTTLVELGRTLALEECALWMPTRTGLELQLSYTLRQQNPVGYTVPIHLPVINQVF SSNRALKISPNSPVARMRPLAGKHMPGEVVAVRVPLLHLSNFQINDWPELSTKRYALM VLMLPSDSARQWHVHELELVEVVADQVAVALSHAAILEESMRARDLLMEQNIALDLAR REAETAIRARNDFLAVMNHEMRTPMHAIIALSSLLQETELTPEQRLMVETILKSSHLL ATLINDVLDLSRLEDGSLQLEIATFNLHSVFREVHNLIKPVASVKKLSVSLNLAADLP VQAVGDEKRLMQIVLNVVGNAVKFSKEGSISITAFVAKSESLRDFRAPEFFPAQSDNH FYLRVQVKDSGSGINPQDIPKLFTKFAQTQSLATRNSGGSGLGLAICKRFVNLMEGHI WIESEGPGKGCTAIFIVKLGFAERSNESKLPFLTKVQANHVQTNFPGLKVLVMDDNGS VTKGLLVHLGCDVTTVSSIDEFLHVISQEHKVVFMDVCMPGIDGYELAVRIHEKFTKR HERPVLVALTGNIDKMTKENCMRVGMDGVILKPVSVDKMRSVLSELLEHRVLFEAM" ORIGIN 1 gcacgagggc tcaccgagcg agctagctct tcaggagtca aggcttctgg gtgaggggaa 61 gaagaagaag cttctttgat gtgttggggt gccaatctaa agaggaagaa gaaggcctct 121 aatgtattga ggtcggctgt ctgggctgcc gatctgtgtt gaatggatag tttggtagag 181 atgcttcaac gacatagggt ggctgaaaag ggtttgaaga aagtgaagga ggaaaccaag ... 2401 tatactgaaa cctgtctcag ttgataaaat gaggagtgtt ttatcagaac tgttggagca 2461 tcgagtttta tttgaggcta tgtaagatat aggaaaattg ttctagtgaa ggaaagattt 2521 aaatggaaaa aaaaaaaaaa // Header The Flatfile Format Feature Table Line-type identifier format. Sequence

The Header LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004
DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY182241 VERSION AY GI: KEYWORDS . SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, (2004) REFERENCE 2 (bases 1 to 1931) TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:

Header: Locus Line Length Division Molecule type Locus name
LOCUS AY bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY182241 VERSION AY GI: KEYWORDS . SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, (2004) REFERENCE 2 (bases 1 to 1931) TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi: LOCUS AY bp mRNA linear PLN 04-MAY-2004 Molecule type Division Modification Date Locus name Length Locus line.

Header: Database Identifiers
LOCUS AY bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY182241 VERSION AY GI: KEYWORDS . SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, (2004) REFERENCE 2 (bases 1 to 1931) TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi: Accession Stable Reportable Universal ACCESSION AY182241 VERSION AY GI: Version Tracks changes in sequence GI number NCBI internal use

NCBI-controlled taxonomy
Header: Organism LOCUS AY bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY182241 VERSION AY GI: KEYWORDS . SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, (2004) REFERENCE 2 (bases 1 to 1931) TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi: SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. NCBI-controlled taxonomy Portion of the record that NCBI controls. Retrieving sequences in precise and accurate way (useful for Entrez searching).

The Feature Table Coding sequence GenPept Identifiers Implied protein
FEATURES Location/Qualifiers source /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene /gene="AFS1" CDS /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO " /db_xref="GI: " /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" start (atg) stop (tag) Coding sequence Implied protein GenPept Identifiers Biologically interesting information.

The Sequence: 99.99% Accurate
ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 1741 ggacccacat cctgtcttta ctattccaac ctcttgtaaa ctagtactca tatagtttga 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a // 1/10,000 bp error ratio

Whole Genome Shotgun Projects
ftp.ncbi.nih.gov/genbank/wgs/ >600 Projects >600 Taxa 423 bacteria 186 eukaryotes 62 fungi 87 animals 5 flowering plants

Mammalian WGS Duck-billed platypus Nine-banded armadillo
Northern tree shrew Domestic rabbit Pika Guinea pig Mouse Rat Thirteen-lined ground squirrel Small-eared galago Mouse lemur Orangutan Human Chimpanzee Gorilla Rhesus macaque Tenrec African elephant Dog Cat Horse European hedgehog Eurasian shrew Little brown bat Cow Gray short-tailed opossum

Plant WGS

Primary vs. Derivative Sequence Databases
Algorithms UniGene Curators RefSeq Genome Assembly TATAGCCG AGCTCCGATA CCGATGACAA GA ATT C AT TTGACA ATTGACTA ACGTGC CGTGA TATAGCCG Labs Sequencing Centers Updated continually by NCBI ~11,000 sequences are submitted per day. GenBank Updated ONLY by submitters ~11,000 sequences are submitted per day.

RefSeq: NCBI’s Derivative Sequence Database
Curated transcripts and proteins reviewed human, mouse, rat, fruit fly, zebrafish, arabidopsis microbial genomes (proteins), and more Model transcripts and proteins Assembled Genomic Regions (contigs) human genome mouse genome rat genome Chromosome records Human genome microbial organelle chicken honeybee sea urchin Goal= nonredundant set of genes/proteins for each organism represented Model= comes from analysis of genomic content from organism assembly Reannotation of microbial genomes, for example.

RefSeq Benefits non-redundancy
explicitly linked nucleotide and protein sequences updates to reflect current sequence data and biology data validation format consistency distinct accession series (NM_, NP_, NC_)

UniGene A gene-oriented view of sequence entries
MegaBlast based automated sequence clustering Nonredundant set of gene oriented clusters Each cluster a unique gene Information on tissue types and map locations Includes known genes and uncharacterized ESTs Useful for gene discovery Clusters of ESTs based on automatic similarity. Each cluster represents a gene.

EST hits: Human mRNA Thrombin mRNA 5’ EST hits 3’ EST hits

Protein Domains Structural Domain Conserved Domain (sequence-based)
Discrete independently folding unit of a protein Conserved Domain (sequence-based) Protein region with recognizable position-specific pattern of sequence conservation Sequence-based domains often roughly correspond to structural domains Domains often have distinct, identifiable functions

Structure Summary (Src domain)
Cn3D viewer Structure Neighbors 3D Domain Neighbors Conserved Domains

Structure vs Conserved Domain
SH2 SH3 TyrKC Conserved phosphotyrosine binding residues SH2

GPL GSM GSE GDS GEO SEries: GEO SaMple: Entrez GEO Datasets Entrez GEO
Submitted by Experimentalists Submitted by Manufacturer* Curated by NCBI GPL Platform descriptions GSM Raw/processed spot intensities from a single slide/chip GSE Grouping of slide/chip data “a single experiment” GDS Grouping of experiments GEO SaMple: experimental conditions GEO SEries: set of related samples Entrez GEO Datasets Entrez GEO

Gene Expression Omnibus
Dataset browser

GEO Dataset Browser

GEO Dataset Report

GEO Profiles … of 12625

Genome Map: Human MLH1 Customizable EST Hits Transcripts Models
NCBI Assembly Gene Annotations

Examples of other databases

UniProtKB Protein sequences
Release 57.7 of 01-Sep-09 of UniProtKB/Swiss-Prot: sequence entries, comprising amino acids abstracted from references. UniProtKB is a database that is manually curated

Annotation of data Manual versus computational annotation
A curator (domain expert) annotates databases Expert knowledge Literature review Use of other databases Use of computational tools Very time consuming

UniProtKB curation Biologists with specific expertise do the annotation: Function(s) Enzyme-specific information Biologically relevant domains and sites Post-translational modifications Sub-cellular location(s) Tissue specificity Developmental specific expression Structure Interactions Splice isoform(s) Associated diseases or deficiencies or abnormalities etc Cross-referencing Also merging of different reports for a single protein

Can we use WIKI’s to support curation?

UniProtKB / SwissProt (HBA1)

UCSC Genome browser This is the genome browser of the Human Genome Project at the University of Santa Cruz Again a part of chromosome 16 Above you can navigate through the genome or search for specific genes Bar indicates coverage of the genome Output of several genefindings programs is shown here mRNA and EST alignments, SNPs Blue bars are the known gene annotations

GeneCards (HBA1)

GeneCards

OMIM: Online Mendelian Inheritance in Man
Knowledgebase of human genes and phenotypes Originally published as a book in 1966 Content of OMIM is derived exclusively from the published biomedical literature Update daily Statistics (2009): full text entries 2.239 genes have mutations causing disease 3.770 disease have a molecular basis 70 new entries added per month 700 entries are updated per month

OMIM: Online Mendelian Inheritance in Man

Example record: HBB (sickle cell anemia)

Gene Ontology (GO) Set of structured controlled vocabularies
For community use in annotating genes, gene products and sequences Biologically meaningful annotation Three key biological domains: Molecular function Biological process Cellular component

Cellular component A cellular component is a component of a cell; this may be an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene product group (e.g. ribosome, a protein dimer).

Molecular function Activities that occur at the molecular level. GO molecular function terms represent activities rather than the entities (molecules or complexes) that perform the actions. Examples of broad functional terms are catalytic activity, transporter activity, or binding.

Biological process Series of events accomplished by one or more ordered assemblies of molecular functions. Examples of broad biological process terms are cellular physiological process or signal transduction.

GO tree

Ontology relations Types of relations in the Gene Ontology Example
is a (is a subtype of) part of regulates negatively regulates positively regulates Example

Relations allow to reason
A is a B B part of C we can infer that A is part of C

Database challenges

Challenges Explosive increase in number and size of databases
Data quality Annotation / curation (e.g., gene names) Community standards Maintenance Integration (data formats and approaches) Transparent query across databases Format compatibility Database website usability Local installation of databases Data and/or tool integration (Funding)

Good value for money, yet difficult to establish and fund
Databases ensure that expensive data is not lost.

Explosive data growth: number of databases
1512 in the Molecular Biology Database Collection (NAR), but many more exist. How do you select the right database? Redundancy What are your considerations?

Why so many databases? Web makes it (too) easy to publish Being data provider is one way to make a reputation Many types of data; each has its own community and repositories Each new sub-discipline develops its own data representations skewed to its own biases Thus many specialist resource suppliers instead of few centralised data centers (e.g., particle physics) Consequently, each data type has a multiplicity of resources: many replicating, partially overlapping or presenting slightly different views on more or less the same data types. Technical or modelling skills of database designer are often lacking. Goble and Stevens (2008) J. Biomedical Informatics, 41, 687

Data quality Example: microarray experiment
Experimental variation and noise is introduced throughout the process Sometimes data is processed before submission to database Quality control procedures are required to ensure minimum quality of biological database But who checks at the database site? Ji H, Davis RW. Data quality in genomics and microarrays. Nat Biotechnol. 2006, 24(9):

Community standards There is a need for community standards Example:
Minimal information recommendations (e.g., MIAME) Shared ontologies (National Center for Biomedical Ontologies) Systems biology: SBML, CellML, ..... Pathways: BioPax

Database maintenance Involves: Availability of sufficient curators
Technicians (database, user-interface, API,......) Hardware (server, storage, ) Community interest maintain internal / external consistency Funding

Database integration

Understanding molecular processes
KEGG: Kyoto Encyclopedia of Genes and Genomes metabolite enzyme/gene

Understanding molecular processes
Integration with pathway databases (eg KEGG)

Cancer module map (Gene expression database)

Database integration: it is difficult
Solution (?): Single biological database Simpler But poor solution Diverse databases reflect expertise and interest of the groups that maintain them Instead ensure that they can easily be integrated to allow cross-database queries

Integration based on (gene) names?
How do you assign and maintain the correct names of biological objects across databases? For example: DNA-damage checkpoint-pathway gene S. cerevisiae (bakers yeast): Rad24 S. Pompe (fission yeast): rad24 These two genes are not orthologues

DNA-damage checkpoint pathway genes
Saccharomyces cerevisiae (Sc) DNA-damage checkpoint pathway genes Rad24 (Sc) rad24 (Sp) not orthologues orthologue rad17 (Sp) Rad17 (Sc) not orthologues closest match Schizo- saccharomyces pombe (Sp) mrt-2 (Ce) Human genes are sometimes named after S. cerevisiae orthologues, sometimes after S. pombe and sometimes have independently derived names. In C. elegans: non of the rad genes is orthologous to Rad17 of S. cerevisiae. Instead, closest C. elegans match is mrt-2

DNA-damage checkpoint pathway genes
Saccharomyces cerevisiae (Sc) DNA-damage checkpoint pathway genes Rad24 (Sc) rad24 (Sp) not orthologues orthologue Thus integration on basis of gene name is impossible! rad17 (Sp) Rad17 (Sc) not orthologues closest match Schizo- saccharomyces pombe (Sp) mrt-2 (Ce) Human genes are sometimes named after S. cerevisiae orthologues, sometimes after S. pombe and sometimes have independently derived names. In C. elegans: non of the rad genes is orthologous to Rad17 of S. cerevisiae. Instead, closest C. elegans match is mrt-2

Other database integration challenges
Biological databases use different DBMSs and none provide a standard way of accessing the data Some provide large text dumps of their contents Some provide direct access to the underlying DBMS Some provide only web-pages Even more challenging: updates Biological databases are always changing, so integration must be an ongoing task. Underlying formats and schemas may completely change Databases may link to different versions of the same database (inconsistencies)

Data integration problem is sociological
Meaningful scalable integration cannot be achieved without the cooperation of the data providers If they continue to produce online databases without regard for the way in which information will be aggregated then integration will stay a monumental

Data formats Flat text >20 formats for sequences (GenBank, GCG, Fasta,....) Relational database XML (e.g., SBML) RDF/OWL (e.g., BioPax) -Many (conversion) tools available -Requires programming skills to integrate data

(User) interfaces Ftp Web-interface (through e.g., Internet Explorer)
Application Programming Interface (API) Web-services

Data access Not all queries allowed or possible (performance, limitations web-interface) Database-wide mining or analysis not possible Integration with experimental data is often not possible Much more is possible if databases are available on local computer system and if people are around to develop the required software

Errors in databases One very important issue is the frequency and type of errors among the entries of a database. Depends strongly on the type of data, and whether the database is curated (modified by a defined group of people) or not. For the sequence databases, the errors may be either in the sequence itself or in the annotation Be careful!!!

Computer exercises Explore frequently used biological databases
And how to query them via the web 13:30-16:00 L0-227

Introduction to biological databases

Similar presentations

Presentation on theme: "Introduction to biological databases"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to biological databases

Similar presentations

Presentation on theme: "Introduction to biological databases"— Presentation transcript:

Similar presentations

About project

Feedback